Searching Discourse via API


#1

As part of the new securedrop.org website logic, we’re trying to integrate Discourse content in the “search” functionality, so that folks searching will find relevant results on this here forum. We’ve got a Discourse API token set up for that, and already have app code logic that queries the Discourse API to update the search index and store it in the website database.

There’s a problem, however: the Discourse server is blocking our requests due to a ratelimit trigger. It looks like the ratelimit is triggered at around 60 or 61 requests. Unfortunately it doesn’t look like we can reduce the number of GETs via smart pagination in the API client calls—the docs imply by omission that this isn’t possible.

Question for @dachary and @fpoulain: Would you be amenable to raising the ratelimit to support the search index updates? We’ll try to isolate certain URL paths if that’s useful to you—it depends on where the ratelimit is occurring. I assume the ratelimit is implemented at the webserver level, rather than the firewall, but it could also be Discourse itself (i.e. application-level). Here’s an example error message from local testing:

429 Client Error: Too Many Requests for url: https://forum.securedrop.club/t/579.json

The “update search indices” requests will likely run on a daily schedule, to strike a balance between keeping the search results current. Happy to work with you on a solution that works for all involved.


#2

@conorsch I will look at it.


#3

It seems that nginx have issued some 429 to requests from 159.89.238.97.

We can try to increase the limit following https://www.nginx.com/blog/rate-limiting-nginx/ and hoping than firewall leave all of them passing.

Questions:

  • which IP to be whitelisted?
  • which needed rate?
  • is there bursts behavior? How many?

#4

PS: Currently, the forum has been manually deployed from docker upstream image. Migration to the ansible repo/deployment should occurs in next weeks/months.


#5

@conorsch Would it make sense to whitelist FPF’s VPN IPs for dev as well as the new securedrop.org server?

I’m not 100% sure how many requests are required for us to index the posts, but we reindex them completely every time and Discourse’s API requires a lot of pagination, so this will grow with the size of the forum. I’d estimate at the current size ~12 requests just to get a list of topics and then at least one request per topic (more if the topic has a lot of comments) so my guess is that we’ll currently clock in somewhere below 200 requests right now. If this is or becomes a major issue, we might look into a different approach (like maybe using discourse’s webhooks to update the index).

There’s no bursts behavior right now. I am fairly sure that the script fires the requests consecutively. I can add in a delay if that would be helpful.


#6

@fpoulain I’d prefer not to use IP whitelisting here, since the systems are managed separately, and any changes on the securedrop.org webserver side, e.g. due to rebuild, would break the whitelisting, and require that we correspond with you about the change so you can implement a fix again.

Would you be amenable to loosening the ratelimit a bit on the URL routes the API will hit? We can generate a list of routes, e.g. /t/*.json or similar, and then only those routes would have the ratelimit adjusted, and you can preserve the existing ratelimit site-wide.

Will work on generating a list of the expected GET requests from the API consumer, to confirm that the routes are predictable and appropriate for the selective ratelimiting strategy proposed above, and share them here.


#7

Here are the URLs we GET (with * being a topic id):

  • /latest.json
  • /t/*.json
  • /t/*/posts.json

We query most of these with querystrings, in case that makes a difference.


#8

@harris @conorsch

Is there an issue in securedrop.org somewhere that explains how the import from the forum is designed? I’m not sure I understand why it would take over 60 requests to just get the latest messages from the forum. But (disclaimer) I’m ignorant of the API :wink:

It feels like we should not be having this problem at all because the activity of the forum is way below the threshold.


#9

@dachary The Discourse API, sadly, doesn’t provide any way of identifying or retrieving latest posts, just latest topics, which doesn’t let us know if there’s new content to index under an old topic. As a result, we’re doing a complete reindexing of all topics every time we run the script, hence the large number of requests.

The script does these things:

  • Paginates over the entirety of latest.json to generate a complete list of public topics (the URL for this endpoint is sort of a misnomer—it makes available all topics sorted by date created)
  • Gets details for each topic at /t/{topic_id}.json
  • If a topic has posts, paginate through those posts at /t/{topic_id}/posts.json

Using Discourse’s webhooks to get new partial content updates immediately might be a superior alternative to this sometime down the line, but, unfortunately, I don’t think I have time to build that out before our planned launch for securedrop.org, so it’d be great to get this working, even if as a stogap.


#10

It should be ok. @harris can you try please?

I added to nginx:

    # we bypass limits for json
    location ~ ^/(latest|t/).*\.json {
      add_header Referrer-Policy 'no-referrer-when-downgrade';
      add_header Strict-Transport-Security 'max-age=31536000'; # remember the certificate for a year and automatically connect to HTTPS for this domain
      proxy_set_header Host $http_host;
      proxy_set_header X-Real-IP $remote_addr;
      proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
      proxy_set_header X-Forwarded-Proto $thescheme;
      proxy_pass http://discourse;
      break;
    }

Before adding it I got:

# while curl https://forum.securedrop.club/t/searching-discourse-via-api/629/8.json -s -I | grep -q 'HTTP/1.1 200 OK'> /dev/null; do echo -n "."; done; echo "";
................................................................
# while curl https://forum.securedrop.club/t/searching-discourse-via-api/629/8.json -s -I | grep -q 'HTTP/1.1 200 OK'> /dev/null; do echo -n "."; done; echo "";
...................................................................................

After adding I got more than 300 requests without error.


#11

This looks great! Thanks for the super fast turnaround, we’ll test the calls again and let you know!


#12

@fpoulain I’m still getting 429 after 61 requests. I’m not sure why. I’m going to experiment with building in a delay before retrying after getting that error, but I’m still confused. :confused:


#13

I managed to resolve this by sleeping the script for 30 seconds and retrying every time it encounters a 429 response, so I think we’re actually good here regardless, @fpoulain. Thanks for your help. Having successfully run it, I can now state with certainty that with 380 forum topics, the script requires 410 requests to index the entire thing! I don’t think we’re gonna knock over the server or anything, but still Yikes! We’ll be looking into alternative options for this process sometime after launch.


#14

It works for me:

$ while curl https://forum.securedrop.club/t/searching-discourse-via-api/629/8.json -s -I | grep -q 'HTTP/2 200'> /dev/null; do echo -n "."; done; echo ""
......................................................................................................................................................................................................................................................................................^C

Maybe you request something that does not match location ~ ^/(latest|t/).*\.json { in the modification @fpoulain did?


#15

Weird.

I see e.g.

[19/Apr/2018:20:03:43 +0000] "forum.securedrop.club" 159.89.238.97 "GET /t/107.json HTTP/1.1" "python-requests/2.18.4" "-" 429 514 "-" 0.019 0.019 "-"

But Nginx didn’t logged it as an error like it did before

2018/04/18 16:35:51 [error] 2981#2981: *1370411 limiting requests, excess: 12.144 by zone "flood", client: 94.23.6.219, server: _, request: "HEAD /t/searching-discourse-via-api/629/8.json HTTP/1.1", host: "forum.securedrop.club"
2018/04/18 16:36:01 [error] 2981#2981: *1370579 limiting requests, excess: 100.742 by zone "bot", client: 94.23.6.219, server: _, request: "HEAD /t/searching-discourse-via-api/629/8.json HTTP/1.1", host: "forum.securedrop.club"

But on my side I can do:

# while curl https://forum.securedrop.club/t/107.json -s -I | grep -q 'HTTP/1.1 200 OK'> /dev/null; do echo -n "."; done; echo "";
.......................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................^C

Leading to

[20/Apr/2018:09:12:49 +0000] "forum.securedrop.club" 94.23.6.219 "HEAD /t/107.json HTTP/1.1" "curl/7.38.0" "topics/show" 200 501 "-" 0.032 0.032 "-"

I suspect you are limited by limit_conn connperip 20;. So I reloaded NginX to issue 418 for this particular limitation. May you please retry?


#16

Thanks, @fpoulain, that’s quite helpful. Will check our logs and compare, I’m optimistic we can iron something out on the request rate that’ll work for all involved.


#17

Looks like we’ve got successful index updates via Discourse! The state of the config at present works well for our needs. Thanks again for your collaboration here, @fpoulain!