Pagination Pitfalls

Wed 21 November 2018

A common pattern in many REST APIs that return a large number of results is to do something like this:

  • There's an endpoint /item that returns all items starting with the most recent and limited by an amount, e.g. 100 items per page.
  • You can get the total number of items by making a request to /item/count.
  • You can get a specific page by doing something like: /item?page=10
  • Therefore to get all items you take the count and divide by the limit per page to work out how many pages there are. You can then iterate over the pages to collect all items.

The Problem

Doing so could result in duplicated and/or missing data if the number of items may be growing while the paginated requests are happening.

To illustrate by example, suppose at the start of the request there are 199 items. Let's number them starting from 1 for the oldest item. These would be split amongst the pages like so:

page items
1 100 - 99
2 1 - 99

Now if 2 items are added, they would be split like this:

page items
1 102 - 201
2 2 - 101
3 1 - 1

So if 2 items are added between the request to get page 1 and the request to get page 2, when following the above pattern for getting items, you would end up with these items:

page items
1 100 - 99
2 2 - 101

You end up with 2 duplicates (items 100 and 101) and 1 missing (item 1).

Cursors

Many modern APIs that deal with huge amounts of constantly changing data use cursors instead of page numbers. This prevents duplicate or missing data by keeping a reference to the exact position where the next slice needs to start. Each request returns the cursor to use for the next request.

This does solve the problem of missing or duplicated data. However, it means that requests must be carried out in a series, since you don't know the cursor for the next request until the previous request has returned. Unlike with page numbers where you can make multiple requests at once.

Perhaps some of these API owners prefer that users do not make too many requests at once, and for many use cases a client would only need to make a few requests, but when gathering data for analysis, where we want to get all items, doing the requests one after the other versus doing many at once can be the difference between waiting a few hours to waiting a few minutes for the results.

Reversing the flow

An alternative way to deal with this, is to request that the items be sorted from oldest to newest (where an API supports such an option). This would make the contents of the pages consistent even when items are added.

This still wouldn't help if items can also be deleted, as the pages would shift and other items would go missing. For example, if we start with 200 items, the pages would be:

page items
1 1 - 100
2 101 - 200

If item number 100 is deleted, the pages would contain:

page items
1 1 - 101
2 102 - 200

So if item number 100 is deleted between the request for page 1 and the request for page 2, we end up with item number 102 missing:

page items
1 1 - 100
2 102 - 200

Conclusion

None of the above methods for getting paginated data are perfect. Cursors guarantee accuracy but sacrifice on speed of access. Page numbers can result in data going missing or being duplicated. It is important do be aware of this issue, because it may go unnoticed for a while, especially with data that doesn't change that frequently where the effects would be sporadic, but just being aware of it can sometimes be enough to mitigate the issues. Duplicates can easily be dropped, and the total count can be verified to check for missing items.