As part of our partnership with The New York Armory Show this year, we installed a number of terminals throughout the fair. These screens used our own real-time data to display an ever shifting set of trending artworks, artists, and booths, to the attendees.
Out of this work, we've open-sourced Forgetsy, a lightweight Ruby trending library. Put simply, Forgetsy implements data structures that forget. Loosely based on Bit.ly's Forget Table concept, Forgetsy uses decaying counters to track temporal trends in categorical distributions.
Anatomy of a Trend
To clarify the term 'trend', let's take this graph of cumulative artist searches over time as an example.
On the left-hand side, we see a steepening gradient (denoted by the dashed lines) for Banksy during his residency in New York (Oct 2013), but in contrast a linear rise in searches for Warhol over the same period. We define a 'trend' as this rise in the rate of observations of a particular event over a time period, which we'll call τ.
In Forgetsy, trends are encapsulated by a construct named delta. A delta consists of two sets of counters, each of which implements exponential decay of the following form.
Where the inverse of the decay rate (λ) is the lifetime of an observation in the set, τ. By normalising one set by a set with half the decay rate (or double the lifetime), we obtain a trending score for each category in a distribution. This score expresses the change in the rate of observations of a category over the lifetime of the set, as a proportion in the range [0,1].
Forgetsy removes the need for manually sliding time windows or explicitly maintaining rolling counts, as observations naturally decay away over time. It's designed for heavy writes and sparse reads, as it implements decay at read time. Each set is implemented as a redis sorted set, and keys are scrubbed when a count is decayed to near zero, providing storage efficiency.
As a result, Forgetsy handles distributions with up to around 106 active categories, receiving hundreds of writes per second, without much fuss.
Take a social network in which users can follow each other. You want to track trending users. You construct a delta with a one week lifetime, to capture trends in your follows data over one week periods:
The delta consists of two sets of counters indexed by category identifiers. In this example, the identifiers will be user ids. One set decays over the mean lifetime specified by τ, and another set decays over double the lifetime.
You can now add observations to the delta, in the form of follow events. Each time a user follows another, you increment the followed user id. You can also do this retrospectively:
1 2 3 4
Providing an explicit date is useful if you are processing data asynchronously. You can also use the
incr_by option to increment a counter in batches. You can now consult your follows delta to find your top trending users:
Each user is given a dimensionless score in the range [0,1] corresponding to the normalised follows delta over the time period. This expresses the proportion of follows gained by the user over the last week compared to double that lifetime.
Optionally fetch the top n users, or an individual user's trending score:
For more information on usage, check out the github project page.
In the Wild
In practice, we use linear, weighted combinations of deltas to produce trending scores for any given domain, such as artists. Forgetsy doesn't provide a server, but we send events to an rpc service that updates the deltas in a streamed manner. These events might include artist follows, artwork favorites, auction lot sales or individual page views.
One requirement we have is lifetime flexibility. Forgetsy lets us stipulate the trending period τ on a delta by delta basis. This allows us to lower the lifetime significantly in a fair context, in which we track trends over just a few hours, contrasted with a general art market context, in which we're interested in trends over weeks and months.
In summary, the delta structures provided by Forgetsy provide you with a simple, scalable, transparent base for a trending algorithm that can be tuned to suit the specific dynamics of the domain in question.