How Mastodon is Crawled.
When a curator is first added to the group, their time line is crawled and imported as back as is possible.
Right now each account is updated once a day. A major rewrite is almost complete, and will allow the following.
Shortly Mastodon.Social will be crawled every hour. The curator who has not been updated for the longest time will be downloaded first. If their first toot has already been seen, the algorithm moves onto the next person.
The software downloads a page of toots at time, and attempts to create the toots, as described here,.
For successfully created toots, the software attempts to create the referenced articles, and add their default images. Sometimes the downloads fail.
Every article gets a unique import time, an integer, this is how they are looked up. Unlikely that there will be more than one recommended article per second. Each article also gets a unique cannonical url /the-article-name.
If the curator has been crawled in the last day, only the last days toots are recrawled. If not, then the last six days are recrawled to get any updates to the number of boosts.
It all sounds quite easy, but curiously enough it is incredibly complicated to get it all working right. I am now in the third iteration of this download software.
Built using the Forest Map Wiki