While Artsy is the largest database of Contemporary Art online, it’s not exactly “big data”. To date, we have published over 500,000 artworks by more than 50,000 artists from over 4,000 galleries, 700 museums and institutions across over 40,000 shows. Our team has written thousands of articles, hosted hundreds of art fairs and a few dozen auctions. We have over 1,000 genes from the Art Genome project, too.
There’re just over a million web pages generated from this data on artsy.net. Generating sitemaps to submit to Google and other search engines for a million pages never seemed like a big deal. In this post I’ll describe 3 generations of code, including our most recent iteration that uses Apache Spark to generates static sitemap files in S3.