Micro-Services make sense for an engineering team of our size. You can scope a domain of your business to particular small unit of abstraction like an API. Doing so makes it easy to work in isolation, experiment with new ideas and evolve in many directions.
We've been carefully pushing for years to move away from our single monolithic API, to a collection of smaller, more focused projects. Our highlights docs showcase this well. The movement to smaller composable services works great from an isolated platform/systems perspective but can be a bit tricky to handle with front-end clients. Until 2018, the way that we've addressed the growing complexity in our service later has been to migrate the complexity inside our main GraphQL API, metaphysics. Metaphysics is our GraphQL API gateway that consolidates many API sources into a single service, then extending and interleaving and their data to make clients easier to write.
However, as more services have been created, and grown - so has metaphysics. This creates a worrying trend, as the growth of code in metaphysics isn't quite linear.
Our main line-of-thought on how to address this is via GraphQL schema stitching. We've been running experiments in stitching for over a year, and have have been running with stitching enabled in production for a few months.
What is Schema Stitching?
The core idea behind schema stitching is that because GraphQL talks in type systems, you should be able to merge
type systems from many GraphQL APIs into a single source of truth. Schema stitching came out at the end of
2017 via the
graphql-tools and became production-ready in April
We started experimenting on staging last year and would occasionally run into edge-case issues. This meant the state of the project would ebb & flow between being blocked, or no-one having the bandwidth to work on it. This was fine, because our aim was incremental evolutions over bold revolution.
Before we dive into implementation details, here's a quick glossary of terms before we start:
- GraphQL Type - the shape of an object exposed from your GraphQL API
- GraphQL Schema - a representation of your GraphQL's type system, containing all types and fields on them
- GraphQL Resolver - every field accessed in a query resolves to a corresponding value, the function doing that is a resolver
- Schema Merging - taking two GraphQL schemas, and merging all the types and resolvers into one schema
- Schema Stitching - extending a GraphQL Schema programmatically, with the ability to delegate to merged schemas
Stitching is one of the end-goals, but merging may be enough for a lot of cases. Both of the two launch posts above give a much more in-depth explanation of how everything comes together, but these should be enough for this post.
How Do We Do It?
We have 5 GraphQL APIs inside the Artsy ecosystem, our aim is to cautiously include these APIs inside metaphysics. We don't need the entire contents of those APIs, and as you'll learn - we couldn't do that even if we wanted.
The technique we settled on was:
- Download the schema of each external API into metaphysics' source code
- Have each schema trimmed to just the essentials that we need today
- Merge in each schema incrementally
- Stitch in any desired schema changes
Let's dig, with some code into how we do each of these steps.
We created a pretty minimal script which can be run periodically from a developer's computer.
1 2 3 4 5 6 7 8 9 10 11 12 13
The script uses an apollo-http-link to grab our schema, and store it in our repo, see
src/data/convection.graphql. This means that when someone wants to update to a new version of the
schema, it will go through code review and a normal testing-flow. The trade-off being that it will always be out of
date a little bit, but you can make guarantees about the current schema. This is a reasonable trade-off, as GraphQL
schemas should always be forward compatible for queries, and when someone wants to use a new field from another
service they can move the schema definition from the git repo.
This file is the GraphQL SDL representations of the entire type system for that schema. This means we have a local copy of the schemas, so we can use it for tests for the next few steps.
Each API writes for their own domain. This can be problematic when you use a
User in one API, which isn't generic
enough to be a
User in a global API of all services combined. When thinking about this problem, we created a
guide for ourselves on how to think about schema design at local and global level.
We use a few of the transform APIs available in graphql-tools to make the merges work. The first approach is to force a namespace by prefixing the merged Types with their domain.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
Another example is to outright remove almost everything in the schema, and to only allow Types and fields which we know to be useful.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
We can write tests for this by running a query which returns all of the types in a schema, and validating what exists:
1 2 3 4 5 6 7 8 9 10
This one is interesting, we don't want the version of
Artwork from Gravity's GraphQL
implementation - because the hand-rolled
Artist types which lives in the source code of Metaphysics
right now is a combination of many sources, and front-end-client specific code.
If we allowed the
Artwork to overwrite the existing implementations it would be a massively breaking
change. For example, compare the Artwork type from Gravity's GraphQL (5 fields) vs Metaphysics'
GraphQL (~90 fields) accidentally switching the types would cripple our front-ends.
There are two classes of schemas involved in our stitching. Local Schemas, which is our existing schema (e.g. the resolver live inside the current source code), and Remote Schemas (e.g. where you make an API request to run those resolvers). Merging a schema has a pretty small API surface and doesn't mind which type of schemas you merge together.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
It's a pretty simple composition model, makes it real easy to do some verification tests using the same techniques as above.
The next step from merging is stitching. Stitching is about taking the merged schemas and taking data from one and re-applying it via another API. For example, we have a consignments API (for when you want to sell a work at auction) and a consignment references the artwork's artist. These live inside an API called convection.
In this case, the consignment has an
artist_id which represents an
Artist type which lives in metaphysics. We
would like to stitch an Artist in from the local schema, into a
ConsignmentSubmission which has come in from a
The API works by using Type Extensions which are a way of opening up an existing Type and adding new fields on it. We want to be working with the highest level abstraction, which in this case is directly writing GraphQL SDL (basically writing the interface) and then hooking that up to its resolvers.
Here's what that looks like in our app:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
This file consolidates the two steps of merging and then stitching:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
We extend the merge schema function to also include the SDL for our stitching, and de-structure in the extension resolvers. We're still exploring how to write useful tests for this part.
Validating your changes
We had some useful tools which were used to make the switch to using schema-stitching in production.
In order to validate that the runtime behavior of our queries wasn't changing, we used the persistent queries generated by our iOS app Emission to create JSON dumps of the results of many API calls in both stitched and un-stitched environments in a script and compared the results.
SDL dump comparison
We can use the GraphQL type system to validate our changes don't break clients. We used a schema dump script to validate the type system was the same across stitched and un-stitched environments.
The advantage here is that we take reduce the acceleration of growing complexity in Metaphysics completely, because the merging and stitching occurs outside of it. Metaphysics is merged and stitched in, just as all our other APIs are.
The downside is that it's another hop to get what you want, and changes could require being updated in more places. We'd be able to use the above ideas for validating that the API is working as expected.
Today we stitch all new APIs by default, see Kaws' integration PR. We're slowly trying to retro-actively migrate existing APIs into stitching and then deleting the existing code, but that's real tricky when those APIs are being used or use advanced features of GraphQL.
We've been using GraphQL since mid-2015 and we've also used it with Relay for the past two years, this has meant we have quite a few interesting edge cases in our use of the GraphQL. We got in touch with Mikhail Novikov and he contracted to help us with most of these issues and I'd strongly recommend doing the same (with any OSS dependency, but that's, like, just my opinion man.)
GraphQL Stitching solves the problem of API consolidation in a really well thought out abstraction, and I consider it one of the most interesting avenues of exploration into what GraphQL will be in the future (see Is GraphQL The Future? for a more philosophical take also.)