This is a guest blog post by Michael Song and Rajesh Mikkilineni at Takeda. In their own words, “Takeda is a global, values-based, R&D-driven biopharmaceutical leader committed to discover and deliver life-transforming treatments, guided by our commitment to patients, our people and the planet. Takeda’s R&D data engineering team aspires to build a robust and flexible data platform for their scientists and researchers to access data and derive value from it.”
The Global Medical Affairs team and other R&D teams at Takeda had to gain access to their data hub, a data repository shared across different teams, without going through the AWS Management Console. They used JupyterHub deployed on Amazon Elastic Compute Cloud (Amazon EC2) instances to access the data, without the overhead of managing user permissions with AWS Identity and Access Management (IAM) roles. To enable rapid product iterations, they wanted self-service, API-based access in which data scientists have the flexibility to fetch their datasets with minimal interaction with the engineering team. Providing API-based access to data with the flexibility of GraphQL enables researchers to access data the way they need it.
This post explains how Takeda set up this architecture for their researchers and scientists. Takeda uses a wide range of AWS services, such as Amazon Simple Storage Service (Amazon S3) for storage, Amazon EMR for processing, and Amazon Athena for data analytics.
Overview of solution
GraphQL is a query language for APIs that enables developers to query and manipulate data from multiple data sources and other APIs easily through a flexible runtime, using an intuitive syntax that describes data requirements and interactions with the backend. GraphQL has an API component to expose and access data, and a runtime component with which you can customize your business logic directly at the API layer.
AWS AppSync is a managed serverless GraphQL service that simplifies application development by letting you create a flexible API to securely access, manipulate, and combine data from one or more data sources with a single network call. With AWS AppSync, you can build scalable applications on a range of data sources, including Amazon DynamoDB NoSQL tables, Amazon Aurora Serverless relational databases, Amazon Elasticsearch Service (Amazon ES) clusters, HTTP APIs, and serverless functions powered by AWS Lambda.
To deploy a GraphQL API on AWS AppSync, you need to define three components:
- GraphQL schema – This is where the API definition is modeled in a GraphQL schema definition language (SDL)
- Data source – This is the component that points AWS AppSync to where the data is stored (DynamoDB, Aurora, Amazon ES, Lambda, HTTP/REST APIs, or other AWS services)
- Resolvers – These provide business logic linking or resolving types or fields defined in the GraphQL schema with the data in the data sources
In this post, we first focus on setting up the GraphQL schema in AWS AppSync, then we configure the data source using Lambda. This setup provides the flexibility to fetch and transform the data from multiple data sources in addition to the one directly supported by AWS AppSync. We then provide an example of how data scientists can use JupyterHub to access data using GraphQL.
We use COVID-19 datasets published by Johns Hopkins University. This dataset includes information about the country, province, and state of confirmed, recovered, and death instances related to COVID-19.
The following diagram illustrates the high-level architecture, in which data from Amazon S3 is served using Athena and accessed using GraphQL APIs configured in AWS AppSync.
In this post, we walk through the following steps:
- Set up an S3 bucket and its access using Athena.
- Create a GraphQL API and define the schema in AWS AppSync.
- Create a Lambda function that connects with Athena.
- Configure the Lambda function as the data source in AWS AppSync.
- Configure our GraphQL API settings in AWS AppSync.
- Access the data using JupyterHub.
Set up an S3 bucket and its access using Athena
This post assumes familiarity with creating a database in Athena. If you’re new to Athena, refer to the Getting Started guide and create a database before continuing.
The COVID-19 dataset is fetched at regular intervals and loaded as files into an S3 bucket. So set up your S3 bucket, configure an AWS Glue crawler to connect to your bucket, determine the data structures based on the file data in Amazon S3, and write tables into the AWS Glue Data Catalog. Then set up Athena to access data in Amazon S3 using the Data Catalog.
Create the GraphQL API and define the schema in AWS AppSync
To create your GraphQL API and define its schema, complete the following steps:
- On the AWS AppSync console, choose APIs.
- Choose Create API.
- In the Customize your API or import from Amazon DynamoDB section, select Build from scratch.
- Choose Start.
- For API name, enter a name.
- Choose Create.
- Under Define the schema, choose Edit Schema.
A GraphQL service is created by defining types and fields on those types, then providing functions for each field on each type. For example, if we want to get COVID-19 information by country, we can write the query like the following:
The following code is the GraphQL schema for the COVID-19 dataset at Takeda:
Create a Lambda function to connect with Athena
To create the Lambda function, complete the following steps:
- On the Lambda console, choose Create function.
- Select Author from scratch.
- For Function name, enter
- For Runtime, choose Python 3.8.
- For Choose or create an execution role, select Create new role with basic Lambda permissions.
- Chose Create function.
- On the Permissions tab, update the IAM role to have a policy that gives access to Athena, AWS Glue, and to the Athena query results location in Amazon S3.
In addition, make sure the role is also associated to a policy with Amazon S3 read access to the bucket where the COVID-19 data file is stored. The role should also have a trust relationship with AWS AppSync as shown in the policy below:
- On the Configuration tab, replace the existing text with the following code, which uses Lambda to connect Athena and AWS AppSync:
Configure the Lambda function as a data source and resolver in AWS AppSync
Use the Lambda function created in previous step to connect AWS AppSync with Athena.
- On the AWS AppSync console, under My AppSync App, choose Data Sources.
- Choose Create data source.
- Enter a name for your data source.
- For Data source type, choose AWS Lambda Function.
AWS AppSync can identify a DynamoDB table, Amazon ES domain, Lambda function, relational database, or HTTP endpoint as the data source. For this post, we have datasets stored in Amazon S3 and registered in Athena, so we create a Lambda function to connect AWS AppSync to Athena.
Now that we’ve registered an our Lambda function as the data source and have a valid GraphQL schema, we can connect our GraphQL fields to our data source using resolvers.
- Attach a resolver to the Lambda function for the following fields:
Configure your API settings and authorization strategy
You can define any authorization strategy for your API access. In this case, we use API key-based authorization.
- On the AWs AppSync console, in the navigation pane, choose Settings.
- For API URL and API ID, enter the URL and ID to access the AWS AppSync APIs, respectfully.
- For API Key, enter the authorization parameter.
These details are used when accessing the API from outside.
Use JupyterHub and Python to access data through GraphQL
We can now use JupyterHub and Python to access the data through GraphQL.
- Open a new Jupyter notebook.
- Import the necessary packages:
- Use the following code to load the dataset we want from the data source:
- Now we can run a query to fetch the data:
We can see in the following code that GraphQL is starting to fetch the data:
- We now load the data into a Pandas data frame:
The following screenshot shows our results.
In this post, we walked through the process of deploying an AWS AppSync API and using JupyterHub to access data via GraphQL. We explored how we use Lambda to connect AWS AppSync to Athena. We then used JupyterHub to create a simple query to fetch the dataset using the GraphQL API we deployed in AWS AppSync.
About the Authors
Michael Song is a data engineer at Takeda Pharmaceuticals in Cambridge. Michael joined Takeda in 2018 and has worked on many projects within the firm’s data engineering and data science initiatives. Most recently, Michael is working on R&D IT’s API strategy, focusing on AWS AppSync and GraphQL.
Rajesh Mikkilineni is a lead data engineer at Takeda Pharmaceuticals in Cambridge. Raj is an experienced softwaredeveloper, pipeline developer and worked on multiple projects to bring actionable insight of data. He has implemented cloud-based data platforms at multiple companies.
Karl Gutwin is the Director for Software Engineering Services at BioTeam, based in the Boston, MA area. Since 2017, Karl has worked with numerous clients on software development, cloud architecture and implementation, data management, and more. As part of Takeda’s long-standing relationship with BioTeam, Karl has been a key part of building their data infrastructure in AWS.
Anusha Dharmalingam is a Solutions Architect at Amazon Web Services, with a passion for Application Development and Big Data solutions. Anusha works with enterprise customers to help them architect, build, and scale applications to achieve their business goals.