Amazon Web Services Feed
Field Notes: Gaining Insights into Labeling Jobs for Machine Learning
In an era where more and more data is generated, it becomes critical for businesses to derive value from it. With the help of supervised learning, it is possible to generate models to automatically make predictions or decisions by leveraging historical data. For example, image recognition for self-driving cars, predicting anomalies on X-rays, fraud detection in finance and more. With supervised learning, these models learn from labeled data. The success of those models is highly dependent on readily available, high quality labeled data.
However, you might encounter cases where a high percentage of your pre-existing data is unlabeled. In these situations, providing correct labeling to previously unlabeled data points would directly translate to higher model accuracy.
Amazon SageMaker Ground Truth helps you with exactly that. It lets you build highly accurate training datasets for machine learning quickly. SageMaker Ground Truth provides your labelers with built-in workflows and interfaces for common labeling tasks. This process could take several hours or more depending on the size of your unlabeled dataset, and you might have a need to track the progress easily, preferably in the form of a dashboard.
In this blogpost we show how to gain deep insights into the progress of labeling and the performance of the workers by using Amazon Athena and Amazon QuickSight. We use Amazon Athena former to set up several views with specific insights into the labeling progress. Finally we will reference these views in Amazon QuickSight to visualize the data in a dashboard.
This approach also works for combining multiple AWS services in general. AWS provides many building blocks than you can mix-and-match to create a unique, integrated solution with cohesive insights. In this blog post we use data produced by one service (Ground Truth), prepare it with another (Athena) and visualize with a third (QuickSight). The following diagram shows this architecture.
Solution Architecture
Mapping a JSON structure to a table structure
Ground Truth creates several directories in your Amazon S3 output path. These directories contain the results of your labeling job and other artifacts of the job. The top-level directory for a labeling job has the same name as your labeling job, while the output directories are placed inside it. We will create all insights from what SageMaker Ground Truth calls worker responses.
All respective JSON files reside in the path s3://bucket/<job-name>/annotations/worker-response/
.
To analyze the labeling data with Amazon Athena we need to understand the structure of the underlying JSON files. Let’s review the example below. For each item that was labeled, we see the label
itself, followed by the submission time
and a workerId
pointing to an identity. This identity lives in Amazon Cognito, a fully managed service that provides the user directory for our labelers.
{ "answers": [ { "answerContent": { "crowd-classifier": { "label": "Compute" } }, "submissionTime": "2020-03-27T10:31:04.210Z", "workerId": "private.eu-west-1.1111111111111111", "workerMetadata": { "identityData": { "identityProviderType": "Cognito", "issuer": "https://cognito-idp.eu-west-1.amazonaws.com/eu-west-1_111111111", "sub": "11111111-1111-1111-1111-111111111111" } } }, ... ]
}
Although the data is stored in Amazon S3 object storage, we are able to use SQL to access the data by using Amazon Athena. Since we now understand the JSON structure from shown in the preceding code, we use Athena and define how to interpret the data that is relevant to us. We do so by first creating a database using the Athena Query Editor:
CREATE DATABASE analyze_labels_db;
Once inside the database, we add the table schema. The actual files remain on Amazon S3, but using the metadata catalog, Athena then knows where the data lies and how to interpret it. The AWS Glue Data Catalog is a central repository to store structural and operational metadata for all your data assets. For a given dataset, you can store its table definition, physical location, add business relevant attributes, in addition to track how this data has changed over time. Besides, Athena the AWS Glue Data Catalog also provides out-of-box integration with Amazon EMR and Amazon Redshift Spectrum. Once you add your table definitions to the Glue Data Catalog, they are available for ETL. They are also readily available for querying in Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum so that you can have a common view of your data between these services.
When going from JSON to SQL, we are crossing format boundaries. To further facilitate how to read the JSON formatted data we are using SerDe Properties to replace the hyphen in crowd-classifier
with an underscore due to DDL constraints. Finally we point the location to our Amazon S3 bucket containing the single worker responses. Recognize in the following script that we translate the nested structure of the JSON file itself into a hierarchical, nested data structure in the schema definition. Also, we could leave out the workerMetadata
as we don’t need it at this time. The data would still stay in the files on Amazon S3, so that we could later change and add the workerMetadata
STRUCT into the table definition for our analysis.
CREATE EXTERNAL TABLE annotations_raw ( answers array< struct<answercontent: struct<crowd_classifier: struct<label: string> >, submissionTime: string, workerId: string, workerMetadata: struct<identityData: struct<identityProviderType: string, issuer: string, sub: string> > > >
) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ( "mapping.crowd_classifier"="crowd-classifier" ) LOCATION 's3://<YOUR_BUCKET>/<JOB_NAME>/annotations/worker-response/'
Creating Views in Athena
Now, we have nested data in our annotations_raw table. For many use cases, especially for analytical uses, representing data in a tabular fashion—as rows—is more natural. This is also the standard way when using SQL and business intelligence tools. To unnest the hierarchical data into flattened rows, we create the following view which will serve as foundation for the other views we create. For an in-depth look into unnesting data with Amazon Athena, read this blog post.
Some of the information we’re interested in might not be part of the document, but is encoded in the path. We use a trick in Athena by using the $path
variable from the Presto Hive Connector. This determines which Amazon S3 file contains data that is returned by a specific row in an Athena table. This way we can find out which data object an annotation belongs to. Since Athena is built on top of Presto, we are able to use Presto’s built-in regexp_extract function to find out the iteration as well as the data object id per labeling result. We also cast the submission time in date format to later determine the labeling progress per day.
CREATE OR REPLACE VIEW annotations_view AS
SELECT regexp_extract("$path", 'iteration-[0-9]*') as iteration, regexp_extract("$path", '(iteration-[0-9]*/([0-9]*))',2) as dataRecord, answer.answercontent.crowd_classifier.label, cast(from_iso8601_timestamp(answer.submissionTime) as timestamp) as submissionTime, cast(from_iso8601_timestamp(answer.submissionTime) as date) as submissionDay, answer.workerId, answer.workerMetadata.identityData.identityProviderType, answer.workerMetadata.identityData.issuer, answer.workerMetadata.identityData.sub, "$path" path
FROM annotations_raw
CROSS JOIN UNNEST(answers) AS t(answer)
This view, annotations_view
, will be the starting point for the other views we will be creating in further in this post.
Visualizing with QuickSight
In this section, we explore a way to visualize the views we build in Athena by pointing Amazon QuickSight to the respective view. Amazon QuickSight lets you create and publish interactive dashboards that include ML Insights. Dashboards can then be accessed from any device, and embedded into your applications, portals, and websites.
Thanks to the tight integration between Athena and QuickSight, we are able to map one dataset in QuickSight to one Athena view. In order to further optimize the performance of the dashboard, we can optionally import the datasets into the in-memory optimized calculation engine for Amazon QuickSight called SPICE. With the datasets in place we can now create an analysis in order to interact with the visuals we’re going to add. You can think of an analysis as a container for a set of related visuals. You can use multiple datasets in an analysis, although any given visual can only use one of those datasets. After you create an analysis and an initial visual, you can expand the analysis. You can do this for example by adding datasets and visuals.
Let’s start with our first insight.
Annotations per worker
We’d like to gain insights not only into the total number of labeled items but also on the level of contributions of each individual workers. This could give us an indication whether the labels were created by a diverse crowd of labelers or by a few productive ones. A largely disproportionate amount of contributions from a handful of workers who may have brought along their biases.
SageMaker Ground Truth calls labeled data objects annotations, which is the result of a single workers labeling task.
Luckily we encapsulated all the heavy lifting of format conversion in the annotations_view
, so that it is now easy to create a view for the annotations per user:
CREATE OR REPLACE VIEW annotations_per_user AS
SELECT COUNT(sub) AS LabeledItems,
sub AS User
FROM annotations_view
GROUP BY sub
ORDER BY LabeledItems DESC
Next we visualize this view in QuickSight. We add a visual to our analysis, select the respective dataset for the view and use the AutoGraph feature, which chooses the most appropriate visual type. Since we already arranged our view in Athena by the number of labeled items in descending order, there is no need now to sort the data in QuickSight. In the following screenshot, worker c4ef78e4...
contributed more labels compared to their peers.
This view gives you an indicator to check for a bias that the leading worker might have brought along.
Annotations per label
One thing we want to be aware of is potential imbalances between classes in our dataset. Especially simple machine learning models, which may learn to frequently predict a label that is massively over represented in the dataset. If we can identify an imbalance, we can apply mitigation actions such as upsampling data of underrepresented classes. With the following view we list the total number of annotations per label.
CREATE OR REPLACE VIEW annotations_per_label AS
SELECT Count(dataRecord) AS TotalLabels, label As Label FROM annotations_view
GROUP BY label
ORDER BY TotalLabels DESC, Label;
As before, we create a dataset in QuickSight pointing to the annotations_per_label
view, open the analysis, add a new visual and leverage the AutoGraph functionality. The result is the following visual representation:
One can clearly see that the Analytics & AI/ML
class is massively underrepresented. At this point, you might want to try getting more data or think about upsampling data for that class.
Annotations per day
Seeing the total number of annotations per label and per worker is good, but we are also interested in how the labeling progress changes over time. This way we might see spikes related to labeller activations. We can also or estimate how long it takes to reach a certain goal of annotations given the current pace. For this purpose we create the following view aggregating the total annotations per day.
CREATE OR REPLACE VIEW annotations_per_day AS
SELECT COUNT(datarecord) AS LabeledItems,
submissionDay
FROM annotations_view
GROUP BY submissionDay
ORDER BY submissionDay, LabeledItems DESC
This time the QuickSight AutoGraph provides us with the following line chart. You might have noticed that the axis labels do not match the column names in Athena. That is because we renamed them in QuickSight for better readability.
In the preceding chart we see that there is no consistent pace of labeling, which makes it hard to predict when a certain amount of labeled data will be reached. In this example, after starting strong the progress immediately went down. Knowing this, we might want to take action into motivating our workers to contribute more and validate the effectiveness of these actions with the help of this chart. The spikes indicate an effective short-term action.
Distribution of total annotations by user
We already have insights into annotations per worker, per label and per day. Let us now now see what insights we can get from aggregating some of this information.
The bigger your labeling workforce gets, the harder it can become to see the whole picture. For that reason we will now create a histogram consisting of five buckets. Each bucket represents an interval of total annotations (for example, 0-25 annotations) mapped to the number of users whose amount of total annotations lies in that interval. This allows us to get a sense of what kind of bias might be introduced by the majority of annotations being contributed by a small amount of workers.
To do that, we use the Presto function width_bucket
which returns the number of labeled data objects according to the five buckets we defined with a size of 25 each. We define these buckets by creating an Array with 5 elements that specify the boundaries.
CREATE OR REPLACE VIEW users_per_bucket_annotations AS
SELECT bucket,numberOfUsers,
CASE WHEN bucket=5 THEN 'B' || cast(bucket AS VARCHAR(10)) || ': ' || cast(((bucket-1) * 25) AS VARCHAR(10)) || '+' ELSE 'B' || cast(bucket AS VARCHAR(10)) || ': ' || cast(((bucket-1) * 25) AS VARCHAR(10)) || '-' || cast((bucket * 25) AS VARCHAR(10))
END AS NumberOfAnnotations
FROM
(SELECT width_bucket(labeleditems,ARRAY[0,25,50,75,100]) AS bucket, count(user) AS numberOfUsers
FROM annotations_per_user
GROUP BY 1
ORDER BY bucket)
A SELECT * FROM users_per_bucket_annotations
produces the following result:
Let’s now investigate the same data via QuickSight:
Now that we can look at the data visually it becomes clear that we have a bimodal distribution, with many labelers having done very little, and many labelers doing quite a lot. This may warrant interviewing some labelers to find out if there is something holding back users from progressing, or if we can keep engagement high over time.
Putting it all together in QuickSight
Since we created all previous visuals into one analysis, we can now utilize it as a central place to consume our insights in a user-friendly way. Moreover, we can share our insights with others as a read-only snapshot which QuickSight calls a dashboard. User who are dashboard viewers can view and filter the dashboard data as below:
Furthermore, you can generate a report and let QuickSight send it either once or on a schedule (daily, weekly or monthly) to your peers. This way users do not have to sign in and they can get reminders to check the progress of the labeling job. Lastly, sending out those reports is an opportunity to stay in touch with the labelers and keep the engagement high.
Conclusion
In this blogpost, we have shown one example of combining multiple AWS services in order to build a solution tailored to your needs. We took the Amazon S3 output generated by SageMaker Ground Truth and showed how it can be further processed and analyzed with Athena. Finally, we created a central place to consume our insights in a user-friendly way with QuickSight. By putting it all together in a dashboard we were able to share our insights with our peers.
You can take the same pattern and apply it to other situations: take some of the many building blocks AWS provides and mix-and-match them to create a unique, integrated solution with cohesive insights just as we did with Ground Truth, Athena, and QuickSight.