Tag: Analytics

Performance updates to Apache Spark in Amazon EMR 5.24 – Up to 13x better performance compared to Amazon EMR 5.16

Performance updates to Apache Spark in Amazon EMR 5.24 – Up to 13x better performance compared to Amazon EMR 5.16

Amazon EMR release 5.24.0 includes several optimizations in Spark that improve query performance. To evaluate the performance improvements, we used TPC-DS benchmark queries with 3-TB scale and ran them on a 6-node c4.8xlarge EMR cluster with data in Amazon S3. We observed up to 13X better query performance on EMR 5.24 compared to EMR 5.16 when operating with a similar configuration.

Customers use Spark for a wide array of analytics use cases ranging from large-scale transformations to streaming, data science, and machine learning. They choose to run Spark on EMR because EMR provides the latest, stable open source community innovations, performant storage with Amazon S3, and the unique cost savings capabilities of Spot Instances and Auto Scaling.

Each monthly EMR release offers the latest open source packages, alongside new features such as multiple master nodes, and cluster reconfiguration. The team also adds performance improvements with each release.

Each of those optimizations helps you run faster and reduce costs. With EMR 5.24, we have made several new optimizations and are detailing three critical ones in this post.

Setup

To get started with EMR, sign into the console, launch a cluster, and process data.

To replicate the setup for the benchmarking queries, use the following configuration:

  • Applications installed on the cluster: Ganglia, Hive, Spark, Hadoop (installed by default).
  • EMR release: EMR 5.24.0
  • Cluster configuration
    • Master instance group: 1 c4.8xlarge instance with 512 GiB of GP2 EBS storage (4 volumes of 128 GiB each)
    • Core instance group: 5 c4.8xlarge instances with 512 GiB of GP2 EBS storage (4 volumes of 128 GiB each)
Classification Properties
yarn-site yarn.nodemanager.resource.memory-mb : 53248
yarn.scheduler.maximum-allocation-vcores : 36
spark-defaults spark.executor.memory : 4743m
spark.driver.memory : 2g
spark.sql.optimizer.distinctBeforeIntersect.enabled : true
spark.sql.dynamicPartitionPruning.enabled : true
spark.sql.optimizer.flattenScalarSubqueriesWithAggregates.enabled : true
spark.executor.cores : 4
spark.executor.memoryOverhead : 890m

Results observed using TPC-DS benchmarks

The following two graphs compare the total aggregate runtime and geometric mean for all queries in the TPC-DS 3TB query dataset between the EMR releases.

The per-query runtime improvement between EMR 5.16 and EMR 5.24 is also illustrated in the following chart. The horizontal axis shows each of the queries in the TPC-DS 3 TB benchmark. The vertical axis shows the orders of magnitude of performance improvement seen in EMR 5.24.0 relative to EMR 5.16.0 as measured by query execution time. The largest performance improvements can be seen in 26 of the queries. In each of these queries, the performance was at least 2X better than EMR 5.16.

Performance optimizations in EMR 5.24

While AWS made several incremental performance improvements aggregating to the overall speedup, this post describes three major improvements in EMR 5.24 that affect the most common customer workloads:

  • Dynamic partition pruning
  • Flatten scalar subqueries
  • DISTINCT before INTERSECT

Dynamic partition pruning

Dynamic partition pruning improves job performance by selecting specific partitions within a table that must be read and processed for a query. By reducing the amount of data read and processed, queries run faster. The open source version of Spark (2.4.2) only supports pushing down static predicates that can be resolved at plan time. Examples of static predicate push down include the following:

partition_col = 5

partition_col IN (1,3,5)

partition_col BETWEEN 1 AND 3

partition_col = 1 + 3

With dynamic partition pruning turned on, Spark on EMR infers the partitions that must be read at runtime. Dynamic partition pruning is disabled by default, and can be enabled by setting the Spark property spark.sql.dynamicPartitionPruning.enabled from within Spark or when creating clusters. For more information, see Configure Spark.

Here’s an example that joins two tables and relies on dynamic partition pruning to improve performance. The store_sales table contains total sales data partitioned by region, and store_regions table contains a mapping of regions for each country. In this representative query, you want to only get data from a specific country.

SELECT ss.quarter, ss.region, ss.store, ss.total_sales 
FROM store_sales ss, store_regions sr
WHERE ss.region sr.region AND sr.country = ’North America’

Without dynamic partition pruning, this query reads all regions, before filtering out the subset of regions that match the results of the subquery. With dynamic partition pruning, only the partitions for the regions returned in the subquery are read and processed. This saves time and resources by both reading less data from storage, and processing fewer records.

The following graph shows the performance improvements to Queries 72, 80, 17, and 25 from the TPC-DS suite that we tested with 3-TB data.

Flatten scalar subqueries

This optimization can improve query performance where multiple conditions must be applied to rows from a specific table. The optimization prevents the table from being read multiple times for each condition. This optimization detects such cases, and optimizes the query to read the table only one time.

Flatten scalar subqueries is disabled by default and can be enabled by setting the Spark property spark.sql.optimizer.flattenScalarSubqueriesWithAggregates.enabled from within Spark or when creating clusters.

To give an example of how this works, use the same total_sales table from the previous optimization. In this example, you want to group stores by their average sales when their average sales are in between specific ranges.

SELECT (SELECT avg(total_sales) FROM store_sales 
WHERE total_sales BETWEEN 5000000 AND 10000000) AS group1, 
(SELECT avg(total_sales) FROM store_sales 
WHERE total_sales BETWEEN 10000000 AND 15000000) AS group2, 
(SELECT avg(total_sales) FROM store_sales 
WHERE total_sales BETWEEN 15000000 AND 20000000) AS group3  

With this optimization disabled, the total_sales table is read for each sub query. With the optimization enabled, the query is rewritten as follows to apply each of the conditions to the rows returned by reading the table only one time.

SELECT c1 AS group1, c2 AS group2, c3 AS group3 
FROM (SELECT avg (IF(total_sales BETWEEN 5000000 AND 10000000, total_sales, null)) AS c1, 
avg (IF(total_sales BETWEEN 10000000 AND 15000000, total_sales, null)) AS c2, 
avg (IF(total_sales BETWEEN 15000000 AND 20000000, total_sales, null)) AS c3 FROM store_sales);  

This optimization saves time and resources by both reading less data from storage, and processing fewer records.

To illustrate, take the example of Q9 from the TPCDS suite. The query runs 2.9x faster in version 5.24 compared to 5.16, when the relevant Spark property is switched on.

DISTINCT before INTERSECT

When producing the intersection of two collections, the result of that intersection is a set of unique values found in each collection. When dealing with large collections, many duplicate records must be both processed, and shuffled between hosts to finally calculate the intersection. This optimization eliminates duplicate values in each collection before computing the intersection, improving performance by reducing the amount of data shuffled between hosts.

This optimization is disabled by default and can be enabled by setting the Spark property spark.sql.optimizer.distinctBeforeIntersect.enabled from within Spark or when creating clusters.

For example (simplified from TPC-DS query14), you want to find all of the brands that are sold both in store and catalog sale channels. In this example, the store_sales table contains sale made through the store channel, the catalog_sales table contains sale made through catalog, and the item table contains each unique product’s formulation (e.g. brand, manufactuer).

(SELECT item.brand ss_brand FROM store_sales, item
WHERE store_sales.item_id = item.item_id)
INTERSECT
(SELECT item.brand cs_brand FROM catalog_sales, item 
WHERE catalog_sales.item_id = item.item_id) 

With this optimization disabled, the first SELECT statement produces 2,600,000 records (same number of records as store_sales) with only 1,200 unique brands. The second SELECT statement produces 1,500,000 records (same number of records as catalog_sales) with 300 unique brands. This results in all 4,100,000 rows being fed into the intersect operation to produce the 200 brands that exist in both results.

With the optimization enabled, a distinct operation is performed on each collection before being fed into the intersect operator, resulting in only 1,200 + 300 records being fed into the intersect operator. This optimization saves time and resources by shuffling less data between hosts.

Summary

With each of these performance optimizations to Apache Spark, you benefit from better query performance on EMR 5.24 compared to EMR 5.16. We look forward to feedback on how these optimizations benefit your real world workloads.

Stay tuned as we roll out additional updates to improve Apache Spark performance in EMR. To keep up-to-date, subscribe to the Big Data blog’s RSS feed to learn about more great Apache Spark optimizations, configuration best practices, and tuning advice. Be sure not to miss other great optimizations like using S3 Select with Spark, and the EMRFS S3-Optimized Committer from previous EMR releases.

 


About the Author

Paul Codding is a senior product manager for EMR at Amazon Web Services.

 

 

 

 

Peter Gvozdjak is a senior engineering manager for EMR at Amazon Web Services.

 

 

 

 

Joseph Marques is a principal engineer for EMR at Amazon Web Services.

 

 

 

 

Yuzhou Sun is a software development engineer for EMR at Amazon Web Services.

 

 

 

 

Atul Payapilly is a software development engineer for EMR at Amazon Web Services.

 

 

 

 

Surya Vadan Akivikolanu is a software development engineer for EMR at Amazon Web Services.

 

 

 

 

from AWS Big Data Blog

Introducing Amazon QuickSight fine-grained access control over Amazon S3 and Amazon Athena

Introducing Amazon QuickSight fine-grained access control over Amazon S3 and Amazon Athena

Today, AWS is excited to announce the availability of fine-grained access control for AWS Identity and Access Management (IAM)-permissioned resources in Amazon QuickSight. Fine-grained access control allows Amazon QuickSight account administrators to control authors’ default access to connected AWS resources. Fine-grained access control enables administrators to use IAM policies to scope down access permissions, limiting specific authors’ access to specific items within the AWS resources. Administrators can now apply this new level of access control to Amazon S3, Amazon Athena, and Amazon RDS/Redshift database discovery.

Fine-Grained Access Control Setup

Here’s how fine-grained access control works in Amazon QuickSight:

Imagine an AWS customer, Acme. Acme’s account contains three S3 buckets, called b1, b2, and b3. Amazon QuickSight is configured to read data via Athena. Assume that Acme’s administrators have configured Amazon QuickSight service role permissions with access to all three buckets. The new default access setting introduced today enables administrators to limit access to Acme’s data for all users by default. Administrators grant access to specific buckets (b1, b2, b3) to individual users or groups within Acme via IAM policies.

In the following example, the policies assigned to Acme users A and B, and Group X grant them access to buckets 1, 2, and 3. Group Y is not assigned, as shown in the following diagram.

When User A attempts to read data via Athena from bucket 1, AWS evaluates the IAM policy associated with the user. Since the policy assigned to User A grants access to bucket 1, the query succeeds. User A can access data from bucket 1. Similarly, User B and users in group X can access data in buckets 2 and 3, respectively.

However, when a user from group Y tries to access bucket 2, QuickSight doesn’t allow any access to data. Remember, group Y has no user-level assignments. Users from group Y are denied access to bucket 2 because Amazon QuickSight requires explicit permissions to access data, as shown in the following diagram.

In an S3 data lake, Amazon QuickSight enables administrators to restrict each author’s data access using IAM policies. (Other AWS mechanisms, outside of Amazon QuickSight, provide for the modification of the policies themselves.) This fine-grained access control relies on the underlying IAM policy evaluation infrastructure. Currently, there is a ten-policy limit on policies that may be linked together at runtime to evaluate the user’s permissions.

Amazon QuickSight also offers an updated UI for AWS permissions management. Administrators can now access an account’s default resource setting, as shown in the following screenshot:

You can set Default resource access to Allow or Deny, based on the administration model. Choose the appropriate option and press Update. (Or, if you decide not to proceed, press Cancel.)

As before, you can specify AWS resource permissions for the Amazon QuickSight account through the new UI. Select resources from the checkboxes on the right side of the screen, as shown in the following screenshot:

And, as shown in the next screenshot, S3 access can be granted to buckets within the AWS account or other AWS accounts:

To control access for specific users or groups, use the newly introduced fine-grained access control feature, illustrated in this screenshot:

The IAM policy assignments button leads you to a page displaying all assignments in the account and lets you create a new assignment, as shown in the following screenshot.

Creating a new policy assignment involves only two steps:

  1. Pick from an IAM policy from those on the AWS account list.
  2. Assign to specific users or groups.

If you haven’t yet configured groups in Amazon QuickSight, you can do so using AWS APIs for accounts using SSO or Amazon QuickSight-native credentials. Groups are also natively available in an AD-connected account.

Fine-grained access control in Amazon QuickSight, when combined with data in S3 and Athena, allows you to set up a secure environment for data exploration across the organization. This feature is available in Amazon QuickSight Enterprise Edition in all supported AWS Regions starting today.

 


About the Author

Jose Kunnackal is a principal product manager for Amazon QuickSight.

 

 

 

from AWS Big Data Blog

Why 3M Health Information Systems chose Amazon Redshift to build a healthcare analytics platform

Why 3M Health Information Systems chose Amazon Redshift to build a healthcare analytics platform

3M Health Information Systems (HIS), a business of 3M Health Care, works with providers, payers, and government agencies to anticipate and navigate a changing healthcare landscape. 3M provides healthcare performance measurement and management solutions, analysis and strategic services that help clients move from volume to value-based health care, resulting in millions of dollars in savings, improved provider performance and higher quality care. 3M’s innovative software is designed to raise the bar for computer-assisted coding, clinical documentation improvement, performance monitoring, quality outcomes reporting, and terminology management.

There was an ongoing initiative at 3M HIS to migrate applications installed on-premises or on other cloud hosting providers to Amazon Web Services (AWS). 3M HIS began migration to AWS to take advantage of compute, storage and network elasticity. We wanted to build on a solid foundation that would help us focus more of our efforts on delivering customer value while also scaling to support business growth that we expected over the next several years. 3M HIS was already processing healthcare data for many customers which is complex in nature, requiring a lot of complex transformations to get data into a format useful for analytics or machine learning.
After reviewing many solutions, 3M HIS chose Amazon Redshift as the appropriate data warehouse solution. We concluded Amazon Redshift met our needs; a fast, fully managed, petabyte-scale data warehouse solution that uses columnar storage to minimize I/O, provides high data compression rates, and offers fast performance. We quickly spun up a cluster in our development environment, built out the dimensional model, loaded data, and made it available to perform benchmarking and testing of the user data. An extract, transform, load (ETL) tool was used to process and load the data from various sources into Amazon Redshift.

After reviewing many solutions, 3M HIS chose Amazon Redshift as the appropriate data warehouse solution. We concluded Amazon Redshift met our needs; a fast, fully managed, petabyte-scale data warehouse solution that uses columnar storage to minimize I/O, provides high data compression rates, and offers fast performance. We quickly spun up a cluster in our development environment, built out the dimensional model, loaded data, and made it available to perform benchmarking and testing of the user data. An extract, transform, load (ETL) tool was used to process and load the data from various sources into Amazon Redshift.

3M legacy implementation

3M HIS processes a large volume of data through this data warehouse. We ingest healthcare data from clients, representing millions of procedure codes, diagnosis codes and all the associated meta data for each of those codes. The legacy process loaded this data into the data warehouse once every two weeks.

For reporting, we published 25 static reports and 6 static HTML reports for over 1000 customers every week. To provide business intelligence reports, we built analytical cubes on a legacy relational database and provided reports from those cubes.

With the ever-increasing amounts of data to be processed meeting the SLA’s was a challenge. It was time to replace our SQL database with a modern architecture and tooling that would auto scale based on the data to be processed and be performant.

How 3M modernized the data warehouse with Amazon Redshift

When choosing a new solution, first we had to ensure that we were able to load data in near real-time. Second, we had to ensure that the solution was scalable, because the amount of data stored in the data warehouse would be 10x larger as compared to the existing solution. Third, the solution needed to be able to provide reports for massive queries in reasonable amount of time without impacting the ETL processing that would be running 24/7. Last, we needed a cost-effective data warehouse that could integrate with other analytics services that were part of the overall solution.

We evaluated data warehouses such as Amazon Redshift and Snowflake. We chose Amazon Redshift because it fulfilled the preceding criteria and aligned with our preference for native AWS managed services. But also, we did so because Amazon Redshift would be a solution for the future that could keep pace with the growth of the business in an economically sustainable way.

To build the reporting tool, we migrated our multi-terabyte data warehouse to Amazon Redshift. The data was processed through the ETL workflow into an Amazon S3 bucket and then bulk-copied into Amazon Redshift. The GitHub repository provided by AWS, a collection of scripts and utilities, helped us set up the cluster and get the best performance possible from Amazon Redshift.

Top lessons that we learned during the implementation

During the initial development, we encountered challenges loading the data into the Amazon Redshift tables because we were trying to load the data from an Amazon RDS staging instance into Amazon Redshift. After some research, we identified that a bulk load from Amazon S3 bucket was the best practice to get large amounts of data loaded into the Amazon Redshift tables.

The second challenge was that the Amazon Redshift VACUUM and ANALYZE operations were holding up our ETL pipeline, because these operations had been baked into the ETL process. We performed frequent data loads into the Amazon Redshift tables and a lot of DELETE operations as part of the ETL processing. These two concerns meant that the VACUUM and ANALYZE operations had to be performed frequently, resulting in the tables being locked for the operations’ duration and conflicting with the ETL process. Triggering the process after all the loads were complete helped eliminate the performance issues we were encountering. VACUUM and ANALYZE have recently been automated, which we expect to prevent such issues arising in the future.

The final challenge was to find a way to use windowing functions which previously resided in the Analysis Services cube layer, whose functionality Amazon Redshift now fulfilled. However, most of the windowing functions that we need are built into Amazon Redshift allowing for an easy transition to port the existing functionality to Amazon Redshift and provide the same results.

During the port, we used Amazon Redshift’s comprehensive best practices guide and tuning techniques. We found these helped us set up the Amazon Redshift cluster for optimal performance.

Flow diagram of the new implementation

Benefits of the updated implementation

With the legacy solution, our implementation had grown complex, we found it difficult to support the growing volume of new data we needed to incorporate into the database and then report on in near real-time. Reports executing on the data were slowly drifting away from the initial SLA requirements. With Amazon Redshift, we’re working to solve those problems with less need for manual maintenance and care and feeding of the solution. First, it has the potential to allow us to store a larger quantity of data for a longer time. Second, adding nodes to the cluster when needed is simple and we can do it in minutes with the Elastic Resize feature. Likewise, we can scale back nodes when cost sensitivity is an issue. Third, Amazon Redshift also gives better support for computing analytics over large sets of grouped data than our previous solution. Often, we want to look at how recent data compares to historical data. In some cases, we want more than a year or two of historical data to rule out seasonality and we have found that Amazon Redshift is a more scalable solution for this type of operation.

Conclusion

At 3M HIS, we’re transforming health care from a system that treats disease to one that improves health and wellness with accurate health and clinical information from the start. 3M’s nearly 40 years of clinical knowledge, from data management and nosology to reimbursement and risk adjustment, opens the door for our providers and payers clients to find innovative solutions that improve outcomes across the continuum of care. We help our clients ensure accurate and compliant reimbursement, as well as leverage 3M’s analytical capabilities, powered by AWS, to improve health system and health plan performance while lowering costs.

 


About the Author

Dhanraj Shriyan is an Enterprise Data Architect at 3M Health Information Systems with a Masters in Predictive Analytics from Northwestern University, Chicago. He loves helping customers in exploring their data and providing valuable insights and implementing a scalable solution using the right database technology based on their needs. He has several years of experience in building large scale datawarehouse Business Intelligence solutions in cloud as well as on prem. Currently, exploring graph technologies and Lake formation services in AWS.

 

 

 

 

from AWS Big Data Blog

Amazon EMR Migration Guide

Amazon EMR Migration Guide

Businesses worldwide are discovering the power of new big data processing and analytics frameworks like Apache Hadoop and Apache Spark, but they are also discovering some of the challenges of operating these technologies in on-premises data lake environments. They may also have concerns about the future of their current distribution vendor.

To address this, we’ve introduced the Amazon EMR Migration Guide (first published June 2019.) This paper is a comprehensive guide to offer sound technical advice to help customers in planning how to move from on-premises big data deployments to EMR.

Common problems of on-premises big data environments include a lack of agility, excessive costs, and administrative headaches, as IT organizations wrestle with the effort of provisioning resources, handling uneven workloads at large scale, and keeping up with the pace of rapidly changing, community-driven, open-source software innovation. Many big data initiatives suffer from the delay and burden of evaluating, selecting, purchasing, receiving, deploying, integrating, provisioning, patching, maintaining, upgrading, and supporting the underlying hardware and software infrastructure.

A subtler, if equally critical, problem is the way companies’ data center deployments of Apache Hadoop and Apache Spark directly tie together the compute and storage resources in the same servers, creating an inflexible model where they must scale in lock step. This means that almost any on-premises environment pays for high amounts of under-used disk capacity, processing power, or system memory, as each workload has different requirements for these components. Typical workloads run on different types of clusters, at differing frequencies and times of day. These big data workloads should be freed to run whenever and however is most efficient, while still accessing the same shared underlying storage or data lake. See Figure 1. below for an illustration.

How can smart businesses find success with their big data initiatives? Migrating big data (and machine learning) to the cloud offers many advantages. Cloud infrastructure service providers, such as AWS offer a broad choice of on-demand and elastic compute resources, resilient and inexpensive persistent storage, and managed services that provide up-to-date, familiar environments to develop and operate big data applications. Data engineers, developers, data scientists, and IT personnel can focus their efforts on preparing data and extracting valuable insights.

Services like Amazon EMR, AWS Glue, and Amazon S3 enable you to decouple and scale your compute and storage independently, while providing an integrated, well-managed, highly resilient environment, immediately reducing so many of the problems of on-premises approaches. This approach leads to faster, more agile, easier to use, and more cost-efficient big data and data lake initiatives.

However, the conventional wisdom of traditional on-premises Apache Hadoop and Apache Spark isn’t always the best strategy in cloud-based deployments. A simple lift and shift approach to running cluster nodes in the cloud is conceptually easy but suboptimal in practice. Different design decisions go a long way towards maximizing your gains as you migrate big data to a cloud architecture.

This guide provides the best practices for:

  • Migrating data, applications, and catalogs
  • Using persistent and transient resources
  • Configuring security policies, access controls, and audit logs
  • Estimating and minimizing costs, while maximizing value
  • Leveraging the AWS Cloud for high availability (HA) and disaster recovery (DR)
  • Automating common administrative tasks

Although not intended as a replacement for professional services, this guide covers a wide range of common questions, and scenarios as you migrate your big data and data lake initiatives to the cloud.

When starting your journey for migrating your big data platform to the cloud, you must first decide how to approach migration. One approach is to re-architect your platform to maximize the benefits of the cloud. The other approach is known as lift and shift, is to take your existing architecture and complete a straight migration to the cloud. A final option is a hybrid approach, where you blend a lift and shift with re-architecture. This decision is not straightforward as there are advantages and disadvantages of both approaches.

A lift and shift approach is usually simpler with less ambiguity and risk. Additionally, this approach is better when you are working against tight deadlines, such as when your lease is expiring for a data center. However, the disadvantage to a lift and shift is that it is not always the most cost effective, and the existing architecture may not readily map to a solution in the cloud.

A re-architecture unlocks many advantages, including optimization of costs and efficiencies. With re-architecture, you move to the latest and greatest software, have better integration with native cloud tools, and lower operational burden by leveraging native cloud products and services.

This paper provides advantages and disadvantages of each migration approach from the perspective of the Apache Spark and Hadoop ecosystems. To read the paper, download the Amazon EMR Migration Guide now.

For a more general resource on deciding which approach is ideal for your workflow, see An E-Book of Cloud Best Practices for Your Enterprise, which outlines the best practices for performing migrations to the cloud at a higher level.

 


About the Author

Nikki Rouda is the principal product marketing manager for data lakes and big data at AWS. Nikki has spent 20+ years helping enterprises in 40+ countries develop and implement solutions to their analytics and IT infrastructure challenges. Nikki holds an MBA from the University of Cambridge and an ScB in geophysics and math from Brown University.

 

 

 

from AWS Big Data Blog

Amazon QuickSight updates: Multiple sheets in dashboards, axis label orientation options, and more

Amazon QuickSight updates: Multiple sheets in dashboards, axis label orientation options, and more

Today, we are pleased to announce a set of updates to Amazon QuickSight:

  • Richer dashboards with multiple sheets in your regular and embedded dashboards
  • Multiple axis label orientation options for better readability of dashboards
  • More calculations such as standard deviation, variance and conditional string functions on SPICE
  • Enhanced URL actions for supporting a broader set of interaction scenarios
  • One-click duplication of visuals for faster authoring

Multiple sheets in Dashboards

First, let’s look at the update to dashboards that allows multiple sheets, accessed via a new tab control. This allows better organization of information within the dashboard, where visuals related to specific subject areas or topics can be organized in separate sheets and distinctly identified through the tab name. As a reader, you can then navigate to a single dashboard to get a comprehensive view of all the insights related to a topic.

Each sheet within a dashboard has a distinct URL, which allows you to bookmark the specific sheet, or share a link to the specific sheet with others. Additionally, dashboards can also be setup to filter content at a dashboard level, or at a per-sheet level, with the filters being passed either via on-screen controls or using the URL.

Filtering through on-screen controls

Navigating to another dashboard and passing filters via URL

If you are authoring a dashboard, you will find that every dashboard now starts off as a single sheet. You can use the “add sheet” icon in the tabs section to add a new sheet to your dashboard. Note that the tab control only shows up on a dashboard if there is more than one sheet. When authoring, you will also be able to rename a sheet, change the order of sheets or delete a sheet. The first sheet will always be the one presented to your readers by default, so this would be where you want to provide a summary of the dashboard with KPIs, ML-Insights, or high level trends. Email reports will send the first sheet of the dashboard to all your viewers; we are working on additional rendering options for email reports.

Previously, on the Big Data Blog, we covered ways to add parameters and on-screen controls to dashboards. Parameter values are global across a dashboard, which allow you to maintain user inputs and filter content according to user selections across the dashboard. On-screen controls are specific to each sheet, since the context of these controls is often confined to the sheet itself. If you want to discreetly filter data based on user or group-level attributes, you may also use row-level security to do so.

Multi-sheet dashboards can be shared through the regular website, or can be embedded in your custom application. To embed a dashboard in your application, visit Embed interactive dashboards in your application with Amazon QuickSight.

Axis label orientation

When authoring a dashboard, you will see that Amazon QuickSight now automatically selects a horizontal or angled layout to provide better readability. You can also manually navigate to a visual’s options menu and choose “Format visual”, and then choose the axis label orientation that is most suited to your dashboard.

Duplicate visual option

When authoring a dashboard, there are often situations where you want to quickly clone a visual and make changes. With the new “Duplicate visual” option, Amazon QuickSight allows you to achieve this via a one-click action from the visual’s options menu. This new option creates a copy of the visual, along with its filters and other settings such as visual size and format, within the same sheet or a different sheet within the dashboard, thus removing the overhead of creating visuals from scratch each time.

URL action open options

URL actions allow you to configure actions outside of QuickSight that can be triggered by a user viewing a QuickSight dashboard. With today’s release we will allow you to choose how these URLs are opened, with options of overriding the viewer’s QuickSight browser tab, loading in a new tab, or a new browser window altogether.

In addition to these features that improve author and reader experience, QuickSight now supports additional statistical capabilities through new functions in calculated fields. These statistical functions include standard deviation and variance calculations and conditional functions on string fields. With standard deviation and variance, you can now determine dispersion of a metric relative to the mean. You can do this across your entire data set or partition by a select dimension. For more information, see Functions by Category in the Amazon QuickSight User Guide.

With conditional string functions on SPICE, you can now use coalesce, ifelse, isNotNull, isNull, nullif while building reports on SPICE datasets. For more information, see Calculated Field Function and Operator Reference for Amazon QuickSight in the Amazon QuickSight user guide.

All of these new features are now available in both Standard and Enterprise Editions in all supported Amazon QuickSight regions – US East (N. Virginia and Ohio), US West (Oregon), EU (Ireland), and Asia Pacific (Singapore, Sydney and Tokyo).


About the Author

Sahitya Pandiri is a technical program manager with Amazon Web Services. Sahitya has been in the product/program management for 5 years now, and has built multiple products in the retail, healthcare and analytics spaces. She enjoys problem solving, and leveraging technology to simplify processes.

 

 

 

from AWS Big Data Blog

Optimizing downstream data processing with Amazon Kinesis Data Firehose and Amazon EMR running Apache Spark

Optimizing downstream data processing with Amazon Kinesis Data Firehose and Amazon EMR running Apache Spark

For most organizations, working with ever-increasing volumes of data and incorporating new data sources can be a challenge.  Often, AWS customers have messages coming from various connected devices and sensors that must be efficiently ingested and processed before further analysis.  Amazon S3 is a natural landing spot for data of all types.  However, the way data is stored in Amazon S3 can make a significant difference in the efficiency and cost of downstream data processing.  Specifically, Apache Spark can be over-burdened with file operations if it is processing a large number of small files versus fewer larger files.  Each of these files has its own overhead of a few milliseconds for opening, reading metadata information, and closing. This overhead of file operations on these large numbers of files results in slow processing. This blog post shows how to use Amazon Kinesis Data Firehose to merge many small messages into larger messages for delivery to Amazon S3.  This results in faster processing with Amazon EMR running Spark.

Like Amazon Kinesis Data Streams, Kinesis Data Firehose accepts a maximum incoming message size of 1 MB.  If a single message is greater than 1 MB, it can be compressed before placing it on the stream.  However, at large volumes, a message or file size of 1 MB or less is usually too small.  Although there is no right answer for file size, 1 MB for many datasets would just yield too many files and file operations.

This post also shows how to read the compressed files using Apache Spark that are in Amazon S3, which does not have a proper file name extension and store back in Amazon S3 in parquet format.

Solution overview

The steps we follow in this blog post are:

  1. Create a virtual private cloud (VPC) and an Amazon S3 bucket.
  2. Provision a Kinesis data stream, and an AWS Lambda function to process the messages from the Kinesis data stream.
  3. Provision Kinesis Data Firehose to deliver messages to Amazon S3 sent from the Lambda function in step 2. This step also provisions an Amazon EMR cluster to process the data in Amazon S3.
  4. Generate test data with custom code running on an Amazon EC2
  5. Run a sample Spark program from the Amazon EMR cluster’s master instance to read the files from Amazon S3, convert them into parquet format and write back to an Amazon S3 destination.

The following diagram explains how the services work together:

The AWS Lambda function in the diagram reads the messages, append additional data to them, and compress them with gzip before sending to Amazon Kinesis Data Firehose. The reason for this is most customers need some enrichment to the data before arriving to Amazon S3.

Amazon Kinesis Data Firehose can buffer incoming messages into larger records before delivering them to your Amazon S3 bucket. It does so according to two conditions, buffer size (up to 128 MB) and buffer interval (up to 900 seconds). Record delivery is triggered once either of these conditions has been satisfied.

An Apache Spark job reads the messages from Amazon S3, and stores them in parquet format. With parquet, data is stored in a columnar format that provides more efficient scanning and enables ad hoc querying or further processing by services like Amazon Athena.

Considerations

The maximum size of a record sent to Kinesis Data Firehose is 1,000 KB. If your message size is greater than this value, compressing the message before it is sent to Kinesis Data Firehose is the best approach. Kinesis Data Firehose  also offers compression of messages after they are written to the Kinesis Data Firehose data stream. Unfortunately, this does not overcome the message size limitation, because this compression happens after the message is written. When Kinesis Data Firehose delivers a previously compressed message to Amazon S3 it is written as an object without a file extension. For example, if a message is compressed with gzip before it is written to Kinesis Data Firehose, it is delivered to Amazon S3 without the .gz extension. This is problematic if you are using Apache Spark for downstream processing because a “.gz” extension is required.

We will see how to overcome this issue by reading the files using the Amazon S3 API operations later in this blog.

Prerequisites and assumptions

To follow the steps outlined in this blog post, you need the following:

  • An AWS account that provides access to AWS services.
  • An AWS Identity and Access Management (IAM) user with an access key and secret access key to configure the AWS CLI.
  • The templates and code are intended to work in the US East (N. Virginia) Region only.

Additionally, be aware of the following:

  • We configure all services in the same VPC to simplify networking considerations.
  • Important: The AWS CloudFormation templates and the sample code that we provide use hardcoded user names and passwords and open security groups. These are for testing purposes only. They aren’t intended for production use without any modifications.

Implementing the solution

You can use this downloadable template for single-click deployment. This template is launched in the US East (N. Virginia) Region by default. Do not change to a different Region. The template is designed to work only in the US East (N. Virginia) Region. To launch directly through the console, choose the Launch Stack button.

This template takes the following parameters. Some of the parameters have default values, and you can’t edit these. These predefined names are hardcoded in the code. For some of the parameters, you must provide the values. The following table provides additional details.

For this parameter Provide this
StackName Provide the stack name.

ClientIP

 

The IP address range of the client that is allowed to connect to the cluster using SSH.
FirehoseDeliveryStreamName The name of the Amazon Firehose delivery stream. Default value is set to “AWSBlogs-LambdaToFireHose”.
InstanceType The EC2 instance type.
KeyName The name of an existing EC2 key pair to enable access to login.
KinesisStreamName The name of the Amazon Kinesis Stream. Default value is set to “AWS-Blog-BaseKinesisStream”
Region AWS Region – By default it is us-east-1 — US East (N. Virginia). Do not change this as the scripts are developed to work in this Region only.

EMRClusterName

 

A name for the EMR cluster.
S3BucketName The name of the bucket that is created in your account. Provide some unique name to this bucket. This bucket is used for storing the messages and output from the Spark code.

After you specify the template details, choose Next. On the options page, choose Next again. On the Review page, select the check box for I acknowledge that AWS CloudFormation might create IAM resources with custom names and for I acknowledge that AWS CloudFormation might require the following capability: CAPABILITY_AUTO_EXPAND. And then click on the Create button.

If you use this one-step solution, you can skip to Step 7: Generate test dataset and load into Kinesis Data Streams.

To create each component individually, use the following steps.

1. Use the AWS CloudFormation template to configure Amazon VPC and create an Amazon S3 bucket

In this step, we set up a VPC, public subnet, internet gateway, route table, and a security group. The security group has two inbound access rules. The first inbound rule allows access to the TCP port 22 (SSH) from the provided client IP CIDR range and the second inbound rule allows access to any TCP port from any host with in the same security group. We use this VPC and subnet for all other services that are created in the next steps. In addition to these resources, we will also create a standard Amazon S3 bucket with a provided bucket name to store the incoming data and processed data. You can use this downloadable AWS CloudFormation template to set up the previous components. To launch directly through the console, choose Launch Stack.

This template takes the following parameters. The following table provides additional details.

For this parameter Do this
StackName Provide the stack name.
S3BucketName Provide a unique Amazon S3 bucket. This bucket is created in your account.
ClientIp Provide a CIDR IP address range that is added to inbound rule of the security group. You can get your current IP address from “checkip.amazon.com” web url.

After you specify the template details, choose Next. On the Review page, choose Create.

When the stack launch is complete, it should return outputs similar to the following.

Key Value
StackName Name
VPCID Vpc-xxxxxxx
SubnetID subnet-xxxxxxxx
SecurityGroup sg-xxxxxxxxxx
S3BucketDomain <S3_BUCKET_NAME>.s3.amazonaws.com
S3BucketARN arn:aws:s3:::<S3_BUCKET_NAME>

Make a note of the output, because you use this information in the next step. You can view the stack outputs on the AWS Management Console or by using the following AWS CLI command:

$ aws cloudformation describe-stacks --stack-name <stack_name> --region us-east-1 --query 'Stacks[0].Outputs'

2.  Use the AWS CloudFormation template to create necessary IAM Roles

In this step, we set up two AWS IAM roles. One of the IAM roles will be used by an AWS Lambda function to allow access to Amazon S3 service, Amazon Kinesis Data Firehose, Amazon CloudWatch Logs, and Amazon EC2 instances.  The second IAM role is used by the Amazon Kinesis Data Firehose service to access Amazon S3 service. You can use this downloadable CloudFormation template to set up the previous components. To launch directly through the console, choose Launch Stack.

This template takes the following parameters. The following table provides additional details.

For this parameter Do this
StackName Provide the stack name.

After you specify the template details, choose Next. On the options page, choose Next again. On the Review page, select the check box for I acknowledge that AWS CloudFormation might create IAM resources with custom names. Choose Create.

When the stack launch is complete, it should return outputs similar to the following.

Key Value
LambdaRoleArn arn:aws:iam::<ACCOUNT_NUMBER>:role/small-files-lamdarole
FirehoseRoleArn arn:aws:iam::<ACCOUNT_NUMBER>:role/small-files-firehoserole

When the stack launch is complete, it returns the output with information about the resources that were created. Make a note of the output, because you use this information in the next step. You can view the stack outputs on the AWS Management Console or by using the following AWS CLI command:

$ aws cloudformation describe-stacks --stack-name <stack_name> --region us-east-1 --query 'Stacks[0].Outputs'

3. Use an AWS CloudFormation template to configure the Amazon Kinesis Data Firehose data stream

In this step, we set up Amazon Kinesis Data Firehose with Amazon S3 as destination for the incoming messages. We select the Uncompressed option for compression format, buffering options with 128 MB size and interval seconds of 300. You can use this downloadable AWS CloudFormation template to set up the previous components. To launch directly through the console, choose Launch Stack.

This template takes the following parameters. The following table provides additional details.

For this parameter Do this
StackName Provide the stack name.
FirehoseDeliveryStreamName Provide the name of the Amazon Kinesis Data Firehose delivery stream. The default value is set to “AWSBlogs-LambdaToFirehose”
Role Provide the Kinesis Data Firehose IAM role ARN that was created as part of step 2.
S3BucketARN Select the S3BucketARN. You can get this from the step 1 AWS CloudFormation output.

After you specify the template details, choose Next. On the options page, choose Next again. On the Review page, choose Create.

4. Use an AWS CloudFormation template to create a Kinesis data stream and a Lambda function

In this step, we set up a Kinesis data stream and an AWS Lambda function. We can use the AWS Lambda function to process incoming messages in a Kinesis data stream. An event source mapping is also created as part of this template. This adds a trigger to the AWS Lambda function for the Kinesis data stream source. For more information about creating event source mapping, see Creating an Event Source Mapping. This Kinesis data stream is created with 10 shards and the Lambda function is created with a Java 8 runtime. We allocate memory size of 1920 MB and timeout of 300 seconds. You can use this downloadable AWS CloudFormation template to set up the previous components. To launch directly through the console, choose Launch Stack.

This template takes the following parameters. The following table provides details.

For this parameter Do this
StackName Provide the stack name.
KinesisStreamName Provide the name of the Amazon Kinesis stream. Default value is set to ‘AWS-Blog-BaseKinesisStream’
Role Provide the IAM Role created for Lambda function as part of the second AWS CloudFormation template. Get the value from the output of second AWS CloudFormation template.
S3Bucket Provide the existing Amazon S3 bucket name that was created using first AWS CloudFormation template. Do not use the domain name. Provide the bucket name only.
Region Select the AWS Region. By default it is us-east-1 — US East (N. Virginia).

After you specify the template details, choose Next. On the options page, choose Next again. On the Review page, choose Create.

5. Use an AWS CloudFormation template to configure the Amazon EMR cluster

In this step, we set up an Amazon EMR 5.16.0 cluster with “Spark”, “Ganglia” and “Hive” applications. We create this cluster with one master and two core nodes, and use an r4.xlarge instance type. The template uses an AWS Glue metastore for the Amazon EMR hive metastore. This Amazon EMR cluster is used to process the messages in Amazon S3 bucket that are created by the Amazon Kinesis Data Firehose data stream. You can use this downloadable AWS CloudFormation template to set up the previous components. To launch directly through the console, choose Launch Stack.

This template takes the following parameters. The following table provides additional details.

For this parameter Do this
EMRClusterName Provide the name for the EMR cluster.
ClusterSecurityGroup Select the security group ID that was created as part of the first AWS CloudFormation template.
ClusterSubnetID Select the subnet ID that was created as part of the first AWS CloudFormation template.
AllowedCIDR Provide the IP address range of the client that is allowed to connect to the cluster.
KeyName Provide the name of an existing EC2 key pair to access the Amazon EMR cluster.

After you specify the template details, choose Next. On the options page, choose Next again. On the Review page, choose Create.

When the stack launch is complete, it should return outputs similar to the following.

Key Value
EMRClusterMaster ssh [email protected] -i <KEY_PAIR_NAME>.pem

Make a note of the output, because you use this information in the next step. You can view the stack outputs on the AWS Management Console or by using the following AWS CLI command:

$ aws cloudformation describe-stacks --stack-name <stack_name> --region us-east-1 --query 'Stacks[0].Outputs'

6. Use an AWS CloudFormation template to create an Amazon EC2 Instance to generate test data

In this step, we set up an Amazon EC2 instance and install open-jdk version 1.8. The AWS CloudFormation script that creates this EC2 instance runs two additional steps. First, it downloads and installs open-jdk version 1.8. Second, it downloads a Java program jar file on to the EC2 instance’s ec2-user home directory. We use this Java program to generate test data messages with an approximate size of ~900 KB. We then send them to the Kinesis data stream that was created as part of the previous steps. The Java jar file name is: “sample-kinesis-producer-1.0-SNAPSHOT-jar-with-dependencies.jar”.

You can use this downloadable AWS CloudFormation template to set up the previous components. To launch directly through the console, choose Launch Stack.

This template takes the following parameters. The following table provides additional details.

For this parameter Do this
EC2SecurityGroup Select the security group ID that was created from the first AWS CloudFormation template.
EC2Subnet Select the subnet that was created from the first AWS CloudFormation template.
InstanceType Select the provided instance type. By default, it selects r4.4xlarge instance.
KeyName Name of an existing EC2 key pair to enable SSH access to the EC2 instance.

After you specify the template details, choose Next. On the options page, choose Next again. On the Review page select “I acknowledge that AWS CloudFormation might create IAM resources with custom names” option and, click Create button.

When the stack launch is complete, it should return outputs similar to the following.

Key Value
EC2Instance ssh [email protected]<Public-IP> -i <KEY_PAIR_NAME>.pem

Make a note of the output, because you use this information in the next step. You can view the stack outputs on the AWS Management Console or by using the following AWS CLI command:

$ aws cloudformation describe-stacks --stack-name <stack_name> --region us-east-1 --query 'Stacks[0].Outputs'

7. Generate the test dataset and load into Kinesis Data Streams

After all of the previous AWS CloudFormation stacks are created successfully, log in to the EC2 instance that was created as part of the step 6. Use the “ssh” command as shown in the CloudFormation stack template output. This template copies the “sample-kinesis-producer-1.0-SNAPSHOT-jar-with-dependencies.jar” file, which we use to generate the test data and send to Amazon Kinesis Data Streams. You can find the code corresponding to this sample Kinesis producer in this Git repository.

Make sure your EC2 instance’s security group allows ssh port 22 (Inbound) from your IP address. If not, update your security group inbound access.

  • ssh [email protected]<Public IP Address of the EC2 Instance> -i <SSH_KEY_PAIR_NAME>.pem

Run the following commands to generate some test data.

$ cd;

 

$ ls -ltra sample-kinesis-producer-1.0-SNAPSHOT-jar-with-dependencies.jar

-rwxr-xr-x 1 ec2-user ec2-user 27536802 Oct 29 21:19 sample-kinesis-producer-1.0-SNAPSHOT-jar-with-dependencies.jar

 

$java -Xms1024m -Xmx25600m -XX:+UseG1GC -cp sample-kinesis-producer-1.0-SNAPSHOT-jar-with-dependencies.jar com.optimize.downstream.entry.Main 10000

 

This java program uses PutRecords API method that allows many records to be sent with a single HTTP request. For more information on this you can check this AWS blog. Once you run the above java program, you will see the below output that shows messages are in the process of sending to Kinesis Data Stream.

“Starting producer and consumer.....
Inserting a message into blocking queue before sending to Kinesis Firehose and Message number is : 0
Producer Thread # 9 is going to sleep mode for 500 ms.
Inserting a message into blocking queue before sending to Kinesis Firehose and Message number is : 1
Inserting a message into blocking queue before sending to Kinesis Firehose and Message number is : 2
Inserting a message into blocking queue before sending to Kinesis Firehose and Message number is : 3
::
::
Record sent to Kinesis Stream. Record size is ::: 5042850 KB
Sending a record to Kinesis Stream with 5 messages grouped together.
Record sent to Kinesis Stream. Record size is ::: 5042726 KB
Sending a record to Kinesis Stream with 5 messages grouped together.
Record sent to Kinesis Stream. Record size is ::: 5042729 KB”

When running the sample Kinesis producer jar, notice that the number of messages is 10,000. This program generates the test data messages and is not a replacement for your load testing tool. This is created to demonstrate the use case presented in this post.

After all of the messages generated and sent to Amazon Kinesis Data Streams, program will exit gracefully.

The sample JSON input message format is shown as follows:

   "processedDate":"2018/10/30 19:05:19",
   "currentDate":"2018/10/30 19:05:07",
   "hashDeviceId":"0c2745e4-c2d6-4d43-8339-9c2401e80e92",
   "deviceId":"94581b5f-a117-484a-8e3c-4fcc2dbd53b7",
   "accelerometerSensorList":[  
      {  
         "accelerometer_Y":8,
         "gravitySensor_X":5,
         "accelerometer_X":9,
         "gravitySensor_Z":4,
         "accelerometer_Z":1,
         "gravitySensor_Y":5,
         "linearAccelerationSensor_Z":3,
         "linearAccelerationSensor_Y":9,
         "linearAccelerationSensor_X":9
      },
      {  
         "accelerometer_Y":1,
         "gravitySensor_X":3,
         "accelerometer_X":5,
         "gravitySensor_Z":5,
         "accelerometer_Z":7,
         "gravitySensor_Y":9,
         "linearAccelerationSensor_Z":6,
         "linearAccelerationSensor_Y":5,
         "linearAccelerationSensor_X":3
      },
 {
   …
 },
 {
   …
 },
 :
 :
   ],
   "tempSensorList":[  
      {  
         "kelvin":585.4928040286752,
         "celsius":43.329574923775425,
         "fahrenheit":50.13864584530086
      },
      {  
         "kelvin":349.95625855125814,
         "celsius":95.68423052685313,
         "fahrenheit":7.854854574219985
      },
 {
   …
 },
 {
   …
 },
 :
 :
 
   ],
   "illuminancesSensorList":[  
      {  
         "illuminance":44.65135784368194
      },
      {  
         "illuminance":98.15404017082403
      },
 {
   …
 },
 {
   …
 },
 :
 :
   ],
   "gpsSensorList":[  
      {  
         "altitude":4.38273213294682,
         "heading":7.416314616289915,
         "latitude":5.759723677991661,
         "longitude":1.4732885894731842
      },
      {  
         "altitude":9.816473807569487,
         "heading":5.118919157684835,
         "latitude":3.581361614110458,
         "longitude":1.3699272610616127
      },
 {
   …
 },
 {
   …
 },
 :
 :
   }

 

Log in to the Kinesis Data Streams console, then choose the Kinesis data stream that was created as part of the step 4.  Choose the Monitor tab to see the graphs. Run the data generation utility for at least 15 mins to generate enough data.

8. Processing Kinesis Data Streams messages using AWS Lambda

As part of the previously-described setup, we also use an AWS Lambda function (name:LambdaForProcessingKinesisRecords) to process the messages from the Kinesis data stream. This Lambda function reads each message content and appends “additional data.” This demonstrates that the incoming message from Kinesis data stream is read, and appended with some additional information to make the message size more than 1 MB.  Several customers have a use case like this to enrich the incoming messages by adding additional information. After the AWS Lambda function appends additional data to incoming messages, it sends them to Amazon Kinesis Data Firehose. Because Kinesis Data Firehose accepts only messages that are less than 1 MB, we must compress the messages before sending to it. In the Lambda function, we are compressing the message using gzip compression before sending it to Kinesis Data Firehose. In addition to compressing each message, we are also appending a new line character (“/n”) to each message after compressing it to separate the messages.

We set the buffer size to 128 MB and duration of the buffer is 900 seconds while creating the Kinesis Data Firehose. This helps merge the incoming compressed messages into larger messages and sends to the provided Amazon S3 bucket.

The AWS Lambda function appends the following content to the original message in Kinesis Data Streams after reading it.

"testAdditonalDataList": [
  {
    "dimesnion_X": 9,
    "dimesnion_Y": 2,
    "dimesnion_Z": 2
  },
  {
    "dimesnion_X": 3,
    "dimesnion_Y": 10,
    "dimesnion_Z": 5
  }
  {
    …
  },
  {
    …
  },
  :
  :
]

If we do not compress the message before sending to Kinesis Data Firehose, it throws this error message in the Amazon CloudWatch Logs.

Here is the code snippet where we are compressing the message in the AWS Lambda function. The complete code can be found in this Git repository.

private String sendToFireHose(String mergedJsonString)
{
    PutRecordResult res = null;
    try {
        //To Firehose -
        System.out.println("MESSAGE SIZE BEFORE COMPRESSION IS : " + mergedJsonString.toString().getBytes(charset).length);
        System.out.println("MESSAGE SIZE AFTER GZIP COMPRESSION IS : " + compressMessage(mergedJsonString.toString().getBytes(charset)).length);
        PutRecordRequest req = new PutRecordRequest()
                .withDeliveryStreamName(firehoseStreamName);

        // Without compression - Send to Firehose
        //Record record = new Record().withData(ByteBuffer.wrap((mergedJsonString.toString() + "\r\n").getBytes()));

        // With compression - send to Firehose
        Record record = new Record().withData(ByteBuffer.wrap(compressMessage((mergedJsonString.toString() + "\r\n").getBytes())));
        req.setRecord(record);
        res = kinesisFirehoseClient.putRecord(req);
    }
    catch (IOException ie) {
        ie.printStackTrace();
    }
    return res.getRecordId();
}

You can check the provided bucket to see if the messages are flowing into the bucket. The Amazon S3 bucket should show something similar to the following example:

You see the files generated from Kinesis Data Firehose that do not have any extension. By default, Kinesis Data Firehose does not provide any extension to the files that are generated in Amazon S3 bucket unless you select a compression option. But in our use case, since the size of the uncompressed input message is greater than 1 MB, we are compressing it before sending to Kinesis Data Firehose. As the message is already compressed, we are not selecting any compression option in Kinesis Data Firehose, as it double-compresses the message and the downstream Spark application cannot process this.

9. Reading and converting the data into parquet format using Apache Spark program with Amazon EMR

As we noted down from the previous screen shot, Kinesis Data Firehose by default does not generate any file extensions to the files that are written into Amazon S3 bucket. This creates a problem while reading the files using Apache Spark. Apache Spark, by default, checks for a valid file name extension if the file is compressed. In this case for gzip compression, it looks for <filename>.gz to successfully read it.

To overcome this issue, we can use Amazon S3 API operations, particularly AmazonS3Client class, to list all the Amazon S3 keys and use Spark’s parallelize method to read the contents of the files. After reading the file content, we can uncompress it using GZipInputStream class. You can find the code snippet below. The complete code can be found in the Git repository.

val allLinesRDD = spark.sparkContext.parallelize(s3ObjectKeys).flatMap
{ key => Source.fromInputStream
  (
   new GZipInputStream(s3Client.getObject(bucketName, key).getObjectContent:   InputStream)
  ).getLines 
}

var finalDF = spark.read.json(allLinesRDD).toDF()

Once the Amazon EMR cluster creation is completed successfully, login to the Amazon EMR master machine using the following command. You can get the “ssh” login command from the AWS CloudFormation stack 5 (step 5) outputs parameter “EMRClusterMaster”.

  • ssh [email protected] -i <KEYPAIR_NAME>.pem
  • Make sure the security port 22 is opened to connect to the Amazon EMR master machine.

Run the Spark program using the following Spark submit command.

spark-submit --class com.optimize.downstream.process.ProcessFilesFromS3AndConvertToParquet --master yarn --deploy-mode client s3://aws-bigdata-blog/artifacts/aws-blog-optimize-downstream-data-processing/appjars/spark-process-1.0-SNAPSHOT-jar-with-dependencies.jar <S3_BUCKET_NAME> fromfirehose/<YEAR>/ output-emr-parquet/

Change the S3_BUCKET_NAME and YEAR values from the previous Spark command.

Argument # Property Value
1 –class com.optimize.downstream.process.ProcessFilesFromS3AndConvertToParquet
2 –master yarn
3 –deploy-mode client
4 s3://aws-bigdata-blog/artifacts/aws-blog-avoid-small-files/appjars/spark-process-1.0-SNAPSHOT-jar-with-dependencies.jar
5 S3_BUCKET_NAME The Amazon S3 bucket name that was created as part of the AWS CloudFormation template. The source files are created in this bucket.
6 <INPUT S3 LOCATION> “fromfirehose/<YYYY>/”. The input files are created in this Amazon S3 key location under the bucket that was created. “YYYY” represents the current year. For example, “fromfirehose/2018/”
7 <OUTPUT S3 LOCATION> Provide an output directory name that will be created under the above provided Amazon S3 bucket. For example: “output-emr-parquet/”

 

When the program finishes running, you can check the Amazon S3 output location to see the files that are written in parquet format.

Cleaning up after the migration

After completing and testing this solution, clean up the resources by stopping your tasks and deleting the AWS CloudFormation stacks. The stack deletion fails if you have any files in the created Amazon S3 bucket. Make sure that you cleaned up the Amazon S3 bucket that was created before deleting the AWS CloudFormation templates.

Conclusion

In this post, we described the process of avoiding small file creation in Amazon S3 by sending the incoming messages to Amazon Kinesis Data Firehose. We also went through the process of reading and storing the data in parquet format using Apache Spark with an Amazon EMR cluster.

 


About the Author

Photo of Srikanth KodaliSrikanth Kodali is a Sr. IOT Data analytics architect at Amazon Web Services. He works with AWS customers to provide guidance and technical assistance on building IoT data and analytics solutions, helping them improve the value of their solutions when using AWS.

 

 

from AWS Big Data Blog

How to export an Amazon DynamoDB table to Amazon S3 using AWS Step Functions and AWS Glue

How to export an Amazon DynamoDB table to Amazon S3 using AWS Step Functions and AWS Glue

In typical AWS fashion, not a week had gone by after I published How Goodreads offloads Amazon DynamoDB tables to Amazon S3 and queries them using Amazon Athena on the AWS Big Data blog when the AWS Glue team released the ability for AWS Glue crawlers and AWS Glue ETL jobs to read from DynamoDB tables natively. I was actually pretty excited about this. Less code means fewer bugs. The original architecture had been around for at least 18 months and could be simplified significantly with a little bit of work.

Refactoring the data pipeline

The AWS Data Pipeline architecture outlined in my previous blog post is just under two years old now. We had used data pipelines as a way to back up Amazon DynamoDB data to Amazon S3 in case of a catastrophic developer error. However, with DynamoDB point-in-time recovery we have a better, native mechanism for disaster recovery. Additionally, with data pipelines we still own the operations associated with the clusters themselves, even if they are transient. A common challenge is keeping our clusters up with recent releases of Amazon EMR to help mitigate any outstanding bugs. Another is the inefficiency of needing to spin up an EMR cluster for each DynamoDB table.

I decided to take a step back and list the capabilities I wanted to have in the next iteration:

  • Export tables using AWS Glue instead of EMR.
    • AWS Glue provides a serverless ETL environment where I don’t have to worry about the underlying infrastructure. This minimizes operational tasks like keeping up with the EMR release tags.
  • Use a workflow solution that works across services like AWS Glue and Amazon Athena.
    • In the first iteration, the workflow was spread across various services. Unless you had the entire pipeline in your head, it was difficult to get a bird’s-eye view of how the pipeline was progressing.
  • Ability to select different formats.
    • For data engineering, I prefer Apache Parquet. However, customers might prefer a different format.
  • Add exported data to Athena.
    • I find that the easier it is for the data to be queried, the more likely it’s used.

Architecture overview

At a high level, this is the architecture:

  • We’re using AWS Step Functions as the workflow engine.
    • Each step is either a built-in Step Functions state, a service integration, or a simple Python AWS Lambda For example, GlueStartJobRun is using the synchronous job run service integration, as discussed in the documentation.
    • We get a visual representation of the entire pipeline.
    • It’s quick to onboard new developers.
  • An event in Amazon CloudWatch Events, which is disabled to start, triggers a Step Functions state machine with a JSON payload that contains the following:
    • AWS Glue job name
    • Export destination
    • DynamoDB table name
    • Desired read percentage
    • AWS Glue crawler name
  • AWS Glue exports a DynamoDB table in your preferred format to S3 as snapshots_your_table_name. The data is partitioned by the snapshot_timestamp
  • An AWS Glue crawler adds or updates your data’s schema and partitions in the AWS Glue Data Catalog.
  • Finally, we create an Athena view that only has data from the latest export snapshot.

A simple AWS Glue ETL job

The script that I created accepts AWS Glue ETL job arguments for the table name, read throughput, output, and format. Behind the scenes, AWS Glue scans the DynamoDB table. AWS Glue makes sure that every top-level attribute makes it into the schema, no matter how sparse your attributes are (as discussed in the DynamoDB documentation).

Here’s the script:

import sys
import datetime
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext

ARG_TABLE_NAME = "table_name"
ARG_READ_PERCENT = "read_percentage"
ARG_OUTPUT = "output_prefix"
ARG_FORMAT = "output_format"

PARTITION = "snapshot_timestamp"

args = getResolvedOptions(sys.argv,
  [
    'JOB_NAME',
    ARG_TABLE_NAME,
    ARG_READ_PERCENT,
    ARG_OUTPUT,
    ARG_FORMAT
  ]
)

table_name = args[ARG_TABLE_NAME]
read = args[ARG_READ_PERCENT]
output_prefix = args[ARG_OUTPUT]
fmt = args[ARG_FORMAT]

print("Table name:", table_name)
print("Read percentage:", read)
print("Output prefix:", output_prefix)
print("Format:", fmt)

date_str = datetime.datetime.utcnow().strftime('%Y-%m-%dT%H:%M')
output = "%s/%s=%s" % (output_prefix, PARTITION, date_str)

sc = SparkContext()
glueContext = GlueContext(sc)

table = glueContext.create_dynamic_frame.from_options(
  "dynamodb",
  connection_options={
    "dynamodb.input.tableName": table_name,
    "dynamodb.throughput.read.percent": read
  }
)

glueContext.write_dynamic_frame.from_options(
  frame=table,
  connection_type="s3",
  connection_options={
    "path": output
  },
  format=fmt,
  transformation_ctx="datasink"
)

There’s not a lot here. We’re creating a DynamicFrameReader of connection type dynamodb and passing in the table name and desired maximum read throughput consumption. We pass that data frame to a DynamicFrameWriter that writes the table to S3 in the specified format.

Athena views

Most teams at Amazon own applications that have multiple DynamoDB tables, including my own team. Our current application uses five primary tables. Ideally, at the end of an export workflow you can write simple, obvious queries across a consistent view of your tables. However, each exported table is partitioned by the timestamp from when the table was exported. This makes querying across one or more tables very cumbersome, because you have to add a WHERE snapshot_timestamp = clause to every table reference in your query. Additionally, each table might have a different snapshot_timestamp value for any given day!

The final step in this export workflow creates an Athena view that adds that WHERE clause for you. This means that you can interact with your DynamoDB exports as if they were one sane view of your exported DynamoDB tables.

Setting up the infrastructure

The AWS CloudFormation stacks I create are split into two stacks. The common stack contains shared infrastructure, and you need only one of these per AWS Region. The table stacks are designed in such a way that you can create one per table-format combination in any given AWS Region. It contains the CloudWatch event logic and AWS Glue components needed to export and transform DynamoDB tables.

Creating the common stack

The common stack contains the majority of the infrastructure. That includes the Step Functions state machine and Lambda functions to trigger and check the state of asynchronous jobs. It also includes IAM roles that the export stacks use, and the S3 bucket to store the exports.

To create the common stack, do the following:

  1. Choose this Launch Stack
  2. Choose I acknowledge that AWS CloudFormation might create IAM resources with custom names.
  3. Choose Create Stack.

Creating the table export stack

If you don’t have a DynamoDB table to export, follow the original blog post. Start with the Working with the Reviews stack section and continue until you’ve added the two Items to the table. Otherwise, feel free to point this CloudFormation stack at your favorite DynamoDB table that is using provisioned throughput. Tables that use on-demand throughput are not currently supported.

Because so much of this architecture is shareable, there’s not much in the table export stack. This stack defines the CloudWatch event used to trigger the Step Functions state machine with a JSON payload containing all the necessary metadata. Additionally, it contains the AWS Glue ETL job that exports the table and the AWS Glue Crawler that updates metadata in the AWS Glue Data Catalog.

Technically, you can define the AWS Glue ETL job in the common stack because it’s already parameterized. However, the default limit for concurrent runs for an AWS Glue job is three. This is a soft limit, but with this architecture you have headroom to export up to 25 tables before asking for a limit increase.

To create the table export stack, do the following:

  1. Choose this Launch Stack
  2. Choose an output format from the list. All the available formats are supported by Athena natively.
  3. Enter your DynamoDB table name.
  4. Enter the percentage of Read Capacity Units (RCUs) that the job should consume from your table’s currently provisioned throughput. This percentage is expressed as a float between 0.1 and 1.0 inclusive. The default is 0.25 (25 percent).

As an example: Suppose that your table’s RCUs are set to 100 and you use the default 0.25, 25 percent. Then the AWS Glue job consume 25 RCUs while running.

  1. Choose Create.

Kicking off a state machine execution

To demonstrate how this works, we run the DynamoDB export state machine manually by passing it the JSON payload that the CloudWatch event would pass to Step Functions.

Getting the JSON payload from CloudWatch Events

To get the JSON payload, do the following:

  1. Open CloudWatch in the AWS Management Console.
  2. In the left column under Events, choose Rules.
  3. Choose your rule from the list. It is prefixed by AWSBigDataBlog-.
  4. For Actions, choose Edit.
  5. Copy the JSON payload from the Configure input section of Targets.
  6. Choose Cancel to exit edit mode.

Starting a state machine execution

To start an execution of the state machine, take the following steps:

  1. Open Step Functions in the console.
  2. Choose the DynamoDBExportAndAthenaLoad state machine.
  3. Choose Start execution.
  4. Paste the JSON payload into the Input
  5. Choose Start execution.

There are a few ways to follow along with the execution. As steps are entered and exited, entries are added to the Execution event history list. This is a great way to see what state (event in Lambda speak) is passed to each step, in case you need to debug.

You can also expand the Visual workflow. It’s a great high-level view to see how the workflow is progressing.

After the workflow is finished, you see two new tables under the dynamodb_exports database in your AWS Glue Data Catalog. Your DynamoDB snapshots table name is prefixed with snapshots_. The schema is formatted for the AWS Glue Data Catalog (lowercase and hyphens transformed to underscores). You also have a view table with the same table name formatted for AWS Glue Data Catalog but without the snapshots_ prefix.

Querying your data

To showcase how having a separate view table of the most recent snapshot of a table is useful, I use the Reviews table from the previous blog post. The table has two items. I have also run the export workflow twice. As you can see when you preview the table, there are four items total. That’s because each snapshot contains two items.

From the items, the latest snapshot_timestamp is 2019-01-11T23:26. When I run the same preview query against the view table reviews, we see that there are only two items, which is what we expect. The view takes care of specifying the where snapshot_timestamp=… clause so you don’t have to.

Wrapping up

In this post, I showed you how to use AWS Glue’s DynamoDB integration and AWS Step Functions to create a workflow to export your DynamoDB tables to S3 in Parquet. I also show how to create an Athena view for each table’s latest snapshot, giving you a consistent view of your DynamoDB table exports.


About the Author

Joe Feeney is a Software Engineer at Amazon Go, where he does secret stuff and he’s quite chuffed with that. He enjoys embarrassing his family by taking Mario Kart entirely too seriously.

 

 

 

from AWS Big Data Blog

Trigger cross-region replication of pre-existing objects using Amazon S3 inventory, Amazon EMR, and Amazon Athena

Trigger cross-region replication of pre-existing objects using Amazon S3 inventory, Amazon EMR, and Amazon Athena

In Amazon Simple Storage Service (Amazon S3), you can use cross-region replication (CRR) to copy objects automatically and asynchronously across buckets in different AWS Regions. CRR is a bucket-level configuration, and it can help you meet compliance requirements and minimize latency by keeping copies of your data in different Regions. CRR replicates all objects in the source bucket, or optionally a subset, controlled by prefix and tags.

Objects that exist before you enable CRR (pre-existing objects) are not replicated. Similarly, objects might fail to replicate (failed objects) if permissions aren’t in place, either on the IAM role used for replication or the bucket policy (if the buckets are in different AWS accounts).

In our work with customers, we have seen situations where large numbers of objects aren’t replicated for the previously mentioned reasons. In this post, we show you how to trigger cross-region replication for pre-existing and failed objects.

Methodology

At a high level, our strategy is to perform a copy-in-place operation on pre-existing and failed objects. This operation uses the Amazon S3 API to copy the objects over the top of themselves, preserving tags, access control lists (ACLs), metadata, and encryption keys. The operation also resets the Replication_Status flag on the objects. This triggers cross-region replication, which then copies the objects to the destination bucket.

To accomplish this, we use the following:

  • Amazon S3 inventory to identify objects to copy in place. These objects don’t have a replication status, or they have a status of FAILED.
  • Amazon Athena and AWS Glue to expose the S3 inventory files as a table.
  • Amazon EMR to execute an Apache Spark job that queries the AWS Glue table and performs the copy-in-place operation.

Object filtering

To reduce the size of the problem (we’ve seen buckets with billions of objects!) and eliminate S3 List operations, we use Amazon S3 inventory. S3 inventory is enabled at the bucket level, and it provides a report of S3 objects. The inventory files contain the objects’ replication status: PENDING, COMPLETED, FAILED, or REPLICA. Pre-existing objects do not have a replication status in the inventory.

Interactive analysis

To simplify working with the files that are created by S3 inventory, we create a table in the AWS Glue Data Catalog. You can query this table using Amazon Athena and analyze the objects.  You can also use this table in the Spark job running on Amazon EMR to identify the objects to copy in place.

Copy-in-place execution

We use a Spark job running on Amazon EMR to perform concurrent copy-in-place operations of the S3 objects. This step allows the number of simultaneous copy operations to be scaled up. This improves performance on a large number of objects compared to doing the copy operations consecutively with a single-threaded application.

Account setup

For the purpose of this example, we created three S3 buckets. The buckets are specific to our demonstration. If you’re following along, you need to create your own buckets (with different names).

We’re using a source bucket named crr-preexisting-demo-source and a destination bucket named crr-preexisting-demo-destination. The source bucket contains the pre-existing objects and the objects with the replication status of FAILED. We store the S3 inventory files in a third bucket named crr-preexisting-demo-inventory.

The following diagram illustrates the basic setup.

You can use any bucket to store the inventory, but the bucket policy must include the following statement (change Resource and aws:SourceAccount to match yours).

{
    "Version": "2012-10-17",
    "Id": "S3InventoryPolicy",
    "Statement": [
        {
            "Sid": "S3InventoryStatement",
            "Effect": "Allow",
            "Principal": {
                "Service": "s3.amazonaws.com"
            },
            "Action": "s3:PutObject",
            "Resource": "arn:aws:s3:::crr-preexisting-demo-inventory/*",
            "Condition": {
                "StringEquals": {
                    "s3:x-amz-acl": "bucket-owner-full-control",
                    "aws:SourceAccount": "111111111111"
                }
            }
        }
    ]
}

In our example, we uploaded six objects to crr-preexisting-demo-source. We added three objects (preexisting-*.txt) before CRR was enabled. We also added three objects (failed-*.txt) after permissions were removed from the CRR IAM role, causing CRR to fail.

Enable S3 inventory

You need to enable S3 inventory on the source bucket. You can do this on the Amazon S3 console as follows:

On the Management tab for the source bucket, choose Inventory.

Choose Add new, and complete the settings as shown, choosing the CSV format and selecting the Replication status check box. For detailed instructions for creating an inventory, see How Do I Configure Amazon S3 Inventory? in the Amazon S3 Console User Guide.

After enabling S3 inventory, you need to wait for the inventory files to be delivered. It can take up to 48 hours to deliver the first report. If you’re following the demo, ensure that the inventory report is delivered before proceeding.

Here’s what our example inventory file looks like:

You can also look on the S3 console on the objects’ Overview tab. The pre-existing objects do not have a replication status, but the failed objects show the following:

Register the table in the AWS Glue Data Catalog using Amazon Athena

To be able to query the inventory files using SQL, first you need to create an external table in the AWS Glue Data Catalog. Open the Amazon Athena console at https://console.aws.amazon.com/athena/home.

On the Query Editor tab, run the following SQL statement. This statement registers the external table in the AWS Glue Data Catalog.

CREATE EXTERNAL TABLE IF NOT EXISTS
crr_preexisting_demo (
    `bucket` string,
    key string,
    replication_status string
)
PARTITIONED BY (dt string)
ROW FORMAT DELIMITED
    FIELDS TERMINATED BY ','
    ESCAPED BY '\\'
    LINES TERMINATED BY '\n'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION 's3://crr-preexisting-demo-inventory/crr-preexisting-demo-source/crr-preexisting-demo/hive';

After creating the table, you need to make the AWS Glue Data Catalog aware of any existing data and partitions by adding partition metadata to the table. To do this, you use the Metastore Consistency Check utility to scan for and add partition metadata to the AWS Glue Data Catalog.

MSCK REPAIR TABLE crr_preexisting_demo;

To learn more about why this is required, see the documentation on MSCK REPAIR TABLE and data partitioning in the Amazon Athena User Guide.

Now that the table and partitions are registered in the Data Catalog, you can query the inventory files with Amazon Athena.

SELECT * FROM crr_preexisting_demo where dt='2019-02-24-04-00';

The results of the query are as follows.

The query returns all rows in the S3 inventory for a specific delivery date. You’re now ready to launch an EMR cluster to copy in place the pre-existing and failed objects.

Note: If your goal is to fix FAILED objects, make sure that you correct what caused the failure (IAM permissions or S3 bucket policies) before proceeding to the next step.

Create an EMR cluster to copy objects

To parallelize the copy-in-place operations, run a Spark job on Amazon EMR. To facilitate EMR cluster creation and EMR step submission, we wrote a bash script (available in this GitHub repository).

To run the script, clone the GitHub repo. Then launch the EMR cluster as follows:

$ git clone https://github.com/aws-samples/amazon-s3-crr-preexisting-objects
$ ./launch emr.sh

Note: Running the bash script results in AWS charges. By default, it creates two Amazon EC2 instances, one m4.xlarge and one m4.2xlarge. Auto-termination is enabled so when the cluster is finished with the in-place copies, it terminates.

The script performs the following tasks:

  1. Creates the default EMR roles (EMR_EC2_DefaultRole and EMR_DefaultRole).
  2. Uploads the files used for bootstrap actions and steps to Amazon S3 (we use crr-preexisting-demo-inventory to store these files).
  3. Creates an EMR cluster with Apache Spark installed using the create-cluster

After the cluster is provisioned:

  1. A bootstrap action installs boto3 and awscli.
  2. Two steps execute, copying the Spark application to the master node and then running the application.

The following are highlights from the Spark application. You can find the complete code for this example in the amazon-s3-crr-preexisting-objects repo on GitHub.

Here we select records from the table registered with the AWS Glue Data Catalog, filtering for objects with a replication_status of "FAILED" or “”.

query = """
        SELECT bucket, key
        FROM {}
        WHERE dt = '{}'
        AND (replication_status = '""'
        OR replication_status = '"FAILED"')
        """.format(inventory_table, inventory_date)

print('Query: {}'.format(query))

crr_failed = spark.sql(query)

We call the copy_object function for each key returned by the previous query.

def copy_object(self, bucket, key, copy_acls):
        dest_bucket = self._s3.Bucket(bucket)
        dest_obj = dest_bucket.Object(key)

        src_bucket = self._s3.Bucket(bucket)
        src_obj = src_bucket.Object(key)

        # Get the S3 Object's Storage Class, Metadata, 
        # and Server Side Encryption
        storage_class, metadata, sse_type, last_modified = \
            self._get_object_attributes(src_obj)

        # Update the Metadata so the copy will work
        metadata['forcedreplication'] = runtime

        # Get and copy the current ACL
        if copy_acls:
            src_acl = src_obj.Acl()
            src_acl.load()
            dest_acl = {
                'Grants': src_acl.grants,
                'Owner': src_acl.owner
            }

        params = {
            'CopySource': {
                'Bucket': bucket,
                'Key': key
            },
            'MetadataDirective': 'REPLACE',
            'TaggingDirective': 'COPY',
            'Metadata': metadata,
            'StorageClass': storage_class
        }

        # Set Server Side Encryption
        if sse_type == 'AES256':
            params['ServerSideEncryption'] = 'AES256'
        elif sse_type == 'aws:kms':
            kms_key = src_obj.ssekms_key_id
            params['ServerSideEncryption'] = 'aws:kms'
            params['SSEKMSKeyId'] = kms_key

        # Copy the S3 Object over the top of itself, 
        # with the Storage Class, updated Metadata, 
        # and Server Side Encryption
        result = dest_obj.copy_from(**params)

        # Put the ACL back on the Object
        if copy_acls:
            dest_obj.Acl().put(AccessControlPolicy=dest_acl)

        return {
            'CopyInPlace': 'TRUE',
            'LastModified': str(result['CopyObjectResult']['LastModified'])
        }

Note: The Spark application adds a forcedreplication key to the objects’ metadata. It does this because Amazon S3 doesn’t allow you to copy in place without changing the object or its metadata.

Verify the success of the EMR job by running a query in Amazon Athena

The Spark application outputs its results to S3. You can create another external table with Amazon Athena and register it with the AWS Glue Data Catalog. You can then query the table with Athena to ensure that the copy-in-place operation was successful.

CREATE EXTERNAL TABLE IF NOT EXISTS
crr_preexisting_demo_results (
  `bucket` string,
  key string,
  replication_status string,
  last_modified string
)
ROW FORMAT DELIMITED
  FIELDS TERMINATED BY ','
  LINES TERMINATED BY '\n'
  STORED AS TEXTFILE
LOCATION 's3://crr-preexisting-demo-inventory/results';

SELECT * FROM crr_preexisting_demo_results;

The results appear as follows on the console.

Although this shows that the copy-in-place operation was successful, CRR still needs to replicate the objects. Subsequent inventory files show the objects’ replication status as COMPLETED. You can also verify on the console that preexisting-*.txt and failed-*.txt are COMPLETED.

It is worth noting that because CRR requires versioned buckets, the copy-in-place operation produces another version of the objects. You can use S3 lifecycle policies to manage noncurrent versions.

Conclusion

In this post, we showed how to use Amazon S3 inventory, Amazon Athena, the AWS Glue Data Catalog, and Amazon EMR to perform copy-in-place operations on pre-existing and failed objects at scale.

Note: Amazon S3 batch operations is an alternative for copying objects. The difference is that S3 batch operations will not check each object’s existing properties and set object ACLs, storage class, and encryption on an object-by-object basis. For more information, see Introduction to Amazon S3 Batch Operations in the Amazon S3 Console User Guide.

 


About the Authors

Michael Sambol is a senior consultant at AWS. He holds an MS in computer science from Georgia Tech. Michael enjoys working out, playing tennis, traveling, and watching Western movies.

 

 

 

 

Chauncy McCaughey is a senior data architect at AWS. His current side project is using statistical analysis of driving habits and traffic patterns to understand how he always ends up in the slow lane.

 

 

 

from AWS Big Data Blog

Easily query AWS service logs using Amazon Athena

Easily query AWS service logs using Amazon Athena

Many organizations use Amazon Simple Storage Service (Amazon S3) as a primary storage destination for a wide variety of logs including AWS service logs. One of the benefits of storing log data in Amazon S3 is that you can access it in any number of ways. One popular option is to query it using Amazon Athena, a serverless query engine for data on S3. Common use cases for querying logs are service and application troubleshooting, performance analysis, and security audits. To get the best performance and reduce query costs in Athena, we recommend following common best practices, as outlined in Top 10 Performance Tuning Tips for Amazon Athena on the AWS Big Data Blog. These best practices include converting the data to a columnar format like Apache Parquet and partitioning the resulting data in S3.

In this post, we’re open-sourcing a Python library known as Athena Glue Service Logs (AGSlogger). This library has predefined templates for parsing and optimizing the most popular log formats. The library provides a mechanism for defining schemas, managing partitions, and transforming data within an extract, transform, load (ETL) job in AWS Glue. AWS Glue is a serverless data transformation and cataloging service. You can use this library in conjunction with AWS Glue ETL jobs to enable a common framework for processing log data.

Using Python libraries with AWS Glue ETL

One of the features of AWS Glue ETL is the ability to import Python libraries into a job (as described in the documentation). We take advantage of this feature in our approach. With this capability, you first provide a link to a .zip file in Amazon S3 containing selected Python modules to AWS Glue. Then AWS Glue imports them at runtime.

We want our AWS Glue jobs to be as simple as possible while enabling the ability to easily roll out new versions of the library. To accomplish this, all of the setup, configuration, and transformation logic is contained in the library and AWS Glue simply executes the job. As new log formats are added or updated in the library, a new version of the .zip file can be deployed to S3. It’s then automatically imported by the relevant AWS Glue job. Here is an example ETL script:

from athena_glue_service_logs.job import JobRunner
 
job_run = JobRunner(service_name='s3_access')
job_run.convert_and_partition()

About the AGSlogger library

The library is available on GitHub in the athena-glue-service-logs repository. It’s designed to do an initial conversion of AWS Service logs and also perform ongoing conversion as new logs are delivered to S3. The following log types are supported:

  • Application Load Balancer
  • Classic Load Balancer
  • AWS CloudTrail
  • Amazon CloudFront
  • S3 Access
  • Amazon VPC Flow

To convert additional logs, update the service_name variable in the script, and also the different job parameters that point to your desired table names and Amazon S3 locations.

There are some limitations of the script:

  • The script has not been tested with large volumes of log data (greater than 100 GiB).
  • If you have a large number of log files, you might need to increase your Apache Spark executor settings. Edit the AWS Glue job and add the following job parameter:

key: --conf
value: spark.yarn.executor.memoryOverhead=1G

  • If you do not have any recent logs (less than 30 days old) for certain log types like S3 Access, the script may not be able to properly populate the optimized table.
  • Several CloudTrail fields such as requestParameters and responseElements are left as JSON strings – you can use Athena to extract data from this JSON at the time of query.

Before you begin

There are a few prerequisites before you get started:

  1. Create an IAM role to use with AWS Glue. For more information, see Create an IAM Role for AWS Glue in the AWS Glue documentation.
  2. Ensure that you have access to Athena from your account.
  3. We use Amazon S3 server access logs as our example for this script, so enable access logging on an Amazon S3 bucket. For more information, see How to Enable Server Access Logging in the
    S3 documentation.
  4. Download and store the Python library in an Amazon S3 bucket in the same AWS Region in which you run the AWS Glue ETL job. Download the latest release from https://github.com/awslabs/athena-glue-service-logs/releases. Then, copy the .zip file to your Amazon S3 bucket, as follows:

aws s3 cp athena_glue_converter_v5.3.0.zip s3://<bucket>/glue_scripts/

Now, you are ready to create the AWS Glue ETL job.

Create an AWS Glue ETL job using the library

For this post, we focus on Amazon S3 server access logs. (described in the documentation). By default, these logs are delivered to a single location in Amazon S3. Converting to Parquet and partitioning these logs can significantly improve query performance and decrease query costs.

If you’ve cloned the repository associated with this release, you can use a “make” command to automate the job creation. We also walk through the job creation process in the AWS Glue console. There are a few specific settings on the Job properties page we need to set.

To create the AWS Glue ETL job

  1. In the AWS Glue console, choose Jobs under ETL on the navigation pane, and then choose Add Job. Follow the job creation wizard. Ensure that “A new script to be authored by you” is selected. We provide the code for it later. Our ETL language is Python. Under advanced properties, enable the Job bookmark. Job metrics can also be useful when monitoring your job, but not required.
  2. Under Script libraries in the Python library path section, put the full path to the .zip file that you uploaded to your Amazon S3 bucket as shown previously:
    s3://<bucket>/glue_scripts/athena_glue_converter_v5.3.0.zip

    You can adjust the DPUs if you think you need more or less processing power. For our purposes, you can leave it at 10.
  1. Specify a few different types of parameters, described in detail following:
  • The source of your Amazon S3 Server Access Logs.
  • The destination where to save the converted logs.

AWS service logs can be stored in a number of different locations, as discussed in Service Log Specifics. For storing Amazon S3 server access logs, specify the bucket and prefix matching those that you configured on the S3 bucket where you enabled access logging.

  • The names of the databases and tables that are created in the AWS Glue Data Catalog.

By default, the converted logs are partitioned by date. The script creates the necessary tables and keeps the partitions up-to-date on subsequent runs of the job. You don’t need to use AWS Glue crawlers, although they can provide similar functionality. Here are the different properties you need to configure:

Key Value
--raw_database_name source_logs
--raw_table_name s3_access
--converted_database_name aws_service_logs
--converted_table_name s3_access
--s3_converted_target s3://<bucket>/converted/s3_access
--s3_source_location

s3://<bucket>/s3_access

 

  1. Continue with the rest of the wizard, finishing the job creation flow. The script editor opens. Replace all the code in the script editor, even the import lines, with these lines:
    from athena_glue_service_logs.job import JobRunner
     
    job_run = JobRunner(service_name='s3_access')
    job_run.convert_and_partition()

  1. Save the script and choose Run Job! When the job begins, you see log output from the job scrolling under the script.

The script you just created is saved to S3 in a standard bucket. You can also use the AWS Command Line Interface to create the AWS Glue ETL job. Copy the script preceding to S3 first and provide that as the ScriptLocation parameter.

aws glue create-job --name S3AccessLogConvertor \
--description "Convert and partition S3 Access logs" \
--role AWSGlueServiceRoleDefault \
--command Name=glueetl,ScriptLocation=s3://<bucket>/glue_scripts/s3_access_job.py \
--default-arguments '{
  "--extra-py-files":"s3://<bucket>/glue_scripts/athena_glue_converter_v5.3.0.zip",
  "--job-bookmark-option":"job-bookmark-enable",
  "--raw_database_name":"source_logs",
  "--raw_table_name":"s3_access",
  "--converted_database_name":"aws_service_logs",
  "--converted_table_name":"s3_access",
  "--TempDir":"s3://<bucket>/tmp",
  "--s3_converted_target":"s3://<bucket>/converted/s3_access",
  "--s3_source_location":"s3://<bucket>/s3_access/"
}'

Scheduling future runs

By default, this job is configured to run on a manual basis. To run it on a regular basis, set up a new schedule trigger in AWS Glue to run the job at your desired frequency. We recommend scheduling it at hourly to make it easier to locate recent logs for your optimized queries.

On every run of the job, the script looks for the new log data and converts it to Parquet format. The script then adds any new partitions that might have been added as a result of the conversion. The script uses the AWS Glue job bookmarks to ensure that it processes newly delivered data. To find more information about bookmarks in the AWS Glue documentation, see Tracking Processed Data Using Job Bookmarks.

Querying your optimized data in Athena: examples

Now that you’ve converted your data from row-based log files to columnar-based Parquet, you can write queries against this data using Athena. After the first run of the script, the tables specified in the AWS Glue ETL job properties are created for you. Here are several sample queries to get you started.

Example 1: Most requested S3 keys

SELECT key, COUNT(*) AS count
FROM "aws_service_logs"."s3_access"
WHERE operation IN ('REST.GET.OBJECT', 'REST.COPY.OBJECT', 'REST.COPY.OBJECT_GET')
GROUP BY 1
ORDER BY 2 DESC
limit 100;

Example 2: Top IP addresses that accessed the bucket yesterday

SELECT remote_ip, COUNT(*) FROM "aws_service_logs"."s3_access"
WHERE year=date_format(current_date, '%Y') AND month=date_format(current_date, '%m') AND day=date_format(current_date + interval '-1' day, '%d')
GROUP BY 1
ORDER BY 2 DESC
limit 100;

Note the use of numbers instead of strings in the use of the GROUP BY and ORDER BY operations. This is one of the optimizations for Athena queries. For other optimizations, be sure to check out the Top 10 Performance Tuning Tips blog post.

In addition, we use the year, month, and day partition columns to limit the amount of data scanned and decrease the cost of the query.

Summary

This post introduces a new open-source library that you can use to efficiently process various types of AWS service logs using AWS Glue. The library automates the application of common best practices to allow high-performing and cost-effective querying of the data using Amazon Athena and Amazon Redshift. We hope this library comes in handy, and we’re open to pull requests. If you want to add a new log type, check out the code in the AWS Labs athena-glue-service-logs repository!

 


About the Author

Damon Cortesi is a big data architect with Amazon Web Services.

 

 

 

from AWS Big Data Blog

EMR Notebooks: A managed analytics environment based on Jupyter notebooks

EMR Notebooks: A managed analytics environment based on Jupyter notebooks

Notebooks are increasingly becoming the standard tool for interactively developing big data applications. It’s easy to see why. Their flexible architecture allows you to experiment with data in multiple languages, test code interactively, and visualize large datasets. To help scientists and developers easily access notebook tools, we launched Amazon EMR Notebooks, a managed notebook environment that is based on the popular open-source Jupyter notebook application. EMR Notebooks support Spark Magic kernels, which allows you to submit jobs remotely on your EMR cluster using languages like PySpark, Spark SQL, Spark R, and Scala. The kernels submit your Spark code through Apache Livy, which is a REST server for Spark running on your cluster.

EMR Notebooks is designed to make it easy for you to experiment and build applications with Apache Spark. In this blog post, I first cover some of the benefits that EMR Notebooks offers. Then I introduce you to some of its capabilities such as detaching and attaching a notebook to different EMR clusters, monitoring Spark activity from within the notebook, using tags to control user permissions, and setting up user-impersonation to track notebook users and their actions. To learn about creating and using EMR Notebooks, you can visit Using Amazon EMR Notebooks or follow along with the AWS Online Tech Talks webinar.

Benefits of EMR Notebooks

One of the useful features of EMR Notebooks is the separation of the notebook environment from your underlying cluster infrastructure. The separation makes it easy for you to execute notebook code against transient clusters without worrying about deploying or configuring your notebook infrastructure every time you bring up a new cluster. You can create multiple serverless notebooks from the AWS Management Console for EMR and access the notebook UI without spending time setting up SSH access or configuring your browser for port-forwarding. Each notebook you create is launched instantly with its own Spark context. This capability enables you to attach multiple notebooks to a single shared cluster and submit parallel jobs without fear of job conflicts in a multi-tenant environment. This way you make efficient use of your clusters.

You can also connect EMR Notebooks to an EMR cluster as small as a one node. This gives you a budget-friendly sandbox environment to develop your Spark application.

Finally, with EMR Notebooks, you don’t have to spend time to manually configure your notebook to store files persistently. Your notebook files are saved automatically to a chosen Amazon S3 bucket periodically so you don’t have to worry about losing your work if your cluster is shut down. You can retrieve your saved notebooks from the console or download it locally from your S3 bucket in Jupyter “ipynb” format.

Detaching and attaching EMR Notebooks to different clusters

With EMR Notebooks, you can detach an active notebook from a cluster and attach it to a different cluster and promptly resume your work. This capability can be useful in scenarios where you want to move your notebook from a sandbox development cluster to a production environment or attach to a different cluster with appropriate CPU or memory resources and library packages required to execute your notebook against large datasets. To detach an active notebook:

First select the notebook name and then choose Stop.

Wait for the notebook status shown next to the notebook name to change from Ready to Stopped and then choose Change Cluster.

After you stop the notebook, you can choose to attach it to a different cluster in the same VPC or create a new cluster. EMR Notebooks automatically attaches the notebook to the cluster and re-starts the notebook.

Monitoring and debugging Spark jobs

EMR Notebooks supports a built-in Jupyter notebook widget called SparkMonitor that allows you to monitor the status of all your Spark jobs launched from the notebook without connecting to the Spark web UI server.

A widget appears and automatically integrates within the cell structure of your notebook and displays detailed status about the job submitted from each cell of your notebook, providing you with real-time progress of the different stages of the job. For any failed jobs, this widget also offers embedded links to the container logs in Amazon S3, allowing you to get to the relevant logs and debug your jobs.

Additionally, if you have configured your cluster to accept SSH connections, then you can access the Spark application web UI and Hadoop jobs history server from within the notebook. This allows you to view the event timelines, visualize Directed Acyclic Graphs (DAG) of each job, and view the detailed system and runtime information to inspect and debug your code. These web UIs are available automatically the first time you start to run your Spark code from a notebook.

Writing tag-based policies to control user permissions for notebooks and users

EMR Notebooks are by default shared resources that anyone from your organization with access to your AWS account can open, edit, or even delete. If you want more control over your notebook, you can use tags to label your notebook and write IAM policies that control access for other users. To get you started, when you create a notebook, a default tag with a key string, creatorUserId, is set to the value of the IAM User ID of the user who created the notebook.

You can use this tag to limit allowed actions on the notebook to only the creator of the notebook. For example, the permissions policy statement below, when attached to a role or user, enables the IAM user to view, start, stop, edit, or delete only those notebooks that they have created. This policy statement uses the default tag that is applied by EMR Notebooks when you create the notebook.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "elasticmapreduce:DescribeEditor",
                "elasticmapreduce:StartEditor",
                "elasticmapreduce:StopEditor",
                "elasticmapreduce:DeleteEditor",
                "elasticmapreduce:OpenEditorInConsole"
            ],
            "Effect": "Allow",
            "Resource": "*",
            "Condition": {
                "StringEquals": {
                    "elasticmapreduce:ResourceTag/creatorUserId": "${aws:userId}"
                }
            }
        }
    ]

You can write policies that enforce the creation of tags before starting a notebook. For example, the policy below requires that the user not change or delete the creatorUserID tag that is added by default. The variable ${aws:userId}, specifies the currently active user’s User ID, which is the default value of the tag.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "elasticmapreduce:CreateEditor"
            ],
            "Effect": "Allow",
            "Resource": "*",
            "Condition": {
                "StringEquals": {
                    "elasticmapreduce:RequestTag/creatorUserId": "${aws:userid}"
                }
            }
        }
    ]
}

You can use the notebook tags along with EMR cluster tags to control notebook user access to your cluster. Tagging your notebook and clusters, in addition to securing your resources, also allows you to categorize, track, and allocate your EMR cluster costs across your different line of businesses. For example, the policy below allows a user to create a notebook only if the notebook has a tag with a key string “department” with its value set to “Analytics” and only if the notebook is attached to the EMR cluster that has a tag with the key string “cost-center” and value set to “12345.”

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "elasticmapreduce:StartEditor"
            ],
            "Effect": "Allow",
            "Resource": "arn:aws:elasticmapreduce:*:123456789012:editor/*",
            "Condition": {
                "StringEquals": {
                    "elasticmapreduce:ResourceTag/department": [
                        "Analytics"
                    ]
                }
            }
        },
        {
            "Action": [
                "elasticmapreduce:StartEditor"
            ],
            "Effect": "Allow",
            "Resource": "arn:aws:elasticmapreduce:*:123456789012:cluster/*",
            "Condition": {
                "StringEquals": {
                    "elasticmapreduce:ResourceTag/cost-center": [
                        "12345"
                    ]
                }
            }
        }
    ]
}

You can learn more about notebook and cluster tags by visiting Using Tags to Control User Permissions and Tag Clusters. Also, visit Using Cost Allocation Tags to learn more using tags to generate cost allocation report in the AWS Billing and Cost Management Console.

Tracking notebook users by enforcing user-impersonation

EMR Notebooks enables multiple users to execute their notebooks’ code concurrently in a shared EMR cluster, improving cluster utilization. By default, all Spark jobs spawned by these different users from their notebook run as the same user (the livy user) on your EMR cluster. If your corporate policy requires you to set up an audit trail and track individual notebook user actions on shared clusters, you can use the user-impersonation feature of EMR Notebooks. This capability allows you to discriminate and audit all notebook users by associating the jobs they executed from their notebook with the user’s IAM identity. To use this feature, you should enable Livy user-impersonation by configuring the core-site and livy-conf configuration classifications when you create and launch an EMR cluster as follows:

[
    {
        "Classification": "core-site",
        "Properties": {
          "hadoop.proxyuser.livy.groups": "*",
          "hadoop.proxyuser.livy.hosts": "*"
        }
    },
    {
        "Classification": "livy-conf",
        "Properties": {
          "livy.impersonation.enabled": "true"
        }
    }
]

Visit Configuring Applications to learn more about configuring application. After this feature is enabled, EMR Notebooks creates HDFS user directories on the master node for each user identity. This means all the Spark jobs from the notebook run as the IAM user instead of the indistinct user livy. For example, if user NB_User1 runs code from the notebook editor, then a user directory named user_NB_User1 is created on the master node and all Spark jobs run as user_NB_User1. You can then use a service like AWS CloudTrail to audit the record of actions by the user NbUser1 by creating a trail. To learn more about setting up audit trails, see logging Amazon EMR API calls in AWS CloudTrail.

Conclusion

In this post, I highlighted some of the capabilities of EMR Notebooks such as the ability to change clusters, monitor Spark jobs from each notebook cell, control user permission, and categorize resource costs. You can do this by using notebook and cluster tags and setting up user-impersonation to track notebook user actions.

Oh by the way, there is no additional charge for using EMR Notebooks and you only pay for the use of the EMR cluster as-usual!

If you have questions or suggestions, feel free to leave a comment.

 


About the Authors

Vignesh Rajamani is a senior product manager for EMR at AWS.

 

 

 

 

Nikki Rouda is the principal product marketing manager for data lakes and big data at AWS. Nikki has spent 20+ years helping enterprises in 40+ countries develop and implement solutions to their analytics and IT infrastructure challenges. Nikki holds an MBA from the University of Cambridge and an ScB in geophysics and math from Brown University.

 

 

 

from AWS Big Data Blog