15.2 C
Paris
Saturday, June 7, 2025

Utilizing AWS Glue Knowledge Catalog views with Apache Spark in EMR Serverless and Glue 5.0


The AWS Glue Knowledge Catalog has expanded its Knowledge Catalog views function, and now helps Apache Spark environments along with Amazon Athena and Amazon Redshift. This enhancement, launched in March 2025, now makes it doable to create, share, and question multi-engine SQL views throughout Amazon EMR Serverless, Amazon EMR on Amazon EKS, and AWS Glue 5.0 Spark, in addition to Athena and Amazon Redshift Spectrum. The multi-dialect views empower information groups to create SQL views one time and question them via supported engines—whether or not it’s Athena for ad-hoc analytics, Amazon Redshift for information warehousing, or Spark for large-scale information processing. This cross-engine compatibility means information engineers can concentrate on constructing information merchandise moderately than managing a number of view definitions or complicated permission schemes. Utilizing AWS Lake Formation permissions, organizations can share these views inside the similar AWS account, throughout completely different AWS accounts, and with AWS IAM Identification Middle customers and teams, with out granting direct entry to the underlying tables. Options of Lake Formation resembling fine-grained entry management (FGAC) utilizing Lake Formation-tag based mostly entry management (LF-TBAC) will be utilized to Knowledge Catalog views, enabling scalable sharing and entry management throughout organizations.

In an earlier weblog publish, we demonstrated the creation of Knowledge Catalog views utilizing Athena, including a SQL dialect for Amazon Redshift, and querying the view utilizing Athena and Amazon Redshift. On this publish, we information you thru the method of making a Knowledge Catalog view utilizing EMR Serverless, including the SQL dialect to the view for Athena, sharing it with one other account utilizing LF-Tags, after which querying the view within the recipient account utilizing a separate EMR Serverless workspace and AWS Glue 5.0 Spark job and Athena. This demonstration showcases the flexibility and cross-account capabilities of Knowledge Catalog views and entry via numerous AWS analytics companies.

Advantages of Knowledge Catalog views

The next are key advantages of Knowledge Catalog views for enterprise options:

  • Focused information sharing and entry management – Knowledge Catalog views, mixed with the sharing capabilities of Lake Formation, allow organizations to offer particular information subsets to completely different groups or departments with out duplicating information. For instance, a retail firm can create views that present gross sales information to regional managers whereas proscribing entry to delicate buyer data. By making use of LF-TBAC to those views, corporations can effectively handle information entry throughout massive, complicated organizational constructions, sustaining compliance with information governance insurance policies whereas selling data-driven decision-making.
  • Multi-service analytics integration – The flexibility to create a view in a single analytics service and question it throughout Athena, Amazon Redshift, EMR Serverless, and AWS Glue 5.0 Spark breaks down information silos and promotes a unified analytics strategy. This function permits companies to make use of the strengths of various companies for numerous analytics wants. For example, a monetary establishment might create a view of transaction information and use Athena for ad-hoc queries, Amazon Redshift for complicated aggregations, and EMR Serverless for large-scale information processing—all with out shifting or duplicating the info. This flexibility accelerates insights and improves useful resource utilization throughout the analytics stack.
  • Centralized auditing and compliance – With views saved within the central Knowledge Catalog, companies can keep a complete audit path of knowledge entry throughout linked accounts utilizing AWS CloudTrail logs. This centralization is essential for industries with strict regulatory necessities, resembling healthcare or finance. Compliance officers can seamlessly monitor and report on information entry patterns, detect uncommon actions, and show adherence to information safety rules like GDPR or HIPAA. This centralized strategy simplifies compliance processes and reduces the danger of regulatory violations.

These capabilities of Knowledge Catalog views present highly effective options for companies to reinforce information governance, enhance analytics effectivity, and keep strong compliance measures throughout their information ecosystem.

Resolution overview

An instance firm has a number of datasets containing particulars of their clients’ buy particulars blended with personally identifiable data (PII) information. They categorize their datasets based mostly on sensitivity of the data. The information steward desires to share a subset of their most well-liked clients information for additional evaluation downstream by their information engineering staff.

To show this use case, we use pattern Apache Iceberg tables buyer and customer_address. We create a Knowledge Catalog view from these two tables to filter by most well-liked clients. We then use LF-Tags to share restricted columns of this view to the downstream engineering staff. The answer is represented within the following diagram.

arch diagram

Stipulations

To implement this answer, you want two AWS accounts with an AWS Identification and Entry Administration (IAM) admin position. We use the position to run the offered AWS CloudFormation templates and in addition use the identical IAM roles added as Lake Formation administrator.

Arrange infrastructure within the producer account

We offer a CloudFormation template that deploys the next sources and completes the info lake setup:

  • Two Amazon Easy Storage Service (Amazon S3) buckets: one for scripts, logs, and question outcomes, and one for the info lake storage.
  • Lake Formation administrator and catalog settings. The IAM admin position that you simply present is registered as Lake Formation administrator. Cross-account sharing model is about to 4. Default permissions for newly created databases and tables is about to make use of Lake Formation permissions solely.
    data catalog settings
  • An IAM position with learn, write, and delete permissions on the info lake bucket objects. The information lake bucket is registered with Lake Formation utilizing this IAM position.
    data lake locations
  • An AWS Glue database for the info lake.
  • Lake Formation tags. These tags are hooked up to the database.
    lf-tags
  • CSV and Iceberg format tables within the AWS Glue database. The CSV tables are pointing to s3://redshift-downloads/TPC-DS/2.13/10GB/ and the Iceberg tables are saved within the consumer account’s information lake bucket.
  • An Athena workgroup.
  • An IAM position and an AWS Lambda perform to run Athena queries. Athena queries are run within the Athena workgroup to insert information from CSV tables to Iceberg tables. Related Lake Formation permissions are granted to the Lambda position.
    lf-tables
  • An EMR Studio and associated digital personal cloud (VPC), subnet, routing desk, safety teams, and EMR Studio service IAM position.
  • An IAM position with insurance policies for the EMR Studio runtime. Related Lake Formation permissions are granted to this position on the Iceberg tables. This position will probably be used because the definer position to create the Knowledge Catalog view. A definer position is the IAM position with needed permissions to entry the referenced tables, and runs the SQL assertion that defines the view.

Full the next steps in your producer AWS account:

  1. Check in to the AWS Administration Console as an IAM administrator position.
  2. Launch the CloudFormation stack.

Enable roughly 5 minutes for the CloudFormation stack to finish creation. After the CloudFormation has accomplished launching, proceed with the next directions.

  1. In case you’re utilizing the producer account in Lake Formation for the primary time, on the Lake Formation console, create a database named default and grant describe permission on the default database to runtime position GlueViewBlog-EMRStudio-RuntimeRole.
    data permissions

Create an EMR Serverless software

Full the next steps to create an EMR Serverless software in your EMR Studio:

  1. On the Amazon EMR console, below EMR Studio within the navigation pane, select Studios.
  2. Select GlueViewBlog-emrstudio and select the URL hyperlink of the Studio to open it.
    glueviewblog-emrstudio
  3. On the EMR Studio dashboard, select Create software.
    emr-studio-dashboard

You’ll be directed to the Create software web page on EMR Studio. Let’s create a Lake Formation enabled EMR Serverless software.

  1. Underneath Utility settings, present the next data:
    1. For Title, enter a reputation (for instance, emr-glueview-application).
    2. For Kind, select Spark.
    3. For Launch model, select emr-7.8.0.
    4. For Structure, select x86_64.
  2. Underneath Utility setup choices, choose Use customized settings.
  3. Underneath Interactive endpoint, choose Allow endpoint for EMR studio.
  4. Underneath Further configurations, for Metastore configuration, choose Use AWS Glue Knowledge Catalog as metastore, then choose Use Lake Formation for fine-grained entry management.
  5. Underneath Community connections, select emrs-vpc for VPC, enter any two personal subnets, and enter emr-serverless-sg for Safety teams.
  6. Select Create and begin the applying.

Create an EMR Workspace

Full the next steps to create an EMR Workspace:

  1. On the EMR Studio console, select Workspaces within the navigation pane and select Create Workspace.
  2. Enter a Workspace title (for instance, emrs-glueviewblog-workspace).
  3. Go away all different settings as default and select Create Workspace.
  4. Select Launch Workspace. Your browser may request to permit pop-up permissions for the primary time launching the Workspace.
  5. After the Workspace is launched, within the navigation pane, select Compute.
  6. For Compute kind, choose EMR Serverless software and enter emr-glueview-application for the applying and GlueViewBlog-EMRStudio-RuntimeRole for Interactive runtime position.
  7. Make sure that the kernel hooked up to the Workspace is PySpark.

Create a Knowledge Catalog view and confirm

Full the next steps:

  1. Obtain the pocket book glueviewblog_producer.ipynb. The code creates a Knowledge Catalog view customer_nonpii_view from the 2 Iceberg tables, customer_iceberg and customer_address_iceberg, within the database glueviewblog_<account-id>_db.
  2. In your EMR Workspace emrs-glueviewblog-workspace, go to the File browser part and select Add information.
  3. Add glueviewblog_producer.ipynb.
  4. Replace the info lake bucket title, AWS account ID, and AWS Area to match your sources.
  5. Replace the database_name, table1_name, and table2_name to match your sources.
  6. Save the pocket book.
  7. Select the double arrow icon to restart the kernel and rerun the pocket book.

The Knowledge Catalog view customer_nonpii_view is created and verified.

  1. Within the navigation pane on the Lake Formation console, below Knowledge Catalog, select Views.
  2. Select the brand new view customer_nonpii_view.
  3. On the SQL definitions tab, confirm EMR with Apache Spark exhibits up for Engine title.
  4. Select the tab LF-Tags. The view ought to present the LF-Tag sensitivity=pii-confidential inherited from the database.
  5. Select Edit LF-Tags.
  6. On the Values dropdown menu, select confidential to overwrite the Knowledge Catalog view’s key worth of sensitivity from pii-confidential.
  7. Select Save.

With this, now we have created a non-PII view to share with the info engineering staff from the datasets that has PII data of shoppers.

Add Athena SQL dialect to the view

With the view customer_nonpii_view having been created by the EMR runtime position GlueViewBlog-EMRStudio-RuntimeRole, the Admin can have solely describe permissions on it as a database creator and Lake Formation administrator. On this step, the Admin will grant itself alter permissions on the view, with the intention to add the Athena SQL dialect to the view.

  1. On the Lake Formation console, within the navigation pane, select Knowledge permissions.
  2. Select Grant and supply the next data:
    1. For Principals, enter Admin.
    2. For LF-Tags or catalog sources, choose Sources matched by LF-Tags.
    3. For Key, select sensitivity.
    4. For Values, select confidential and pii-confidential.
    5. Underneath Database permissions, choose Tremendous for Database permissions and Grantable permissions.
    6. Underneath Desk permissions, choose Tremendous for Desk permissions and Grantable permissions.
    7. Select Grant.
  3. Confirm the LF-Tags based mostly permissions the Admin.
  4. Open the Athena question editor, select the Workgroup GlueViewBlogWorkgroup and select the AWS Glue database glueviewblog_<accountID>_db.
  5. Run the next question. Change <accountID> together with your account ID.
    ALTER VIEW glueviewblog_<accountID>_db.customer_nonpii_view ADD DIALECT
    AS
    choose c_customer_id, c_customer_sk, c_last_review_date, ca_country, ca_location_type
    from glueviewblog__<accountID>_db.customer_iceberg, glueviewblog__<accountID>_db.customer_address_iceberg
    the place c_current_addr_sk = ca_address_sk and c_preferred_cust_flag='Y';

  6. Confirm the Athena dialect by operating a preview on the view.
  7. On the Lake Formation console, confirm the SQL dialects on the view customer_nonpii_view.

Share the view to the buyer account

Full the next steps to share the Knowledge Catalog view to the buyer account:

  1. On the Lake Formation console, within the navigation pane, select Knowledge permissions.
  2. Select Grant and supply the next data:
    1. For Principals, choose Exterior accounts and enter the buyer account ID.
    2. For LF-Tags or catalog sources, choose Sources matched by LF-Tags.
    3. For Key, select sensitivity.
    4. For Values, select confidential.
    5. Underneath Database permissions, choose Describe for Database permissions and Grantable permissions.
    6. Underneath Desk permissions, choose Describe and Choose for Desk permissions and Grantable permissions.
    7. Select Grant.
  3. Confirm granted permissions on the Knowledge permissions web page.

With this, the producer account information steward has created a Knowledge Catalog view of a subset of knowledge from two tables of their Knowledge Catalog, utilizing the EMR runtime position because the definer position. They’ve shared it to their analytics account utilizing LF-Tags to run additional processing of the info downstream.

Arrange infrastructure within the client account

We offer a CloudFormation template to deploy the next sources and arrange the info lake as follows:

  • An S3 bucket for Amazon EMR and AWS Glue logs
  • Lake Formation administrator and catalog settings much like the producer account setup
  • An AWS Glue database for the info lake
  • An EMR Studio and associated VPC, subnet, routing desk, safety teams, and EMR Studio service IAM position
  • An IAM position with insurance policies for the EMR Studio runtime

Full the next steps in your client AWS account:

  1. Check in to the console as an IAM administrator position.
  2. Launch the CloudFormation stack.

Enable roughly 5 minutes for the CloudFormation stack to finish creation. After the CloudFormation has accomplished launching, proceed with the next directions.

  1. In case you’re utilizing the buyer account Lake Formation for the primary time, on the Lake Formation console, create a database named default and grant describe permission on the default database to runtime position GlueViewBlog-EMRStudio-Shopper-RuntimeRole.

Settle for AWS RAM shares within the client account

Now you can log in to the AWS client account and settle for the AWS RAM invitation:

  1. Open the AWS RAM console with the IAM position that has AWS RAM entry.
  2. Within the navigation pane, select Useful resource shares below Shared with me.

You need to see two pending useful resource shares from the producer account.

  1. Settle for each invites.

Create a useful resource hyperlink for the shared view

To entry the view that was shared by the producer AWS account, it is advisable to create a useful resource hyperlink within the client AWS account. A useful resource hyperlink is a Knowledge Catalog object that may be a hyperlink to an area or shared database, desk, or view. After you create a useful resource hyperlink to a view, you should use the useful resource hyperlink title wherever you’d use the view title. Moreover, you’ll be able to grant permission on the useful resource hyperlink to the job runtime position GlueViewBlog-EMRStudio-Shopper-RuntimeRole to entry the view via EMR Serverless Spark.

To create a useful resource hyperlink, full the next steps:

  1. Open the Lake Formation console because the Lake Formation information lake administrator within the client account.
  2. Within the navigation pane, select Tables.
  3. Select Create and Useful resource hyperlink.
  4. For Useful resource hyperlink title, enter the title of the useful resource hyperlink (for instance, customer_nonpii_view_rl).
  5. For Database, select the glueviewblog_customer_<accountID>_db database.
  6. For Shared desk area, select the Area of the shared desk.
  7. For Shared desk, select customer_nonpii_view.
  8. Select Create.

Grant permissions on the database to the EMR job runtime position

Full the next steps to grant permissions on the database glueviewblog_customer_<accountID>_db to the EMR job runtime position:

  1. On the Lake Formation console, within the navigation pane, select Databases.
  2. Choose the database glueviewblog_customer_<accountID>_db and on the Actions menu, select Grant.
  3. Within the Ideas part, choose IAM customers and roles, and select GlueViewBlog-EMRStudio-Shopper-RuntimeRole.
  4. Within the Database permissions part, choose Describe.
  5. Select Grant.

Grant permissions on the useful resource hyperlink to the EMR job runtime position

Full the next steps to grant permissions on the useful resource hyperlink customer_nonpii_view_rl to the EMR job runtime position:

  1. On the Lake Formation console, within the navigation pane, select Tables.
  2. Choose the useful resource hyperlink customer_nonpii_view_rl and on the Actions menu, select Grant.
  3. Within the Ideas part, choose IAM customers and roles, and select GlueViewBlog-EMRStudio-Shopper-RuntimeRole.
  4. Within the Useful resource hyperlink permissions part, choose Describe for Useful resource hyperlink permissions.
  5. Select Grant.

This permits the EMR Serverless job runtime roles to explain the useful resource hyperlink. We don’t make any picks for grantable permissions as a result of runtime roles shouldn’t be capable of grant permissions to different rules.

Grant permissions on the goal for the useful resource hyperlink to the EMR job runtime position

Full the next steps to grant permissions on the goal for the useful resource hyperlink customer_nonpii_view_rl to the EMR job runtime position:

  1. On the Lake Formation console, within the navigation pane, select Tables.
  2. Choose the useful resource hyperlink customer_nonpii_view_rl and on the Actions menu, select Grant on track.
  3. Within the Ideas part, choose IAM customers and roles, and select GlueViewBlog-EMRStudio-Shopper-RuntimeRole.
  4. Within the View permissions part, choose Choose and Describe.
  5. Select Grant.

Arrange an EMR Serverless software and Workspace within the client account

Repeat the steps to create an EMR Serverless software within the client account.

Repeat the steps to create a Workspace within the client account. For Compute kind, choose EMR Serverless software and enter emr-glueview-application for the applying and GlueViewBlog-EMRStudio-Shopper-RuntimeRole because the runtime position.

Confirm entry utilizing interactive notebooks from EMR Studio

Full the next steps to confirm entry in EMR Studio:

  1. Obtain the pocket book glueviewblog_emr_consumer.ipynb. The code runs a choose assertion on the view shared from the producer.
  2. In your EMR Workspace emrs-glueviewblog-workspace, navigate to the File browser part and select Add information.
  3. Add glueviewblog_emr_consumer.ipynb.
  4. Replace the info lake bucket title, AWS account ID, and Area to match your sources.
  5. Replace the database to match your sources.
  6. Save the pocket book.
  7. Select the double arrow icon to restart the kernel with PySpark kernel and rerun the pocket book.

Confirm entry utilizing interactive notebooks from AWS Glue Studio

Full the next steps to confirm entry utilizing AWS Glue Studio:

  1. Obtain the pocket book glueviewblog_glue_consumer.ipynb
  2. Open the AWS Glue Studio console.
  3. Select Pocket book after which select Add pocket book.
  4. Add the pocket book glueviewblog_glue_consumer.ipynb.
  5. For IAM position, select GlueViewBlog-EMRStudio-Shopper-RuntimeRole.
  6. Select Create pocket book.
  7. Replace the info lake bucket title, AWS account ID, and Area to match your sources.
  8. Replace the database to match your sources.
  9. Save the pocket book.
  10. Run all of the cells to confirm fine-grained entry.

Confirm entry utilizing the Athena question editor

As a result of the view from the producer account was shared to the buyer account, the Lake Formation administrator can have entry to the view within the producer account. Additionally, as a result of the lake admin position created the useful resource hyperlink pointing to the view, it should even have entry to the useful resource hyperlink. Go to the Athena question editor and run a easy choose question on the useful resource hyperlink.

The analytics staff within the client account was in a position to entry a subset of the info from a enterprise information producer staff, utilizing their analytics instruments of selection.

Clear up

To keep away from incurring ongoing prices, clear up your sources:

  1. In your client account, delete AWS Glue pocket book, cease and delete the EMR software, after which delete EMR Workspace.
  2. In your client account, delete the CloudFormation stack. This could take away the sources launched by the stack.
  3. In your producer account, log in to the Lake Formation console and revoke the LF-Tags based mostly permissions you had granted to the buyer account.
  4. In your producer account, cease and delete the EMR software after which delete the EMR Workspace.
  5. In your producer account, delete the CloudFormation stack. This could delete the sources launched by the stack.
  6. Evaluate and clear up any extra AWS Glue and Lake Formation sources and permissions.

Conclusion

On this publish, we demonstrated a strong, enterprise-grade answer for cross-account information sharing and evaluation utilizing AWS companies. We walked you thru how you can do the next key steps:

  • Create a Knowledge Catalog view utilizing Spark in EMR Serverless inside one AWS account
  • Securely share this view with one other account utilizing LF-TBAC
  • Entry the shared view within the recipient account utilizing Spark in each EMR Serverless and AWS Glue ETL
  • Implement this answer with Iceberg tables (it’s additionally suitable different open desk codecs like Apache Hudi and Delta Lake)

The answer strategy with multi-dialect information catalog views offered on this publish is especially invaluable for enterprises trying to modernize their information infrastructure whereas optimizing prices, enhance cross-functional collaboration whereas enhancing information governance, and speed up enterprise insights whereas sustaining management over delicate data.

Seek advice from the next details about Knowledge Catalog views with particular person analytics companies, and check out the answer. Tell us your suggestions and questions within the feedback part.


Concerning the Authors

Aarthi Srinivasan is a Senior Massive Knowledge Architect with Amazon SageMaker Lakehouse. As a part of the SageMaker Lakehouse staff, she works with AWS clients and companions to architect lake home options, improve product options, and set up greatest practices for information governance.

Praveen Kumar is an Analytics Options Architect at AWS with experience in designing, constructing, and implementing trendy information and analytics platforms utilizing cloud-based companies. His areas of curiosity are serverless know-how, information governance, and data-driven AI purposes.

Dhananjay Badaya is a Software program Developer at AWS, specializing in distributed information processing engines together with Apache Spark and Apache Hadoop. As a member of the Amazon EMR staff, he focuses on designing and implementing enterprise governance options for EMR Spark.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles

error: Content is protected !!