Assume temporary IAM credentials in AWS Glue jobs

In this post we will take a look at how to assume an IAM Role and use temporary credentials inside an AWS Glue job. Why? Because currently a Glue job only supports a single IAM Role. In a multi-tenancy environment you may want to assume a Role with scoped permissions only to a specific S3 bucket for that tenant. The tenant name can then be passed as a job parameter rather than having a job per tenant.

AWS Glue job IAM Role

Setting up the pre-reqs

  1. Create an S3 bucket to be used as the source/destination for the glue job e.g. my-glue-data

  2. Create an IAM Role using the AWS Glue use case named my-glue-job-role. This Role will be assigned to the Glue job itself. Add an inline policy to give access to the glue assest S3 bucket (where Glue will pull the job script from).

    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Sid": "GlueAssets",
                "Effect": "Allow",
                "Action": "s3:GetObject",
                "Resource": "arn:aws:s3:::aws-glue-assets-<account-id>-<region>/*"
            }
        ]
    }
    

    Note: if you want the job to log to CloudWatch you will also need to give CloudWatch logs permissions here too

  3. Create a second IAM Role, this time with a Custom trust policy which will allow it to be assumed by the my-glue-job-role.

    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Sid": "AllowGlueAssume",
                "Effect": "Allow",
                "Principal": {
                    "AWS": "arn:aws:iam::123456789012:role/my-glue-job-role"
                },
                "Action": "sts:AssumeRole"
            }
        ]
    }
    
    
  4. Give the second IAM Role permissions to access the my-glue-data bucket. s3:PutObject and s3:GetObject should be sufficient. Name this role my-glue-assumed-role.

  5. Go back to the my-glue-job-role and edit the inline policy permissions to add sts:AssumeRole to the my-glue-assumed-role.

    {
        "Version": "2012-10-17",
        "Statement": {
            "Effect": "Allow",
            "Action": "sts:AssumeRole",
            "Resource": "arn:aws:iam::123456789012:role/my-glue-assumed-role"
        }
    }
    

Now we have everything in place to create the Glue job. To recap we have two IAM roles:

  • my-glue-job-role = permissions to be assumed by the Glue service and access to assume the role my-glue-assumed-role. This does not have any S3 permissions.
  • my-glue-assumed-role = permissions to access the S3 bucket containing the data

Creating the Glue job

  1. In the AWS Glue Console create a new job. You can start with the Visual source and target using Amazon S3 for both.

    AWS Glue job create

  2. Set your desired source path, destination path and any transformations required (this article does not cover setting this up in detail).

    AWS Glue job editor

  3. Under Job details set the IAM Role to the my-glue-job-role Role created earlier. Give the job a name and click Save.

  4. Now select the Script tab and click Edit script. You will recieve a warning - clikc Confirm. We only needed the visual editor to generate boiler plate code.

    AWS Glue script edit warning

  5. Your script should look something like this:

    import sys
    from awsglue.transforms import *
    from awsglue.utils import getResolvedOptions
    from pyspark.context import SparkContext
    from awsglue.context import GlueContext
    from awsglue.job import Job
       
       
    args = getResolvedOptions(sys.argv, ["JOB_NAME"])
    
       
    glueContext = GlueContext(sc)
    spark = glueContext.spark_session
       
    job = Job(glueContext)
    job.init(args["JOB_NAME"], args)
       
    # Script generated for node S3 bucket
    S3bucket_node1 = glueContext.create_dynamic_frame.from_options(
        format_options={"multiline": False},
        connection_type="s3",
        format="json",
        connection_options={"paths": ["s3://my-glue-data/MOCK_DATA.json"]},
        transformation_ctx="S3bucket_node1",
    )
    
    ...
    
  6. First add a part near the top of the script to assume the IAM role using boto3:

    import boto3
    
    sts_connection = boto3.client('sts')
    response = sts_connection.assume_role(RoleArn='arn:aws:iam::123456789012:role/my-glue-assumed-role', RoleSessionName='GlueTenantASession',DurationSeconds=3600)
    credentials = response['Credentials']
    
  7. Next add a section to set those credentials in the SparkContext:

    sc._jsc.hadoopConfiguration().set('fs.s3a.aws.credentials.provider', 'org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider')
    sc._jsc.hadoopConfiguration().set('fs.s3a.access.key', credentials['AccessKeyId'])
    sc._jsc.hadoopConfiguration().set('fs.s3a.secret.key', credentials['SecretAccessKey'])
    sc._jsc.hadoopConfiguration().set('fs.s3a.session.token', credentials['SessionToken'])
    
  8. Finally change any instances using the s3:// client to use the s3a:// client.

  9. The final script should look something like this:

    import sys
    from awsglue.transforms import *
    from awsglue.utils import getResolvedOptions
    from pyspark.context import SparkContext
    from awsglue.context import GlueContext
    from awsglue.job import Job
    import boto3
       
    sts_connection = boto3.client('sts')
    response = sts_connection.assume_role(RoleArn='arn:aws:iam::123456789012:role/my-glue-assumed-role', RoleSessionName='GlueTenantASession',DurationSeconds=3600)
    credentials = response['Credentials']
       
    args = getResolvedOptions(sys.argv, ["JOB_NAME"])
    sc = SparkContext()
    sc._jsc.hadoopConfiguration().set('fs.s3a.aws.credentials.provider', 'org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider')
    sc._jsc.hadoopConfiguration().set('fs.s3a.access.key', credentials['AccessKeyId'])
    sc._jsc.hadoopConfiguration().set('fs.s3a.secret.key', credentials['SecretAccessKey'])
    sc._jsc.hadoopConfiguration().set('fs.s3a.session.token', credentials['SessionToken'])
       
    glueContext = GlueContext(sc)
    spark = glueContext.spark_session
       
    job = Job(glueContext)
    job.init(args["JOB_NAME"], args)
       
    # Script generated for node S3 bucket
    S3bucket_node1 = glueContext.create_dynamic_frame.from_options(
        format_options={"multiline": False},
        connection_type="s3",
        format="json",
        connection_options={"paths": ["s3a://test-glue-bucket/MOCK_DATA.json"]},
        transformation_ctx="S3bucket_node1",
    )
    ...
    
  10. Once you have finished editing the script run the job. The job should complete successfully with no AccessDenied errors. You should find your output files on the S3 bucket.

You could extend this further by using a Session Policy with the AssumeRole. This will allow you to dynamically generate a scoped policy at runtime, for example using a job parameter passed in when the job is run.