Glue Jobs – PySpark Basics – Drive DataScience

Creating AWS Glue jobs to process data using PySpark involves several key steps, from script development to data transformation, and finally writing the results to Amazon S3 or Redshift. Below is a detailed guide on each phase:

### 1. Setting Up the Environment

Before creating a Glue job, ensure that:
– You have access to AWS Management Console.
– AWS Glue service is enabled in your AWS account.
– You have necessary permissions to create and manage AWS Glue resources.

### 2. Creating a Glue Job

#### a. Define a Glue Job

1. **Navigate to AWS Glue Console**: Go to the AWS Glue service in the AWS Management Console.
2. **Create the Job**:
– Click on “Jobs” in the navigation pane.
– Choose “Add job”.
3. **Configure the Job**:
– **Name**: Give your job a meaningful name.
– **IAM Role**: Select an IAM role with the necessary permissions to read/write to S3, access AWS Glue, etc.
– **Type**: Choose “Spark” as the job type.
– **Glue Version**: Select the Glue version compatible with your PySpark script. The version dictates the Spark and Python versions available.
– **Worker Type and Number**: Choose appropriate worker type (Standard, G.1X, G.2X) and number of worker nodes based on your processing needs.

#### b. Develop the PySpark Script

1. **Script Editor**:
– AWS Glue provides a script editor where you can write your PySpark code.
– You can also upload an existing script from your local machine.

2. **Basic Structure**:

“`python
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

# Initialize GlueContext
glueContext = GlueContext(SparkContext.getOrCreate())
spark = glueContext.spark_session
job = Job(glueContext)
job.init(‘‘, getResolvedOptions(sys.argv, [‘JOB_NAME’]))

# Load Data from S3
data_source = glueContext.create_dynamic_frame.from_catalog(database=”“,
table_name=”“,
transformation_ctx=”data_source”)

# Transformations
transformed_data = ApplyMapping.apply(frame=data_source,
mappings=[(“field1”, “string”, “field1”, “string”),
(“field2”, “int”, “field2”, “int”)],
transformation_ctx=”transformed_data”)

# Write to S3
output_path = “s3:///transformed-data/”
glueContext.write_dynamic_frame.from_options(frame=transformed_data,
connection_type=”s3″,
connection_options={“path”: output_path},
format=”parquet”,
transformation_ctx=”s3_write”)

# Optionally Write to Amazon Redshift
glueContext.write_dynamic_frame.from_jdbc_conf(frame=transformed_data,
catalog_connection=”“,
connection_options={“dbtable”: ““,
“database”: ““},
redshift_tmp_dir=”s3://“,
transformation_ctx=”redshift_write”)

job.commit()
“`

### 3. Transformations

Transformations in AWS Glue using PySpark can be done through:
– **Mappings**: Mapping fields to new structures or types.
– **Filters**: Removing unwanted data records.
– **Join**: Combining multiple data frames.
– **Aggregate**: Aggregating data, similar to SQL group by operations.

### 4. Writing Output

AWS Glue supports writing data to several destinations. In PySpark, you often write to S3, Redshift, or other databases.

#### a. Writing to S3

– Use `write_dynamic_frame.to_options()` with `connection_type=”s3″` and specify the path and format (e.g., Parquet, CSV).

#### b. Writing to Amazon Redshift

– Utilize `write_dynamic_frame.from_jdbc_conf()` function.
– Ensure you have set up a JDBC connection in AWS Glue’s connections.
– Specify your database and table, along with a temporary directory in S3 for intermediate data.

### 5. Running and Monitoring

After script development and configuration:
1. Save and run the Glue job.
2. Monitor the job execution via the AWS Glue Console, where you can see logs and job status.

### Conclusion

Creating AWS Glue jobs in PySpark involves script development, which is the core of data transformation logic, followed by specifying output destinations in S3, Redshift, or others. With AWS Glue’s automated infrastructure provisioning, job scheduling, and monitoring capabilities, the process of managing and deploying data workflows becomes efficient and scalable.