The original post is located here – https://github.com/naeemmohd/serverless/blob/master/serverless001-processdata-using-s3lambdadynamodb/README.md
This article shows how to use there three AWS services – S3, Lambda and DynamaDB to process structured files.
The following steps would be needed to accomplish the objective:
- First of all we will have salary data files for per month for a organisation containing Employee ID, Employee Name, Salary as the fields
- Next, we will upload this file to S3.
- Next, this will fire Lambda trigger event which will process the uploaded file.
- Next, once the data is proccesed it will be stored in the database.
Lets tabulate the steps as below:
Steps | Actions |
---|---|
Prerequisite | Generate the data files for 12 months for 100 employees |
S3 | Create a S3 bucket to upload files |
Lambda | Create a Lambda function with a trigger which gets invokes as a file is uplaoded to S3 |
DynamoDB | Once the file is getting processed keep writing and updating the data in a table |
Prerequisite – Generate the data files for 12 months for 100 employees
- Since we don’t have data files with us, let’s try to generate data files using a python sample code.
- We would need some random names and random salary values to generate such a file.
- Python ha these two libraries – names and random to genartes names and numbers.
- We will have to install and import these packages first:
# **install pip and dependencies**
sudo apt-get install python-pip python-dev build-essential
# **install names package**
pip install names
# **install random package**
pip install random
- Here is the code for the file salary data generator file(DataFileGenerator.py):
# **import the names package**
import names
# **import the randoms package**
import random
# **As we need to generate 12 months data lets create the range - range(1,13)**
monthNumbers = range(1,13)
# **Iterate through the month numbers**
for monthNumber in monthNumbers:
# **Open/create salarydata files to write the salary data**
with open('logs/salarydata-' + str(monthNumber) + '.csv', 'w+') as outfile:
# **Number of rows in each file - 100**
numRows = range(1,101)
# **Write the first line as header
outfile.write('EmpID,EmpName,EmpSalary\n')
strRow = ''
# **Iterate thrrough the row numbers to generate the rows
for numRow in numRows:
# **EmpID as concatenated string as str(monthNumber) + str(numRow)**
# **EmpName as a random name using the names package - names.get_full_name(gender='male')**
# **EmpSalary as random salary value between 5555 to 7777 - random.randint(5555, 7777)**
# for even rows the name will be male else female
if numRow % 2 == 0:
strRow = str(monthNumber) + str(numRow) + ',' + names.get_full_name(gender='male') + ',' + str(random.randint(5555, 7777))
else:
strRow = str(monthNumber) + str(numRow) + ',' + names.get_full_name(gender='female') + ',' + str(random.randint(5555, 7777))
outfile.write(strRow + '\n')
outfile.close()
- Now save and execute the above python file(DataFileGnerator.py):
python DataFileGnerator.py
S3 Create a S3 bucket to upload
- Create a bucket name – any universally unique name is okay.
- Lets put the name as – S3BucketProcessSalaryData2019
- Please see the snapshot below
Lambda | Create a Lambda function with a trigger which gets invokes as a file is uplaoded to S3
- Create a Lambda function named – process_slary_data.
- Add a trigger to invoke on any item add to the above bucket
- Add the function code as below:
# **import boto3, csv and json packages**
import boto3
import csv
import json
# **generate variables for S3 and DynamoDB clients**
s3 = boto3.client('s3')
dynamodb = boto3.client('dynamodb')
# **The main lambda handler function**
def handler(event, context):
# **get the bucket name and salary data file name as key**
bucket = event['Records'][0]['s3']['bucket']['name']
key = event['Records'][0]['s3']['object']['key']
# **Dont process if the files does not have a .csv extn**
if '.csv' not in key:
return 'Please upload .csv files only.'
if '.psv' in key:
return 'This .psv file is already proccessed'
# **download the .csv file to /tmp folder**
s3.download_file(bucket, key, '/tmp/' + key)
psvName = 'processed_' + key[0:-4] + '.psv'
# Open the .csv file to process it, and upload the processed .psv file
with open('/tmp/' + key, 'r') as infile, \
open('/tmp/' + psvName, 'w') as outfile:
reader = csv.DictReader(infile)
writer = csv.DictWriter(outfile, reader.fieldnames, delimiter='|')
writer.writeheader()
writer.writerows(reader)
s3.upload_file('/tmp/' + psvName, bucket, psvName)
# Use the DynamoDB atomic counters to add/update data in the DynamoDB
with open('/tmp/' + key, 'r') as infile:
first_line = infile.readline()
for row in infile:
ddb_empId = row.strip().split(',')[0]
ddb_empName = row.strip().split(',')[1]
ddb_empSalary = row.strip().split(',')[2]
response = dynamodb.update_item(
TableName='EmployeeSalary',
Key={
'EmpID': {'N': ddb_empId},
'EmpName': {'S': ddb_empName},
},
UpdateExpression='ADD EmpSalary :empSalary',
ExpressionAttributeValues={
':empSalary': {'N': ddb_empSalary}
},
ReturnValues="UPDATED_NEW"
)
print(response)
DynamoDB | Once the file is getting processed keep writing and updating the data in a table
- Create a DynamoDB table ‘EmployeeSalary’ with Primary Key as ‘EmpID’ and Sort Key as ‘EmpName’.
- Please see the snapshot below.