Chaining MapReduces - Google AppEngine - python

I'm trying to send the output of a reduce to a map, by chaining in a pipeline, similar to this guy:
I would like to chain multiple mapreduce jobs in google app engine in Python
I tried his solution, but it didn't work.
The flow of my pipeline is:
Map1
Reduce1
Map2
Reduce2
I'm saving the output of Reduce1 to the blobstore under a blob_key, and then trying to access the blob from Map2. But I get the following error while executing the second map: "BadReaderParamsError: Could not find blobinfo for key <blob_key here>".
Here's the pipeline code:
class SongsPurchasedTogetherPipeline(base_handler.PipelineBase):
def run(self, filekey, blobkey):
bucket_name = app_identity.get_default_gcs_bucket_name()
intermediate_output = yield mapreduce_pipeline.MapreducePipeline(
"songs_purchased_together_intermediate",
"main.songs_purchased_together_map1",
"main.songs_purchased_together_reduce1",
"mapreduce.input_readers.BlobstoreLineInputReader",
"mapreduce.output_writers.GoogleCloudStorageOutputWriter",
mapper_params={
"blob_keys": blobkey,
},
reducer_params={
"output_writer": {
"bucket_name": bucket_name,
"content_type": "text/plain",
}
},
shards=1)
yield StoreOutput("SongsPurchasedTogetherIntermediate", filekey, intermediate_output)
intermediate_output_key = yield BlobKey(intermediate_output)
output = yield mapreduce_pipeline.MapreducePipeline(
"songs_purchased_together",
"main.songs_purchased_together_map2",
"main.songs_purchased_together_reduce2",
"mapreduce.input_readers.BlobstoreLineInputReader",
"mapreduce.output_writers.GoogleCloudStorageOutputWriter",
mapper_params=(intermediate_output_key),
reducer_params={
"output_writer": {
"bucket_name": bucket_name,
"content_type": "text/plain",
}
},
shards=1)
yield StoreOutput("SongsPurchasedTogether", filekey, output)
and here's the BlobKey class which takes the intermediate output and generates the blob key for Map2 to use:
class BlobKey(base_handler.PipelineBase):
def run(self, output):
blobstore_filename = "/gs" + output[0]
blobstore_gs_key = blobstore.create_gs_key(blobstore_filename)
return {
"blob_keys": blobstore_gs_key
}
The StoreOutput class is the same as the one in Google's MapReduce demo https://github.com/GoogleCloudPlatform/appengine-mapreduce/blob/master/python/demo/main.py, and does the same thing as the BlobKey class, but additionally sends the blob's URL to HTML as a link.
Manually visiting the URL appname/blobstore/<blob_key>, by typing it into a browser (after Reduce1 succeeds, but Map2 fails) displays the output expected from Reduce1. Why can't Map2 find the blob? Sorry I'm a newbie to AppEngine, and I'm probably going wrong somewhere because I don't fully understand blob storage.

Okay, I found out that Google has removed the BlobstoreOutputWriter from the list of standard writers on the GAE GitHub repository, which makes things a little more complicated. I had to write to the Google Cloud Store, and read from there. I wrote a helper class which generates mapper parameters for the GoogleCloudStorageInputReader.
class GCSMapperParams(base_handler.PipelineBase):
def run(self, GCSPath):
bucket_name = app_identity.get_default_gcs_bucket_name()
return {
"input_reader": {
"bucket_name": bucket_name,
"objects": [path.split('/', 2)[2] for path in GCSPath],
}
}
The function takes as argument the output of one MapReduce stage which uses a GoogleCloudStorageOutputWriter, and returns a dictionary which can be assigned to mapper_params of the next MapReduce stage.
Basically, the value of the output of the first MapReduce stage is a list containing <app_name>/<pipeline_name>/key/output-[i], where i is the number of shards. In order to use a GoogleCloudStorageInputReader, the key to the data should be passed through the variable objects in the mapper_params. The key must be of the form key/output-[i], so the helper class simply removes the <app_name>/<pipeline_name>/ from it.

Related

Error trying to write CSV file to Google Cloud Storage from Dataflow pipeline

I'm working on building a Dataflow pipeline that reads a CSV file (containing 250,000 rows) from my Cloud Storage bucket, modifies the value of each row and then writes the modified contents to a new CSV in the same bucket. With the code below I'm able to read and modify the contents of the original file, but when I attempt to write the contents of the new file in GCS I get the following error:
google.api_core.exceptions.TooManyRequests: 429 POST https://storage.googleapis.com/upload/storage/v1/b/my-bucket/o?uploadType=multipart: {
"error": {
"code": 429,
"message": "The rate of change requests to the object my-bucket/product-codes/URL_test_codes.csv exceeds the rate limit. Please reduce the rate of create, update, and delete requests.",
"errors": [
{
"message": "The rate of change requests to the object my-bucket/product-codes/URL_test_codes.csv exceeds the rate limit. Please reduce the rate of create, update, and delete requests.",
"domain": "usageLimits",
"reason": "rateLimitExceeded"
}
]
}
}
: ('Request failed with status code', 429, 'Expected one of', <HTTPStatus.OK: 200>) [while running 'Store Output File']
My code in Dataflow:
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
import traceback
import sys
import pandas as pd
from cryptography.fernet import Fernet
import google.auth
from google.cloud import storage
fernet_secret = 'aD4t9MlsHLdHyuFKhoyhy9_eLKDfe8eyVSD3tu8KzoP='
bucket = 'my-bucket'
inputFile = f'gs://{bucket}/product-codes/test_codes.csv'
outputFile = 'product-codes/URL_test_codes.csv'
#Pipeline Logic
def product_codes_pipeline(project, env, region='us-central1'):
options = PipelineOptions(
streaming=False,
project=project,
region=region,
staging_location="gs://my-bucket-dataflows/Templates/staging",
temp_location="gs://my-bucket-dataflows/Templates/temp",
template_location="gs://my-bucket-dataflows/Templates/Generate_Product_Codes.py",
subnetwork='https://www.googleapis.com/compute/v1/projects/{}/regions/us-central1/subnetworks/{}-private'.format(project, env)
)
# Transform function
def genURLs(code):
f = Fernet(fernet_secret)
encoded = code.encode()
encrypted = f.encrypt(encoded)
decrypted = f.decrypt(encrypted.decode().encode())
decoded = decrypted.decode()
if code != decoded:
print(f'Error: Code {code} and decoded code {decoded} do not match')
sys.exit(1)
url = 'https://some-url.com/redeem/product-code=' + encrypted.decode()
return url
class WriteCSVFIle(beam.DoFn):
def __init__(self, bucket_name):
self.bucket_name = bucket_name
def start_bundle(self):
self.client = storage.Client()
def process(self, urls):
df = pd.DataFrame([urls], columns=['URL'])
bucket = self.client.get_bucket(self.bucket_name)
bucket.blob(f'{outputFile}').upload_from_string(df.to_csv(index=False), 'text/csv')
# End function
p = beam.Pipeline(options=options)
(p | 'Read Input CSV' >> beam.io.ReadFromText(inputFile, skip_header_lines=1)
| 'Map Codes' >> beam.Map(genURLs)
| 'Store Output File' >> beam.ParDo(WriteCSVFIle(bucket)))
p.run()
The code produces URL_test_codes.csv in my bucket, but the file only contains one row (not including the 'URL' header) which tells me that my code is writing/overwriting the file as it processes each row. Is there a way to bulk write the contents of the entire file instead of making a series of requests to update the file? I'm new to Python/Dataflow so any help is greatly appreciated.
Let's point out the issues: the evident one is a quota issue from GCS side, reflected by the '429' error codes. But as you noted, this is derived from the inherent issue, which is more related to how you try to write your data to your blob.
Since a Beam Pipeline generates a Parallel Collection of elements, when you add elements to your PCollection, each pipeline step will be executed for each element, in other words, your ParDo function will try to write something to your output file once per element in your PCollection.
So, there are some issues with your WriteCSVFIle function. For example, in order to write your PCollection to GCS, it would be better to use a separate pipeline task focused on writing the whole PCollection, such as follows:
First, you can import this Function already included in Apache Beam:
from apache_beam.io import WriteToText
Then, you use it at the end of your pipeline:
| 'Write PCollection to Bucket' >> WriteToText('gs://{0}/{1}'.format(bucket_name, outputFile))
With this option, you don't need to create a storage client or reference a blob, the function just needs to receive the GCS URI where it would write the final result and you can adjust it according to the parameters you can find in the documentation.
With this, you only need to address the Dataframe created in your WriteCSVFIle function. Each pipeline step creates a new PCollection, so if a Dataframe-creator function should receive an element from a PCollection of URLs, then the new PCollection elements resulting from the Dataframe function will have 1 dataframe per url following your current logic, but since it seems you just want to write the results from genURLs considering that 'URL' is the only column in your dataframe, maybe going directly from genURLs to WriteToText can output what you're looking for.
Either way, you can adjust your pipeline accordingly, but at least with the WriteToText transform it would take care of writing your whole final PCollection to your Cloud Storage bucket.

Look for specific ec2 instances in list of all running instances

Background
I am writing a small script where i am trying to get a list of all the running ec2 instances in a particular region. Out of that list i am trying to see if there are instances with specific names.
I have done the following.
My Code
import boto3
ec2_client = boto3.client("ec2", region_name="us-east-1")
reservations = ec2_client.describe_instances(Filters=[{ "Name": "instance-state-name", "Values": ["running"],}]).get("Reservations")
tag_list = []
for reservation in reservations:
instance = reservation["Instances"][0]
if "Tags" in instance:
tag_list.extend(instance["Tags"])
for tag in tag_list:
if tag["Key"] == "Name":
if tag["Value"] == "primary_node":
print("primary node is still running.")
if "asg-prod" in tag["Value"]:
print("asg instances are still running")
I am wondering if there is a way to simplify and do the above in a more effective manner. For instance can i just add the tag value i am looking for in the describe_instances(Filter=[{ part of the code? I am open to any suggestions that help me do the above more effectively as i suspect i might not be doing so.
You're specifying the filtering incorrectly. You need multiple filters, not one, and how you specify the individual filter name/value pairs needs to change.
Here's an example:
import boto3
ec2_client = boto3.client("ec2", region_name="us-east-1")
instances = ec2_client.describe_instances(
Filters=[
{ "Name": "tag:Name", "Values": ["WebServer01"] },
{ "Name": "instance-state-name", "Values": ["running"] }
]
)
for reservation in instances["Reservations"]:
for instance in reservation["Instances"]:
print(instance['InstanceId'])

How to test feeding a PDF to a lambda function

I have written a lambda function who's goal is to accept a pdf, parse it using PyPDF2, and return specific text fields as a payload.
import PyPDF2 as pypdf
def lambda_handler(event, context):
file_path = event['body']
pdfdoc=open(file_path,'rb')
pdfreader=pypdf.PdfFileReader(pdfdoc)
#dictionary of extracted text fields
dic = pdfreader.getFormTextFields()
#fields we want to send in our payload
firstname = dic['name'].split(' ')[0]
lastname = dic['name[0]'].split(' ')[1]
payload = {"FirstName": firstname,
"LastName": lastname,
}
return jsonify(payload)
The problem is, I dont know if this is going to work and I want to test it by feeding it a pdf and see what error it spits out at me. How can I manually feed my function a PDF without tying it to API Gateway or another app? Ive tested the code locally, but I've never passed a PDF to a lambda function before so this is the part that is throwing me off.

Best way to change type of column in dynamoDB

I have the DynamoDB table that is filled from different services/sources. The table has next schema:
{
"Id": 14782,
"ExtId": 1478240974, //pay attention it is Number
"Name": "name1"
}
Sometimes, after services started to work I had found that one service sends data in incorrect format. It looks like:
{
"Id": 14782,
"ExtId": "1478240974", //pay attention it is String
"Name": "name1"
}
DynamoDB is NoSQL database so, now I have millions mixed records that are difficult to query or scan. I understand that my main fault was missed validation.
Now I have to go throw all records and if it is inappropriate type - remove it and add with same data but with the correct format. Is it possible to do in another gracefully way?
So it was pretty easy. It is possible to do with attribute_type method.
First of all, I added imports:
from boto3.dynamodb.conditions import Attr
import boto3
And my code:
attr = Attr('ExtId').attribute_type('S')
response = table.scan(FilterExpression = attr)
items = response['Items']
while 'LastEvaluatedKey' in response:
response = table.scan(FilterExpression = attr, ExclusiveStartKey = response['LastEvaluatedKey'])
items.extend(response['Items'])
It is possible to find more condition customization in the next article - DynamoDB Customization Reference

How to upload a file to Google Drive using a Python script?

I need to backup various file types to GDrive (not just those convertible to GDocs formats) from some linux server.
What would be the simplest, most elegant way to do that with a python script? Would any of the solutions pertaining to GDocs be applicable?
You can use the Documents List API to write a script that writes to Drive:
https://developers.google.com/google-apps/documents-list/
Both the Documents List API and the Drive API interact with the same resources (i.e. same documents and files).
This sample in the Python client library shows how to upload an unconverted file to Drive:
http://code.google.com/p/gdata-python-client/source/browse/samples/docs/docs_v3_example.py#180
The current documentation for saving a file to google drive using python can be found here:
https://developers.google.com/drive/v3/web/manage-uploads
However, the way that the google drive api handles document storage and retrieval does not follow the same architecture as POSIX file systems. As a result, if you wish to preserve the hierarchical architecture of the nested files on your linux file system, you will need to write a lot of custom code so that the parent directories are preserved on google drive.
On top of that, google makes it difficult to gain write access to a normal drive account. Your permission scope must include the following link: https://www.googleapis.com/auth/drive and to obtain a token to access a user's normal account, that user must first join a group to provide access to non-reviewed apps. And any oauth token that is created has a limited shelf life.
However, if you obtain an access token, the following script should allow you to save any file on your local machine to the same (relative) path on google drive.
def migrate(file_path, access_token, drive_space='drive'):
'''
a method to save a posix file architecture to google drive
NOTE: to write to a google drive account using a non-approved app,
the oauth2 grantee account must also join this google group
https://groups.google.com/forum/#!forum/risky-access-by-unreviewed-apps
:param file_path: string with path to local file
:param access_token: string with oauth2 access token grant to write to google drive
:param drive_space: string with name of space to write to (drive, appDataFolder, photos)
:return: string with id of file on google drive
'''
# construct drive client
import httplib2
from googleapiclient import discovery
from oauth2client.client import AccessTokenCredentials
google_credentials = AccessTokenCredentials(access_token, 'my-user-agent/1.0')
google_http = httplib2.Http()
google_http = google_credentials.authorize(google_http)
google_drive = discovery.build('drive', 'v3', http=google_http)
drive_client = google_drive.files()
# prepare file body
from googleapiclient.http import MediaFileUpload
media_body = MediaFileUpload(filename=file_path, resumable=True)
# determine file modified time
import os
from datetime import datetime
modified_epoch = os.path.getmtime(file_path)
modified_time = datetime.utcfromtimestamp(modified_epoch).isoformat()
# determine path segments
path_segments = file_path.split(os.sep)
# construct upload kwargs
create_kwargs = {
'body': {
'name': path_segments.pop(),
'modifiedTime': modified_time
},
'media_body': media_body,
'fields': 'id'
}
# walk through parent directories
parent_id = ''
if path_segments:
# construct query and creation arguments
walk_folders = True
folder_kwargs = {
'body': {
'name': '',
'mimeType' : 'application/vnd.google-apps.folder'
},
'fields': 'id'
}
query_kwargs = {
'spaces': drive_space,
'fields': 'files(id, parents)'
}
while path_segments:
folder_name = path_segments.pop(0)
folder_kwargs['body']['name'] = folder_name
# search for folder id in existing hierarchy
if walk_folders:
walk_query = "name = '%s'" % folder_name
if parent_id:
walk_query += "and '%s' in parents" % parent_id
query_kwargs['q'] = walk_query
response = drive_client.list(**query_kwargs).execute()
file_list = response.get('files', [])
else:
file_list = []
if file_list:
parent_id = file_list[0].get('id')
# or create folder
# https://developers.google.com/drive/v3/web/folder
else:
if not parent_id:
if drive_space == 'appDataFolder':
folder_kwargs['body']['parents'] = [ drive_space ]
else:
del folder_kwargs['body']['parents']
else:
folder_kwargs['body']['parents'] = [parent_id]
response = drive_client.create(**folder_kwargs).execute()
parent_id = response.get('id')
walk_folders = False
# add parent id to file creation kwargs
if parent_id:
create_kwargs['body']['parents'] = [parent_id]
elif drive_space == 'appDataFolder':
create_kwargs['body']['parents'] = [drive_space]
# send create request
file = drive_client.create(**create_kwargs).execute()
file_id = file.get('id')
return file_id
PS. I have modified this script from the labpack python module. There is class called driveClient in that module written by rcj1492 which handles saving, loading, searching and deleting files on google drive in a way that preserves the POSIX file system.
from labpack.storage.google.drive import driveClient
I found that PyDrive handles the Drive API elegantly, and it also has great documentation (especially walking the user through the authentication part).
EDIT: Combine that with the material on Automating pydrive verification process and Pydrive google drive automate authentication, and that makes for some great documentation to get things going. Hope it helps those who are confused about where to start.

Categories

Resources