Python and boto package for AWS: distributing files on the clusters

Python and boto package for AWS: distributing files on the clusters - python

I am currently testing a Python script on an EMR instance with the Boto package.
The script reads each row of a file called fileC, compare their content with the content of the rows in fileAC, and write in a separate file the filtered rows. As I am comparing 2 big files, I created another intermediate filter with a third file called fileA to gain time.
The problem is the following: during my tests on a local machine, the script works fine, filters out more than 200 rows from fileC. But once I try on AWS with Boto package, the script doesn’t filter fileC at all (p=0, no “found” displayed and the same number of rows as fileC). It seems that the 2 files fileA and fileAC are not read. With boto, I used the functionality “cache_files=” in the function “StreamingStep(” in order to distribute the 2 files (fileA and fileAC) to each of the clusters. It used to work for other scripts but here it doesn’t. Any thought ?
Here is the script:
sys.path.append(os.path.dirname(__file__))
def main(argv):
filenameAC = 'activities.log'
filenameA = 'activitiesCookieCountry.log'
fileC = fileinput.FileInput(sys.argv[1:])
fileA = open(filenameA,'r')
fileAC = open(filenameAC,'r')
fileA = [line.rstrip('\n') for line in fileA]
Alines = set(fileA)
for lineC in fileC:
fieldC = lineC.split('#')
fieldComp = fieldC[0]+'#'+fieldC[2]
p = 0
if fieldComp in Alines:
fileAC.seek(0)
for lineAC in fileAC:
fieldAC = lineAC.split('#')
if (fieldAC[0] == fieldC[0]) and (fieldAC[2] == fieldC[2]) and (fieldAC[1] < fieldC[1]):
p = 1
print('found')
if p == 0:
sys.stdout.write(lineC)
if __name__ == "__main__":
main(sys.argv)
And here is the script that runs the script within EMR:
utils = ['s3n://blablabla/activities.log#activities.log','s3n://blablabla/Activities/activitiesCookieCountry.log#activitiesCookieCountry.log']
sargs = ['-jobconf','mapred.output.compress=true','-jobconf','mapred.output.compression.type=block','-jobconf','mapred.compress.map.output=true','-jobconf','stream.map.output.field.separator="#"','-jobconf','mapred.reduce.tasks="1"']
cACstep = StreamingStep(
name='ClickActivityCheck',
mapper='s3n://blablabla/click-formatting-ACheck-S3.py',
reducer=None,
input='s3n://blablabla/ClickCleanedFeb14/*.gz',
output='s3n://blablabla/ClickCleaned2Feb14',
cache_files=utils,
step_args=sargs
)
jobid = conn.run_jobflow(
name= 'AWS_Flow_Test',
log_uri='s3n://blablabla/Logging/jobflow_logs',
ec2_keyname=’xxxx’,
availability_zone=None,
master_instance_type='m1.small',
slave_instance_type='m1.small',
num_instances=4,
action_on_failure=None,
keep_alive=False,
enable_debugging=True,
hadoop_version='1.0.3',
steps=[cACstep],
bootstrap_actions=[],
instance_groups=None,
additional_info=None,
ami_version='2.4.1',
api_params=None,
visible_to_all_users=True)
My suspicion is that the 2 files called shouldn’t be defined like this:
filenameAC = 'activities.log'
filenameA = 'activitiesCookieCountry.log'
But I don’t really know how else …

Related

Checking len of azure blob storage folder causes function to not run

Weird problem I've run into. I'm currently using the following code:
generic.py
def function_in_different_pyfile(input_folder):
# do stuff here
folder_1 = f"/folder_1"
folder_1_virtualdir = CONTAINER_CLIENT.list_blobs(name_starts_with=folder_1)
folder_2 = f"/folder_2"
folder_2_virtualdir = CONTAINER_CLIENT.list_blobs(name_starts_with=folder_2)
if len([file for file in folder_1_virtualdir]) !=(len([file for file in folder_2_virtualdir]):
generic.function_in_different_pyfile(folder_1_virtualdir)
else:
print('Already done')
So what I'm trying to do is:
Check the number of files in folder_1_virtualdir and folder_2_virtualdir
If they aren't equal, run the function.
If they are, then print statement/pass.
The problem:
The generic.function() runs although doesn't do anything when you pass in the list comprehension.
The generic.function() works totally fine if you don't have a list comprehension in the code e.g:
folder_1 = f"/folder_1"
folder_1_virtualdir = CONTAINER_CLIENT.list_blobs(name_starts_with=folder_1)
folder_2 = f"/folder_2"
folder_2_virtualdir = CONTAINER_CLIENT.list_blobs(name_starts_with=folder_2)
generic.function_in_different_pyfile(folder_1_virtualdir)
Will work completely fine.
There are no error messages. It passes through the function as if the function doesn't do anything.
What I've tried:
I've tested this by modifying the function:
generic.py
def function_in_different_pyfile(input_folder):
print('Start of the function')
# do stuff here
print('End of the function')
You will see these print statements although the function doesn't process any of the files in the input_folder argument if you include the list comprehension.
This is extended to when the list comprehension is ANYWHERE in the code:
folder_1 = f"/folder_1"
folder_1_virtualdir = CONTAINER_CLIENT.list_blobs(name_starts_with=folder_1)
folder_1_contents = [file for file in folder_1_virtualdir]
folder_2 = f"/folder_2"
folder_2_virtualdir = CONTAINER_CLIENT.list_blobs(name_starts_with=folder_2)
generic.function_in_different_pyfile(folder_1_virtualdir)
# Function doesn't run.
I'm fairly new to Python although can't seem to understand why the list comprehension here completely prevents the function from running correctly.

You could try the code if the number of files in the folder is less than 5000:
folder_1 = f"/folder_1"
folder_1_virtualdir = CONTAINER_CLIENT.list_blobs(name_starts_with=folder_1)
folder_2 = f"/folder_2"
folder_2_virtualdir = CONTAINER_CLIENT.list_blobs(name_starts_with=folder_2)
folder_1_count = len(folder_1_virtualdir)
folder_2_count = len(folder_2_virtualdir)
if folder_1_count != folder_2_count :
generic.function_in_different_pyfile(folder_1_virtualdir)
else:
print('Already done')
If greater than 5000, you need to get the number iterating through your blob.
count = 0
for count, item in enumerate(blobs):
print("number", count + 1, "in the list is", item)

writing to CSV works on localhost. Does not work on Website

I have created a basic web app that takes input and does math then returns output. I added a function to take the inputs and write them to a .csv file.
After tinkering with it I got it to work exactly as I wanted running it as localhost. I uploaded the new app and the blank .csv file but whenever I go to run the app now it does not load the results page and nothing is written to the .csv I have even put identical .csv files in multiple locations (in the static, templates and root folder) in case it wasn't looking where I expected
I am still learning python and flask and dealing with hosting and because I am not getting an error output just a non-loading web page I don't know where to start. Ideas?
Here is my code: but as I said it works on local host. And the site worked fine before I added in the section to write to a .csv
(note the indents are off because of pasting it in here. they are correct in practice)
#app.route('/', methods=['GET', 'POST'])
def step_bet():
if request.method == 'POST':
name = request.form['name']
people_start = request.form['people_start']
bet_amount = request.form['bet_amount']
people_remain = request.form['people_remain']
beta = request.form['beta']
#form = web.input(name="nobody", people_start="null", bet="null", people_remain="null", beta="0")
if people_start == None or bet_amount == None or people_remain == None:
return render_template('error_form.html')
else:
people_startf = float(people_start)
betf= float(bet_amount)
people_remainf = float(people_remain)
if beta == "Yes":
cut = .125
elif beta == "Members":
cut = 0
else:
cut = .25
revenue = round((((people_startf * betf) * (1 - cut)) / people_remainf),2)
if revenue < betf:
revenue = betf
profit = round((revenue - betf),2)
people_remain_needed = int(((people_startf * betf) * (1 - cut))/betf)
people_needed = int(people_remainf - people_remain_needed)
if people_needed < 0:
people_needed = 0
else:
pass
# This array is the fields your csv file has and in the following code
# you'll see how it will be used. Change it to your actual csv's fields.
fieldnames = ['name', 'people_start', 'bet_amount', 'people_remain', 'beta', 'revenue', 'profit', 'people_needed']
# We repeat the same step as the reading, but with "w" to indicate
# the file is going to be written.
# The second parameter with open is the mode, w is write, a is append. With append it automatically seeks to the end of the file.
with open('step_bet_save.csv','a') as inFile:
# DictWriter will help you write the file easily by treating the
# csv as a python's class and will allow you to work with
# dictionaries instead of having to add the csv manually.
writer = csv.DictWriter(inFile, fieldnames=fieldnames)
# writerow() will write a row in your csv file
writer.writerow({'name': name, 'people_start': people_start, 'bet_amount': bet_amount, 'people_remain': people_remain, 'beta': beta, 'revenue': revenue, 'profit': profit, 'people_needed': people_needed})
return render_template('results.html', revenue=revenue, profit=profit, name=name, people_needed=people_needed)
else:
return render_template('stepbet_form.html')

Check your permissions on the file/folder and ensure there is write access to the folder.
Refer to CHMOD if you're not 100% sure how to do this on a *nix based web server.

Openbr python wrapper from command line

i'm trying to execute this file test.py from command line:
from brpy import init_brpy
import requests # or whatever http request lib you prefer
import MagicalImageURLGenerator # made up
# br_loc is /usr/local/lib by default,
# you may change this by passing a different path to the shared objects
br = init_brpy(br_loc='/path/to/libopenbr')
br.br_initialize_default()
br.br_set_property('algorithm','CatFaceRecognitionModel') # also made up
br.br_set_property('enrollAll','true')
mycatsimg = open('mycats.jpg', 'rb').read() # cat picture not provided =^..^=
mycatstmpl = br.br_load_img(mycatsimg, len(mycatsimg))
query = br.br_enroll_template(mycatstmpl)
nqueries = br.br_num_templates(query)
scores = []
for imurl in MagicalImageURLGenerator():
# load and enroll image from URL
img = requests.get(imurl).content
tmpl = br.br_load_img(img, len(img))
targets = br.br_enroll_template(tmpl)
ntargets = br.br_num_templates(targets)
# compare and collect scores
scoresmat = br.br_compare_template_lists(targets, query)
for r in range(ntargets):
for c in range(nqueries):
scores.append((imurl, br.br_get_matrix_output_at(scoresmat, r, c)))
# clean up - no memory leaks
br.br_free_template(tmpl)
br.br_free_template_list(targets)
# print top 10 match URLs
scores.sort(key=lambda s: s[1])
for s in scores[:10]:
print(s[0])
# clean up - no memory leaks
br.br_free_template(mycatstmpl)
br.br_free_template_list(query)
br.br_finalize()
this script file is /myfolder/ while the library brpy is in /myfolder/scripts/brpy.
The brpy folder contains 3 files: "face_cluster_viz.py" , "html_viz.py" and "init.py" .
When i try to execute this file from cmd it shows an error:
NameError; name 'init_brpy' is not defined
Why? Where am I doing wrong? Is it possible execute this script from command line?
Thanks

The problem is the following line:
br = init_brpy(br_loc='/path/to/libopenbr')
You have to set your path of the openbr library.

How to use python diff_match_patch to create a patch and apply it

I'm looking for a pythonic way to compare two files file1 and file2 obtain the differences in form of a patch file and merge their differences into file2. The code should do something like this:
diff file1 file2 > diff.patch
apply the patch diff.patch to file2 // this must be doing something like git apply.
I have seen the following post Implementing Google's DiffMatchPatch API for Python 2/3 on google's python API dif_match_patch to find the differences but I'm looking for a solution to create and apply patch.

First you need to install diff_match_patch.
Here is my code:
import sys
import time
import diff_match_patch as dmp_module
def readFileToText(filePath):
file = open(filePath,"r")
s = ''
for line in file:
s = s + line
return s
dmp = dmp_module.diff_match_patch()
origin = sys.argv[1];
lastest = sys.argv[2];
originText = readFileToText(origin)
lastestText = readFileToText(lastest)
patch = dmp.patch_make(originText, lastestText)
patchText = dmp.patch_toText(patch)
# floder = sys.argv[1]
floder = '/Users/test/Documents/patch'
print(floder)
patchFilePath = floder
patchFile = open(patchFilePath,"w")
patchFile.write(patchText)
print(patchText)

Python refresh file from disk

I have a python script that calls a system program and reads the output from a file out.txt, acts on that output, and loops. However, it doesn't work, and a close investigation showed that the python script just opens out.txt once and then keeps on reading from that old copy. How can I make the python script reread the file on each iteration? I saw a similar question here on SO but it was about a python script running alongside a program, not calling it, and the solution doesn't work. I tried closing the file before looping back but it didn't do anything.
EDIT:
I already tried closing and opening, it didn't work. Here's the code:
import subprocess, os, sys
filename = sys.argv[1]
file = open(filename,'r')
foo = open('foo','w')
foo.write(file.read().rstrip())
foo = open('foo','a')
crap = open(os.devnull,'wb')
numSolutions = 0
while True:
subprocess.call(["minisat", "foo", "out"], stdout=crap,stderr=crap)
out = open('out','r')
if out.readline().rstrip() == "SAT":
numSolutions += 1
clause = out.readline().rstrip()
clause = clause.split(" ")
print clause
clause = map(int,clause)
clause = map(lambda x: -x,clause)
output = ' '.join(map(lambda x: str(x),clause))
print output
foo.write('\n'+output)
out.close()
else:
break
print "There are ", numSolutions, " solutions."

You need to flush foo so that the external program can see its latest changes. When you write to a file, the data is buffered in the local process and sent to the system in larger blocks. This is done because updating the system file is relatively expensive. In your case, you need to force a flush of the data so that minisat can see it.
foo.write('\n'+output)
foo.flush()

I rewrote it to hopefully be a bit easier to understand:
import os
from shutil import copyfile
import subprocess
import sys
TEMP_CNF = "tmp.in"
TEMP_SOL = "tmp.out"
NULL = open(os.devnull, "wb")
def all_solutions(cnf_fname):
"""
Given a file containing a set of constraints,
generate all possible solutions.
"""
# make a copy of original input file
copyfile(cnf_fname, TEMP_CNF)
while True:
# run minisat to solve the constraint problem
subprocess.call(["minisat", TEMP_CNF, TEMP_SOL], stdout=NULL,stderr=NULL)
# look at the result
with open(TEMP_SOL) as result:
line = next(result)
if line.startswith("SAT"):
# Success - return solution
line = next(result)
solution = [int(i) for i in line.split()]
yield solution
else:
# Failure - no more solutions possible
break
# disqualify found solution
with open(TEMP_CNF, "a") as constraints:
new_constraint = " ".join(str(-i) for i in sol)
constraints.write("\n")
constraints.write(new_constraint)
def main(cnf_fname):
"""
Given a file containing a set of constraints,
count the possible solutions.
"""
count = sum(1 for i in all_solutions(cnf_fname))
print("There are {} solutions.".format(count))
if __name__=="__main__":
if len(sys.argv) == 2:
main(sys.argv[1])
else:
print("Usage: {} cnf.in".format(sys.argv[0]))

You take your file_var and end the loop with file_var.close().
for ... :
ga_file = open(out.txt, 'r')
... do stuff
ga_file.close()
Demo of an implementation below (as simple as possible, this is all of the Jython code needed)...
__author__ = ''
import time
var = 'false'
while var == 'false':
out = open('out.txt', 'r')
content = out.read()
time.sleep(3)
print content
out.close()
generates this output:
2015-01-09, 'stuff added'
2015-01-09, 'stuff added' # <-- this is when i just saved my update
2015-01-10, 'stuff added again :)' # <-- my new output from file reads
I strongly recommend reading the error messages. They hold quite a lot of information.
I think the full file name should be written for debug purposes.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python and boto package for AWS: distributing files on the clusters - python

Related

Checking len of azure blob storage folder causes function to not run

writing to CSV works on localhost. Does not work on Website

Openbr python wrapper from command line

How to use python diff_match_patch to create a patch and apply it

Python refresh file from disk

Categories

Resources