mrjob with JSON data - python

Me and a friend of mine are working on a rather large JSON file. We want to perform MapReduce on parts of this file, being as speedy as possible. As it appears to be hard to feed a JSON file directly into a "mrjob job", we attempted to write the needed data into a text file (where each line is an array element, exctracted from the json). This intermediate step takes way too much time, because of disc write operations.
Below is an example of our mrjob test file.
from mrjob.job import MRJob
import json
class ReduceData(MRJob):
def mapper(self, _, line):
lineJSON = json.loads(line)
yield lineJSON[2], 1
def reducer(self, key, values):
yield key, sum(values)
if __name__ == '__main__':
ReduceData.run()
The code above is ran as follows:
$ python reducedata.py data.txt
read_json is illustrated below
import ijson
f = open('testData.json')
parser = ijson.parse(f)
if __name__ == '__main__':
for prefix, event, value in parser:
if (prefix, event) == ('data.item', 'start_array'):
item = []
elif prefix == 'data.item.item' and value is not None:
item.append(value)
elif (prefix, event) == ('data.item', '
item = []
# yield data as output, or something?
With the above mentioned, I have two questions:
1) Is there a way to provide the output from read_json.py as input into reducedata.py without performing write to disc operations?
2) If 1) is possible, how to I specify the output? mrjob expects a file, and invokes the mapper line by line. Each yield (bottom comment) in read_json.py is supposed to be a "line".
Thanks in advance!
-Superdids

Related

Add Bittrex API Results from Loop into Dataframe

I am new to python. By following code from others I have put together the following code. The code provide the output from "GetLatestTick" by looping though code for each "pair" in list under the __name/main__ section. I can even perform calculation on the line being processed. How can I organized this output into a DataFrame for sorting and other manipulation?
from multiprocessing import Process, Lock
import requests, json
from tabulate import tabulate
def bittrex(lock, pair):
lock.acquire()
#print(pair)
get_request_link = ('https://bittrex.com/Api/v2.0/pub/market/GetLatestTick?marketName='
+ pair + '&tickInterval=thirtyMin')
api = requests.get(get_request_link)
data = json.loads(api.text)
for i in data:
if i == 'result':
for h in data[i]:
print(pair,h['O'],h['H'],h['L'],h['C'],h['V'],h['BV'],
(((h['O'])/(h['H']))-1),h['T'])
lock.release()
return data
def main():
bitt_all = bittrex(lock, pair) # This still required
if __name__ == '__main__':
lock = Lock()
pair = ['BTC-MUE','USD-ETH','USDT-TRX','USDT-BTC']
for pair in pair:
Process(target=bittrex, args=(lock, pair)).start()

Python 3.6 script surprisingly slow on Windows 10 but not on Ubuntu 17.10

I recently had to write a challenge for a company that was to merge 3 CSV files into one based on the first attribute of each (the attributes were repeating in all files).
I wrote the code and sent it to them, but they said it took 2 minutes to run. That was funny because it ran for 10 seconds on my machine. My machine had the same processor, 16GB of RAM, and had an SSD as well. Very similar environments.
I tried optimising it and resubmitted it. This time they said they ran it on an Ubuntu machine and got 11 seconds, while the code ran for 100 seconds on the Windows 10 still.
Another peculiar thing was that when I tried profiling it with the Profile module, it went on forever, had to terminate after 450 seconds. I moved to cProfiler and it recorded it for 7 seconds.
EDIT: The exact formulation of the problem is
Write a console program to merge the files provided in a timely and
efficient manner. File paths should be supplied as arguments so that
the program can be evaluated on different data sets. The merged file
should be saved as CSV; use the id column as the unique key for
merging; the program should do any necessary data cleaning and error
checking.
Feel free to use any language you’re comfortable with – only
restriction is no external libraries as this defeats the purpose of
the test. If the language provides CSV parsing libraries (like
Python), please avoid using them as well as this is a part of the
test.
Without further ado here's the code:
#!/usr/bin/python3
import sys
from multiprocessing import Pool
HEADERS = ['id']
def csv_tuple_quotes_valid(a_tuple):
"""
checks if a quotes in each attribute of a entry (i.e. a tuple) agree with the csv format
returns True or False
"""
for attribute in a_tuple:
in_quotes = False
attr_len = len(attribute)
skip_next = False
for i in range(0, attr_len):
if not skip_next and attribute[i] == '\"':
if i < attr_len - 1 and attribute[i + 1] == '\"':
skip_next = True
continue
elif i == 0 or i == attr_len - 1:
in_quotes = not in_quotes
else:
return False
else:
skip_next = False
if in_quotes:
return False
return True
def check_and_parse_potential_tuple(to_parse):
"""
receives a string and returns an array of the attributes of the csv line
if the string was not a valid csv line, then returns False
"""
a_tuple = []
attribute_start_index = 0
to_parse_len = len(to_parse)
in_quotes = False
i = 0
#iterate through the string (line from the csv)
while i < to_parse_len:
current_char = to_parse[i]
#this works the following way: if we meet a quote ("), it must be in one
#of five cases: "" | ", | ," | "\0 | (start_of_string)"
#in case we are inside a quoted attribute (i.e. "123"), then commas are ignored
#the following code also extracts the tuples' attributes
if current_char == '\"':
if i == 0 or (to_parse[i - 1] == ',' and not in_quotes): # (start_of_string)" and ," case
#not including the quote in the next attr
attribute_start_index = i + 1
#starting a quoted attr
in_quotes = True
elif i + 1 < to_parse_len:
if to_parse[i + 1] == '\"': # "" case
i += 1 #skip the next " because it is part of a ""
elif to_parse[i + 1] == ',' and in_quotes: # ", case
a_tuple.append(to_parse[attribute_start_index:i].strip())
#not including the quote and comma in the next attr
attribute_start_index = i + 2
in_quotes = False #the quoted attr has ended
#skip the next comma - we know what it is for
i += 1
else:
#since we cannot have a random " in the middle of an attr
return False
elif i == to_parse_len - 1: # "\0 case
a_tuple.append(to_parse[attribute_start_index:i].strip())
#reached end of line, so no more attr's to extract
attribute_start_index = to_parse_len
in_quotes = False
else:
return False
elif current_char == ',':
if not in_quotes:
a_tuple.append(to_parse[attribute_start_index:i].strip())
attribute_start_index = i + 1
i += 1
#in case the last attr was left empty or unquoted
if attribute_start_index < to_parse_len or (not in_quotes and to_parse[-1] == ','):
a_tuple.append(to_parse[attribute_start_index:])
#line ended while parsing; i.e. a quote was openned but not closed
if in_quotes:
return False
return a_tuple
def parse_tuple(to_parse, no_of_headers):
"""
parses a string and returns an array with no_of_headers number of headers
raises an error if the string was not a valid CSV line
"""
#get rid of the newline at the end of every line
to_parse = to_parse.strip()
# return to_parse.split(',') #if we assume the data is in a valid format
#the following checking of the format of the data increases the execution
#time by a factor of 2; if the data is know to be valid, uncomment 3 lines above here
#if there are more commas than fields, then we must take into consideration
#how the quotes parse and then extract the attributes
if to_parse.count(',') + 1 > no_of_headers:
result = check_and_parse_potential_tuple(to_parse)
if result:
a_tuple = result
else:
raise TypeError('Error while parsing CSV line %s. The quotes do not parse' % to_parse)
else:
a_tuple = to_parse.split(',')
if not csv_tuple_quotes_valid(a_tuple):
raise TypeError('Error while parsing CSV line %s. The quotes do not parse' % to_parse)
#if the format is correct but more data fields were provided
#the following works faster than an if statement that checks the length of a_tuple
try:
a_tuple[no_of_headers - 1]
except IndexError:
raise TypeError('Error while parsing CSV line %s. Unknown reason' % to_parse)
#this replaces the use my own hashtables to store the duplicated values for the attributes
for i in range(1, no_of_headers):
a_tuple[i] = sys.intern(a_tuple[i])
return a_tuple
def read_file(path, file_number):
"""
reads the csv file and returns (dict, int)
the dict is the mapping of id's to attributes
the integer is the number of attributes (headers) for the csv file
"""
global HEADERS
try:
file = open(path, 'r');
except FileNotFoundError as e:
print("error in %s:\n%s\nexiting...")
exit(1)
main_table = {}
headers = file.readline().strip().split(',')
no_of_headers = len(headers)
HEADERS.extend(headers[1:]) #keep the headers from the file
lines = file.readlines()
file.close()
args = []
for line in lines:
args.append((line, no_of_headers))
#pool is a pool of worker processes parsing the lines in parallel
with Pool() as workers:
try:
all_tuples = workers.starmap(parse_tuple, args, 1000)
except TypeError as e:
print('Error in file %s:\n%s\nexiting thread...' % (path, e.args))
exit(1)
for a_tuple in all_tuples:
#add quotes to key if needed
key = a_tuple[0] if a_tuple[0][0] == '\"' else ('\"%s\"' % a_tuple[0])
main_table[key] = a_tuple[1:]
return (main_table, no_of_headers)
def merge_files():
"""
produces a file called merged.csv
"""
global HEADERS
no_of_files = len(sys.argv) - 1
processed_files = [None] * no_of_files
for i in range(0, no_of_files):
processed_files[i] = read_file(sys.argv[i + 1], i)
out_file = open('merged.csv', 'w+')
merged_str = ','.join(HEADERS)
all_keys = {}
#this is to ensure that we include all keys in the final file.
#even those that are missing from some files and present in others
for processed_file in processed_files:
all_keys.update(processed_file[0])
for key in all_keys:
merged_str += '\n%s' % key
for i in range(0, no_of_files):
(main_table, no_of_headers) = processed_files[i]
try:
for attr in main_table[key]:
merged_str += ',%s' % attr
except KeyError:
print('NOTE: no values found for id %s in file \"%s\"' % (key, sys.argv[i + 1]))
merged_str += ',' * (no_of_headers - 1)
out_file.write(merged_str)
out_file.close()
if __name__ == '__main__':
# merge_files()
import cProfile
cProfile.run('merge_files()')
# import time
# start = time.time()
# print(time.time() - start);
Here is the profiler report I got on my Windows.
EDIT: The rest of the csv data provided is here. Pastebin was taking too long to process the files, so...
It might not be the best code and I know that, but my question is what slows down Windows so much that doesn't slow down an Ubuntu? The merge_files() function takes the longest, with 94 seconds just for itself, not including the calls to other functions. And there doesn't seem to be anything too obvious to me for why it is so slow.
Thanks
EDIT: Note: We both used the same dataset to run the code with.
It turns out that Windows and Linux handle very long strings differently. When I moved the out_file.write(merged_str) inside the outer for loop (for key in all_keys:) and stopped appending to merged_str, it ran for 11 seconds as expected. I don't have enough knowledge on either of the OS's memory management systems to be able to give a prediction on why it is so different.
But I would say that the way that the second one (the Windows one) is the more fail-safe method because it is unreasonable to keep a 30 MB string in memory. It just turns out that Linux sees that and doesn't always try to keep the string in cache, or to rebuild it every time.
Funny enough, initially I did run it a few times on my Linux machine with these same writing strategies, and the one with the large string seemed to go faster, so I stuck with it. I guess you never know.
Here's the modified code
for key in all_keys:
merged_str = '%s' % key
for i in range(0, no_of_files):
(main_table, no_of_headers) = processed_files[i]
try:
for attr in main_table[key]:
merged_str += ',%s' % attr
except KeyError:
print('NOTE: no values found for id %s in file \"%s\"' % (key, sys.argv[i + 1]))
merged_str += ',' * (no_of_headers - 1)
out_file.write(merged_str + '\n')
out_file.close()
When I run your solution on Ubuntu 16.04 with the three given files, it seems to take ~8 seconds to complete. The only modification I made was to uncomment the timing code at the bottom and use it.
$ python3 dimitar_merge.py file1.csv file2.csv file3.csv
NOTE: no values found for id "aaa5d09b-684b-47d6-8829-3dbefd608b5e" in file "file2.csv"
NOTE: no values found for id "38f79a49-4357-4d5a-90a5-18052ef03882" in file "file2.csv"
NOTE: no values found for id "766590d9-4f5b-4745-885b-83894553394b" in file "file2.csv"
8.039648056030273
$ python3 dimitar_merge.py file1.csv file2.csv file3.csv
NOTE: no values found for id "38f79a49-4357-4d5a-90a5-18052ef03882" in file "file2.csv"
NOTE: no values found for id "766590d9-4f5b-4745-885b-83894553394b" in file "file2.csv"
NOTE: no values found for id "aaa5d09b-684b-47d6-8829-3dbefd608b5e" in file "file2.csv"
7.78482985496521
I rewrote my first attempt without using csv from the standard library and am now getting times of ~4.3 seconds.
$ python3 lettuce_merge.py file1.csv file2.csv file3.csv
4.332579612731934
$ python3 lettuce_merge.py file1.csv file2.csv file3.csv
4.305467367172241
$ python3 lettuce_merge.py file1.csv file2.csv file3.csv
4.27345871925354
This is my solution code (lettuce_merge.py):
from collections import defaultdict
def split_row(csv_row):
return [col.strip('"') for col in csv_row.rstrip().split(',')]
def merge_csv_files(files):
file_headers = []
merged_headers = []
for i, file in enumerate(files):
current_header = split_row(next(file))
unique_key, *current_header = current_header
if i == 0:
merged_headers.append(unique_key)
merged_headers.extend(current_header)
file_headers.append(current_header)
result = defaultdict(lambda: [''] * (len(merged_headers) - 1))
for file_header, file in zip(file_headers, files):
for line in file:
key, *values = split_row(line)
for col_name, col_value in zip(file_header, values):
result[key][merged_headers.index(col_name) - 1] = col_value
file.close()
quotes = '"{}"'.format
with open('lettuce_merged.csv', 'w') as f:
f.write(','.join(quotes(a) for a in merged_headers) + '\n')
for key, values in result.items():
f.write(','.join(quotes(b) for b in [key] + values) + '\n')
if __name__ == '__main__':
from argparse import ArgumentParser, FileType
from time import time
parser = ArgumentParser()
parser.add_argument('files', nargs='*', type=FileType('r'))
args = parser.parse_args()
start_time = time()
merge_csv_files(args.files)
print(time() - start_time)
I'm sure this code could be optimized even further but sometimes just seeing another way to solve a problem can help spark new ideas.

Creating loop for __main__

I am new to Python, and I want your advice on something.
I have a script that runs one input value at a time, and I want it to be able to run a whole list of such values without me typing the values one at a time. I have a hunch that a "for loop" is needed for the main method listed below. The value is "gene_name", so effectively, i want to feed in a list of "gene_names" that the script can run through nicely.
Hope I phrased the question correctly, thanks! The chunk in question seems to be
def get_probes_from_genes(gene_names)
import json
import urllib2
import os
import pandas as pd
api_url = "http://api.brain-map.org/api/v2/data/query.json"
def get_probes_from_genes(gene_names):
if not isinstance(gene_names,list):
gene_names = [gene_names]
#in case there are white spaces in gene names
gene_names = ["'%s'"%gene_name for gene_name in gene_names]**
api_query = "?criteria=model::Probe"
api_query= ",rma::criteria,[probe_type$eq'DNA']"
api_query= ",products[abbreviation$eq'HumanMA']"
api_query= ",gene[acronym$eq%s]"%(','.join(gene_names))
api_query= ",rma::options[only$eq'probes.id','name']"
data = json.load(urllib2.urlopen(api_url api_query))
d = {probe['id']: probe['name'] for probe in data['msg']}
if not d:
raise Exception("Could not find any probes for %s gene. Check " \
"http://help.brain- map.org/download/attachments/2818165/HBA_ISH_GeneList.pdf? version=1&modificationDate=1348783035873 " \
"for list of available genes."%gene_name)
return d
def get_expression_values_from_probe_ids(probe_ids):
if not isinstance(probe_ids,list):
probe_ids = [probe_ids]
#in case there are white spaces in gene names
probe_ids = ["'%s'"%probe_id for probe_id in probe_ids]
api_query = "? criteria=service::human_microarray_expression[probes$in%s]"% (','.join(probe_ids))
data = json.load(urllib2.urlopen(api_url api_query))
expression_values = [[float(expression_value) for expression_value in data["msg"]["probes"][i]["expression_level"]] for i in range(len(probe_ids))]
well_ids = [sample["sample"]["well"] for sample in data["msg"] ["samples"]]
donor_names = [sample["donor"]["name"] for sample in data["msg"] ["samples"]]
well_coordinates = [sample["sample"]["mri"] for sample in data["msg"] ["samples"]]
return expression_values, well_ids, well_coordinates, donor_names
def get_mni_coordinates_from_wells(well_ids):
package_directory = os.path.dirname(os.path.abspath(__file__))
frame = pd.read_csv(os.path.join(package_directory, "data", "corrected_mni_coordinates.csv"), header=0, index_col=0)
return list(frame.ix[well_ids].itertuples(index=False))
if __name__ == '__main__':
probes_dict = get_probes_from_genes("SLC6A2")
expression_values, well_ids, well_coordinates, donor_names = get_expression_values_from_probe_ids(probes_dict.keys())
print get_mni_coordinates_from_wells(well_ids)
whoa, first things first. Python ain't Java, so do yourself a favor and use a nice """xxx\nyyy""" string, with triple quotes to multiline.
api_query = """?criteria=model::Probe"
,rma::criteria,[probe_type$eq'DNA']
...
"""
or something like that. you will get white spaces as typed, so you may need to adjust.
If, like suggested, you opt to loop on the call to your function through a file, you will need to either try/except your data-not-found exception or you will need to handle missing data without throwing an exception. I would opt for returning an empty result myself and letting the caller worry about what to do with it.
If you do opt for raise-ing an Exception, create your own, rather than using a generic exception. That way your code can catch your expected Exception first.
class MyNoDataFoundException(Exception):
pass
#replace your current raise code with...
if not d:
raise MyNoDataFoundException(your message here)
clarification about catching exceptions, using the accepted answer as a starting point:
if __name__ == '__main__':
with open(r"/tmp/genes.txt","r") as f:
for line in f.readlines():
#keep track of your input data
search_data = line.strip()
try:
probes_dict = get_probes_from_genes(search_data)
except MyNoDataFoundException, e:
#and do whatever you feel you need to do here...
print "bummer about search_data:%s:\nexception:%s" % (search_data, e)
expression_values, well_ids, well_coordinates, donor_names = get_expression_values_from_probe_ids(probes_dict.keys())
print get_mni_coordinates_from_wells(well_ids)
You may want to create a file with Gene names, then read content of the file and call your function in the loop. Here is an example below
if __name__ == '__main__':
with open(r"/tmp/genes.txt","r") as f:
for line in f.readlines():
probes_dict = get_probes_from_genes(line.strip())
expression_values, well_ids, well_coordinates, donor_names = get_expression_values_from_probe_ids(probes_dict.keys())
print get_mni_coordinates_from_wells(well_ids)

Python refresh file from disk

I have a python script that calls a system program and reads the output from a file out.txt, acts on that output, and loops. However, it doesn't work, and a close investigation showed that the python script just opens out.txt once and then keeps on reading from that old copy. How can I make the python script reread the file on each iteration? I saw a similar question here on SO but it was about a python script running alongside a program, not calling it, and the solution doesn't work. I tried closing the file before looping back but it didn't do anything.
EDIT:
I already tried closing and opening, it didn't work. Here's the code:
import subprocess, os, sys
filename = sys.argv[1]
file = open(filename,'r')
foo = open('foo','w')
foo.write(file.read().rstrip())
foo = open('foo','a')
crap = open(os.devnull,'wb')
numSolutions = 0
while True:
subprocess.call(["minisat", "foo", "out"], stdout=crap,stderr=crap)
out = open('out','r')
if out.readline().rstrip() == "SAT":
numSolutions += 1
clause = out.readline().rstrip()
clause = clause.split(" ")
print clause
clause = map(int,clause)
clause = map(lambda x: -x,clause)
output = ' '.join(map(lambda x: str(x),clause))
print output
foo.write('\n'+output)
out.close()
else:
break
print "There are ", numSolutions, " solutions."
You need to flush foo so that the external program can see its latest changes. When you write to a file, the data is buffered in the local process and sent to the system in larger blocks. This is done because updating the system file is relatively expensive. In your case, you need to force a flush of the data so that minisat can see it.
foo.write('\n'+output)
foo.flush()
I rewrote it to hopefully be a bit easier to understand:
import os
from shutil import copyfile
import subprocess
import sys
TEMP_CNF = "tmp.in"
TEMP_SOL = "tmp.out"
NULL = open(os.devnull, "wb")
def all_solutions(cnf_fname):
"""
Given a file containing a set of constraints,
generate all possible solutions.
"""
# make a copy of original input file
copyfile(cnf_fname, TEMP_CNF)
while True:
# run minisat to solve the constraint problem
subprocess.call(["minisat", TEMP_CNF, TEMP_SOL], stdout=NULL,stderr=NULL)
# look at the result
with open(TEMP_SOL) as result:
line = next(result)
if line.startswith("SAT"):
# Success - return solution
line = next(result)
solution = [int(i) for i in line.split()]
yield solution
else:
# Failure - no more solutions possible
break
# disqualify found solution
with open(TEMP_CNF, "a") as constraints:
new_constraint = " ".join(str(-i) for i in sol)
constraints.write("\n")
constraints.write(new_constraint)
def main(cnf_fname):
"""
Given a file containing a set of constraints,
count the possible solutions.
"""
count = sum(1 for i in all_solutions(cnf_fname))
print("There are {} solutions.".format(count))
if __name__=="__main__":
if len(sys.argv) == 2:
main(sys.argv[1])
else:
print("Usage: {} cnf.in".format(sys.argv[0]))
You take your file_var and end the loop with file_var.close().
for ... :
ga_file = open(out.txt, 'r')
... do stuff
ga_file.close()
Demo of an implementation below (as simple as possible, this is all of the Jython code needed)...
__author__ = ''
import time
var = 'false'
while var == 'false':
out = open('out.txt', 'r')
content = out.read()
time.sleep(3)
print content
out.close()
generates this output:
2015-01-09, 'stuff added'
2015-01-09, 'stuff added' # <-- this is when i just saved my update
2015-01-10, 'stuff added again :)' # <-- my new output from file reads
I strongly recommend reading the error messages. They hold quite a lot of information.
I think the full file name should be written for debug purposes.

Successive multiprocessing

I am filtering huge text files using multiprocessing.py. The code basically opens the text files, works on it, then closes it.
Thing is, I'd like to be able to launch it successively on multiple text files. Hence, I tried to add a loop, but for some reason it doesn't work (while the code works on each file). I believe this is an issue with:
if __name__ == '__main__':
However, I am looking for something else. I tried to create a Launcher and a LauncherCount files like this:
LauncherCount.py:
def setLauncherCount(n):
global LauncherCount
LauncherCount = n
and,
Launcher.py:
import os
import LauncherCount
LauncherCount.setLauncherCount(0)
os.system("OrientedFilterNoLoop.py")
LauncherCount.setLauncherCount(1)
os.system("OrientedFilterNoLoop.py")
...
I import LauncherCount.py, and use LauncherCount.LauncherCount as my loop index.
Of course, this doesn't work too as it edits the variable LauncherCount.LauncherCount locally, so it won't be edited in the imported version of LauncherCount.
Is there any way to edit globally a variable in an imported file? Or, is there any way to do this in any other way? What I need is running a code multiple times, in changing one value, and without using any loop apparently.
Thanks!
Edit: Here is my main code if necessary. Sorry for the bad style ...
import multiprocessing
import config
import time
import LauncherCount
class Filter:
""" Filtering methods """
def __init__(self):
print("launching methods")
# Return the list: [Latitude,Longitude] (elements are floating point numbers)
def LatLong(self,line):
comaCount = []
comaCount.append(line.find(','))
comaCount.append(line.find(',',comaCount[0] + 1))
comaCount.append(line.find(',',comaCount[1] + 1))
Lat = line[comaCount[0] + 1 : comaCount[1]]
Long = line[comaCount[1] + 1 : comaCount[2]]
try:
return [float(Lat) , float(Long)]
except ValueError:
return [0,0]
# Return a boolean:
# - True if the Lat/Long is within the Lat/Long rectangle defined by:
# tupleFilter = (minLat,maxLat,minLong,maxLong)
# - False if not
def LatLongFilter(self,LatLongList , tupleFilter) :
if tupleFilter[0] <= LatLongList[0] <= tupleFilter[1] and
tupleFilter[2] <= LatLongList[1] <= tupleFilter[3]:
return True
else:
return False
def writeLine(self,key,line):
filterDico[key][1].write(line)
def filteringProcess(dico):
myFilter = Filter()
while True:
try:
currentLine = readFile.readline()
except ValueError:
break
if len(currentLine) ==0: # Breaks at the end of the file
break
if len(currentLine) < 35: # Deletes wrong lines (too short)
continue
LatLongList = myFilter.LatLong(currentLine)
for key in dico:
if myFilter.LatLongFilter(LatLongList,dico[key][0]):
myFilter.writeLine(key,currentLine)
###########################################################################
# Main
###########################################################################
# Open read files:
readFile = open(config.readFileList[LauncherCount.LauncherCount][1], 'r')
# Generate writing files:
pathDico = {}
filterDico = config.filterDico
# Create outputs
for key in filterDico:
output_Name = config.readFileList[LauncherCount.LauncherCount][0][:-4]
+ '_' + key +'.log'
pathDico[output_Name] = config.writingFolder + output_Name
filterDico[key] = [filterDico[key],open(pathDico[output_Name],'w')]
p = []
CPUCount = multiprocessing.cpu_count()
CPURange = range(CPUCount)
startingTime = time.localtime()
if __name__ == '__main__':
### Create and start processes:
for i in CPURange:
p.append(multiprocessing.Process(target = filteringProcess ,
args = (filterDico,)))
p[i].start()
### Kill processes:
while True:
if [p[i].is_alive() for i in CPURange] == [False for i in CPURange]:
readFile.close()
for key in config.filterDico:
config.filterDico[key][1].close()
print(key,"is Done!")
endTime = time.localtime()
break
print("Process started at:",startingTime)
print("And ended at:",endTime)
To process groups of files in sequence while working on files within a group in parallel:
#!/usr/bin/env python
from multiprocessing import Pool
def work_on(args):
"""Process a single file."""
i, filename = args
print("working on %s" % (filename,))
return i
def files():
"""Generate input filenames to work on."""
#NOTE: you could read the file list from a file, get it using glob.glob, etc
yield "inputfile1"
yield "inputfile2"
def process_files(pool, filenames):
"""Process filenames using pool of processes.
Wait for results.
"""
for result in pool.imap_unordered(work_on, enumerate(filenames)):
#NOTE: in general the files won't be processed in the original order
print(result)
def main():
p = Pool()
# to do "successive" multiprocessing
for filenames in [files(), ['other', 'bunch', 'of', 'files']]:
process_files(p, filenames)
if __name__=="__main__":
main()
Each process_file() is called in sequence after the previous one has been complete i.e., the files from different calls to process_files() are not processed in parallel.

Categories

Resources