How to store all the output before multiprocessing finish? - python

I want to run multiprocess in python.
Here is an example:
def myFunction(name,age):
output = paste(name,age)
return output
names = ["A","B","C"]
ages = ["1","2","3"]
with mp.Pool(processes=no_cpus) as pool:
results = pool.starmap(myFunction,zip(names,ages))
results_table = pd.concat(results)
results_table.to_csv(file,sep="\t",index=False)
myFunction in the real case takes really long time. Sometime I have to interupt the running and start again. However the results will only be written to the output file when all pool.starmap is done. How can I store the intermediate/cache result before it's finished?
I don't want to change myFunction from return to .to_csv()
Thanks!

Instead of using map, use method imap, which returns an iterator that when iterated gives each result one by one as they become available (i.e. returned by my_function). However, the results must still be returned in order. If you do not care about the order, than use imap_unordered.
As each dataframe is returned and iterated, it is converted to a CSV file and outputted either with or without a header according to whether it is the first result being processed.
import pandas as pd
import multiprocessing as mp
def paste(name, age):
return pd.DataFrame([[name, age]], columns=['Name', 'Age'])
def myFunction(t):
name, age = t # unpack passed tuple
output = paste(name, age)
return output
# Required for Windows:
if __name__ == '__main__':
names = ["A","B","C"]
ages = ["1","2","3"]
no_cpus = min(len(names), mp.cpu_count())
csv_file = 'test.txt'
with mp.Pool(processes=no_cpus) as pool:
# Results from imap must be iterated
for index, result in enumerate(pool.imap(myFunction, zip(names,ages))):
if index == 0:
# First return value
header = True
open_flags = "w"
else:
header = False
open_flags = "a"
with open(csv_file, open_flags, newline='') as f:
result.to_csv(f, sep="\t", index=False, header=header)
Output of test.txt:
Name Age
A 1
B 2
C 3

Related

Why does it output differently when printing and exporting into .csv file?

When I run this command in script 1 print(test.dataScraper(merchantID, productID)) it prints multiple values.
While when I export it to .csv
df = pd.DataFrame(script2.dataScraper(merchantID, productID))
df.to_csv("plsWork.csv")
it only prints the last value and not all of them.
SCRIPT 1
import script2
with open('productID.csv', 'r') as csv_file:
csv_reader = csv.reader(csv_file)
next(csv_reader)
for line in csv_reader:
merchantID = (line[2])
productID = (line[4])
if len(productID) == 0:
break
df = pd.DataFrame(script2.dataScraper(merchantID, productID))
df.to_csv("plsWork.csv")
#print(test.dataScraper(merchantID, productID))
SCRIPT 2
def dataScraper(merchantID, productID):
## Product Information
data_dict['Product ID'] = data['data']['id']
data_dict['Product Name'] = data['data']['name']
data_dict['Product Size'] = bottleSize
data_dict['Product Option ID'] = optionID
data_dict['Quantity In Stock'] = availableQuantity
master_list.append(data_dict)
if size == 0:
break
return(master_list)
It is a bit difficult to parse your code, because it is not a minimal reproducible example, however you have some obvious issues:
dataScraper takes two variables, and uses neither of them. Instead, you're relying on some variables not local to the function. This is bad practice.
dataScraper is continuously overwriting a non-local variable data_dict.
dataScraper returns either a None value or the now-modified array.
But your main problem is probably that you are writing to to_csv for each loop of your for line in csv_reader stanza. If your master_list is getting re-created on each loop then it will only ever have the last value you process in it, because the to_csv function is not appending, it's overwriting.
To at least isolate your problems, you should make your functions utilize only local variables. Something like:
def process_file(file_to_read: str, output_file: str) -> None:
with open(file_to_read, 'r') as csv_file:
csv_reader = csv.reader(csv_file)
next(csv_reader) # Skip column headers
processed_dicts = [ scrape_data(line[2], line[4]) for line in csv_reader ]
pprint(processed_dict) # Just to prove to ourselves it's correct, comment out for prod
df = pd.DataFrame(processed_dicts)
df.t0_csv(output_file)
def scrape_data(merchant_id, product_id) -> dict:
data = ...? # Either this should be passed in as a param or fetched from somewhere.
processed_dict = {
'Product Id': data['data']['id'],
... # all your other things
}
return processed_dict
As you can see, there is a big hole where data is, but this structure should keep your problems constrained. Remember: global variables are not your friend!
(Also, as an aside, you should really use idiomatic Python snake-case for your variable names, aka scrape_data not scrapeData.)

Exporting a List is producing an empty CSV file

I have a two lists that are zipped together, i am able to print the list out to view but i when i try to export the list into a csv file, the csv file is created but its empty. not sure why as im using the same method to save the two lists separately and it works.
import csv
import random
import datetime
import calendar
with open("Duty Name List.csv") as CsvNameList:
NameList = CsvNameList.read().split("\n")
date = datetime.datetime.now()
MaxNumofDays = calendar.monthrange(date.year,date.month)
print (NameList)
print(date.year)
print(date.month)
print(MaxNumofDays[1])
x = MaxNumofDays[1] + 1
daylist = list(range(1,x))
print(daylist)
ShuffledList = random.sample(NameList,len(daylist))
print(ShuffledList)
RemainderList = set(NameList) - set(ShuffledList)
print(RemainderList)
with open("remainder.csv","w") as f:
wr = csv.writer(f,delimiter="\n")
wr.writerow(RemainderList)
AssignedDutyList = zip(daylist,ShuffledList)
print(list(AssignedDutyList))
with open("AssignedDutyList.csv","w") as g:
wr = csv.writer(g)
wr.writerow(list(AssignedDutyList))
no error messages are produced.
In Python 3, This line
AssignedDutyList = zip(daylist,ShuffledList)
creates an iterator named AssignedDutyList.
This line
print(list(AssignedDutyList))
exhausts the iterator. When this line is executed
wr.writerow(list(AssignedDutyList))
the iterator has no further output, so nothing is written to the file.
The solution is to store the result of calling list on the iterator in a variable rather than the iterator itself, in cases where the content of an iterator must be reused.
AssignedDutyList = list(zip(daylist,ShuffledList))
print(AssignedDutyList)
with open("AssignedDutyList.csv","w") as g:
wr = csv.writer(g)
wr.writerow(AssignedDutyList)
As a bonus, the name AssignedDutyList now refers to an actual list, and so is less confusing for future readers of the code.

Writing to textfile in a specific way using a list in python

I am trying to write to a textfile in python where the the output in the file.
I have a Class called phonebook which has a list containing objects of the phonebook class.
My constructor looks like this:
def __init__(self,name,number):
self.name = name
self.number = number
When i add a new object to the list looks like this:
def add(self):
name = input()
number = input()
p = Phonebook(name,number)
list.append(p)
When I'm writing my list to the textfile the function looks like this:
def save():
f = open("textfile.txt","w")
for x in list:
f.write(x.number+";"+x.name+";")
f.close()
And its writes out:
12345;david;12345;dave;12345;davey;09876;cathryn;09876;cathy; and so on..
should look like this:
12345;david,dave,davey
09876;cathryn,cathy,
78887;peter,pete,petr,petemon
My question is then.. How do I implement this save function so it will only write out one unique number and all its names connected to that number?
Feels like its impossible to do with only a list containing names and numbers.. Maybe im wrong..
Dictionaries in Python give you fast access to items based on their key. So a good solution to your problem would be to index the Phonebook objects using the Phonebook.number as the key to store a list of Phonebooks as the values. Then at the end just handle the printing based on however you want each line to appear.
This example should work in your case:
phone_dict = dict() # Used to store Phonebook objects intead of list
def add(self):
name = input()
number = input()
p = Phonebook(name,number)
if p.number in phone_dict:
phone_dict[p.number].append(p) # Append p to list of Phonebooks for same number
else:
phone_dict[p.number] = [p] # Create list for new phone number key
def save():
f = open("textfile.txt","w")
# Loop through all keys in dict
for number in phone_dict:
f.write(x.number + ";") # Write out number
phone_books = phone_dict[number]
# Loop through all phone_books associated with number
for i, pb in enumerate(phone_books):
f.write(pb.name)
# Only append comma if not last value
if i < len(phone_books) - 1:
f.write(",")
f.write("\n") # Go to next line for next number
f.close()
so how would the load function look?
I have tried doing one, and it loads everything into the dictionary but the program doesnt function with my other functions like it did before i saved it and reload it to the program again..
def load(self,filename):
self.dictList = {}
f = open(filename,"r")
for readLine in f:
readLine = readLine.split(";")
number = readLine[0]
nameLength = len(readLine[1:])
name = readLine[1:nameLength]
p = phonebook(name)
self.dictList[number] = [p]
print(self.dictList)
f.close()

Easiest way to impliment multithread in this function [Python]

So I have data known as id_list that is coming into the function in this format [(u'SGP-3630', 1202), (u'MTSCR-534', 1244)]. The format being two values paired together, there could be 1 pair or a hundred pairs.
This is the function:
def ListParser(id_list):
list_length = len(id_list)
count = 0
table = ""
while count < list_length:
jira = id_list[count][0]
stash = id_list[count][1]
count = count + 1
table = table + RetrieveFromAPI(stash, jira)
table = TableFormatter(table)
table = TableColouriser(table)
return table
What this function does is goes through the list and extracts the pairs and puts them through a function called RetrieveFromAPI() which fetches information from a URL.
Anyone have an idea on how to impliment multithreading here? I've had a shot at splitting both lists up into their own lists and getting the pool to iterate through each list but it hasn't quite worked.
def ListParser(id_list):
pool = ThreadPool(4)
list_length = len(id_list)
count = 0
table = ""
jira_list = list()
stash_list = list()
while count < list_length:
jira_list = jira_list.extend(id_list[count][0])
print jira_list
stash_list = stash_list.extend(id_list[count][1])
print stash_list
count = count + 1
table = table + pool.map(RetrieveFromAPI, stash_list, jira_list)
table = TableFormatter(table)
table = TableColouriser(table)
return table
The error I'm getting for this attempt is TypeError: 'int' object is not iterable
EDIT 2: Okay so I've managed to get the first list with tuples split up into two different lists, but I'm unsure how to get multithreading working with it.
jira,stash= map(list,zip(*id_list))
You're working too hard! From help(multiprocessing.pool.ThreadPool)
map(self, func, iterable, chunksize=None)
Apply `func` to each element in `iterable`, collecting the results
in a list that is returned.
The second argument is an iterable of the arguments you want to pass to the worker threads. You have a list of lists and you want the first two items from the inner list for each call. id_list is already iterable, so we're close. A small function (in this case implemented as a lambda) bridges the gap.
I worked up a full mock solution just to make sure it works, so here it goes. As an aside, you can benefit from a fairly large pool size since they spend much of their time waiting on I/O.
from multiprocessing.pool import ThreadPool
def RetrieveFromAPI(stash, jira):
# boring mock of api
return '{}-{}.'.format(stash, jira)
def TableFormatter(table):
# mock
return table
def TableColouriser(table):
# mock
return table
def ListParser(id_list):
if id_list:
pool = ThreadPool(min(12, len(id_list)))
table = ''.join(pool.map(lambda item: RetrieveFromAPI(item[1], item[0]),
id_list, chunksize=1))
pool.close()
pool.join()
else:
table = ''
table = TableFormatter(table)
table = TableColouriser(table)
return table
id_list = [[0,1,'foo'], [2,3,'bar'], [4,5, 'baz']]
print(ListParser(id_list))

Condensing Repetitive Code

I created a simple application that loads up multiple csv's and stores them into lists.
import csv
import collections
list1=[]
list2=[]
list3=[]
l = open("file1.csv")
n = open("file2.csv")
m = open("file3.csv")
csv_l = csv.reader(l)
csv_n = csv.reader(n)
csv_p = csv.reader(m)
for row in csv_l:
list1.append(row)
for row in csv_n:
list2.append(row)
for row in csv_p:
list3.append(row)
l.close()
n.close()
m.close()
I wanted to create a function that would be responsible for this, so that I could avoid being repetitive and to clean up the code so I was thinking something like this.
def read(filename):
x = open(filename)
y = csv.reader(x)
for row in y:
list1.append(row)
x.close()
However it gets tough for me when I get to the for loop which appends to the list. This would work to append to 1 list, however if i pass another file name into the function it will append to the same list. Not sure the best way to go about this.
You just need to create a new list each time, and return it from your function:
def read(filename):
rows = []
x = open(filename)
y = csv.reader(x)
for row in y:
rows.append(row)
x.close()
return rows
Then call it as follows
list1 = read("file1.csv")
Another option is to pass the list in as an argument to your function - then you can choose whether to create a new list each time, or append multiple CSVs to the same list:
def read(filename, rows):
x = open(filename)
y = csv.reader(x)
for row in y:
rows.append(row)
x.close()
return rows
# One list per file:
list1 = []
read("file1.csv", list1)
# Multiple files combined into one list:
listCombined = []
read("file2.csv", listCombined)
read("file3.csv", listCombined)
I have used your original code in my answer, but see also Malik Brahimi's answer for a better way to write the function body itself using with and list(), and DogWeather's comments - there are lots of different choices here!
You can make a single function, but use a with statement to condense even further:
def parse_csv(path):
with open(path) as csv_file:
return list(csv.reader(csv_file))
I like #DNA's approach. But consider a purely functional style. This can be framed as a map operation which converts
["file1.csv", "file2.csv", "file3.csv"]
to...
[list_of_rows, list_of_rows, list_of_rows]
This function would be invoked like this:
l, n, m = map_to_csv(["file1.csv", "file2.csv", "file3.csv"])
And map_to_csv could be implemented something like this:
def map_to_csv(filenames):
return [list(csv.reader(open(filename))) for filename in filenames]
The functional solution is shorter and doesn't need temporary variables.

Categories

Resources