According to this answer when multiprocessing with multiple arguments starmap should be used. The problem I am having is that one of my arguments is a constant dataframe. When I create a list of arguments to be used by my function and starmap the dataframe gets stored over and over. I though I could get around this problem using namespace, but can't seem to figure it out. My code below hasn't thrown an error, but after 30 minutes no files have written. The code runs in under 10 minutes without using multiprocessing and just calling write_file directly.
import pandas as pd
import numpy as np
import multiprocessing as mp
def write_file(df, colIndex, splitter, outpath):
with open(outpath + splitter + ".txt", 'a') as oFile:
data = df[df.iloc[:,colIndex] == splitter]
data.to_csv(oFile, sep = '|', index = False, header = False)
mgr = mp.Manager()
ns = mgr.Namespace()
df = pd.read_table(file_, delimiter = '|', header = None)
ns.df = df.iloc[:,1] = df.iloc[:,1].astype(str)
fileList = list(df.iloc[:, 1].astype('str').unique())
for item in fileList:
with mp.Pool(processes=3) as pool:
pool.starmap(write_file, np.array((ns, 1, item, outpath)).tolist())
To anyone else struggling with this issue, my solution was to create an iterable list of tuples of length chunksize out of the dataframe via:
iterable = product(np.array_split(data, 15), [args])
Then, pass this iterable to the starmap:
pool.starmap(func, iterable)
I had the same issue - needed to pass two existing dataframes to the function using starmap. It turns out that there isn't a need to declare a dataframe as an argument in the function at all. You could just call the dataframe using 'global', as described in the accepted answer here: Pandas: local vs global dataframe in functions
Related
I'm trying to understand how iteration in Python works. Isn't the first method the same as the second method? How can we write the first method with the traditional method?
I get an error in second way:
TypeError: first argument must be an iterable of pandas objects, you passed an object of type "DataFrame"
import pandas as pd
import glob
import os
path = r'C:\Documents'
all_files = glob.glob(os.path.join(path, "*.txt"))
df_from_each_file = (pd.read_csv(f) for f in all_files) #1. way
# 2. way
# for f in all_files:
# df_from_each_file=(pd.read_csv(f))
concatenated_df = pd.concat(df_from_each_file, ignore_index=True)
print(concatenated_df)
In the second case the line df_from_each_file = (pd.read_csv(f)) just assigns a dataframe obtained from the latest file to your variable. You pass as an argument a single object (data from the last file), so pd.concat has nothing to concatenate. That's why it says TypeError: first argument must be an iterable of pandas objects
To make it work use some iterable container, e.g. list:
df_from_each_file = []
for f in all_files:
df_from_each_file.append(pd.read_csv(f))
P.S. How can we use pd.read_csv arguments like sep, encoding, header etc. with pd.concat(map(pd.read_csv, all_files))?
For this purpose we use functools.partial:
import pandas as pd
from functools import partial
from io import StringIO
data1 = '''name;value
one;1
two;2
'''
data2 = '''name;value
three;3
four;4
'''
csv_files = [
StringIO(data1),
StringIO(data2)
]
my_read_csv = partial(pd.read_csv, sep=";")
df = pd.concat(map(my_read_csv, csv_files))
display(df)
I have been trying to get the name of files in a folder on my computer and open an excel worksheet and write the file names in a specific column. However, it returns to me the following message of error. "TypeError: Value must be a list, tuple, range or generator, or a dict. Supplied value is <class 'str'>".
The code is:
from openpyxl import load_workbook
import os
import glob, os
os.chdir("/content/drive/MyDrive/picture")
ox = []
for file in glob.glob("*.*"):
for j in range(0, 15):
replaced_text = file.replace('.JPG', '')
ox.append(replaced_text)
oxx = ['K', ox] #k is a column
file1 = load_workbook(filename = '/content/drive/MyDrive/Default.xlsx')
sheet1 = file1['Enter Data Draft']
for item in oxx:
sheet1.append(item)
I've taken a slightly different approach but looking at your code the problem is with the looping.
The problem.
for item in oxx: sheet1.append(item)
When looping over the items in oxx, there are two items. 'K' and then a list with filenames (x15 each) in it. Openpyxl was expecting a different data structure for append. Its actually after a tuple of tuples. documentation here.
The solution
So not knowing what other data you might have on the worksheet I've changed the approach to hopefully satisfy the expected outcome.
I got the following to work as expected.
from openpyxl import load_workbook
import os
import glob, os
os.chdir("/content/drive/MyDrive/picture")
ox = []
for file in glob.glob("*.*"):
for j in range(0, 15): # I've kept this in here assuming you wanted to list the file name 15 times?
replaced_text = file.replace('.JPG', '')
ox.append(replaced_text)
file_dir = '/content/drive/MyDrive/Default.xlsx'
file1 = load_workbook(filename = file_dir)
sheet1 = file1['Enter Data Draft']
# If you were appending to the bottom of a list that was already there use this
# last_row = len(sheet1['K'])
# else use this
last_row = 1 # Excel starts at 1, adjust if you had a header in that column
for counter, item in enumerate(ox):
# K is the 11th column.
sheet1.cell(row=(last_row + counter), column=11).value = item
# Need to save the file or changes wont be reflected
file1.save(file_dir)
I am trying to do calculation and write it to another txt file using multiprocessing program. I am getting count mismatch in output txt file. every time execute I am getting different output count.
I am new to python could some one please help.
import pandas as pd
import multiprocessing as mp
source = "\\share\usr\data.txt"
target = "\\share\usr\data_masked.txt"
Chunk = 10000
def process_calc(df):
'''
get source df do calc and return newdf
...
'''
return(newdf)
def calc_frame(df):
output_df = process_calc(df)
output_df.to_csv(target,index=None,sep='|',mode='a',header=False)
if __name__ == '__main__':
reader= pd.read_table(source,sep='|',chunksize = chunk,encoding='ANSI')
pool = mp.Pool(mp.cpu_count())
jobs = []
for each_df in reader:
process = mp.Process(target=calc_frame,args=(each_df)
jobs.append(process)
process.start()
for j in jobs:
j.join()
You have several issues in your source as posted that would prevent it from even compiling let alone running. I have attempted to correct those in an effort to also solving your main problem. But do check the code below thoroughly just to make sure the corrections make sense.
First, the args argument to the Process constructor should be specified as a tuple. You have specified args=(each_df), but (each_df) is not a tuple, it is a simple parenthesized expression; you need (each_df,) to make if a tuple (the statement is also missing a closing parentheses).
The problem you have in addition to making no provision against multiple processes simultaneously attempting to append to the same file is that you cannot be assured of the order in which the processes complete and thus you have no real control over the order in which the dataframes will be appended to the csv file.
The solution is to use a processing pool with the imap method. The iterable to pass to this method is just the reader, which when iterated returns the next dataframe to process. The return value from imap is an iterable that when iterated will return the next return value from calc_frame in task-submission order, i.e. the same order that the dataframes were submitted. So as these new, modified dataframes are returned, the main process can simply append these to the output file one by one:
import pandas as pd
import multiprocessing as mp
source = r"\\share\usr\data.txt"
target = r"\\share\usr\data_masked.txt"
Chunk = 10000
def process_calc(df):
'''
get source df do calc and return newdf
...
'''
return(newdf)
def calc_frame(df):
output_df = process_calc(df)
return output_df
if __name__ == '__main__':
with mp.Pool() as pool:
reader = pd.read_table(source, sep='|', chunksize=Chunk, encoding='ANSI')
for output_df in pool.imap(process_calc, reader):
output_df.to_csv(target, index=None, sep='|', mode='a', header=False)
Hello I am trying to read in multiple files, create a dataframe of the specific key information i need and then append each dataframe for each file to a main dataframe called topics. I have tried the following code.
import pandas as pd
import numpy as np
from lxml import etree
import os
topics = pd.DataFrame()
for filename in os.listdir('./topics'):
if not filename.startswith('.'):
#print(filename)
tree = etree.parse('./topics/'+filename)
root = tree.getroot()
childA = []
elementT = []
ElementA = []
for child in root:
elementT.append(str(child.tag))
ElementA.append(str(child.attrib))
childA.append(str(child.attrib))
for element in child:
elementT.append(str(element.tag))
#childA.append(child.attrib)
ElementA.append(str(element.attrib))
childA.append(str(child.attrib))
for sub in element:
#print('***', child.attrib , ':' , element.tag, ':' , element.attrib, '***')
#childA.append(child.attrib)
elementT.append(str(sub.tag))
ElementA.append(str(sub.attrib))
childA.append(str(child.attrib))
df = pd.DataFrame()
df['c'] = np.array (childA)
df['t'] = np.array(ElementA)
df['a'] = np.array(elementT)
file = df['t'].str.extract(r'([A-Z][A-Z].*[words.xml])#')
start = df['t'].str.extract(r'words([0-9]+)')
stop = df['t'].str.extract(r'.*words([0-9]+)')
tags = df['a'].str.extract(r'.*([topic]|[pointer]|[child])')
rootTopic = df['c'].str.extract(r'rdhillon.(\d+)')
df['f'] = file
df['start'] = start
df['stop'] = stop
df['tags'] = tags
# c= topic
# r = pointerr
# d= child
df['topicID'] = rootTopic
df = df.iloc[:,3:]
topics.append(df)
However when i call topics i get the following output
topics
Out[19]:_
Can someone please let me know where i am going wrong, also any suggestions on improving my messy code would be appreciated
Unlike lists, when you append to a DataFrame you return a new object. So topics.append(df) returns an object that you are never storing anywhere and topics remains the empty DataFrame you declare on the 6th line. You can fix this by
topics = topics.append(df)
However, appending to a DataFrame within a loop is a very costly exercise. Instead you should append each DataFrame to a list within the loop and call pd.concat() on the list of DataFrames after the loop.
import pandas as pd
topics_list = []
for filename in os.listdir('./topics'):
# All of your code
topics_list.append(df) # Lists are modified with append
# After the loop one call to concat
topics = pd.concat(topics_list)
I can combined 2 csv scripts and it works well.
import pandas
csv1=pandas.read_csv('1.csv')
csv2=pandas.read_csv('2.csv')
merged=csv1.merge(csv2,on='field1')
merged.to_csv('output.csv',index=False)
Now, I would like to combine more than 2 csvs using the same method as above.
I have list of CSV which I defined to something like this
import pandas
collection=['1.csv','2.csv','3.csv','4.csv']
for i in collection:
csv=pandas.read_csv(i)
merged=csv.merge(??,on='field1')
merged.to_csv('output2.csv',index=False)
I havent got it work so far if more than 1 csv..I guess it just a matter iterate inside the list ..any idea?
You need special handling for the first loop iteration:
import pandas
collection=['1.csv','2.csv','3.csv','4.csv']
result = None
for i in collection:
csv=pandas.read_csv(i)
if result is None:
result = csv
else:
result = result.merge(csv, on='field1')
if result:
result.to_csv('output2.csv',index=False)
Another alternative would be to load the first CSV outside the loop but this breaks when the collection is empty:
import pandas
collection=['1.csv','2.csv','3.csv','4.csv']
result = pandas.read_csv(collection[0])
for i in collection[1:]:
csv = pandas.read_csv(i)
result = result.merge(csv, on='field1')
if result:
result.to_csv('output2.csv',index=False)
I don't know how to create an empty document (?) in pandas but that would work, too:
import pandas
collection=['1.csv','2.csv','3.csv','4.csv']
result = pandas.create_empty() # not sure how to do this
for i in collection:
csv = pandas.read_csv(i)
result = result.merge(csv, on='field1')
result.to_csv('output2.csv',index=False)