I have working code to write from a large dataframe to separate sheets in an excel file but it takes a long time about 30-40 minutes. I would like to find a way for it to run faster using multiprocessing.
I tried to rewrite it using multiprocessing so that writing to each excel tab could be done in parallel with multiple processors. The revised code runs without errors but it also is not writing to the excel file properly either. Any suggestions would be helpful.
Original working section of code:
import os
from excel_writer import append_df_to_excel
import pandas as pd
path = os.path.dirname(
os.path.abspath(__file__)) + '\\fund_data.xlsx' # get path to current directory and excel filename for data
data_cols = df_all.columns.values.tolist() # Create a list of the columns in the final dataframe
# print(data_cols)
for column in data_cols: # For each column in the dataframe
df_col = df_all[column].unstack(level = -1) # unstack so Dates are across the top oldest to newest
df_col = df_col[df_col.columns[::-1]] # reorder for dates are newest to oldest
# print(df_col)
append_df_to_excel(path, df_col, sheet_name = column, truncate_sheet = True,
startrow = 0) # Add data to excel file
Revised code trying multiprocessing:
import os
from excel_writer import append_df_to_excel
import pandas as pd
import multiprocessing
def data_to_excel(col, excel_fn, data):
data_fr = pd.DataFrame(data) # switch list back to dataframe for putting into excel file sheets
append_df_to_excel(excel_fn, data_fr, sheet_name = col, truncate_sheet = True, startrow = 0) # Add data to sheet in excel file
if __name__ == "__main__":
path = os.path.dirname(
os.path.abspath(__file__)) + '\\fund_data.xlsx' # get path to current directory and excel filename for data
data_cols = df_all.columns.values.tolist() # Create a list of the columns in the final dataframe
# print(data_cols)
pool = multiprocessing.Pool(processes = multiprocessing.cpu_count())
for column in data_cols: # For each column in the dataframe
df_col = df_all[column].unstack(level = -1) # unstack so Dates are across the top oldest to newest
df_col = df_col[df_col.columns[::-1]] # reorder for dates are newest to oldest
# print(df_col)
data_col = df_col.values.tolist() # convert dataframe coluumn to a list to use in pool
pool.apply_async(data_to_excel, args = (column, path, data_col))
pool.close()
pool.join()
I do not know proper way to write to single file from multiple process. I need to solve similar problem. I solve it with creation writer process which gets data using Queue. You can see my solution here (sorry it is not documented).
Simplified version (draft)
from multiprocessing import Queue
input_queue = Queue()
res_queue = Queue()
process_list = []
def do_calculation(input_queue, res_queue, calculate_function):
try:
while True:
data = in_queue.get(False)
try:
res = calculate_function(**data)
out_queue.put(res)
except ValueError as e:
out_queue.put("fail")
logging.error(f" fail on {data}")
except queue.Empty:
return
# put data in input queue
def save_process(out_queue, file_path, count):
for i in range(count):
data = out_queue.get()
if data == "fail":
continue
# write to excel here
for i in range(process_num):
p = Process(target=do_calculation, args=(input_queue, res_queue, calculate_function))
p.start()
process_list.append(p)
p2 = Process(target=save_process, args=(res_queue, path_to_excel, data_size))
p2.start()
p2.join()
for p in process_list:
p.join()
Related
So this is kind of weird but I'm new to Python and I'm committed to seeing my first project with Python through to the end.
So I am reading about 100 .xlsx files in from a file path. I then trim each file and send only the important information to a list, as an individual and unique dataframe. So now I have a list of 100 unique dataframes, but iterating through the list and writing to excel just overwrites the data in the file. I want to append the end of the .xlsx file. The biggest catch to all of this is, I can only use Excel 2010, I do not have any other version of the application. So the openpyxl library seems to have some interesting stuff, I've tried something like this:
from openpyxl.utils.dataframe import dataframe_to_rows
wb = load_workbook(outfile_path)
ws = wb.active
for frame in main_df_list:
for r in dataframe_to_rows(frame, index = True, header = True):
ws.append(r)
Note: In another post I was told it's not best practice to read dataframes line by line using loops, but when I started I didn't know that. I am however committed to this monstrosity.
Edit after reading Comments
So my code scrapes .xlsx files and stores specific data based on a keyword comparison into dataframes. These dataframes are stored in a list, I will list the entirety of the program below so hopefully I can explain what's in my head. Also, feel free to roast my code because I have no idea what is actual good python practices vs. not.
import os
import pandas as pd
from openpyxl import load_workbook
#the file path I want to pull from
in_path = r'W:\R1_Manufacturing\Parts List Project\Tool_scraping\Excel'
#the file path where row search items are stored
search_parameters = r'W:\R1_Manufacturing\Parts List Project\search_params.xlsx'
#the file I will write the dataframes to
outfile_path = r'W:\R1_Manufacturing\Parts List Project\xlsx_reader.xlsx'
#establishing my list that I will store looped data into
file_list = []
main_df = []
master_list = []
#open the file path to store the directory in files
files = os.listdir(in_path)
#database with terms that I want to track
search = pd.read_excel(search_parameters)
search_size = search.index
#searching only for files that end with .xlsx
for file in files:
if file.endswith('.xlsx'):
file_list.append(in_path + '/' + file)
#read in the files to a dataframe, main loop the files will be maninpulated in
for current_file in file_list:
df = pd.read_excel(current_file)
#get columns headers and a range for total rows
columns = df.columns
total_rows = df.index
#adding to store where headers are stored in DF
row_list = []
column_list = []
header_list = []
for name in columns:
for number in total_rows:
cell = df.at[number, name]
if isinstance(cell, str) == False:
continue
elif cell == '':
continue
for place in search_size:
search_loop = search.at[place, 'Parameters']
#main compare, if str and matches search params, then do...
if insensitive_compare(search_loop, cell) == True:
if cell not in header_list:
header_list.append(df.at[number, name]) #store data headers
row_list.append(number) #store row number where it is in that data frame
column_list.append(name) #store column number where it is in that data frame
else:
continue
else:
continue
for thing in column_list:
df = pd.concat([df, pd.DataFrame(0, columns=[thing], index = range(2))], ignore_index = True)
#turns the dataframe into a set of booleans where its true if
#theres something there
na_finder = df.notna()
#create a new dataframe to write the output to
outdf = pd.DataFrame(columns = header_list)
for i in range(len(row_list)):
k = 0
while na_finder.at[row_list[i] + k, column_list[i]] == True:
#I turn the dataframe into booleans and read until False
if(df.at[row_list[i] + k, column_list[i]] not in header_list):
#Store actual dataframe into my output dataframe, outdf
outdf.at[k, header_list[i]] = df.at[row_list[i] + k, column_list[i]]
k += 1
main_df.append(outdf)
So main_df is a list that has 100+ dataframes in it. For this example I will only use 2 of them. I would like them to print out into excel like:
So the comment from Ashish really helped me, all of the dataframes had different column titles so my 100+ dataframes eventually concat'd to a dataframe that is 569X52. Here is the code that I used, I completely abandoned openpyxl because once I was able to concat all of the dataframes together, I just had to export it using pandas:
# what I want to do here is grab all the data in the same column as each
# header, then move to the next column
for i in range(len(row_list)):
k = 0
while na_finder.at[row_list[i] + k, column_list[i]] == True:
if(df.at[row_list[i] + k, column_list[i]] not in header_list):
outdf.at[k, header_list[i]] = df.at[row_list[i] + k, column_list[i]]
k += 1
main_df.append(outdf)
to_xlsx_df = pd.DataFrame()
for frame in main_df:
to_xlsx_df = pd.concat([to_xlsx_df, frame])
to_xlsx_df.to_excel(outfile_path)
The output to excel ended up looking something like this:
Hopefully this can help someone else out too.
I am trying to do calculation and write it to another txt file using multiprocessing program. I am getting count mismatch in output txt file. every time execute I am getting different output count.
I am new to python could some one please help.
import pandas as pd
import multiprocessing as mp
source = "\\share\usr\data.txt"
target = "\\share\usr\data_masked.txt"
Chunk = 10000
def process_calc(df):
'''
get source df do calc and return newdf
...
'''
return(newdf)
def calc_frame(df):
output_df = process_calc(df)
output_df.to_csv(target,index=None,sep='|',mode='a',header=False)
if __name__ == '__main__':
reader= pd.read_table(source,sep='|',chunksize = chunk,encoding='ANSI')
pool = mp.Pool(mp.cpu_count())
jobs = []
for each_df in reader:
process = mp.Process(target=calc_frame,args=(each_df)
jobs.append(process)
process.start()
for j in jobs:
j.join()
You have several issues in your source as posted that would prevent it from even compiling let alone running. I have attempted to correct those in an effort to also solving your main problem. But do check the code below thoroughly just to make sure the corrections make sense.
First, the args argument to the Process constructor should be specified as a tuple. You have specified args=(each_df), but (each_df) is not a tuple, it is a simple parenthesized expression; you need (each_df,) to make if a tuple (the statement is also missing a closing parentheses).
The problem you have in addition to making no provision against multiple processes simultaneously attempting to append to the same file is that you cannot be assured of the order in which the processes complete and thus you have no real control over the order in which the dataframes will be appended to the csv file.
The solution is to use a processing pool with the imap method. The iterable to pass to this method is just the reader, which when iterated returns the next dataframe to process. The return value from imap is an iterable that when iterated will return the next return value from calc_frame in task-submission order, i.e. the same order that the dataframes were submitted. So as these new, modified dataframes are returned, the main process can simply append these to the output file one by one:
import pandas as pd
import multiprocessing as mp
source = r"\\share\usr\data.txt"
target = r"\\share\usr\data_masked.txt"
Chunk = 10000
def process_calc(df):
'''
get source df do calc and return newdf
...
'''
return(newdf)
def calc_frame(df):
output_df = process_calc(df)
return output_df
if __name__ == '__main__':
with mp.Pool() as pool:
reader = pd.read_table(source, sep='|', chunksize=Chunk, encoding='ANSI')
for output_df in pool.imap(process_calc, reader):
output_df.to_csv(target, index=None, sep='|', mode='a', header=False)
I tried out the pool.map approach given in similar answers but I ended up with 8 files of 23G each which is worse.
import os
import pandas as pd
from tqdm import tqdm
from multiprocessing import Pool
#csv file name to be read in
in_csv = 'mycsv53G.csv'
#get the number of lines of the csv file to be read
number_lines = sum(1 for row in tqdm((open(in_csv, encoding = 'latin1')), desc = 'Reading number of lines....'))
print (number_lines)
#size of rows of data to write to the CSV,
#you can change the row size according to your need
rowsize = 11367260 #decided based on your CPU core count
#start looping through data writing it to a new file for each set
def reading_csv(filename):
for i in tqdm(range(1,number_lines,rowsize), desc = 'Reading CSVs...'):
print ('in reading csv')
df = pd.read_csv(in_csv, encoding='latin1',
low_memory=False,
header=None,
nrows = rowsize,#number of rows to read at each loop
skiprows = i)#skip rows that have been read
#csv to write data to a new file with indexed name. input_1.csv etc.
out_csv = './csvlist/input' + str(i) + '.csv'
df.to_csv(out_csv,
index=False,
header=False,
mode='a',#append data to csv file
chunksize=rowsize)#size of data to append for each loop
def main():
# get a list of file names
files = os.listdir('./csvlist')
file_list = [filename for filename in tqdm(files) if filename.split('.')[1]=='csv']
# set up your pool
with Pool(processes=8) as pool: # or whatever your hardware can support
print ('in Pool')
# have your pool map the file names to dataframes
try:
df_list = pool.map(reading_csv, file_list)
except Exception as e:
print (e)
if __name__ == '__main__':
main()
The above approach took 4 hours to split the files in a concurrent fashion and then parsing every CSV will be even more.. not sure if multiprocessing helped or not!
Currently, I read the CSV file through this code:
import pandas as pd
import datetime
import numpy as np
for chunk in dd.read_csv(filename, chunksize = 10**5, encoding='latin-1', skiprows=1, header=None):
#chunk processing
final_df = final_df.append(agg, ignore_index=True)
final_df.to_csv('Final_Output/'+output_name, encoding='utf-8', index=False)
It takes close to 12 hours to process the large CSV at once.
What will the improvements here be?
Any suggestions?
I am willing to try out dask now.. but no other options left.
I would read the csv file line by line and feed the lines into a queue from which the processes pick their tasks. This way, you don't have to split the file first.
See this example here: https://stackoverflow.com/a/53847284/4141279
So I have looked through multiple ways in how to compare the two gfiles.
One of the ways I have looked and dicovered to compare the two files was to use the Pandas Module. The other way in which I discovered was to use the Numpy module within Python, I am also using various other modules to help me work with Excel sheets. The main thing is I have an ACII text file that I need to compare with another file. Both files are the same size and I have even included a check to see if the files are the same size, but I think there is something wrong with conditional statements that check the overall size of the two files. So basically I need some advice here on how to compare the two files.
The text file uses UTF-Encoding 8.
The information will look like this:
StudentID,Code,Date,FirstName,LastName,GradeLevel,CampusID
000001,R,mm/dd/yyyy/,JOHN,SMITH,01,00001
The header is not seen within the original file I have to compare.
Original File Header:
StudentID,Code,Date,FirstName,LastName,GradeLevel,CampusID
The file I pull out from our SIS database, has headers that match
StudentID,Code,Date,FirstName,LastName,GradeLevel,CampusID
But some of the formatting is a little different.
For example the data is not mm/dd/yyyy and the CampusID is only ###
The documentation that I have looked at to help me has shown the following:
Using Pandas to between Two Dataframes
Using Pandas to Compare Two Excel Files
Working with Missing Data
Pandas Cookbook
Now I have been able to print out data in a concatenated data frame, but I have not really been able to run comparisons yet and highlight the differences between the two files be it text or excel files. I was curious if anyone could point me in a direction or a better direction is they know how to compare files.
I am using the following code right now and it is at least printing the data frames, but it doesn't seem to be doing anything other than printing them as a pandas data frame.
#!/bin/python
# ===========================================================
# Created By: Richard Barrett
# Organization: DVISD
# DepartmenT: Data Services
# Purpose: Dynamic Excel Diff Comparison Report
# Date: 02/28/2020
# ===========================================================
import getpass
import json
import logging
import numpy as np
import os
import pandas as pd
import platform
import shutil
import subprocess
import threading
import time
import unittest
import xlsxwriter
from datetime import date
# System Variables
today = date.today()
date = today.strftime("%m/%d/%Y")
node = platform.node()
system = platform.system()
username = getpass.getuser()
version = platform.version()
working_directory = os.getcwd()
pd.set_option('display.max_rows', None)
# File Variables on Relative Path within CWD
file_1 = "ExportPOSStudents.xlsx"
file_2 = "ExportNutrikidsSkywardCompare.xlsx"
# Column Variables to Compare
e_code = "Eligibility Code"
e_date = "Effective Date"
f_name = "First Name"
l_name = "Last Name"
# Logging Variables
# Ensure that the Files Exist
if os.path.exists(file_1) and os.path.exists(file_2):
print("The Files Exist.")
else:
print("One of the files might not exist.")
# Create Dataframes
df1 = pd.read_excel(file_1)
df2 = pd.read_excel(file_2)
print(df1)
print(df2)
# Check to See if Files are Same Size
df1.equals(df2)
if print(df1.equals(df2)) is False:
print("Dataframes are not the same size.")
else:
print("Dataframes are the same size.")
df1[e_date].equals(df2[e_date])
if print(df1[e_date].equals(df2[e_date])) is False:
print("The Entries are not the same within column for e_date.")
else:
print("The Entries are the same within the columns for e_date.")
#comparison_values = df1.values == df2.values
#print(comparison_values)
#if df2.equals(df1) == False:
# print("Datframes are not of the the same size.")
#else df2.equals(df1) == True:
# print("Dataframes are of the same size.")
# If Files are Not Same Size Check Indexes and Column Names and Format
# Check Indexes and Size
# Compare Dataframe Values
#if comparison_values = df1.values == df2.values
# print(comparison_values)
#else:
# print("Cannot compare Dataframes.")
# Get-Index of Cell with Parameter == False
#rows,cols=np.where(comparison_values==False)
# Iterate over Cells and Update (df1) value to display changed value in second dataframe (df2)
#for item in zip(rows,cols):
# df1.iloc[item[0], item[1]] = '{} --> {}'.format(df1.iloc[item[0], item[1]],df2.i
# Export to Excel after df1(Old Value) --> df2(New Value)
#df1.to_excel('./excel_diff.xlsx',index=False,header=True)
You can see the main code and process here that I am trying to achieve: Link to Code and Process
I was able to compare two excel sheets using the following:
Afterwards, you will have a new spreadsheet with three workbooks in it.
import pandas as pd
from pathlib import Path
def excel_diff(path_OLD, path_NEW, index_col):
df_OLD = pd.read_excel(path_OLD, index_col=index_col).fillna(0)
df_NEW = pd.read_excel(path_NEW, index_col=index_col).fillna(0)
# Perform Diff
dfDiff = df_NEW.copy()
droppedRows = []
newRows = []
cols_OLD = df_OLD.columns
cols_NEW = df_NEW.columns
sharedCols = list(set(cols_OLD).intersection(cols_NEW))
for row in dfDiff.index:
if (row in df_OLD.index) and (row in df_NEW.index):
for col in sharedCols:
value_OLD = df_OLD.loc[row,col]
value_NEW = df_NEW.loc[row,col]
if value_OLD==value_NEW:
dfDiff.loc[row,col] = df_NEW.loc[row,col]
else:
dfDiff.loc[row,col] = ('{}→{}').format(value_OLD,value_NEW)
else:
newRows.append(row)
for row in df_OLD.index:
if row not in df_NEW.index:
droppedRows.append(row)
dfDiff = dfDiff.append(df_OLD.loc[row,:])
dfDiff = dfDiff.sort_index().fillna('')
print(dfDiff)
print('\nNew Rows: {}'.format(newRows))
print('Dropped Rows: {}'.format(droppedRows))
# Save output and format
fname = '{} vs {}.xlsx'.format(path_OLD.stem,path_NEW.stem)
writer = pd.ExcelWriter(fname, engine='xlsxwriter')
dfDiff.to_excel(writer, sheet_name='DIFF', index=True)
df_NEW.to_excel(writer, sheet_name=path_NEW.stem, index=True)
df_OLD.to_excel(writer, sheet_name=path_OLD.stem, index=True)
# get xlsxwriter objects
workbook = writer.book
worksheet = writer.sheets['DIFF']
worksheet.hide_gridlines(2)
worksheet.set_default_row(15)
# define formats
date_fmt = workbook.add_format({'align': 'center', 'num_format': 'yyyy-mm-dd'})
center_fmt = workbook.add_format({'align': 'center'})
number_fmt = workbook.add_format({'align': 'center', 'num_format': '#,##0.00'})
cur_fmt = workbook.add_format({'align': 'center', 'num_format': '$#,##0.00'})
perc_fmt = workbook.add_format({'align': 'center', 'num_format': '0%'})
grey_fmt = workbook.add_format({'font_color': '#E0E0E0'})
highlight_fmt = workbook.add_format({'font_color': '#FF0000', 'bg_color':'#B1B3B3'})
new_fmt = workbook.add_format({'font_color': '#32CD32','bold':True})
# set format over range
## highlight changed cells
worksheet.conditional_format('A1:ZZ1000', {'type': 'text',
'criteria': 'containing',
'value':'→',
'format': highlight_fmt})
# highlight new/changed rows
for row in range(dfDiff.shape[0]):
if row+1 in newRows:
worksheet.set_row(row+1, 15, new_fmt)
if row+1 in droppedRows:
worksheet.set_row(row+1, 15, grey_fmt)
# save
writer.save()
print('\nDone.\n')
def main():
path_OLD = Path('v1.xlsx')
path_NEW = Path('v2.xlsx')
# get index col from data
df = pd.read_excel(path_NEW)
index_col = df.columns[0]
print('\nIndex column: {}\n'.format(index_col))
excel_diff(path_OLD, path_NEW, index_col)
if __name__ == '__main__':
main()
You will have to work with the highlighting in the book as I am still having an issue with it. NOTE THIS ANSWER WORKS ONLY WITH EXCEL SPREADSHEETS
I'm building a script that receives data from a json/REST stream and then adds it to a database. I would like to build a buffer that collects the data from the stream and stores it until it is successfully inserted into the database.
The idea is that one thread would stream the data from the api into the dataframe and the other thread would try to submit the data into the database, removing the items from the dataframe once they are successfully inserted into the database.
I wrote the following code to test the concept - only problem is, it doesn't work!
import threading
from threading import Thread
import pandas as pd
import numpy as np
import time
from itertools import count
# set delay
d=5
# add items to dataframe every few seconds
def appender():
testdf = pd.DataFrame([])
print('starting streamsim')
for i in count():
testdf1 = pd.DataFrame(np.random.randint(0,100,size=(np.random.randint(0,25), 4)), columns=list('ABCD'))
testdf = testdf.append(testdf1)
print('appended')
print('len is now {0}'.format(len(testdf)))
return testdf
time.sleep(np.random.randint(0,5))
# combine the dfs, and operate on them
def dumper():
print('starting dumpsim')
while True:
# check if there are values in the df
if len(testdf.index) > 0:
print('searching for values')
for index, row in testdf.iterrows():
if row['A'] < 10:
testdf.drop(index, inplace=True)
print('val dropped')
else:
print('no vals found')
# try to add rows to csv to simulate sql insert command
for index, row in testdf.iterrows():
# try to append to csv
try:
with open('footest.csv', 'a') as f:
row.to_csv(f, mode= 'a', header=True)
except:
print('append operation failed, skipping')
pass
#if operation succeeds, drop the row
else:
testdf.drop(index)
print('row dropped after success')
if len(testdf.index) == 0:
print('df is empty')
pass
time.sleep(d)
if __name__ == '__main__':
Thread(target = appender).start()
Thread(target = dumper).start()
Is there a way to make this work? Or are is a DataFrame 'locked' when one thread is working on it?