I/O efficiency in Python

I/O efficiency in Python - python

I am writing a program that:
Read the content from an excel sheets for each row (90,000 rows in total)
Compare the content with another excel sheet for each row (600,000 rows in total)
If a match occurs, write the matching entry into a new excel sheet
I have written the script and everything works fine. however, the computational time is HUGE. For an hour, it has done just 200 rows from the first sheet, resulting in writing 200 different files.
I was wondering if there is a way to save the matching in a different way as I am going to use them later on? Is there any way to save in a matrix or something?
import xlrd
import xlsxwriter
import os, itertools
from datetime import datetime
# choose the incident excel sheet
book_1 = xlrd.open_workbook('D:/Users/d774911/Desktop/Telstra Internship/Working files/Incidents.xlsx')
# choose the trap excel sheet
book_2 = xlrd.open_workbook("D:/Users/d774911/Desktop/Telstra Internship/Working files/Traps.xlsx")
# choose the features sheet
book_3 = xlrd.open_workbook("D:/Users/d774911/Desktop/Telstra Internship/Working files/Features.xlsx")
# select the working sheet, either by name or by index
Traps = book_2.sheet_by_name('Sheet1')
# select the working sheet, either by name or by index
Incidents = book_1.sheet_by_name('Sheet1')
# select the working sheet, either by name or by index
Features_Numbers = book_3.sheet_by_name('Sheet1')
#return the total number of rows for the traps sheet
Total_Number_of_Rows_Traps = Traps.nrows
# return the total number of rows for the incident sheet
Total_Number_of_Rows_Incidents = Incidents.nrows
# open a file two write down the non matching incident's numbers
print(Total_Number_of_Rows_Traps, Total_Number_of_Rows_Incidents)
write_no_matching = open('C:/Users/d774911/PycharmProjects/GlobalData/No_Matching.txt', 'w')
# For loop to iterate for all the row for the incident sheet
for Rows_Incidents in range(Total_Number_of_Rows_Incidents):
# Store content for the comparable cell for incident sheet
Incidents_Content_Affected_resources = Incidents.cell_value(Rows_Incidents, 47)
# Store content for the comparable cell for incident sheet
Incidents_Content_Product_Type = Incidents.cell_value(Rows_Incidents, 29)
# Convert Excel date type into python type
Incidents_Content_Date = xlrd.xldate_as_tuple(Incidents.cell_value(Rows_Incidents, 2), book_1.datemode)
# extract the year, month and day
Incidents_Content_Date = str(Incidents_Content_Date[0]) + ' ' + str(Incidents_Content_Date[1]) + ' ' + str(Incidents_Content_Date[2])
# Store content for the comparable cell for incident sheet
Incidents_Content_Date = datetime.strptime(Incidents_Content_Date, '%Y %m %d')
# extract the incident number
Incident_Name = Incidents.cell_value(Rows_Incidents, 0)
# Create a workbook for the selected incident
Incident_Name_Book = xlsxwriter.Workbook(os.path.join('C:/Users/d774911/PycharmProjects/GlobalData/Test/', Incident_Name + '.xlsx'))
# Create sheet name for the created workbook
Incident_Name_Sheet = Incident_Name_Book.add_worksheet('Sheet1')
# insert the first row that contains the features
Incident_Name_Sheet.write_row(0, 0, Features_Numbers.row_values(0))
Insert_Row_to_Incident_Sheet = 0
# For loop to iterate for all the row for the traps sheet
for Rows_Traps in range(Total_Number_of_Rows_Traps):
# Store content for the comparable cell for traps sheet
Traps_Content_Node_Name = Traps.cell_value(Rows_Traps, 3)
# Store content for the comparable cell for traps sheet
Traps_Content_Event_Type = Traps.cell_value(Rows_Traps, 6)
# extract date temporally
Traps_Content_Date_temp = Traps.cell_value(Rows_Traps, 10)
# Store content for the comparable cell for traps sheet
Traps_Content_Date = datetime.strptime(Traps_Content_Date_temp[0:10], '%Y-%m-%d')
# If the content matches partially or full
if len(str(Traps_Content_Node_Name)) * len(str(Incidents_Content_Affected_resources)) != 0 and \
str(Incidents_Content_Affected_resources).lower().find(str(Traps_Content_Node_Name).lower()) != -1 and \
len(str(Traps_Content_Event_Type)) * len(str(Incidents_Content_Product_Type)) != 0 and \
str(Incidents_Content_Product_Type).lower().find(str(Traps_Content_Event_Type).lower()) != -1 and \
len(str(Traps_Content_Date)) * len(str(Incidents_Content_Date)) != 0 and \
Traps_Content_Date <= Incidents_Content_Date:
# counter for writing inside the new incident sheet
Insert_Row_to_Incident_Sheet = Insert_Row_to_Incident_Sheet + 1
# Write the Incident information
Incident_Name_Sheet.write_row(Insert_Row_to_Incident_Sheet, 0, Incidents.row_values(Rows_Incidents))
# Write the Traps information
Incident_Name_Sheet.write_row(Insert_Row_to_Incident_Sheet, 107, Traps.row_values(Rows_Traps))
Incident_Name_Book.close()
Thanks

What your doing is seeking/reading a litte bit of data for each cell. This is very inefficient.
Try reading all information in one go into an as basic as sensible python data structure (lists, dicts etc.) and make your comparisons/operations on this data set in memory and write all results in one go. If not all data fits into memory, try to partition it into sub-tasks.
Having to read the data set 10 times, to extract a tenth of data each time will likely still be hugely faster than reading each cell independently.

I don't see how your code can work; the second loop works on variables which change for every row in the first loop but the second loop isn't inside of the first one.
That said, comparing files in this way has a complexity of O(N*M) which means that the runtime explodes quickly. In your case you try to execute 54'000'000'000 (54 billion) loops.
If you run into these kind of problems, the solution is always a three step process:
Transform the data to make it easier to process
Put the data into efficient structures (sorted lists, dict)
Search the data with the efficient structures
You have to find a way to get rid of the find(). Try to get rid of all the junk in the cells that you want to compare so that you could use =. When you have this, you can put rows into a dict to find matches. Or you could load it into a SQL database and use SQL queries (don't forget to add indexes!)
One last trick is to use sorted lists. If you can sort the data in the same way, then you can use two lists:
Sort the data from the two sheets into two lists
Use two row counters (one per list)
If the current item from the first list is less than the current one from the second list, then there is no match and you have to advance the first row counter.
If the current item from the first list is bigger than the current one from the second list, then there is no match and you have to advance the second row counter.
If the items are the same, you have a match. Process the match and advance both counters.
This allows you to process all the items in one go.

I would suggest that you use pandas. This module provides a huge amount of functions to compare datasets. It also has a very fast import/export algorithms for excel files.
IMHO you should use the merge function and provide the arguments how = 'inner' and on = [list of your columns to compare]. That will create a new dataset with only such rows that occur in both tables (having the same values in the defined colums). This new dataset you can export to your excel file.

Related

Python - How to optimize code to run faster? (lots of for loops in DataFrame)

I have a code that works with an excel file (SAP Download) quite extensively (data transformation and calculation steps).
I need to loop through all the lines (couple thousand rows) a few times. I have written a code prior that adds DataFrame columns separately, so I could do everything in one for loop that was of course quite quick, however, I had to change data source that meant change in raw data structure.
The raw data structure has 1st 3 rows empty, then a Title row comes with column names, then 2 rows empty, and the 1st column is also empty. I decided to wipe these, and assign column names and make them headers (steps below), however, since then, separately adding column names and later calculating everything in one for statement does not fill data to any of these specific columns.
How could i optimize this code?
I have deleted some calculation steps since they are quite long and make code part even less readable
#This function adds new column to the dataframe
def NewColdfConverter(*args):
for i in args:
dfConverter[i] = '' #previously used dfConverter[i] = NaN
#This function creates dataframe from excel file
def DataFrameCreator(path,sheetname):
excelFile = pd.ExcelFile(path)
global readExcel
readExcel = pd.read_excel(excelFile,sheet_name=sheetname)
#calling my function to create dataframe
DataFrameCreator(filePath,sheetName)
dfConverter = pd.DataFrame(readExcel)
#dropping NA values from Orders column (right now called Unnamed)
dfConverter.dropna(subset=['Unnamed: 1'], inplace=True)
#dropping rows and deleting other unnecessary columns
dfConverter.drop(dfConverter.head(1).index, inplace=True)
dfConverter.drop(dfConverter.columns[[0,11,12,13,17,22,23,48]], axis = 1,inplace = True)
#renaming columns from Unnamed 1: etc to proper names
dfConverter = dfConverter.rename(columns={Unnamed 1:propername1 Unnamed 2:propername2 etc.})
#calling new column function -> this Day column appears in the 1st for loop
NewColdfConverter("Day")
#example for loop that worked prior, but not working since new dataset and new header/column steps added:
for i in range(len(dfConverter)):
#Day column-> floor Entry Date -1, if time is less than 5:00:00
if(dfConverter['Time'][i] <= time(hour=5,minute=0,second=0)):
dfConverter['Day'][i] = pd.to_datetime(dfConverter['Entry Date'][i])-timedelta(days=1)
else:
dfConverter['Day'][i] = pd.to_datetime(dfConverter['Entry Date'][i])
Problem is, there are many columns that build on one another, so I cannot get them in one for loop, for instance in below example I need to calculate reqsWoSetUpValue, so I can calculate requirementsValue, so I can calculate otherReqsValue, but I'm not able to do this within 1 for loop by assigning the values to the dataframecolumn[i] row, because the value will just be missing, like nothing happened.
(dfsorted is the same as dfConverter, but a sorted version of it)
#example code of getting reqsWoSetUpValue
for i in range(len(dfSorted)):
reqsWoSetUpValue[i] = #calculationsteps...
#inserting column with value
dfSorted.insert(49,'Reqs wo SetUp',reqsWoSetUpValue)
#getting requirements value with previously calculated Reqs wo SetUp column
for i in range(len(dfSorted)):
requirementsValue[i] = #calc
dfSorted.insert(50,'Requirements',requirementsValue)
#Calculating Other Reqs value with previously calculated Requirements column.
for i in range(len(dfSorted)):
otherReqsValue[i] = #calc
dfSorted.insert(51,'Other Reqs',otherReqsValue)
Anyone have a clue, why I cannot do this in 1 for loop anymore by 1st adding all columns by the function, like:
NewColdfConverter('Reqs wo setup','Requirements','Other reqs')
#then in 1 for loop:
for i in range(len(dfsorted)):
dfSorted['Reqs wo setup'] = #calculationsteps
dfSorted['Requirements'] = #calculationsteps
dfSorted['Other reqs'] = #calculationsteps
Thank you

General comment: How to identify bottlenecks
To get started, you should try to identify which parts of the code are slow.
Method 1: time code sections using the time package
Wrap blocks of code in statements like this:
import time
t = time.time()
# do something
print("time elapsed: {:.1f} seconds".format(time.time() - t))
Method 2: use a profiler
E.g. Spyder has a built-in profiler. This allows you to check which operations are most time consuming.
Vectorize your operations
Your code will be orders of magnitude faster if you vectorize your operations. It looks like your loops are all avoidable.
For example, rather than calling pd.to_datetime on every row separately, you should call it on the entire column at once
# slow (don't do this):
for i in range(len(dfConverter)):
dfConverter['Day'][i] = pd.to_datetime(dfConverter['Entry Date'][i])
# fast (do this instead):
dfConverter['Day'] = pd.to_datetime(dfConverter['Entry Date'])
If you want to perform an operation on a subset of rows, you can also do this in a vectorized operation by using loc:
mask = dfConverter['Time'] <= time(hour=5,minute=0,second=0)
dfConverter.loc[mask,'Day'] = pd.to_datetime(dfConverter.loc[mask,'Entry Date']) - timedelta(days=1)

Not sure this would improve performance, but you could calculate the dependent columns at the same time row by row with DataFrame.iterrows()
for index, data in dfSorted.iterrows():
dfSorted['Reqs wo setup'][index] = #calculationsteps
dfSorted['Requirements'][index] = #calculationsteps
dfSorted['Other reqs'][index] = #calculationsteps

How to create one large dataframe of specific excel information using python pandas from a list of excel paths

Perhaps an easy fix.
I am looking to extract specific information from many of the same style of excel workbooks within a directory and concatenate the specific information all in into one workbook (while changing the format). I have completed every part of this task except for successfully creating one big dataframe of n columns from the different workbooks(proportional to the number of xlsx files read). Each of the read workbooks has only one sheet ['Sheet1']. Does this sound like I am taking the right approach? I am currently using a for loop to gather this data.
Upon much research online (Github, youtube, stackoverflow), others say to make one big dataframe, then concatenate. I have tried to use a for loop to create this dataframe; however, I have not seen users "piece together" bits of data to form a dataframe the way I have. I don't believe this should hinder the operation. I realize I am not appending or concatenating, just not sure where to go with it.
for i in filepaths: #filepaths is a list of n filepaths`
df = pd.read_excel(i) #read the excel sheets`
info = otherslices #condensed form of added slices from df`
Final = pd.DataFrame(info) #expected big dataframe`
The expected results should be columns directly next to each other (one from each excel sheet respectively)
Excel1 Excel2 -> Excel(n)
info1a info1b
info2a info2b
info3a info3b
... ...
What I currently get when using "print(Final)" in loop is
Excel1
info1a
info2a
info3a
...
Excel2
info1b
info2b
info3b
...
|
Excel(n)
However, the dataframe I get from this loop (when I type "Final") is only
the very last excel workbook's data

I would create a list of data frames which you append in each loop then after the loop concate the list into a single data frame. So something like this.
Final=[]
for i in filepaths: #filepaths is a list of n filepaths`
df = pd.read_excel(i) #read the excel sheets`
info = otherslices #condensed form of added slices from df`
Final.append(info) #expected big dataframe`'
Final=pd.concat(Final)

I discovered my own solution to this problem.
Final = pd.DataFrame(index=range(95)) #95 is the number of rows I have for each column
n=0
for i in filepaths: #filepaths is a list of n filepaths
df = pd.read_excel(i) #read the excel sheets`
info = otherslices #condensed form of added slices from df`
Final[n]=pd.DataFrame(info)
n+=1
Final = Final.append(Final) #big dataframe of n columns
Final

Searching for specific text in csv(excel format) file

CVS Sample
So I have a csv file(sample in link above) , with variable names in row 7 and values in row 8 . The Variable all have units after them, and the values are just numbers like this :
Velocity (ft/s) Volumetric (Mgal/d Mass Flow (klb/d) Sound Speed (ft/s)
.-0l.121 1.232 1.4533434 1.233423
There are alot more variables, but basically I need some way to search in the csv file for the specefic unit groups, and then append the value associated with that in a list. For example search for text "(ft/s)", and then make a dictionary with Velocity and Sound speed as Keys, and their associated values . I am unable to do this because the csv is formatted like an excel spreadsheet, and the cells contains the whole variable name with it's unit
In the end I will have a dictionary for each unit group, and I need to do it this way because each csv file generated, the unit groups change ( ft/s becomes m/s). I also can't use excel read, because it doesn't work in IronPython.

You can use csv module to read the appropriate lines into lists.
defaultdict is a good choice for data aggregation, while variable
names and units can be easily separated by splitting on '('.
import csv
import collections
with open(csv_file_name) as fp:
reader = csv.feader(fp)
for k in range(6): # skip 6 lines
next(reader)
varnames = next(reader) # 7th line
values = next(reader) # 8th line
groups = collections.defaultdict(dict)
for i, (col, value) in enumerate(zip(varnames, values)):
if i < 2:
continue
name, units = map(str.strip, col.strip(')').split('(', 1))
groups[units][name] = float(value)
Edit: added the code to skip first two columns

I'll help with the part I think you're stuck on, which is trying to extract the units from the category. Given your data, your best bet may be to use regex, the following should work:
import re
f = open('data.csv')
# I assume the first row has the header you listed in your question
header = f.readline().split(',') #since you said its a csv
for item in header:
print re.search(r'\(.+\)', item).group()
print re.sub(r'\(.+\)', '', item)
That should print the following for you:
(ft/s)
Velocity
(Mgal/d)
Volumetric
(klb/d)
Mass Flow
(ft/s)
Sound Speed
You can modify the above to store these in a list, then iterate through them to find duplicates and merge the appropriate strings to dictionaries or whatnot.

Is there a more efficient tool than iterrows() in this situation?

Okay so, here's the thing. I'm working with a lot of pandas data frames and arrays. Often times, I need to pair up a value from one frame with a value from another, ideally combining the information into one frame in the end.
Say I'm looking at image files. There's a set of information specific to each file. Sometimes there's certain types of image files that share the same kind of information. Simple example:
FILEPATH, TYPE, COLOR, VALUE_I,<br>
/img2.jpg, A, 'green', 0.6294<br>
/img45.jpg, B, 'green', 0.1846<br>
/img87.jpg, A, 'blue', 34.78<br>
Often, this information is indexed out by type/color/value etc and fed into some other function that gives me another important output, let's say VALUE_II. But I can't concatenate it directly onto the original dataframe because the indices won't match, either because of the nature of the output or because I only fed part of the frame.
Or another situation: I learn that images of a certain TYPE have a specific value attached to them, so I make a dictionary of types and their value. Again, this column doesn't exist, so in this case I would use iterrows() to march down the frame, see if the type matches a specific key, and if it does append it to an array. Then in the end, I convert that array to a dataframe and concatenate it onto the original.
Here's the worse offender. With up to 1800 rows in each frame, it takes FOREVER.:
newColumn = []
for index, row in originalDataframe.iterrows():
for indx, rw in otherDataframe.iterrows():
if row['filename'] in rw['filepath']:
newColumn.append([rw['VALUE_I'],rw['VALUE_II'], rw['VALUE_III']])
newColumn = pd.DataFrame(newColumn, columns = ['VALUE_I', 'VALUE_II', 'VALUE_III'])
originalDataframe = pd.concat([originalDataframe, newColumn], axis=1)
Solutions would be appreciated!

If you can split filename from otherDataframe["filepath"], you can then just compare for equality with orinalDataframe's filename without need to check in. After that you can simplify calculation with pandas.DataFrame.join, which for each filename in originalDataframe will find the same filename in otherDataframe and add all other columns from it.
import os
otherDataframe["filename"] = otherDataframe["filepath"].map(os.path.basename)
joinedDataframe = originalDataframe.join(otherDataframe.set_index("filename"), on="filename")
If there are columns with the same name in originalDataframe and otherDataframe you should set lsuffix or rsuffix.

focusing on the second half of your question, as that's what you provided code for. Your program is checking every row of df1 against every row in df2, yielding potentially 1800 *1800, or 3240000 possible combinations. If there is only one possible match for each row then adding 'break' in will help some, but is not ideal.
newColumn.append([rw['VALUE_I'],rw['VALUE_II'], rw['VALUE_III']])
break
if the structure of you data allows it, i would try something like:
ref = {}
for i, path in enumerate(otherDataframe['filepath']):
*_, file = path.split('\\')
ref[file] = i
originalDataframe['VALUE_I'] = None
originalDataframe['VALUE_II'] = None
originalDataframe['VALUE_III'] = None
for i, file in enumerate(originalDataframe['filename']):
try:
j = ref[file]
originalDataframe.loc[i, 'VALUE_I'] = otherDataframe.loc[j, 'VALUE_I']
originalDataframe.loc[i, 'VALUE_II'] = otherDataframe.loc[j, 'VALUE_II']
originalDataframe.loc[i, 'VALUE_III'] = otherDataframe.loc[j, 'VALUE_III']
except:
pass
Here we we iterate through the paths in otherDataframe (I assume they follow a pattern of C:\asdf\asdf\file), split the path on \ to pull out file, and then construct a dictionary of files to row numbers. Next we initialize the 3 columns in originalDataframe that you want to write to.
Lastly we iterate through the files in originalDataframe, check to see if that file exists in our dictionary of files in otherDataframe (done inside a try to catch errors), and pull the row number (out of the dictionary) which we then use to write the values from other to original.
Side note, you describe you paths as being in the vein of 'C:/asd/fdg/img2.jpg', in which case you should use:
*_, file = path.split('/')

Read/write/append to CSV using python

I am attempting to use the CSV module of python to modify a CSV file. The file represents a stock and lists (as columns) the date, open price, high price, low price, close price, and volume for the day. What I would like to do is create multiple new columns by performing algebra on the existing data. For instance, I would like to create a column for the percentage from the open price to the high price for any given day and another for the percentage change from yesterday's close to today's close (no end in sight here, as of now thinking of about 10 columns to add).
Is there a compact way to do this? As of now, I am opening the original file and reading into a list(s) the values of interest. Then writing onto some temp file the modified values using that list(s). Then writing onto some new file using a for loop and adding the rows from each spreadsheet. Then writing the entire contents of that new file onto the original csv, as I would like to maintain the name of the csv (ticker.csv).
Hopefully I have made my issue clear. If you would like any clarification or further details, please do not hesitate.
edit: I have included a snippet of the code for one function below. The function seeks to create a new column that has the percent change from yesterday's close to today's close.
def add_col_pchange(ticker):
"""
Add column with percent change in closing price.
"""
original = open('file1', 'rb')
reader = csv.reader(original)
reader.next()
close = list()
for row in reader:
# build list of close values; entries from left to right are reverse chronological
# index 4 corresponds to "Close" column
close.append(float(row[4])
original.close()
new = open(file2, 'wb')
writer = csv.writer(new)
writer.writerow(["Percent Change"])
pchange = list()
for i in (0, len(close)-1):
x = (close[i]-close[i+1])/close[i+1]
pchange.append(x)
new.close()
# open original and new csv's as read, write out to some new file.
# later, copy that entire file to original csv in order to maintain
# original csv's name and include new data

Hope this helps
def add_col_pchange(ticker):
"""
Add column with percent change in closing price.
"""
# always use with to transparently manage opening/closing files
with open('ticker.csv', 'rb') as original:
spam = csv.reader(original)
headers = spam.next() # get header row
# get all of the data at one time, then transpose it using zip
data = zip(*[row for row in spam])
# build list of close values; entries from left to right are reverse chronological
# index 4 corresponds to "Close" column
close = data[4] # the 5th column has close values
# use map to process whole column at one time
f_pchange = lambda close0, close1: 100 * (float(close0) - float(close1)) / float(close1)
Ndays = len(close) # length of table
pchange = map(f_pchange, close[:-1], close[1:]) # list of percent changes
pchange = (None,) + tuple(pchange) # add something for the first or last day
headers.append("Percent Change") # add column name to headers
data.append(pchange)
data = zip(*data) # transpose back to rows
with open('ticker.csv', 'wb') as new:
spam = csv.writer(new)
spam.writerow(headers) # write headers
for row in data:
spam.writerow(row)
# open original and new csv's as read, write out to some new file.
# later, copy that entire file to original csv in order to maintain
# original csv's name and include new data
You should check out either numpy; you could use loadtxt() and vector math, but #lightalchemist is right, pandas was designed just for this.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.