Python DataFrames: finding *almost" identical rows

Python DataFrames: finding *almost" identical rows - python

I have a DF loaded with orders. Some of them contains negative quantities, and the reason for that is that they are actually cancellations of prior orders.
Problem, there is no unique key that can help me find back which order corresponds to which cancellation.
So I've built the following code ('cancelations' is a subset of the original data containing only the rows that correspond to... well... cancelations):
for i, item in cancelations.iterrows():
#find a row similar to the cancelation we are currently studying:
#We use item[1] to access second value of the tuple given back by iterrows()
mask1 = (copy['CustomerID'] == item['CustomerID'])
mask2 = (copy['Quantity'] == item['Quantity'])
mask3 = (copy['Description'] == item['Description'])
subset = copy[ mask1 & mask2 & mask3]
if subset.shape[0] >0: #if we find one or several corresponding orders :
print('possible corresponding orders:', subset.index.tolist())
copy = copy.drop(subset.index.tolist()[0]) #retrieve only the first ot them from the copy of the data
So, this works, but :
first, it takes forever to run; and second, I read somewhere that whenever you find yourself writing complex code to manipulate dataframes, there's already a method for it.
So perhaps one of you know something that could help me ?
thank you for your time !
edit : note that sometimes, we can have several orders that could correspond to the cancelation at hand. This is why I didn't use drop_duplicates with only some columns specified... because it eliminates all duplicates (or all but one) : I need to drop only one of them.

Related

comparing two columns of a row in python dataframe

I know that one can compare a whole column of a dataframe and making a list out of all rows that contain a certain value with:
values = parsedData[parsedData['column'] == valueToCompare]
But is there a possibility to make a list out of all rows, by comparing two columns with values like:
values = parsedData[parsedData['column01'] == valueToCompare01 and parsedData['column02'] == valueToCompare02]
Thank you!

It is completely possible, but I have never tried using and in order to mask the dataframe, rather using & would be of interest in this case. Note that, if you want your code to be more clear, use ( ) in each statement:
values = parsedData[(parsedData['column01'] == valueToCompare01) & (parsedData['column02'] == valueToCompare02)]

Try and Except - Understanding how it is used in finding data specific to dates

I have been trying to understand a piece of code that includes Try and Except that filters data based on specific/required dates:
required_date = '2021-02-11'
index_for_date = (data_dict['date'] == required_date)
data_filtered_by_date = {}
for key in data_dict.keys():
try:
data_filtered_by_date[key] = np.float_(data_dict[key][index_for_date])
except:
data_filtered_by_date[key] = data_dict[key][index_for_date]
I do not understand why the Try and Except would be used and how the whole code would function. I have researched specifics such as np.float and why we use two crotchets (e.g. [key][index_for_date], why are they together?) next to each other. Hopefully I can get further clarification on this code as I am very new to Python and have done various forms of research in order to find some sort of answer

Let's start with an explanation of your code. data_dict is a dictionary of data columns, one of which is the 'date' column on which you want to filter. index_for_date = (data_dict['date'] == required_date) constructs a Boolean index for the columns to find the specific data (an array which is all false except where it matches the desired date).
You loop over the columns which are in data_dict[key]. Then for each column, you select the ones matching the date using the Boolean index: data_dict[key][index_for_date] that is why you have two sets of square brackets, this first is dict indexing, the second is Boolean array indexing.
Then in the try clause, you try to cast the values to floats using np.float_. If this fails (it throws an exception), you fall back to the 'raw', non casted values.

Python - How to optimize code to run faster? (lots of for loops in DataFrame)

I have a code that works with an excel file (SAP Download) quite extensively (data transformation and calculation steps).
I need to loop through all the lines (couple thousand rows) a few times. I have written a code prior that adds DataFrame columns separately, so I could do everything in one for loop that was of course quite quick, however, I had to change data source that meant change in raw data structure.
The raw data structure has 1st 3 rows empty, then a Title row comes with column names, then 2 rows empty, and the 1st column is also empty. I decided to wipe these, and assign column names and make them headers (steps below), however, since then, separately adding column names and later calculating everything in one for statement does not fill data to any of these specific columns.
How could i optimize this code?
I have deleted some calculation steps since they are quite long and make code part even less readable
#This function adds new column to the dataframe
def NewColdfConverter(*args):
for i in args:
dfConverter[i] = '' #previously used dfConverter[i] = NaN
#This function creates dataframe from excel file
def DataFrameCreator(path,sheetname):
excelFile = pd.ExcelFile(path)
global readExcel
readExcel = pd.read_excel(excelFile,sheet_name=sheetname)
#calling my function to create dataframe
DataFrameCreator(filePath,sheetName)
dfConverter = pd.DataFrame(readExcel)
#dropping NA values from Orders column (right now called Unnamed)
dfConverter.dropna(subset=['Unnamed: 1'], inplace=True)
#dropping rows and deleting other unnecessary columns
dfConverter.drop(dfConverter.head(1).index, inplace=True)
dfConverter.drop(dfConverter.columns[[0,11,12,13,17,22,23,48]], axis = 1,inplace = True)
#renaming columns from Unnamed 1: etc to proper names
dfConverter = dfConverter.rename(columns={Unnamed 1:propername1 Unnamed 2:propername2 etc.})
#calling new column function -> this Day column appears in the 1st for loop
NewColdfConverter("Day")
#example for loop that worked prior, but not working since new dataset and new header/column steps added:
for i in range(len(dfConverter)):
#Day column-> floor Entry Date -1, if time is less than 5:00:00
if(dfConverter['Time'][i] <= time(hour=5,minute=0,second=0)):
dfConverter['Day'][i] = pd.to_datetime(dfConverter['Entry Date'][i])-timedelta(days=1)
else:
dfConverter['Day'][i] = pd.to_datetime(dfConverter['Entry Date'][i])
Problem is, there are many columns that build on one another, so I cannot get them in one for loop, for instance in below example I need to calculate reqsWoSetUpValue, so I can calculate requirementsValue, so I can calculate otherReqsValue, but I'm not able to do this within 1 for loop by assigning the values to the dataframecolumn[i] row, because the value will just be missing, like nothing happened.
(dfsorted is the same as dfConverter, but a sorted version of it)
#example code of getting reqsWoSetUpValue
for i in range(len(dfSorted)):
reqsWoSetUpValue[i] = #calculationsteps...
#inserting column with value
dfSorted.insert(49,'Reqs wo SetUp',reqsWoSetUpValue)
#getting requirements value with previously calculated Reqs wo SetUp column
for i in range(len(dfSorted)):
requirementsValue[i] = #calc
dfSorted.insert(50,'Requirements',requirementsValue)
#Calculating Other Reqs value with previously calculated Requirements column.
for i in range(len(dfSorted)):
otherReqsValue[i] = #calc
dfSorted.insert(51,'Other Reqs',otherReqsValue)
Anyone have a clue, why I cannot do this in 1 for loop anymore by 1st adding all columns by the function, like:
NewColdfConverter('Reqs wo setup','Requirements','Other reqs')
#then in 1 for loop:
for i in range(len(dfsorted)):
dfSorted['Reqs wo setup'] = #calculationsteps
dfSorted['Requirements'] = #calculationsteps
dfSorted['Other reqs'] = #calculationsteps
Thank you

General comment: How to identify bottlenecks
To get started, you should try to identify which parts of the code are slow.
Method 1: time code sections using the time package
Wrap blocks of code in statements like this:
import time
t = time.time()
# do something
print("time elapsed: {:.1f} seconds".format(time.time() - t))
Method 2: use a profiler
E.g. Spyder has a built-in profiler. This allows you to check which operations are most time consuming.
Vectorize your operations
Your code will be orders of magnitude faster if you vectorize your operations. It looks like your loops are all avoidable.
For example, rather than calling pd.to_datetime on every row separately, you should call it on the entire column at once
# slow (don't do this):
for i in range(len(dfConverter)):
dfConverter['Day'][i] = pd.to_datetime(dfConverter['Entry Date'][i])
# fast (do this instead):
dfConverter['Day'] = pd.to_datetime(dfConverter['Entry Date'])
If you want to perform an operation on a subset of rows, you can also do this in a vectorized operation by using loc:
mask = dfConverter['Time'] <= time(hour=5,minute=0,second=0)
dfConverter.loc[mask,'Day'] = pd.to_datetime(dfConverter.loc[mask,'Entry Date']) - timedelta(days=1)

Not sure this would improve performance, but you could calculate the dependent columns at the same time row by row with DataFrame.iterrows()
for index, data in dfSorted.iterrows():
dfSorted['Reqs wo setup'][index] = #calculationsteps
dfSorted['Requirements'][index] = #calculationsteps
dfSorted['Other reqs'][index] = #calculationsteps

Is there a more efficient tool than iterrows() in this situation?

Okay so, here's the thing. I'm working with a lot of pandas data frames and arrays. Often times, I need to pair up a value from one frame with a value from another, ideally combining the information into one frame in the end.
Say I'm looking at image files. There's a set of information specific to each file. Sometimes there's certain types of image files that share the same kind of information. Simple example:
FILEPATH, TYPE, COLOR, VALUE_I,<br>
/img2.jpg, A, 'green', 0.6294<br>
/img45.jpg, B, 'green', 0.1846<br>
/img87.jpg, A, 'blue', 34.78<br>
Often, this information is indexed out by type/color/value etc and fed into some other function that gives me another important output, let's say VALUE_II. But I can't concatenate it directly onto the original dataframe because the indices won't match, either because of the nature of the output or because I only fed part of the frame.
Or another situation: I learn that images of a certain TYPE have a specific value attached to them, so I make a dictionary of types and their value. Again, this column doesn't exist, so in this case I would use iterrows() to march down the frame, see if the type matches a specific key, and if it does append it to an array. Then in the end, I convert that array to a dataframe and concatenate it onto the original.
Here's the worse offender. With up to 1800 rows in each frame, it takes FOREVER.:
newColumn = []
for index, row in originalDataframe.iterrows():
for indx, rw in otherDataframe.iterrows():
if row['filename'] in rw['filepath']:
newColumn.append([rw['VALUE_I'],rw['VALUE_II'], rw['VALUE_III']])
newColumn = pd.DataFrame(newColumn, columns = ['VALUE_I', 'VALUE_II', 'VALUE_III'])
originalDataframe = pd.concat([originalDataframe, newColumn], axis=1)
Solutions would be appreciated!

If you can split filename from otherDataframe["filepath"], you can then just compare for equality with orinalDataframe's filename without need to check in. After that you can simplify calculation with pandas.DataFrame.join, which for each filename in originalDataframe will find the same filename in otherDataframe and add all other columns from it.
import os
otherDataframe["filename"] = otherDataframe["filepath"].map(os.path.basename)
joinedDataframe = originalDataframe.join(otherDataframe.set_index("filename"), on="filename")
If there are columns with the same name in originalDataframe and otherDataframe you should set lsuffix or rsuffix.

focusing on the second half of your question, as that's what you provided code for. Your program is checking every row of df1 against every row in df2, yielding potentially 1800 *1800, or 3240000 possible combinations. If there is only one possible match for each row then adding 'break' in will help some, but is not ideal.
newColumn.append([rw['VALUE_I'],rw['VALUE_II'], rw['VALUE_III']])
break
if the structure of you data allows it, i would try something like:
ref = {}
for i, path in enumerate(otherDataframe['filepath']):
*_, file = path.split('\\')
ref[file] = i
originalDataframe['VALUE_I'] = None
originalDataframe['VALUE_II'] = None
originalDataframe['VALUE_III'] = None
for i, file in enumerate(originalDataframe['filename']):
try:
j = ref[file]
originalDataframe.loc[i, 'VALUE_I'] = otherDataframe.loc[j, 'VALUE_I']
originalDataframe.loc[i, 'VALUE_II'] = otherDataframe.loc[j, 'VALUE_II']
originalDataframe.loc[i, 'VALUE_III'] = otherDataframe.loc[j, 'VALUE_III']
except:
pass
Here we we iterate through the paths in otherDataframe (I assume they follow a pattern of C:\asdf\asdf\file), split the path on \ to pull out file, and then construct a dictionary of files to row numbers. Next we initialize the 3 columns in originalDataframe that you want to write to.
Lastly we iterate through the files in originalDataframe, check to see if that file exists in our dictionary of files in otherDataframe (done inside a try to catch errors), and pull the row number (out of the dictionary) which we then use to write the values from other to original.
Side note, you describe you paths as being in the vein of 'C:/asd/fdg/img2.jpg', in which case you should use:
*_, file = path.split('/')

How to multi-thread large number of pandas dataframe selection calls on large dataset

df is a dataframe containing 12 millions+ lines unsorted.
Each row has a GROUP ID.
The end goal is to randomly select 1 row per unique GROUP ID, thus populating a new column named SELECTED where 1 means selected 0 means the opposite
There may be 5000+ unique GROUP IDs.
Seeking better and faster solution than the following, Potentially multi-threaded solution?
for sec in df['GROUP'].unique():
sz = df.loc[df.GROUP == sec, ['SELECTED']].size
sel = [0]*sz
sel[random.randint(0,sz-1)] = 1
df.loc[df.GROUP == sec, ['SELECTED']] = sel

You could try a vectorized version, which will probably speed things up if you have many classes.
import pandas as pd
# get fake data
df = pd.DataFrame(pd.np.random.rand(10))
df['GROUP'] = df[0].astype(str).str[2]
# mark one element of each group as selected
df['selected'] = df.index.isin( # Is current index in a selected list?
df.groupby('GROUP') # Get a GroupBy object.
.apply(pd.Series.sample) # Select one row from each group.
.index.levels[1] # Access index - in this case (group, old_id) pair; select the old_id out of the two.
).astype(pd.np.int) # Convert to ints.
Note that this may fail if duplicate indices are present.

I do not know panda's dataframe, but if you simply set selected where it is needed to be one and later assume that not having the attribute means not selected you could avoid updating all elements.
You may also do something like this :
selected = []
for sec in df['GROUP'].unique():
selected.append(random.choice(sec))
or with list comprehensions
selected = [random.choice(sec) for sec in df['GROUP'].unique()]
maybe this can speed it up because you will not need to allow new memory and udpate all elements from your dataframe.
If you really want multithreading have a look at concurrent.futures https://docs.python.org/3/library/concurrent.futures.html

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python DataFrames: finding *almost" identical rows - python

Related

comparing two columns of a row in python dataframe

Try and Except - Understanding how it is used in finding data specific to dates

Python - How to optimize code to run faster? (lots of for loops in DataFrame)

Is there a more efficient tool than iterrows() in this situation?

How to multi-thread large number of pandas dataframe selection calls on large dataset

Categories

Resources