How to vectorise looping through pandas groupby dataframes whilst appyling conditionals - python

I am trying to organise some data into groups based on the contents of certain columns. The current code I have is dependent on loops and I would like to vectorise instead to improve performance. I know this is the way forwards with pandas and whilst I can vectorise some problems I am really struggling with this one.
What I need to do is group the data by the ClientNumber and link the genuine and incomplete rows so that for each Clientnumber all the genuine rows have a different process ID and the incomplete rows are given the same process ID as the nearest genuine row which has a StartDate that is greater than the StartDate for the incomplete row (Essentially, incomplete rows should be connected to a genuine if present and once a genuine row is found it should close that grouping and treat future rows as separate events). Then I must be able to set a process Start date for each row which is equal to the lowest Start date within that group of processID and mark the last row (the one with the greatest StartDate) with a ProcessCount in a separate column.
Apologies for my lack of descriptive ability here, hopefully the code (written in Python 3.6) I have so far will better explain my desired outcome. The code works but as you can see relies on nested loops which I don't like. I have tried researching around to find out how to vectorise this but I am struggling to get my head around the concept for this problem.
Any help you can provide me with in straightening out the loops in this code would be most appreciated and really help me better understand how to apply this to other tasks going forwards.
Data
df_dict = {'ClientNumber': {0: 1234, 1: 1234, 2: 1234, 3: 123, 4: 123, 5: 123, 6: 12, 7: 12, 8: 1}, 'Genuine_Incomplete': {0: 'Incomplete', 1: 'Genuine', 2: 'Genuine', 3: 'Incomplete', 4: 'Incomplete', 5: 'Genuine', 6: 'Incomplete', 7: 'Incomplete', 8: 'Genuine'}, 'StartDate': {0: Timestamp('2018-01-01 00:00:00'), 1: Timestamp('2018-01-05 00:00:00'), 2: Timestamp('2018-03-01 00:00:00'), 3: Timestamp('2018-01-01 00:00:00'), 4: Timestamp('2018-01-03 00:00:00'), 5: Timestamp('2018-01-10 00:00:00'), 6: Timestamp('2018-01-01 00:00:00'), 7: Timestamp('2018-06-02 00:00:00'), 8: Timestamp('2018-01-01 00:00:00')}}
df = pd.DataFrame(data=df_dict)
df["ID"] = df.index
df["Process_Start_Date"] = np.nan
df["ProcessCode"] = np.nan
df["ProcessCount"] = np.nan
grouped_df = df.groupby('ClientNumber')
for key, item in grouped_df:
newdf = grouped_df.get_group(key)
newdf.sort_values(by=["StartDate"], inplace=True)
c = 1
for i in newdf.iterrows():
i = i[0]
GI = df.loc[i, "Genuine_Incomplete"]
proc_code = "{}_{}".format(df.loc[i, "ClientNumber"],c)
df.loc[i, "ProcessCode"] = proc_code
if GI == "Genuine":
c += 1
grouped_df = df.groupby('ProcessCode')
for key, item in grouped_df:
newdf = grouped_df.get_group(key)
newdf.sort_values(by=["StartDate"], inplace=True)
df.loc[newdf.ID.iat[-1], "ProcessCount"] = 1
for i in newdf.iterrows():
i = i[0]
df.loc[i, "Process_Start_Date"] = df.loc[newdf.ID.iat[0], "StartDate"]
Note - You may have noticed my use of the df["ID"] which is just a copy of the index. I know this is not good practice but I couldn't work out how to set values from other columns using the index. Any suggestions for doing this are also very welcome.

Related

fuzzy duplicate check using python dedupe library error

I'm trying to use the python dedupe library to perform a fuzzy duplicate check on my mock data, but i keep getting this error:
{'Vendor': {0: 'ABC', 1: 'ABC', 2: 'TIM'},
'Doc Date': {0: '5/12/2019', 1: '5/13/2019', 2: '4/15/2019'},
'Invoice Date': {0: '5/10/2019', 1: '5/10/2019', 2: '4/10/2019'},
'Invoice Ref Num': {0: 'ABCDE56.', 1: 'ABCDE56', 2: 'RTET5SDF'},
'Invoice Amount': {0: '56', 1: '56', 2: '100'}}
IndexError: Cannot choose from an empty sequence
Here's the code that i'm using:
import pandas as pd
import pandas_dedupe
df = pd.read_csv("duptest.csv") df.columns
df = pandas_dedupe.dedupe_dataframe(df,['Vendor','Invoice Ref Num','Invoice Amount'])
Any idea what i'm doing wrong? thanks.
pandas-dedupe create a sample of observations you need to label.
The default amount of observation is equal to 30% of your dataframe.
In your case you have too few examples in you dataframe to start active learning.
If you sample_size=1 as follows:
df = pandas_dedupe.dedupe_dataframe(df,['Vendor','Invoice Ref Num','Invoice Amount'], sample_size=1)
you will be able to dedupe you data :)

Unable to create dataframe with pandas DateRange and multiple columns

I'm working on a df as follows:
df = pd.DataFrame({'ID': {0: 'S0001', 1: 'S0002', 2: 'S0003'},
'StartDate': {0: Timestamp('2018-01-01 00:00:00'),
1: Timestamp('2019-01-01 00:00:00'),
2: Timestamp('2019-04-01 00:00:00')},
'EndDate': {0: Timestamp('2019-01-02 00:00:00'),
1: Timestamp('2020-01-02 00:00:00'),
2: Timestamp('2020-04-01 00:00:00')}
'Color': {0: 'Blue', 1: 'Green', 2: 'Red'},
'Type': {0: 'Small', 1: 'Mid', 2: 'Mid'}})
Now I want to create a df with 366 rows between Start and End dates and I want to add the Color, Type, ID for every row between Start and End Date.
I'm doing the following whick works well:
OutputDF = pd.concat([pd.DataFrame(data = Row['ID'], index = pd.date_range(Row['StartDate'], Row['EndDate'], freq='1D', closed = 'left'), columns = ['ID']) for index, Row in df.iterrows()])
and I get a df with 2 columns SiteID and days in the range Start/End Dates.
I'm able to add the Color/Type by doing a pd.merge on 'ID' but I think there is a direct way to add the column Color and Type directly when creating the DF.
I've tried data = [Row['ID'], Row['Type'], Row['Color']] or data = Row[['ID', 'Color', 'Type']] but neither works.
Therefore, how should I do to create my dataframe but having the Color for every item for the whole 366 rows directly without requiring the merge ?
Sample of current output:
It goes on for all the days between Start/End dates for each item.
Desired output:
Thanks
Try, pd.DataFrame constructor with a dictionary for data:
pd.concat([pd.DataFrame({'ID':Row['ID'],
'Color':Row['Color'],
'Type':Row['Type']},
index = pd.date_range(Row['StartDate'],
Row['EndDate'],
freq='1D',
closed = 'left'))
for index, Row in df.iterrows()])

Compare two dataframe columns for matching percentage

I want to compare a data frame of one column with another data frame of multiple columns and return the header of the column having maximum match percentage.
I am not able to find any match functions in pandas. First data frame first column :
cars
----
swift
maruti
wagonor
hyundai
jeep
First data frame second column :
bikes
-----
RE
Ninja
Bajaj
pulsar
one column data frame :
words
---------
swift
RE
maruti
waganor
hyundai
jeep
bajaj
Desired output :
100% match header - cars
Try to use isin function of pandas DataFrame. Assuming df is your first dataframe and words is a list :
In[1]: (df.isin(words).sum()/df.shape[0])*100
Out[1]:
cars 100.0
bikes 20.0
dtype: float64
You may need to lowercase strings in your df and in the words list to avoid any casing issue.
You can first get the columns into lists:
dfCarsList = df['cars'].tolist()
dfWordsList = df['words'].tolist()
dfBikesList = df['Bikes'].tolist()
And then iterate of the list for comparision:
numberCars = sum(any(m in L for m in dfCarsList) for L in dfWordsList)
numberBikes = sum(any(m in L for m in dfBikesList) for L in dfWordsList)
The higher number you can use than for your output.
Construct a Series using numpy.in1d and ndarray.mean then call the Series.idxmax and max methods:
# Setup
df1 = pd.DataFrame({'cars': {0: 'swift', 1: 'maruti', 2: 'waganor', 3: 'hyundai', 4: 'jeep'}, 'bikes': {0: 'RE', 1: 'Ninja', 2: 'Bajaj', 3: 'pulsar', 4: np.nan}})
df2 = pd.DataFrame({'words': {0: 'swift', 1: 'RE', 2: 'maruti', 3: 'waganor', 4: 'hyundai', 5: 'jeep', 6: 'bajaj'}})
match_rates = pd.Series({col: np.in1d(df1[col], df2['words']).mean() for col in df1})
print('{:.0%} match header - {}'.format(match_rates.max(), match_rates.idxmax()))
[out]
100% match header - cars
Here is a solution with a function that returns a tuple (column_name, match_percentage) for the column with the maximum match percentage. It accepts a pandas dataframe (bikes and cars in your example) and a series (words) as arguments.
def match(df, se):
max_matches = 0
max_col = None
for col in df.columns:
# Get the number of matches in a column
n_matches = sum([1 for row in df[col] if row in se.unique()])
if n_matches > max_matches:
max_col = col
max_matches = n_matches
return max_col, max_matches/df.shape[0]
With your example, you should get the following output.
df = pd.DataFrame()
df['Cars'] = ['swift', 'maruti', 'wagonor', 'hyundai', 'jeep']
df['Bikes'] = ['RE', 'Ninja', 'Bajaj', 'pulsar', '']
se = pd.Series(['swift', 'RE', 'maruti', 'wagonor', 'hyundai', 'jeep', 'bajaj'])
In [1]: match(df, se)
Out[1]: ('Cars', 1.0)

How to calculate a value grouped by one attribute, but provided in the second column in pandas

I have a datarame with Id of orders, Id Client,Date_order and some metrics (not to much important)
I want to get number of last ID order of Client for all rows
I tried this one:
data=pd.DataFrame({'ID': [ 133853.0,155755.0,149331.0,337270.0,
775727.0,200868.0,138453.0,738497.0,666802.0,697070.0,128148.0,1042225.0,
303441.0,940515.0,143548.0],
'CLIENT':[ 235632.0,231562.0,235632.0,231562.0,734243.0,
235632.0,235632.0,734243.0,231562.0,734243.0,235632.0,734243.0,231562.0,
734243.0,235632.0],
'DATE_START': [ ('2017-09-01 00:00:00'),
('2017-10-05 00:00:00'),('2017-09-26 00:00:00'),
('2018-03-23 00:00:00'),('2018-12-21 00:00:00'),
('2017-11-23 00:00:00'),('2017-09-08 00:00:00'),
('2018-12-12 00:00:00'),('2018-11-21 00:00:00'),
('2018-12-01 00:00:00'),('2017-08-22 00:00:00'),
('2019-02-06 00:00:00'),('2018-02-20 00:00:00'),
('2019-01-20 00:00:00'),('2017-09-17 00:00:00')]})
data.groupby('CLIENT').apply(lambda x:max(x['ID']))
enter image description here
It takes into account all the IDs and displays only three rows of Client and max ID, but I need to look only among the previous ones for all rows DataFrame. Help please)
import pandas as pd
data=pd.DataFrame({
'ID': [133853.0,155755.0,149331.0,337270.0,
775727.0,200868.0,138453.0,738497.0,
666802.0,697070.0,128148.0,1042225.0,
303441.0,940515.0,143548.0],
'CLIENT':[235632.0,231562.0,235632.0,231562.0,734243.0,
235632.0,235632.0,734243.0,231562.0,734243.0,
235632.0,734243.0,231562.0,734243.0,235632.0],
'DATE_START': [('2017-09-01 00:00:00'), ('2017-10-05 00:00:00'),
('2017-09-26 00:00:00'), ('2018-03-23 00:00:00'),
('2018-12-21 00:00:00'), ('2017-11-23 00:00:00'),
('2017-09-08 00:00:00'), ('2018-12-12 00:00:00'),
('2018-11-21 00:00:00'), ('2018-12-01 00:00:00'),
('2017-08-22 00:00:00'), ('2019-02-06 00:00:00'),
('2018-02-20 00:00:00'), ('2019-01-20 00:00:00'),
('2017-09-17 00:00:00')]
})
data.groupby('CLIENT').apply(lambda df:
df[df['DATE_START'] == df['DATE_START'].max()].iloc[0][['ID', 'DATE_START']]
)
Output:
CLIENT ID DATE_START
231562.0 666802.0 2018-11-21 00:00:00
235632.0 200868.0 2017-11-23 00:00:00
734243.0 1042225.0 2019-02-06 00:00:00
Let's break this down:
1.) Group By CLIENT. this will form an iterable of dataframes, grouped by CLIENT.
2.) apply a function to each dataframe in the group with a logic (that's what the apply(lambda df: ...) part is for)
3.) for each dataframe, find the most recent DATE_START, and then subset each dataframe to show only ID with the latest DATE_START (that's what the df[df['DATE_START'] == df['DATE_START'].max()] is for).
4.) At this point, I don't know what logic you want to apply if there are multiple orders from a client on the same date. In this case, I used the first match (.iloc[0]).
5.) And then I return the ID and the DATE_START.
6.) pandas will then understand that you want the logic you applied to each dataframe in the iterable to be combined row-wise, which is why the output is such.
Let me know if this is what you're looking for.q
data['id_last_order']= data.sort_values('DATE_START').groupby('CLIENT')['ID'].transform(lambda x: x.shift())
or with creation function
def select_last_order_id(row):
df = data[(data['CLIENT']==row['CLIENT'])&(data['DATE_START']<row['DATE_START'])]
try:
value = df.groupby(by=['ID','CLIENT'],as_index=False,sort = False).agg('max')['ID'].values[0]
except Exception:
value = None
return(value)
data['id_last_order'] = data.apply(select_last_order_id,axis=1)

Python Pandas String matching from different columns

I have a excel-1(Raw Data) and excel-2(reference Document)
In excel-1 the "Comments" should be matched against excel-2 "Comments" column.If the string in excel-1 "comments" column contains any of the substring in excel-2 "comments" column,the Primary reason and Secondary reason from excel-2 should be populated against each row in excel-1.
Excel-1
{'Item': {0: 'rr-1', 1: 'ss-2'}, 'Order': {0: 1, 1: 2}, 'Comments': {0: 'Good;Stock out of order,#1237-MF, Closing the stock ', 1: 'no change, bad, next week delivery,09/12/2018-MF*'}}
Excel-2
{'Comments': {0: 'Good', 1: 'Stock out of order', 2: 'Stock closed ', 3: 'No Change', 4: 'Bad stock', 5: 'Next week delivery '}, 'Primary Reason': {0: 'Quality', 1: 'Warehouse', 2: 'Logistics ', 3: 'Feedback', 4: 'Warehouse', 5: 'Logistics '}, 'Secondary Reason': {0: 'Manufacture', 1: 'Stock', 2: 'Warehouse', 3: 'Feedback', 4: 'Stock', 5: 'Warehouse'}}
Please help to build the logic.
I get the answer when there is single match using pd.dataframe.str.contains/isin function but how to write the logic to search multiple matches and to write in a particular structure format.
for value in df['Comments']:
string = re.sub(r'[?|$|.|!|,|;]',r'',value)
for index,value in df1.iterrows():
substring = df1.Comment[index]
if substring in string:
df['Primary Reason']= df1['Primary Reason'][index]
df['Secondary Reason']=df1['Secondary Reason'][index]
for value in df['Comments']:
string = re.sub(r'[?|$|.|!|,|;]',r'',value)
for index,value in df1.iterrows():
substring = df1.Comment[index]
if substring in string:
df['Primary Reason']= df1['Primary Reason'][index]
df['Secondary Reason']=df1['Secondary Reason'][index]
Analysis from above code:
Basically you are comparing row1 of excel-1 and row-1 of excel-2 and matching the substring and string and getting Primary and Secondary Reason right ?
Here you are overwriting the same location i.e o/p location and because of this reason you always end up with only 1 result.
Issue is in the following code:
df['Primary Reason']= df1['Primary Reason'][index]
df['Secondary Reason']=df1['Secondary Reason'][index]
Come up with the logic where you can add up result in the same line as below format
res1, res2 ....etc

Categories

Resources