I'm trying to use the python dedupe library to perform a fuzzy duplicate check on my mock data, but i keep getting this error:
{'Vendor': {0: 'ABC', 1: 'ABC', 2: 'TIM'},
'Doc Date': {0: '5/12/2019', 1: '5/13/2019', 2: '4/15/2019'},
'Invoice Date': {0: '5/10/2019', 1: '5/10/2019', 2: '4/10/2019'},
'Invoice Ref Num': {0: 'ABCDE56.', 1: 'ABCDE56', 2: 'RTET5SDF'},
'Invoice Amount': {0: '56', 1: '56', 2: '100'}}
IndexError: Cannot choose from an empty sequence
Here's the code that i'm using:
import pandas as pd
import pandas_dedupe
df = pd.read_csv("duptest.csv") df.columns
df = pandas_dedupe.dedupe_dataframe(df,['Vendor','Invoice Ref Num','Invoice Amount'])
Any idea what i'm doing wrong? thanks.
pandas-dedupe create a sample of observations you need to label.
The default amount of observation is equal to 30% of your dataframe.
In your case you have too few examples in you dataframe to start active learning.
If you sample_size=1 as follows:
df = pandas_dedupe.dedupe_dataframe(df,['Vendor','Invoice Ref Num','Invoice Amount'], sample_size=1)
you will be able to dedupe you data :)
Related
dataset: top 50 novels sold each years
Genre column: fiction , nonfiction (only two unique values)
how can I summarize data so that I get a table of author name, and the number of fiction And nonfiction books they have written in two different columns ?
Here is the minimized version of dataset:
"""{'Name': {0: '10-Day Green Smoothie Cleanse',
1: '11/22/63: A Novel',
2: '12 Rules for Life: An Antidote to Chaos'},
'Author': {0: 'JJ Smith', 1: 'Stephen King', 2: 'Jordan B. Peterson'},
'User Rating': {0: 4.7, 1: 4.6, 2: 4.7},
'Reviews': {0: '17,350', 1: '2,052', 2: '18,979'},
'Price': {0: '$8.00', 1: '$22.00', 2: '$15.00'},
'Price_r': {0: '$8', 1: '$22', 2: '$15'},
'Year': {0: 2016, 1: 2011, 2: 2018},
'Genre': {0: 'Non Fiction', 1: 'Fiction', 2: 'Non Fiction'}} """
df.groupby(['Author']).Genre.value_counts().sort_values(ascending = False)
I have tried using group by but not getting sperate columns for fiction and non fiction.
We dont have the column's names but from what I understand, something like this should do the work :
df.groupby(["author", "genre"]).count()
Or (to get back the "author" and "genre" as columns instead of having them in the index) :
df.groupby(["author", "genre"]).count().reset_index()
Getting a key error 197 when I am writing this code. However when I am debugging by removing the for loop and considering values instead of i, it is working.
import pandas as pd
d = {'Customer Type': ['physician/doctor/gp', 'private hospital', 'private hospital', 'private hospital', 'pharmacy-retail'],
'Invoice Date Month J&J': ['Oct', 'May', 'Oct', 'Nov', 'May'],
'Invoice Date Year J&J': [2015.0, 2016.0, 2016.0, 2017.0, 2018.0],
'Matching Type': ['Credit/Other', 'Credit/Other', 'Credit/Other', 'Credit/Other', 'Credit/Other']}
df_CT = pd.DataFrame(data=d)
hosp = ['hospital', 'hosp', 'hosps','hospitals', 'hsp','clinics',
'clinic','clin',"hosp'tl", 'health', 'doctor', 'pract', 'clinical','patients',
'hlthcare','nurse', 'clncal', 'cln', 'doctors', 'clin','hlth', 'nursing']
for i in range(df_CT.shape[0]):
if (any(j in df_CT.loc[i,'Customer Type'] for j in hosp)):
df_CT.loc[i,'CustomerType_Level0'] = 'Hospitals'
Error as follows:
Key Error
How I see, you try to add new column in existing DataFrame, try it by insert method.
IE
df_CT.insert(loc=1, column='CustomerType_Level0', value=[])
Im sorry for creating new answers, but I cant comment until reach 50rep. Your problem is CustomerType_Level0 does not exist, you should add that column first, it can be nullable and then you shot easy to imput values by .loc
I'm working on a df as follows:
df = pd.DataFrame({'ID': {0: 'S0001', 1: 'S0002', 2: 'S0003'},
'StartDate': {0: Timestamp('2018-01-01 00:00:00'),
1: Timestamp('2019-01-01 00:00:00'),
2: Timestamp('2019-04-01 00:00:00')},
'EndDate': {0: Timestamp('2019-01-02 00:00:00'),
1: Timestamp('2020-01-02 00:00:00'),
2: Timestamp('2020-04-01 00:00:00')}
'Color': {0: 'Blue', 1: 'Green', 2: 'Red'},
'Type': {0: 'Small', 1: 'Mid', 2: 'Mid'}})
Now I want to create a df with 366 rows between Start and End dates and I want to add the Color, Type, ID for every row between Start and End Date.
I'm doing the following whick works well:
OutputDF = pd.concat([pd.DataFrame(data = Row['ID'], index = pd.date_range(Row['StartDate'], Row['EndDate'], freq='1D', closed = 'left'), columns = ['ID']) for index, Row in df.iterrows()])
and I get a df with 2 columns SiteID and days in the range Start/End Dates.
I'm able to add the Color/Type by doing a pd.merge on 'ID' but I think there is a direct way to add the column Color and Type directly when creating the DF.
I've tried data = [Row['ID'], Row['Type'], Row['Color']] or data = Row[['ID', 'Color', 'Type']] but neither works.
Therefore, how should I do to create my dataframe but having the Color for every item for the whole 366 rows directly without requiring the merge ?
Sample of current output:
It goes on for all the days between Start/End dates for each item.
Desired output:
Thanks
Try, pd.DataFrame constructor with a dictionary for data:
pd.concat([pd.DataFrame({'ID':Row['ID'],
'Color':Row['Color'],
'Type':Row['Type']},
index = pd.date_range(Row['StartDate'],
Row['EndDate'],
freq='1D',
closed = 'left'))
for index, Row in df.iterrows()])
I have a excel-1(Raw Data) and excel-2(reference Document)
In excel-1 the "Comments" should be matched against excel-2 "Comments" column.If the string in excel-1 "comments" column contains any of the substring in excel-2 "comments" column,the Primary reason and Secondary reason from excel-2 should be populated against each row in excel-1.
Excel-1
{'Item': {0: 'rr-1', 1: 'ss-2'}, 'Order': {0: 1, 1: 2}, 'Comments': {0: 'Good;Stock out of order,#1237-MF, Closing the stock ', 1: 'no change, bad, next week delivery,09/12/2018-MF*'}}
Excel-2
{'Comments': {0: 'Good', 1: 'Stock out of order', 2: 'Stock closed ', 3: 'No Change', 4: 'Bad stock', 5: 'Next week delivery '}, 'Primary Reason': {0: 'Quality', 1: 'Warehouse', 2: 'Logistics ', 3: 'Feedback', 4: 'Warehouse', 5: 'Logistics '}, 'Secondary Reason': {0: 'Manufacture', 1: 'Stock', 2: 'Warehouse', 3: 'Feedback', 4: 'Stock', 5: 'Warehouse'}}
Please help to build the logic.
I get the answer when there is single match using pd.dataframe.str.contains/isin function but how to write the logic to search multiple matches and to write in a particular structure format.
for value in df['Comments']:
string = re.sub(r'[?|$|.|!|,|;]',r'',value)
for index,value in df1.iterrows():
substring = df1.Comment[index]
if substring in string:
df['Primary Reason']= df1['Primary Reason'][index]
df['Secondary Reason']=df1['Secondary Reason'][index]
for value in df['Comments']:
string = re.sub(r'[?|$|.|!|,|;]',r'',value)
for index,value in df1.iterrows():
substring = df1.Comment[index]
if substring in string:
df['Primary Reason']= df1['Primary Reason'][index]
df['Secondary Reason']=df1['Secondary Reason'][index]
Analysis from above code:
Basically you are comparing row1 of excel-1 and row-1 of excel-2 and matching the substring and string and getting Primary and Secondary Reason right ?
Here you are overwriting the same location i.e o/p location and because of this reason you always end up with only 1 result.
Issue is in the following code:
df['Primary Reason']= df1['Primary Reason'][index]
df['Secondary Reason']=df1['Secondary Reason'][index]
Come up with the logic where you can add up result in the same line as below format
res1, res2 ....etc
I am trying to organise some data into groups based on the contents of certain columns. The current code I have is dependent on loops and I would like to vectorise instead to improve performance. I know this is the way forwards with pandas and whilst I can vectorise some problems I am really struggling with this one.
What I need to do is group the data by the ClientNumber and link the genuine and incomplete rows so that for each Clientnumber all the genuine rows have a different process ID and the incomplete rows are given the same process ID as the nearest genuine row which has a StartDate that is greater than the StartDate for the incomplete row (Essentially, incomplete rows should be connected to a genuine if present and once a genuine row is found it should close that grouping and treat future rows as separate events). Then I must be able to set a process Start date for each row which is equal to the lowest Start date within that group of processID and mark the last row (the one with the greatest StartDate) with a ProcessCount in a separate column.
Apologies for my lack of descriptive ability here, hopefully the code (written in Python 3.6) I have so far will better explain my desired outcome. The code works but as you can see relies on nested loops which I don't like. I have tried researching around to find out how to vectorise this but I am struggling to get my head around the concept for this problem.
Any help you can provide me with in straightening out the loops in this code would be most appreciated and really help me better understand how to apply this to other tasks going forwards.
Data
df_dict = {'ClientNumber': {0: 1234, 1: 1234, 2: 1234, 3: 123, 4: 123, 5: 123, 6: 12, 7: 12, 8: 1}, 'Genuine_Incomplete': {0: 'Incomplete', 1: 'Genuine', 2: 'Genuine', 3: 'Incomplete', 4: 'Incomplete', 5: 'Genuine', 6: 'Incomplete', 7: 'Incomplete', 8: 'Genuine'}, 'StartDate': {0: Timestamp('2018-01-01 00:00:00'), 1: Timestamp('2018-01-05 00:00:00'), 2: Timestamp('2018-03-01 00:00:00'), 3: Timestamp('2018-01-01 00:00:00'), 4: Timestamp('2018-01-03 00:00:00'), 5: Timestamp('2018-01-10 00:00:00'), 6: Timestamp('2018-01-01 00:00:00'), 7: Timestamp('2018-06-02 00:00:00'), 8: Timestamp('2018-01-01 00:00:00')}}
df = pd.DataFrame(data=df_dict)
df["ID"] = df.index
df["Process_Start_Date"] = np.nan
df["ProcessCode"] = np.nan
df["ProcessCount"] = np.nan
grouped_df = df.groupby('ClientNumber')
for key, item in grouped_df:
newdf = grouped_df.get_group(key)
newdf.sort_values(by=["StartDate"], inplace=True)
c = 1
for i in newdf.iterrows():
i = i[0]
GI = df.loc[i, "Genuine_Incomplete"]
proc_code = "{}_{}".format(df.loc[i, "ClientNumber"],c)
df.loc[i, "ProcessCode"] = proc_code
if GI == "Genuine":
c += 1
grouped_df = df.groupby('ProcessCode')
for key, item in grouped_df:
newdf = grouped_df.get_group(key)
newdf.sort_values(by=["StartDate"], inplace=True)
df.loc[newdf.ID.iat[-1], "ProcessCount"] = 1
for i in newdf.iterrows():
i = i[0]
df.loc[i, "Process_Start_Date"] = df.loc[newdf.ID.iat[0], "StartDate"]
Note - You may have noticed my use of the df["ID"] which is just a copy of the index. I know this is not good practice but I couldn't work out how to set values from other columns using the index. Any suggestions for doing this are also very welcome.