I have a excel-1(Raw Data) and excel-2(reference Document)
In excel-1 the "Comments" should be matched against excel-2 "Comments" column.If the string in excel-1 "comments" column contains any of the substring in excel-2 "comments" column,the Primary reason and Secondary reason from excel-2 should be populated against each row in excel-1.
Excel-1
{'Item': {0: 'rr-1', 1: 'ss-2'}, 'Order': {0: 1, 1: 2}, 'Comments': {0: 'Good;Stock out of order,#1237-MF, Closing the stock ', 1: 'no change, bad, next week delivery,09/12/2018-MF*'}}
Excel-2
{'Comments': {0: 'Good', 1: 'Stock out of order', 2: 'Stock closed ', 3: 'No Change', 4: 'Bad stock', 5: 'Next week delivery '}, 'Primary Reason': {0: 'Quality', 1: 'Warehouse', 2: 'Logistics ', 3: 'Feedback', 4: 'Warehouse', 5: 'Logistics '}, 'Secondary Reason': {0: 'Manufacture', 1: 'Stock', 2: 'Warehouse', 3: 'Feedback', 4: 'Stock', 5: 'Warehouse'}}
Please help to build the logic.
I get the answer when there is single match using pd.dataframe.str.contains/isin function but how to write the logic to search multiple matches and to write in a particular structure format.
for value in df['Comments']:
string = re.sub(r'[?|$|.|!|,|;]',r'',value)
for index,value in df1.iterrows():
substring = df1.Comment[index]
if substring in string:
df['Primary Reason']= df1['Primary Reason'][index]
df['Secondary Reason']=df1['Secondary Reason'][index]
for value in df['Comments']:
string = re.sub(r'[?|$|.|!|,|;]',r'',value)
for index,value in df1.iterrows():
substring = df1.Comment[index]
if substring in string:
df['Primary Reason']= df1['Primary Reason'][index]
df['Secondary Reason']=df1['Secondary Reason'][index]
Analysis from above code:
Basically you are comparing row1 of excel-1 and row-1 of excel-2 and matching the substring and string and getting Primary and Secondary Reason right ?
Here you are overwriting the same location i.e o/p location and because of this reason you always end up with only 1 result.
Issue is in the following code:
df['Primary Reason']= df1['Primary Reason'][index]
df['Secondary Reason']=df1['Secondary Reason'][index]
Come up with the logic where you can add up result in the same line as below format
res1, res2 ....etc
Related
dataset: top 50 novels sold each years
Genre column: fiction , nonfiction (only two unique values)
how can I summarize data so that I get a table of author name, and the number of fiction And nonfiction books they have written in two different columns ?
Here is the minimized version of dataset:
"""{'Name': {0: '10-Day Green Smoothie Cleanse',
1: '11/22/63: A Novel',
2: '12 Rules for Life: An Antidote to Chaos'},
'Author': {0: 'JJ Smith', 1: 'Stephen King', 2: 'Jordan B. Peterson'},
'User Rating': {0: 4.7, 1: 4.6, 2: 4.7},
'Reviews': {0: '17,350', 1: '2,052', 2: '18,979'},
'Price': {0: '$8.00', 1: '$22.00', 2: '$15.00'},
'Price_r': {0: '$8', 1: '$22', 2: '$15'},
'Year': {0: 2016, 1: 2011, 2: 2018},
'Genre': {0: 'Non Fiction', 1: 'Fiction', 2: 'Non Fiction'}} """
df.groupby(['Author']).Genre.value_counts().sort_values(ascending = False)
I have tried using group by but not getting sperate columns for fiction and non fiction.
We dont have the column's names but from what I understand, something like this should do the work :
df.groupby(["author", "genre"]).count()
Or (to get back the "author" and "genre" as columns instead of having them in the index) :
df.groupby(["author", "genre"]).count().reset_index()
I want to drop columns that have no content in any of the rows and drop other columns that starts with the same name.
In this example, Line of Business > Organization should be dropped since there are only blanks in all the rows. And since this column is dropped, all other columns starting with "Line of business >" should also be dropped from the pandas data frame. The complete data frame follows the same structure of [some text] > [Organization/Department/Employees].
data = pd.DataFrame({'Process name': {0: 'Ad campaign', 1: 'Payroll', 2: ''},
'Line of business > Organization': {0: "", 1: "", 2:''},
'Line of business > Department': {0: "Social media", 1: "People", 2:''},
'Line of business > Employees': {0: "Linda, Tom", 1: "Manuel, Olaf", 2:''}})
Result:
output = pd.DataFrame({'Process name': {0: 'Ad campaign', 1: 'Payroll', 2: ''}})
I hope I understand the case correctly, but I think you could try this:
First, replace the emtpy "" values with NaNs:
data.replace('', np.nan, inplace=True)
Then, identify the empty cols like this:
empty_cols = [col for col in data.columns if data[col].isnull().all()]
Next, identify the columns to be deleted. (this assumes that the '>' is the separator of the text relevant to identify this).
delete_cols= [col for col in data.columns for empty_col in empty_cols if col.split('>')[0] == empty_col.split('>')[0]]
At last, drop the columns you don't need and drop null values from the columns remaining:
data = data.drop(delete_cols, axis=1).dropna()
I'm trying to use the python dedupe library to perform a fuzzy duplicate check on my mock data, but i keep getting this error:
{'Vendor': {0: 'ABC', 1: 'ABC', 2: 'TIM'},
'Doc Date': {0: '5/12/2019', 1: '5/13/2019', 2: '4/15/2019'},
'Invoice Date': {0: '5/10/2019', 1: '5/10/2019', 2: '4/10/2019'},
'Invoice Ref Num': {0: 'ABCDE56.', 1: 'ABCDE56', 2: 'RTET5SDF'},
'Invoice Amount': {0: '56', 1: '56', 2: '100'}}
IndexError: Cannot choose from an empty sequence
Here's the code that i'm using:
import pandas as pd
import pandas_dedupe
df = pd.read_csv("duptest.csv") df.columns
df = pandas_dedupe.dedupe_dataframe(df,['Vendor','Invoice Ref Num','Invoice Amount'])
Any idea what i'm doing wrong? thanks.
pandas-dedupe create a sample of observations you need to label.
The default amount of observation is equal to 30% of your dataframe.
In your case you have too few examples in you dataframe to start active learning.
If you sample_size=1 as follows:
df = pandas_dedupe.dedupe_dataframe(df,['Vendor','Invoice Ref Num','Invoice Amount'], sample_size=1)
you will be able to dedupe you data :)
I'm working on a df as follows:
df = pd.DataFrame({'ID': {0: 'S0001', 1: 'S0002', 2: 'S0003'},
'StartDate': {0: Timestamp('2018-01-01 00:00:00'),
1: Timestamp('2019-01-01 00:00:00'),
2: Timestamp('2019-04-01 00:00:00')},
'EndDate': {0: Timestamp('2019-01-02 00:00:00'),
1: Timestamp('2020-01-02 00:00:00'),
2: Timestamp('2020-04-01 00:00:00')}
'Color': {0: 'Blue', 1: 'Green', 2: 'Red'},
'Type': {0: 'Small', 1: 'Mid', 2: 'Mid'}})
Now I want to create a df with 366 rows between Start and End dates and I want to add the Color, Type, ID for every row between Start and End Date.
I'm doing the following whick works well:
OutputDF = pd.concat([pd.DataFrame(data = Row['ID'], index = pd.date_range(Row['StartDate'], Row['EndDate'], freq='1D', closed = 'left'), columns = ['ID']) for index, Row in df.iterrows()])
and I get a df with 2 columns SiteID and days in the range Start/End Dates.
I'm able to add the Color/Type by doing a pd.merge on 'ID' but I think there is a direct way to add the column Color and Type directly when creating the DF.
I've tried data = [Row['ID'], Row['Type'], Row['Color']] or data = Row[['ID', 'Color', 'Type']] but neither works.
Therefore, how should I do to create my dataframe but having the Color for every item for the whole 366 rows directly without requiring the merge ?
Sample of current output:
It goes on for all the days between Start/End dates for each item.
Desired output:
Thanks
Try, pd.DataFrame constructor with a dictionary for data:
pd.concat([pd.DataFrame({'ID':Row['ID'],
'Color':Row['Color'],
'Type':Row['Type']},
index = pd.date_range(Row['StartDate'],
Row['EndDate'],
freq='1D',
closed = 'left'))
for index, Row in df.iterrows()])
I am trying to organise some data into groups based on the contents of certain columns. The current code I have is dependent on loops and I would like to vectorise instead to improve performance. I know this is the way forwards with pandas and whilst I can vectorise some problems I am really struggling with this one.
What I need to do is group the data by the ClientNumber and link the genuine and incomplete rows so that for each Clientnumber all the genuine rows have a different process ID and the incomplete rows are given the same process ID as the nearest genuine row which has a StartDate that is greater than the StartDate for the incomplete row (Essentially, incomplete rows should be connected to a genuine if present and once a genuine row is found it should close that grouping and treat future rows as separate events). Then I must be able to set a process Start date for each row which is equal to the lowest Start date within that group of processID and mark the last row (the one with the greatest StartDate) with a ProcessCount in a separate column.
Apologies for my lack of descriptive ability here, hopefully the code (written in Python 3.6) I have so far will better explain my desired outcome. The code works but as you can see relies on nested loops which I don't like. I have tried researching around to find out how to vectorise this but I am struggling to get my head around the concept for this problem.
Any help you can provide me with in straightening out the loops in this code would be most appreciated and really help me better understand how to apply this to other tasks going forwards.
Data
df_dict = {'ClientNumber': {0: 1234, 1: 1234, 2: 1234, 3: 123, 4: 123, 5: 123, 6: 12, 7: 12, 8: 1}, 'Genuine_Incomplete': {0: 'Incomplete', 1: 'Genuine', 2: 'Genuine', 3: 'Incomplete', 4: 'Incomplete', 5: 'Genuine', 6: 'Incomplete', 7: 'Incomplete', 8: 'Genuine'}, 'StartDate': {0: Timestamp('2018-01-01 00:00:00'), 1: Timestamp('2018-01-05 00:00:00'), 2: Timestamp('2018-03-01 00:00:00'), 3: Timestamp('2018-01-01 00:00:00'), 4: Timestamp('2018-01-03 00:00:00'), 5: Timestamp('2018-01-10 00:00:00'), 6: Timestamp('2018-01-01 00:00:00'), 7: Timestamp('2018-06-02 00:00:00'), 8: Timestamp('2018-01-01 00:00:00')}}
df = pd.DataFrame(data=df_dict)
df["ID"] = df.index
df["Process_Start_Date"] = np.nan
df["ProcessCode"] = np.nan
df["ProcessCount"] = np.nan
grouped_df = df.groupby('ClientNumber')
for key, item in grouped_df:
newdf = grouped_df.get_group(key)
newdf.sort_values(by=["StartDate"], inplace=True)
c = 1
for i in newdf.iterrows():
i = i[0]
GI = df.loc[i, "Genuine_Incomplete"]
proc_code = "{}_{}".format(df.loc[i, "ClientNumber"],c)
df.loc[i, "ProcessCode"] = proc_code
if GI == "Genuine":
c += 1
grouped_df = df.groupby('ProcessCode')
for key, item in grouped_df:
newdf = grouped_df.get_group(key)
newdf.sort_values(by=["StartDate"], inplace=True)
df.loc[newdf.ID.iat[-1], "ProcessCount"] = 1
for i in newdf.iterrows():
i = i[0]
df.loc[i, "Process_Start_Date"] = df.loc[newdf.ID.iat[0], "StartDate"]
Note - You may have noticed my use of the df["ID"] which is just a copy of the index. I know this is not good practice but I couldn't work out how to set values from other columns using the index. Any suggestions for doing this are also very welcome.