Pandas Using Unique row value as columns like pivot - python

dataset: top 50 novels sold each years
Genre column: fiction , nonfiction (only two unique values)
how can I summarize data so that I get a table of author name, and the number of fiction And nonfiction books they have written in two different columns ?
Here is the minimized version of dataset:
"""{'Name': {0: '10-Day Green Smoothie Cleanse',
1: '11/22/63: A Novel',
2: '12 Rules for Life: An Antidote to Chaos'},
'Author': {0: 'JJ Smith', 1: 'Stephen King', 2: 'Jordan B. Peterson'},
'User Rating': {0: 4.7, 1: 4.6, 2: 4.7},
'Reviews': {0: '17,350', 1: '2,052', 2: '18,979'},
'Price': {0: '$8.00', 1: '$22.00', 2: '$15.00'},
'Price_r': {0: '$8', 1: '$22', 2: '$15'},
'Year': {0: 2016, 1: 2011, 2: 2018},
'Genre': {0: 'Non Fiction', 1: 'Fiction', 2: 'Non Fiction'}} """
df.groupby(['Author']).Genre.value_counts().sort_values(ascending = False)
I have tried using group by but not getting sperate columns for fiction and non fiction.

We dont have the column's names but from what I understand, something like this should do the work :
df.groupby(["author", "genre"]).count()
Or (to get back the "author" and "genre" as columns instead of having them in the index) :
df.groupby(["author", "genre"]).count().reset_index()

Related

Pandas: How to Find a % of a Group?

*** Disclaimer: I am a total noob. I am trying to learn Pandas by solving a problem at work. This is a subset of my total problem but I am trying to solve the pieces before I tackle the project. I appreciate your patience! ***
I am trying to find out what percentage each Fund is of the States total.
Concept: We have funds(departments) that are based in states. The funds have different levels of compensation for different projects. I first need to total(group) the funds so I know the total compensation per fund.
I also need to total(group) the compensation by state so I can later figure out the fund % by state.
I have converted my data to sample code here:
import pandas as pd
#sample data
data = {'Fund':['1000','1000','2000','2000','3000','3000','4000','4000'],
'State':['AL','AL','FL','FL','AL','AL','NC','NC'],
'Compensation':[2000,2500,1500,1750,4000,3200,1450,3000]}
If the pic doesn't come over here is what I did:
print(employees)
employees.groupby('Fund').Compensation.sum()
employees.groupby('State').Compensation.sum()
I've spent a good portion of the day on my actual data trying to figure out how to get the:
Fund's compensation is __% of total compensation for State
or..
Fund_1000 is 38% of AL total compensation.
Thanks for your patience and your help!
John
Here is one solution. You can first do a groupby to get to the lowest level of aggregation, and then use groupby transform to divide these values by state totals.
agg = df.groupby(['Fund','State'],as_index=False)['Compensation'].sum()
agg['percentage'] = (agg['Compensation'] / agg.groupby('State')['Compensation'].transform(sum)) * 100
agg.to_dict()
{'Fund': {0: '1000', 1: '2000', 2: '3000', 3: '4000'},
'State': {0: 'AL', 1: 'FL', 2: 'AL', 3: 'NC'},
'Compensation': {0: 4500, 1: 3250, 2: 7200, 3: 4450},
'percentage': {0: 38.46153846153847,
1: 100.0,
2: 61.53846153846154,
3: 100.0}}
This should do the work:
df['total_state_compensataion'] = df.groupby('State')['Compensation'].transform(sum)
df['total_state_fund_compensataion'] = df.groupby(['State','Fund'])['Compensation'].transform(sum)
df['ratio']=df['total_state_fund_compensataion'].div(df['total_state_compensataion'])
>>>df.groupby(['State','Fund'])['ratio'].mean().to_dict()
out[1] {('AL', '1000'): 0.38461538461538464,
('AL', '3000'): 0.6153846153846154,
('FL', '2000'): 1.0,
('NC', '4000'): 1.0}
You can also calculate and merge data frames...
import pandas as pd
data = {
"Fund": ["1000", "1000", "2000", "2000", "3000", "3000", "4000", "4000"],
"State": ["AL", "AL", "FL", "FL", "AL", "AL", "NC", "NC"],
"Compensation": [2000, 2500, 1500, 1750, 4000, 3200, 1450, 3000],
}
# Create dataframe from dictionary provided
df = pd.DataFrame.from_dict(data)
# first group compensation by state and fund
df_fund = df.groupby(["Fund", "State"]).Compensation.sum().reset_index()
# Calculate Total by state in new df
df_total = df_fund.groupby("State").Compensation.sum().reset_index()
# Merge dataframes with total column
merged = df_fund.merge(df_total, how="outer", left_on="State", right_on="State")
#Add percentage col to merged dataframe.
merged["percentage"] = merged["Compensation_x"] / merged["Compensation_y"] * 100

fuzzy duplicate check using python dedupe library error

I'm trying to use the python dedupe library to perform a fuzzy duplicate check on my mock data, but i keep getting this error:
{'Vendor': {0: 'ABC', 1: 'ABC', 2: 'TIM'},
'Doc Date': {0: '5/12/2019', 1: '5/13/2019', 2: '4/15/2019'},
'Invoice Date': {0: '5/10/2019', 1: '5/10/2019', 2: '4/10/2019'},
'Invoice Ref Num': {0: 'ABCDE56.', 1: 'ABCDE56', 2: 'RTET5SDF'},
'Invoice Amount': {0: '56', 1: '56', 2: '100'}}
IndexError: Cannot choose from an empty sequence
Here's the code that i'm using:
import pandas as pd
import pandas_dedupe
df = pd.read_csv("duptest.csv") df.columns
df = pandas_dedupe.dedupe_dataframe(df,['Vendor','Invoice Ref Num','Invoice Amount'])
Any idea what i'm doing wrong? thanks.
pandas-dedupe create a sample of observations you need to label.
The default amount of observation is equal to 30% of your dataframe.
In your case you have too few examples in you dataframe to start active learning.
If you sample_size=1 as follows:
df = pandas_dedupe.dedupe_dataframe(df,['Vendor','Invoice Ref Num','Invoice Amount'], sample_size=1)
you will be able to dedupe you data :)

Unable to create dataframe with pandas DateRange and multiple columns

I'm working on a df as follows:
df = pd.DataFrame({'ID': {0: 'S0001', 1: 'S0002', 2: 'S0003'},
'StartDate': {0: Timestamp('2018-01-01 00:00:00'),
1: Timestamp('2019-01-01 00:00:00'),
2: Timestamp('2019-04-01 00:00:00')},
'EndDate': {0: Timestamp('2019-01-02 00:00:00'),
1: Timestamp('2020-01-02 00:00:00'),
2: Timestamp('2020-04-01 00:00:00')}
'Color': {0: 'Blue', 1: 'Green', 2: 'Red'},
'Type': {0: 'Small', 1: 'Mid', 2: 'Mid'}})
Now I want to create a df with 366 rows between Start and End dates and I want to add the Color, Type, ID for every row between Start and End Date.
I'm doing the following whick works well:
OutputDF = pd.concat([pd.DataFrame(data = Row['ID'], index = pd.date_range(Row['StartDate'], Row['EndDate'], freq='1D', closed = 'left'), columns = ['ID']) for index, Row in df.iterrows()])
and I get a df with 2 columns SiteID and days in the range Start/End Dates.
I'm able to add the Color/Type by doing a pd.merge on 'ID' but I think there is a direct way to add the column Color and Type directly when creating the DF.
I've tried data = [Row['ID'], Row['Type'], Row['Color']] or data = Row[['ID', 'Color', 'Type']] but neither works.
Therefore, how should I do to create my dataframe but having the Color for every item for the whole 366 rows directly without requiring the merge ?
Sample of current output:
It goes on for all the days between Start/End dates for each item.
Desired output:
Thanks
Try, pd.DataFrame constructor with a dictionary for data:
pd.concat([pd.DataFrame({'ID':Row['ID'],
'Color':Row['Color'],
'Type':Row['Type']},
index = pd.date_range(Row['StartDate'],
Row['EndDate'],
freq='1D',
closed = 'left'))
for index, Row in df.iterrows()])

Python Pandas String matching from different columns

I have a excel-1(Raw Data) and excel-2(reference Document)
In excel-1 the "Comments" should be matched against excel-2 "Comments" column.If the string in excel-1 "comments" column contains any of the substring in excel-2 "comments" column,the Primary reason and Secondary reason from excel-2 should be populated against each row in excel-1.
Excel-1
{'Item': {0: 'rr-1', 1: 'ss-2'}, 'Order': {0: 1, 1: 2}, 'Comments': {0: 'Good;Stock out of order,#1237-MF, Closing the stock ', 1: 'no change, bad, next week delivery,09/12/2018-MF*'}}
Excel-2
{'Comments': {0: 'Good', 1: 'Stock out of order', 2: 'Stock closed ', 3: 'No Change', 4: 'Bad stock', 5: 'Next week delivery '}, 'Primary Reason': {0: 'Quality', 1: 'Warehouse', 2: 'Logistics ', 3: 'Feedback', 4: 'Warehouse', 5: 'Logistics '}, 'Secondary Reason': {0: 'Manufacture', 1: 'Stock', 2: 'Warehouse', 3: 'Feedback', 4: 'Stock', 5: 'Warehouse'}}
Please help to build the logic.
I get the answer when there is single match using pd.dataframe.str.contains/isin function but how to write the logic to search multiple matches and to write in a particular structure format.
for value in df['Comments']:
string = re.sub(r'[?|$|.|!|,|;]',r'',value)
for index,value in df1.iterrows():
substring = df1.Comment[index]
if substring in string:
df['Primary Reason']= df1['Primary Reason'][index]
df['Secondary Reason']=df1['Secondary Reason'][index]
for value in df['Comments']:
string = re.sub(r'[?|$|.|!|,|;]',r'',value)
for index,value in df1.iterrows():
substring = df1.Comment[index]
if substring in string:
df['Primary Reason']= df1['Primary Reason'][index]
df['Secondary Reason']=df1['Secondary Reason'][index]
Analysis from above code:
Basically you are comparing row1 of excel-1 and row-1 of excel-2 and matching the substring and string and getting Primary and Secondary Reason right ?
Here you are overwriting the same location i.e o/p location and because of this reason you always end up with only 1 result.
Issue is in the following code:
df['Primary Reason']= df1['Primary Reason'][index]
df['Secondary Reason']=df1['Secondary Reason'][index]
Come up with the logic where you can add up result in the same line as below format
res1, res2 ....etc

How to vectorise looping through pandas groupby dataframes whilst appyling conditionals

I am trying to organise some data into groups based on the contents of certain columns. The current code I have is dependent on loops and I would like to vectorise instead to improve performance. I know this is the way forwards with pandas and whilst I can vectorise some problems I am really struggling with this one.
What I need to do is group the data by the ClientNumber and link the genuine and incomplete rows so that for each Clientnumber all the genuine rows have a different process ID and the incomplete rows are given the same process ID as the nearest genuine row which has a StartDate that is greater than the StartDate for the incomplete row (Essentially, incomplete rows should be connected to a genuine if present and once a genuine row is found it should close that grouping and treat future rows as separate events). Then I must be able to set a process Start date for each row which is equal to the lowest Start date within that group of processID and mark the last row (the one with the greatest StartDate) with a ProcessCount in a separate column.
Apologies for my lack of descriptive ability here, hopefully the code (written in Python 3.6) I have so far will better explain my desired outcome. The code works but as you can see relies on nested loops which I don't like. I have tried researching around to find out how to vectorise this but I am struggling to get my head around the concept for this problem.
Any help you can provide me with in straightening out the loops in this code would be most appreciated and really help me better understand how to apply this to other tasks going forwards.
Data
df_dict = {'ClientNumber': {0: 1234, 1: 1234, 2: 1234, 3: 123, 4: 123, 5: 123, 6: 12, 7: 12, 8: 1}, 'Genuine_Incomplete': {0: 'Incomplete', 1: 'Genuine', 2: 'Genuine', 3: 'Incomplete', 4: 'Incomplete', 5: 'Genuine', 6: 'Incomplete', 7: 'Incomplete', 8: 'Genuine'}, 'StartDate': {0: Timestamp('2018-01-01 00:00:00'), 1: Timestamp('2018-01-05 00:00:00'), 2: Timestamp('2018-03-01 00:00:00'), 3: Timestamp('2018-01-01 00:00:00'), 4: Timestamp('2018-01-03 00:00:00'), 5: Timestamp('2018-01-10 00:00:00'), 6: Timestamp('2018-01-01 00:00:00'), 7: Timestamp('2018-06-02 00:00:00'), 8: Timestamp('2018-01-01 00:00:00')}}
df = pd.DataFrame(data=df_dict)
df["ID"] = df.index
df["Process_Start_Date"] = np.nan
df["ProcessCode"] = np.nan
df["ProcessCount"] = np.nan
grouped_df = df.groupby('ClientNumber')
for key, item in grouped_df:
newdf = grouped_df.get_group(key)
newdf.sort_values(by=["StartDate"], inplace=True)
c = 1
for i in newdf.iterrows():
i = i[0]
GI = df.loc[i, "Genuine_Incomplete"]
proc_code = "{}_{}".format(df.loc[i, "ClientNumber"],c)
df.loc[i, "ProcessCode"] = proc_code
if GI == "Genuine":
c += 1
grouped_df = df.groupby('ProcessCode')
for key, item in grouped_df:
newdf = grouped_df.get_group(key)
newdf.sort_values(by=["StartDate"], inplace=True)
df.loc[newdf.ID.iat[-1], "ProcessCount"] = 1
for i in newdf.iterrows():
i = i[0]
df.loc[i, "Process_Start_Date"] = df.loc[newdf.ID.iat[0], "StartDate"]
Note - You may have noticed my use of the df["ID"] which is just a copy of the index. I know this is not good practice but I couldn't work out how to set values from other columns using the index. Any suggestions for doing this are also very welcome.

Categories

Resources