I have one dataframe which contain many columns and i am trying to make pivot table
like this
Data sample
program | InWappTable | InLeadExportTrack
VIC | True | 1
VIC | True |1
VIC | True |1
VIC | True | 1
Here is my code
rec.groupby(['InWappTable', 'InLeadExportTrack','program']).size()
And Expected Output is
IIUC, you can try this:
df_new=df.groupby(['program'])['InWappTable','InLeadExporttrack'].count().reset_index()
total = df_new.sum()
total['program'] = 'Total'
df_new=df_new.append(total, ignore_index=True)
print(df_new)
I do not believe that you require a pivot_table here, though a pivot_table approach with aggfunc can also be used effectively.
Here is how I approached this
Generate some data
a = [['program','InWappTable','InLeadExportTrack'],
['VIC',True,1],
['Mall',False,15],
['VIC',True,101],
['VIC',True,1],
['Mall',True,74],
['Mall',True,11],
['VIC',False,44]]
df = pd.DataFrame(a[1:], columns=a[0])
print(df)
program InWappTable InLeadExportTrack
0 VIC True 1
1 Mall False 15
2 VIC True 101
3 VIC True 1
4 Mall True 74
5 Mall True 11
6 VIC False 44
First do GROUP BY with count aggregation
df_grouped = df.groupby(['program']).count()
print(df_grouped)
InWappTable InLeadExportTrack
program
Mall 3 3
VIC 4 4
Then to get the sum of all columns
num_cols = ['InWappTable','InLeadExportTrack']
df_grouped[num_cols] = df_grouped[num_cols].astype(int)
df_grouped.loc['Total']= df_grouped.sum(axis=0)
df_grouped.reset_index(drop=False, inplace=True)
print(df_grouped)
program InWappTable InLeadExportTrack
0 Mall 3 3
1 VIC 4 4
2 Total 7 7
EDIT
Based on the comments in the OP, df_grouped = df.groupby(['program']).count() could be replaced by df_grouped = df.groupby(['program']).sum(). In this case, the output is shown below
program InWappTable InLeadExportTrack
0 Mall 2 100
1 VIC 3 147
2 Total 5 247
Related
I've a pandas dataframe as below:
driver_id | trip_id | pickup_from | drop_off_to | date
1 | 3 | stop 1 | city1 | 2018-02-04
1 | 7 | city 2 | city3 | 2018-02-04
1 | 4 | city 4 | stop1 | 2018-02-04
2 | 8 | stop 1 | city7 | 2018-02-06
2 | 9 | city 8 | stop1 | 2018-02-06
2 | 12 | stop 1 | city5 | 2018-02-07
2 | 10 | city 3 | city1 | 2018-02-07
2 | 1 | city 4 | city7 | 2018-02-07
2 | 6 | city 2 | stop1 | 2018-02-07
I want to calculate the longest trip for each driver between stop 1 in the (pick_from) column and stop 1 in the (drop_off_to) column. i.e: for driver 1 if he started from stop 1 and went to city 1 then city 2 then city 3 then city 4 and then back to stop 1. so the max number of trips should be the number of cities he visited = 4 cities.
For driver 2 he started from stop 1 and then went to city 7 and city 8 before he goes back to stop 1 so he visited 2 cities. Then he started from stop 1 again and visited city 5, city 3, city 1, city 4, city 7 and city 2 before he goes back to stop 1 then total number of cities he worked in is 6 cities. So for driver 2 the max number of cities he visited = 6. The date doesn't matter in our calculation.
How can I do this using Pandas
Define the following function computing the longest trip
for a driver:
def maxTrip(grp):
trip = pd.DataFrame({'city': grp[['pickup_from', 'drop_off_to']]
.values.reshape(1, -1).squeeze()})
return trip.groupby(trip.city.str.match('stop').cumsum())\
.apply(lambda grp2: grp2.drop_duplicates().city.size).max() - 1
Then apply it:
result = df.groupby('driver_id').apply(maxTrip)
The result, for your data sample, is:
driver_id
1 4
2 6
dtype: int64
Note: It is up to you whether you want to eliminate repeating cities during
one sub-trip (from leaving the stop to return).
I assumed that they are to be eliminated. If you don't want this, drop
.drop_duplicates() from my code.
For your data sample this does not matter, since in each sub-trip
city names are unique. But it can happen that a driver visits a city,
then goes to another city and sometimes later (but before the return
to the stop) visits the same city again.
This would be best done using a function. Assuming the name of your dataframe above is df, Consider this as a continuation:
drop_cities1 = list(df.loc[1:3, "drop_off_to"]) #Makes a list of all driver 1 drops
pick_cities1 = list(df.loc[1:3, "pickup from"]) #Makes a list of all driver 1 pickups
drop_cities2 = list(df.loc[4:, "drop_off_to"])
pick_cities2 = list(df.loc[4:, "pickup from"])
no_of_c1 = 0 #Number of cities visited by driver 1
no_of_c2 = 0 #Number of cities visited by driver 2
def max_trips(pick_list, drop_list):
no_of_cities = 0
for pick in pick_list:
if pick == stop1:
for drop in drop_list:
no_of_cities += 1
if drop == stop1:
break
return no_of_cities
max_trips(pick_cities1, drop_cities1)
max_trips(pick_cities2, drop_cities2)
I have a sample dataframe/table as below and I would like to do a simple pivot table in Python to calculate the % difference from the previous year.
DataFrame
Year Month Count Amount Retailer
2019 5 10 100 ABC
2019 3 8 80 XYZ
2020 3 8 80 ABC
2020 5 7 70 XYZ
...
Expected Output
MONTH %Diff
ABC 7 -0.2
XYG 8 -0.125
Thanks,
EDIT: I would like to reiterate that I would like to create the following table below. Not to do a join with the two tables
It looks like you need a groupby not pivot
gdf = df.groupby(['Retailer']).agg({'Amount': 'pct_change'})
Then rename and merge with original df
df = gdf.rename(columns={'Amount': '%Diff'}).dropna().merge(df, how='left', left_index=True, right_index=True)
%Diff Year Month Count Amount Retailer
2 -0.200 2020 3 7 80 ABC
3 -0.125 2020 5 8 70 XYZ
I am trying to create a column that includes a percentage of values based on the values in other columns in python. For example, let's assume that we have the following dataset.
+------------------------------------+------------+--------+
| Teacher | grades | counts |
+------------------------------------+------------+--------+
| Teacher1 | 1 | 1 |
| | 2 | 2 |
| | 3 | 1 |
| Teacher2 | 2 | 1 |
| Teacher3 | 3 | 2 |
| Teacher4 | 2 | 2 |
| | 3 | 2 |
+------------------------------------+------------+--------+
As you can see we have teachers in the first columns, grades that teacher gives (1,2 and 3) in the second column, and the number of given corresponding grade in third columns. Here, I am trying to get the percentage of grade numbers 1 and 2 in total given grade for each teacher. For instance, teacher 1 gave one grade 1, two grade 2, and one grade 3. In this case, the percentage of given grade numbers 1 and 2 in the total grade is 75%. Teacher 2 gave only 1 grade 2 so the percentage is 100%. Similarly, teacher 3 gave two grade 3 so the percentage 0% because he/she did not give any grades 1 and 2. So these percentages should be added to the new column in the dataset. Honestly, I couldn't even think about anything to try and I didn't find anything about it when I search it in here. Could you please help me to get the column.
I am not sure this is the most efficient way, but I find it quite readable and easy to follow.
percents = {} #store Teacher:percent
for t, g in df.groupby('Teacher'): #t,g is short for teacher,group
total = g.counts.sum()
one_two = g.loc[g.grades.isin([1,2])].counts.sum() #consider only 1&2
percent = (one_two/total)*100
#print(t, percent)
percents[t] = [percent]
xf = pd.DataFrame(percents).T.reset_index() #make a df from the dic
xf.columns = ['Teacher','percent'] #rename columns
df = df.merge(xf) #merge with initial df
print(df)
Teacher grades counts percent
0 Teacher1 1 1 75.0
1 Teacher1 2 2 75.0
2 Teacher1 3 1 75.0
3 Teacher2 2 1 100.0
4 Teacher3 3 2 0.0
5 Teacher4 2 2 50.0
6 Teacher4 3 2 50.0
I believe this will solve your query
y=0
data['Percentage']='None'
for teacher in teachers:
x=data[data['Teachers']==teacher]
total=sum(x['Counts'])
condition1= 1 in set(x['Grades'])
condition2= 2 in set(x['Grades'])
if (condition1==True or condition2==True):
for i in range(y,y+len(x)):
data['Percentage'].iloc[i]=(data['Counts'].iloc[i]/total)*100
else:
for i in range(y,y+len(x)):
data['Percentage'].iloc[i]=0
y=y+len(x)
Output:
Teachers Grades Counts Percentage
0 Teacher1 1 1 25
1 Teacher1 2 2 50
2 Teacher1 3 1 25
3 Teacher2 2 1 100
4 Teacher3 3 2 0
5 Teacher4 2 2 50
6 Teacher4 3 2 50
I have made used of boolean comprehension to segregate the data on
basis on each teacher. Most of the code is self explanatory. For any
other clarification please fill free to leave a comment.
I want to create a contingency table in Pandas. I can do it with the following code but I wondered if there is a pandas function that would do it for me.
For a reproducible example:
toy_data #json
'{"Light":{"321":"no_light","476":"night_light","342":"lamp","454":"lamp","25":"night_light","53":"night_light","120":"night_light","346":"night_light","360":"lamp","55":"no_light","391":"night_light","243":"no_light","101":"night_light","377":"night_light","124":"no_light","368":"lamp","400":"no_light","247":"night_light","270":"lamp","208":"night_light"},"Nearsightedness":{"321":"No","476":"Yes","342":"Yes","454":"Yes","25":"No","53":"Yes","120":"Yes","346":"No","360":"No","55":"Yes","391":"Yes","243":"No","101":"No","377":"Yes","124":"No","368":"No","400":"No","247":"No","270":"Yes","208":"No"}}'
toy_data.head()
Light Nearsightedness
321 no_light No
476 night_light Yes
342 lamp Yes
454 lamp Yes
25 night_light No
df = pd.DataFrame(toy_data.groupby(['Light', 'Nearsightedness']).size())
df = df.unstack('Nearsightedness')
df.columns = df.columns.droplevel()
df
Nearsightedness No Yes
Light
lamp 2 3
night_light 5 5
no_light 4 1
pd.crosstab will do the trick:
pd.crosstab(df.Light, df.Nearsightedness)
Output:
Nearsightedness No Yes
Light
lamp 2 3
night_light 5 5
no_light 4 1
You can use pd.crosstab:
res = pd.crosstab(df['Light'], df['Nearsightedness'].eq('Yes'))
print(res)
Nearsightedness False True
Light
lamp 2 3
night_light 5 5
no_light 4 1
Im working on a database of incidents affecting different sectors in different countries and want to create a table tallying the incident rate breakdown for each country.
The database looks like this atm
Incident Name | Country Affected | Sector Affected
incident_1 | US,TW,CN | Engineering,Media
incident_2 | FR,RU,CN | Government
etc., etc.
My aim would be to build a which looks like this:
Country | Engineering | Media | Government
CN | 3 | 0 | 5
etc.
Right now my method is basically to use an if loop to check if the country column contains a specific string (for example 'CN') and if this returns True then to run Counter from collections to create a dictionary of the initial tally, then save this.
My issue is how to scale this us to a level where it can be run across the entire database AND how to actually save the dictionary produced by Counter.
pd.Series.str.get_dummies and pd.DataFrame.dot
c = df['Country Affected'].str.get_dummies(sep=',')
s = df['Sector Affected'].str.get_dummies(sep=',')
c.T.dot(s)
Engineering Government Media
CN 1 1 1
FR 0 1 0
RU 0 1 0
TW 1 0 1
US 1 0 1
bigger example
np.random.seed([3,1415])
countries = ['CN', 'FR', 'RU', 'TW', 'US', 'UK', 'JP', 'AU', 'HK']
sectors = ['Engineering', 'Government', 'Media', 'Commodidty']
def pick_rnd(x):
i = np.random.randint(1, len(x))
j = np.random.choice(x, i, False)
return ','.join(j)
df = pd.DataFrame({
'Country Affected': [pick_rnd(countries) for _ in range(10)],
'Sector Affected': [pick_rnd(sectors) for _ in range(10)]
})
df
Country Affected Sector Affected
0 CN Government,Media
1 FR,TW,JP,US,UK,CN,RU,AU Commodidty,Government
2 HK,AU,JP Commodidty
3 RU,CN,FR,JP,UK Media,Commodidty,Engineering
4 CN,RU,FR,JP,TW,HK,US,UK Government,Media,Commodidty
5 FR,CN Commodidty
6 FR,HK,JP,TW,US,AU,CN Commodidty
7 CN,HK,RU,TW,UK,US,FR,JP Media,Commodidty
8 JP,UK,AU Engineering,Media
9 RU,UK,FR Media
Then
c = df['Country Affected'].str.get_dummies(sep=',')
s = df['Sector Affected'].str.get_dummies(sep=',')
c.T.dot(s)
Commodidty Engineering Government Media
AU 3 1 1 1
CN 6 1 3 4
FR 6 1 2 4
HK 4 0 1 2
JP 6 2 2 4
RU 4 1 2 4
TW 4 0 2 2
UK 4 2 2 5
US 4 0 2 2