I have created a groupby object with 2 columns, the names of directors and how much revenue their film has returned. I would like to sum the total revenue for each director in to a data frame, but the formatting that my code is returning is a bit off
Any suggestions?
gross_revised.groupby(['director','updated_inflation_values']).size()
and the code its returning looks like this:
director updated_inflation_values
Art Stevens 9088096.00 1
Ben Sharpsteen 528279994.00 1
2187090808.00 1
Clyde Geronimi 91305448.00 1
138612686.00 1
920608730.00 1
David Hand 1078510579.00 1
5228953251.00
is there any way to have the sum value show as 1 number?
Thank you for your help!
Related
Essentially what I am trying to do is go through the "External_Name" column, row by row, and get a count of unique substrings within each string, kind of like .value_counts().
External_Name
Specialty
ABMC Hyperbaric Medicine and Wound Care
Hyperbaric/Wound Care
ABMC Kaukauna Laboratory Services
Laboratory
AHCM Sinai Bariatric Surgery Clinic
General Surgery
...........
...........
n
n
For example, after running through the first three rows in "External_Name" the output would be something like
Output
Count
ABMC
2
Hyperbaric
1
Medicine
1
and
1
Wound
1
Care
1
So on and so forth. Any help would be really appreciated!
You can split at whitespace with str.split(), then explode the resulting word lists into individual rows and count the values with value_counts.
>>> df.External_Name.str.split().explode().value_counts()
ABMC 2
Hyperbaric 1
Medicine 1
and 1
Wound 1
Care 1
Kaukauna 1
Laboratory 1
Services 1
AHCM 1
Sinai 1
Bariatric 1
Surgery 1
Clinic 1
Name: External_Name, dtype: int64
I have a dataframe shown in the image 1. It is a sample of pubs in London,UK (3337 pubs/rows). And the geometry is at an LSOA level. In some LSOAs, there is more than 1 pub. I want my dataframe to summarise the number of pubs in every LSOA. I already have the information by using
psdf['lsoa11nm'].value_counts()
prints out:
City of London 001F 103
City of London 001G 40
Westminster 013B 36
Westminster 018A 36
Westminster 013E 30
...
Lambeth 005A 1
Croydon 043C 1
Hackney 002E 1
Merton 022D 1
Bexley 008B 1
Name: lsoa11nm, Length: 1630, dtype: int64
I cant use this as a new dataframe because it is a key and one column as opposed two columns where one would be lsoa11nm and the other pub count.
Does anyone know how to groupby the dataframe so that there will be only one row for every lsoa, that says how many pubs are in it?
I have a pandas dataframe that looks something like this:
name job jobchange_rank date
Thisguy Developer 1 2012
Thisguy Analyst 2 2014
Thisguy Data Scientist 3 2015
Anotherguy Developer 1 2018
The jobchange_rank represents the each individual's (based on name) ranked change in position, where rank nr 1 represent his/her first position nr 2 his/her second position, etc.
Now for the fun part. I want to create a new column where I can see a person's previous job, something like this:
name job jobchange_rank date previous_job
Thisguy Developer 1 2012 None
Thisguy Analyst 2 2014 Developer
Thisguy Data Scientist 3 2015 Analyst
Anotherguy Developer 1 2018 None
I've created the following code to get the "None" values where there was no job change:
df.loc[df['jobchange_rank'].sub(df['jobchange_rank'].min()) == 0, 'previous_job'] = 'None'
Sadly, I can't seem to figure out how to get the values from the other column where the needed condition applies.
Any help is more then welcome!
Thanks in advance.
This answer assumes that your DataFrame is sorted by name and jobchange_rank, if that is not the case, sort first.
# df = df.sort_values(['name', 'jobchange_rank'])
m = df['name'].eq(df['name'].shift())
df['job'].shift().where(m)
0 NaN
1 Developer
2 Analyst
3 NaN
Name: job, dtype: object
Or using a groupby + shift (assuming at least sorted by jobchange_rank)
df.groupby('name')['job'].shift()
0 NaN
1 Developer
2 Analyst
3 NaN
Name: job, dtype: object
Although the groupby + shift is more concise, on larger inputs, if your data is already sorted like your example, it may be faster to avoid the groupby and use the first solution.
TL;DR:
Have 2 dataframes with different sizes, but one 'id' column(in both df) that supposed to act as index. Need to merge them, group by 'sector' and 'gender' and count/sum entrys in each group.
Long version:
I have a dataframe with 'id', 'sector', among other columns, from company personnel. Another dataframe with 'id' and 'gender'. Examples bellow:
df1:
row* id sector other columns
1 0 Operational ...
2 0 Administrative ...
3 1 Sales ...
4 2 IT ...
5 3 Operational ...
6 3 IT ...
7 4 Sales ...
[...]
150 100 Operational ...
151 100 Sales ...
152 101 IT ...
*I don't really have a 'row' column, it's there just to make it easier to understand my problem.
df2:
row* id gender
1 0 Male
2 1 Female
3 2 Female
4 3 Male
5 4 Male
[...]
101 100 Male
102 101 Female
As you can see, one person can be in more then one sector (which seems to make my problem more complicated.)
I need to merge them together and then make a sum from how many male and female in each sector.
FIRST PROBLEM
Decided to make a new df to get only the columns 'id' and 'sector'.
df3 = df1[['id','sector']]
df3 = df3.merge(df2)
I get:
No common columns to perform merge on. Merge options: left_on=None,
right_on=None, left_index=False, right_index=False
Tried using .join() instead of .merge() and I get:
['id'] not in index"
Tried now with reset_index() - Found in some of the answers around here, but didn't really solved my issue.
df1 = df1.reset_index()
df3 = df1[['id','sector']]
df3 = df3.join(df2)
What I got was this:
row* id sector gender
1 0 Operational Male
2 0 Administrative Female
3 1 Sales Female
4 2 IT Male
5 3 Operational Male
6 3 IT ...
7 4 Sales ...
[...]
150 100 Operational NaN
151 100 Sales NaN
152 101 IT NaN
It didn't respected the 'id' and just concatenated the column to the side. Since df2 only had 102 rows, I got NaN in the other rows(103 to 152), aside from the fact that the 'gender' was no longer accurate.
SECOND PROBLEM
Decided to power through that in order to get the rest of the work done. I tried this:
df3 = df3.groupby('sector','gender').size()
It raises:
No axis named gender for object type < class 'pandas.core.frame.DataFrame'>
What doesn't really make sense to me, because I can call df3.gender and I get the (entire) expected series. If I remove 'gender' from the line above, it actually group but just that doesn't work for me. Also tried passing the columns names befor groupby, to no avail.
Expected result should be something like this:
sector gender sum
operational male 20
operational female 5
administrative male 10
administrative female 17
sales male 12
sales female 13
IT male 1
IT female 11
Not sure if I can answer to my own question but I think I should since the issue is resolved.
The solutions were very simple, even though I don't understand some of the issues I got.
First problem added on='id' in the merge
df3 = df1[['id','sector']].merge(df2, on='id')
Second problem just missing a list, as pointed by #DYZ
df3.groupby(['sector','gender']).size()
Feeling quite stupid right now... Must be tired. Thanks DYZ and sorry for the trouble.
I am trying to get a daily status count from the following DataFrame (it's a subset, the real data set is ~14k jobs with overlapping dates, only one status at any given time within a job):
Job Status User
Date / Time
1/24/2011 10:58:04 1 A Ted
1/24/2011 10:59:20 1 C Bill
2/11/2011 6:53:14 1 A Ted
2/11/2011 6:53:23 1 B Max
2/15/2011 9:43:13 1 C Bill
2/21/2011 15:24:42 1 F Jim
3/2/2011 15:55:22 1 G Phil Jr.
3/4/2011 14:57:45 1 H Ted
3/7/2011 14:11:02 1 I Jim
3/9/2011 9:57:34 1 J Tim
8/18/2014 11:59:35 2 A Ted
8/18/2014 13:56:21 2 F Bill
5/21/2015 9:30:30 2 G Jim
6/5/2015 13:17:54 2 H Jim
6/5/2015 14:40:38 2 I Ted
6/9/2015 10:39:15 2 J Tom
1/16/2015 7:45:58 3 A Phil Jr.
1/16/2015 7:48:23 3 C Jim
3/6/2015 14:09:42 3 A Bill
3/11/2015 11:16:04 3 K Jim
My initial thought (from the following link) was to groupby the job column, fill in the missing dates for each group and then ffill the statuses down.
Pandas reindex dates in Groupby
I was able to make this work...kinda...if two statuses occurred on the same date, one would not be included in output and consequently some statuses were missing.
I then found the following, it supposedly handles the duplicate issue, but I am unable to get it to work with my data.
Efficiently re-indexing one level with "forward-fill" in a multi-index dataframe
Am I on the right path thinking that filling in the missing dates and then ffill down the statuses is the correct way to ultimately capture daily counts of individual statuses? Is there another method that might better use pandas features that I'm missing?