Output after groupby and aggregation - python

I have a PANDAS dataframe. When I make a GROUP BY and an aggregation function such as min or max, I get only partial results, namely the column on which I made the min/max aggregation on a numeric column. How can I get the full line, ie all the data corresponding to this min/max?
The dataframe looks llike:
Place Year Time TimeS
BOSTON 1973 02:16:03 8163
FUKUOKA 1973 02:11:45 7905
NEW YORK 1973 02:21:54 8514
BERLIN 1974 02:44:53 9893
BOSTON 1974 02:13:39 8019
FUKUOKA 1974 02:11:32 7892
NEW YORK 1974 02:26:30 8790
I want the min or max time realized per year and the city. I can only get the time with (marathon is the name of the pandas.DataFrame)
marathon.groupby('year').TimeS.max()
that gives:
1973 02:21:54
1974 02:44:53
How can I have the place that corresponds to this time?
Namely:
NEW YORK 1973 02:21:54
BERLIN 1974 02:44:53

There's many ways to do this, definitely. Here's two:
marathon[marathon.TimeS == marathon.groupby('Year').TimeS.transform('max')]
or
marathon[marathon.TimeS.isin(marathon.groupby('Year').TimeS.max())]
Let's check out some of these intermediate objects
In [29]: marathon.groupby('Year').TimeS.max()
Out[29]:
Year
1973 8514
1974 9893
Name: TimeS, dtype: int64
So we get a series, but only of two values. So we can index the dataframe wherever the column values are equal to one of these, which is the second solution.
The first solution uses transform('max') instead, which preserves the size of the dataframe:
In [30]: marathon.groupby('Year').TimeS.transform('max')
Out[30]:
0 8514
1 8514
2 8514
3 9893
4 9893
5 9893
6 9893
Name: TimeS, dtype: int64
So now it's the same size and we can just compare equality directly to the columns that it's equal to.
Note that if the max values occur multiple times, both of these methods will return the duplicates as well---that may or may not be what you want.

Related

When merging Dataframes on a common column like ID (primary key),how do you handle data that appears more than once for a single ID, in the second df?

So I have two dfs.
DF1
Superhero ID Superhero City
212121 Spiderman New york
364331 Ironman New york
678523 Batman Gotham
432432 Dr Strange New york
665544 Thor Asgard
123456 Superman Metropolis
555555 Nightwing Gotham
666666 Loki Asgard
Df2
SID Mission End date
665544 10/10/2020
665544 03/03/2021
212121 02/02/2021
665544 05/12/2020
212121 15/07/2021
123456 03/06/2021
666666 12/10/2021
I need to create a new df that summarizes how many heroes are in each city and in which quarter will their missions be complete. I'll be able to match the superhero (and their city) in df1 to the mission end date via their Superhero ID or SID in Df2 ('Superhero Id'=='SID'). Superhero IDs appear only once in Df1 but can appear multiple times in DF2.
Ultimately I need a count for the total no. of heroes in the different cities (which I can do - see below) as well as how many heroes will be free per quarter.
These are the thresholds for the quarters
Quarter 1 – Apr, May, Jun
Quarter 2 – Jul, Aug, Sept
Quarter 3 – Oct, Nov, Dec
Quarter 4 – Jan, Feb, Mar
The following code tells me how many heroes are in each city:
df_Count = pd.DataFrame(df1.City.value_counts().reset_index())
Which produces:
City Count
New york 3
Gotham 2
Asgard 2
Metropolis 1
I can also convert the dates into datetime format via the following operation:
#Convert to datetime series
Df2['Mission End date'] = pd.to_datetime('Df2['Mission End date']')
Ultimately I need a new df that looks like this
City Total Count No. of heroes free in Q3 No. of heroes free in Q4 Free in Q1 2021+
New york 3 2 0 1
Gotham 2 2 2 0
Asgard 2 1 2 0
Metropolis 1 0 0 1
If anyone can help me create the appropriate quarters and be able to sort them into the appropriate columns I'd be extremely grateful. I'd also like a way to handle heroes having multiple mission end dates. I can't ignore them I need to still count them. I suspect I'll need to create a custom function which I can than apply to each row via the apply() method and a lambda expression. This issue has been a pain for a while now so I'd appreciate all the help I can get. Thank you very much :)
After merging your dataframe with
df = df1.merge(df2, left_on='Superhero ID', right_on='SID')
And converting your date column to pd.datetime format
df.assign(missing_end_date=lambda x: pd.to_datetime(x['Missing End Date']))
You can create two columns; one to extract the quarter and one to extract the year of the newly created datetime column
df.assign(quarter_end_date=lambda x: x.missing_end_date.dt.quarter)
.assign(year_end_date=lambda x: x.missing_end_date.dt.year)
And combine them into a column that shows the quarter in a format Qx, yyyy
df.assign(quarter_year_end=lambda x: f"Q{int(x.quarter_end_date)}, {int(x.year_end_date)}")
Finally groupby the city and quarter, count the number of superheros and pivot the dataframe to get your desired result
df.groupby(['City', 'quarter_year_end'])
.count()
.reset_index()
.pivot(index='City', columns='quarter_year_end', values='Superhero')

How to aggregate rows in a pandas dataframe

I have a dataframe shown in the image 1. It is a sample of pubs in London,UK (3337 pubs/rows). And the geometry is at an LSOA level. In some LSOAs, there is more than 1 pub. I want my dataframe to summarise the number of pubs in every LSOA. I already have the information by using
psdf['lsoa11nm'].value_counts()
prints out:
City of London 001F 103
City of London 001G 40
Westminster 013B 36
Westminster 018A 36
Westminster 013E 30
...
Lambeth 005A 1
Croydon 043C 1
Hackney 002E 1
Merton 022D 1
Bexley 008B 1
Name: lsoa11nm, Length: 1630, dtype: int64
I cant use this as a new dataframe because it is a key and one column as opposed two columns where one would be lsoa11nm and the other pub count.
Does anyone know how to groupby the dataframe so that there will be only one row for every lsoa, that says how many pubs are in it?

Calculating new rows in a Pandas Dataframe on two different columns

So I'm a beginner at Python and I have a dataframe with Country, avgTemp and year.
What I want to do is calculate new rows on each country where the year adds 20 and avgTemp is multiplied by a variable called tempChange. I don't want to remove the previous values though, I just want to append the new values.
This is how the dataframe looks:
Preferably I would also want to create a loop that runs the code a certain number of times
Super grateful for any help!
If you need to copy the values from the dataframe as an example you can have it here:
Country avgTemp year
0 Afghanistan 14.481583 2012
1 Africa 24.725917 2012
2 Albania 13.768250 2012
3 Algeria 23.954833 2012
4 American Samoa 27.201417 2012
243 rows × 3 columns
If you want to repeat the rows, I'd create a new dataframe, perform any operation in the new dataframe (sum 20 years, multiply the temperature by a constant or an array, etc...) and use then use concat() to append it to the original dataframe:
import pandas as pd
tempChange=1.15
data = {'Country':['Afghanistan','Africa','Albania','Algeria','American Samoa'],'avgTemp':[14,24,13,23,27],'Year':[2012,2012,2012,2012,2012]}
df = pd.DataFrame(data)
df_2 = df.copy()
df_2['avgTemp'] = df['avgTemp']*tempChange
df_2['Year'] = df['Year']+20
df = pd.concat([df,df_2]) #ignore_index=True if you wish to not repeat the index value
print(df)
Output:
Country avgTemp Year
0 Afghanistan 14.00 2012
1 Africa 24.00 2012
2 Albania 13.00 2012
3 Algeria 23.00 2012
4 American Samoa 27.00 2012
0 Afghanistan 16.10 2032
1 Africa 27.60 2032
2 Albania 14.95 2032
3 Algeria 26.45 2032
4 American Samoa 31.05 2032
where df is your data frame name:
df['tempChange'] = df['year']+ 20 * df['avgTemp']
This will add a new column to your df with the logic above. I'm not sure if I understood your logic correct so the math may need some work
I believe that what you're looking for is
dfName['newYear'] = dfName.apply(lambda x: x['year'] + 20,axis=1)
dfName['tempDiff'] = dfName.apply(lambda x: x['avgTemp']*tempChange,axis=1)
This is how you apply to each row.

Concat 2 Dataframes with 54 entries yields 1 row

I have created 2 dataframes with a common index based on Year and District. There are 58 rows in each dataframe and the Year and Districts are exact matches. Yet when I try to join them, I get a new dataframe with all of the columns combined (which is what I want) but only one single row - New York City. That row exists in both dataframes, as do all the rest, but only this one makes it to the merged DF. I have tried a few different methods of joining the dataframes but they all do the same thing. This example uses:
pd.concat([ groupeddf,Popdf], axis=1)
This is the Popdf with (Year, District) as Index:
Population
Year District
2017 Albany 309612
Allegany 46894
Broome 193639
Cattaraugus 77348
Cayuga 77603
This is the groupeddf indexed on Year and District (some columns eliminated for clarity):
Total SNAP Households Total SNAP Persons \
Year District
2017 Albany 223057 416302
Allegany 36935 69802
Broome 201586 363504
Cattaraugus 75567 144572
Cayuga 64168 121988
This is the merged DF after executing pd.concat([ groupeddf,Popdf], axis=1):
Population Total SNAP Households Total SNAP Persons
Year District
2017 New York City 8622698 11314598 19987958
This shows the merged dataframe has only 1 entry:
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 1 entries, (2017, New York City) to (2017, New York City)
Data columns (total 4 columns):
Population 1 non-null int64
Total SNAP Households 1 non-null int64
Total SNAP Persons 1 non-null int64
Total SNAP Benefits 1 non-null float64
dtypes: float64(1), int64(3)
memory usage: 170.0+ bytes
UPDATE: I tried another approach and it demonstrates that the indices which appear identical to me, are not being seen as identical.
When I execute this code, I get duplicates instead of a merge:
combined_df = groupeddf.merge(Popdf, how='outer', left_index=True, right_index=True)
The results look like this:
Year District
2017 Albany 223057.0 416302.0
Albany NaN NaN
Allegany 36935.0 69802.0
Allegany NaN NaN
Broome 201586.0 363504.0
Broome NaN NaN
Cattaraugus 75567.0 144572.0
Cattaraugus NaN NaN
The only exception is when you get down to New York City. That one does not duplicate, so is actually seen as the same index. So there is something wrong with the data, but I an not sure what.
Did you try using merge, like this:
combined_df = merge(groupeddf, Popdf, how = 'inner', on = ['Year','District'])
I did inner if you want to combine only where the district and year exist in both dataframes. If you want to keep all on the left dataframe, but only matching from the right, then do a left join, etc.
It took a while but I finally sorted it out. The District name in the population dataframe had a space at the end of the name, where there was not space in the SNAP df.
"Albany " vs "Albany"

Pandas: Delete rows of a DataFrame if total count of a particular column occurs only 1 time

I'm looking to delete rows of a DataFrame if total count of a particular column occurs only 1 time
Example of raw table (values are arbitrary for illustrative purposes):
print df
Country Series Value
0 Bolivia Population 123
1 Kenya Population 1234
2 Ukraine Population 12345
3 US Population 123456
5 Bolivia GDP 23456
6 Kenya GDP 234567
7 Ukraine GDP 2345678
8 US GDP 23456789
9 Bolivia #McDonalds 3456
10 Kenya #Schools 3455
11 Ukraine #Cars 3456
12 US #Tshirts 3456789
Intended outcome:
print df
Country Series Value
0 Bolivia Population 123
1 Kenya Population 1234
2 Ukraine Population 12345
3 US Population 123456
5 Bolivia GDP 23456
6 Kenya GDP 234567
7 Ukraine GDP 2345678
8 US GDP 23456789
I know that df.Series.value_counts()>1 will identify which df.Series occur more than 1 time; and that the code returned will look something like the following:
Population
True
GDP
True
#McDonalds
False
#Schools
False
#Cars
False
#Tshirts
False
I want to write something like the following so that my new DataFrame drops column values from df.Series that occur only 1 time, but this doesn't work:
df.drop(df.Series.value_counts()==1,axis=1,inplace=True)
You can do this by creating a boolean list/array by either list comprehensions or using DataFrame's string manipulation methods.
The list comprehension approach is:
vc = df['Series'].value_counts()
u = [i not in set(vc[vc==1].index) for i in df['Series']]
df = df[u]
The other approach is to use the str.contains method to check whether the values of the Series column contain a given string or match a given regular expression (used in this case as we are using multiple strings):
vc = df['Series'].value_counts()
pat = r'|'.join(vc[vc==1].index) #Regular expression
df = df[~df['Series'].str.contains(pat)] #Tilde is to negate boolean
Using this regular expressions approach is a bit more hackish and may require some extra processing (character escaping, etc) on pat in case you have regex metacharacters in the strings you want to filter out (which requires some basic regex knowledge). However, it's worth noting this approach is about 4x faster than using the list comprehension approach (tested on the data provided in the question).
As a side note, I recommend avoiding using the word Series as a column name as that's the name of a pandas object.
This is an old question, but the current answer doesn't work for any moderately large dataframes. A much faster and more "dataframe" way is to add a value count column and filter out count.
Create the dataset:
df = pd.DataFrame({'Country': 'Bolivia Kenya Ukraine US Bolivia Kenya Ukraine US Bolivia Kenya Ukraine US'.split(),
'Series': 'Pop Pop Pop Pop GDP GDP GDP GDP McDonalds Schools Cars Tshirts'.split()})
Drop rows that have a count < 1 for the column ('Series' in this case):
# Group values for Series and add 'cnt' column with count
df['cnt'] = df.groupby(['Series'])['Country'].transform('count')
# Drop indexes for count value == 1, and dropping 'cnt' column
df.drop(df[df.cnt==1].index)[['Country','Series']]

Categories

Resources