I have a dataframe with multilevel index (company, year) that grouped by mean, looks like this:
company year mean salary
ABC 2018 3000
2019 3400
LOL 2018 1200
2019 3500
I want to select the data belongs to "LOL", my desired outcome would be:
company year mean salary
LOL 2018 1200
2019 3500
Is there a way I can only select a certain group? I tried to use .filter function on dataframe but I was only able to apply it to rows such as (lambda x: x > 1000) but not for index value.
Any advice will be appreciated!
Use DataFrame.xs with drop_level=False for avoid removed first level:
df1 = df.xs('LOL', drop_level=False)
Or filter by first level with Index.get_level_values:
df1 = df[df.index.get_level_values(0) == 'LOL']
print (df1)
mean salary
company year
LOL 2018 1200
2019 3500
Related
I'm working with panel data looking like this (only relevant columns included):
Ticker Year Account_number Industry
AAA 2018 xxxx Fossil
2019 xxxx Fossil
2020 xxxx Fossil
BBB 2018 yyyy Materials
2019 yyyy Services
2020 yyyy Materials
CCC 2018 zzzz Services
2019 zzzz Services
2020 zzzz Services
Tickers (level 0 of MultiIndex) are used to identify individual and unique units in the panel. Each unit is observed over 3 years (level 1 of MultiIndex).
When I groupby('Industry') I end up double-counting the units since the same ticker is associated with more than one industry (as with ticker 'BBB').
The goal is to identify and print the tickers having this issue, and to assign them to a single industry.
I'm thinking of some code that returns the ticker if the string in the industry column is not unique, so that I can manually change it later.
Thanks for your help!
PS This is my first question here so pls let me know if you want me to be more specific or show more details about the df
If all of the values for Industry should be the same for each Ticker then you should do this the other way round.
Instead of using groupby() on Industry, use groupby() on Ticker and then loop through the data frames and return only those for which grouped_df.Ticker.nunique() > 1
I have a pandas dataframe like this:
Name Year Sales
Ann 2010 500
Ann 2011 500
Bob 2010 400
Bob 2011 700
Ed 2010 300
Ed 2011 300
I want to be able to combine the figures in the sales column for each name returning:
Name Sales
Ann 1000
Bob 1100
Ed 600
Perhaps I need a for loop to go through and combine the 2 values for both years and create a new column, but I'm not quite sure. Is there a pandas function that can help me with this?
That's a simple dataframe groupby.
In that case you'll just have to select the two columns you need
df = df[["Name", "Sales"]]
And then apply the groupby
df.groupby(["name"], as_index=False).sum()
By default the groupby will make the grouped by columns part of the index. If you want to keep them as colum you need to specify as_index=False
I am having a database of all customer transactions within company I work at.
ID
Payment
Amount
Month
Year
A
Inward
100
2
2005
A
Outward
200
2
2005
B
Inward
100
7
2017
I have hardships combining Sum/Count of Amount of those transactions per Customer ID per Month/Year.
Only item that I succeed at is combining Sum/Count of Amount of those transactions per customer ID.
Combined = data.groupby("ID")["Amount"].sum().rename("Sum").reset_index()
Can you please let me know what are the alternative solutions?
Thank you in advance!
You can use a list of columns in groupby like:
>>> df.groupby(['ID', 'Year', 'Month', 'Payment'])['Amount'].agg(['sum', 'count'])
sum count
ID Year Month Payment
A 2005 2 Inward 100 1
Outward 200 1
B 2017 7 Inward 100 1
For further:
>>> df.assign(Amount=np.where(df['Payment'].eq('Outward'),
-df['Amount'], df['Amount'])) \
.groupby(['ID', 'Year', 'Month'])['Amount'].agg(['sum', 'count'])
sum count
ID Year Month
A 2005 2 -100 2
B 2017 7 100 1
So I have two dfs.
DF1
Superhero ID Superhero City
212121 Spiderman New york
364331 Ironman New york
678523 Batman Gotham
432432 Dr Strange New york
665544 Thor Asgard
123456 Superman Metropolis
555555 Nightwing Gotham
666666 Loki Asgard
Df2
SID Mission End date
665544 10/10/2020
665544 03/03/2021
212121 02/02/2021
665544 05/12/2020
212121 15/07/2021
123456 03/06/2021
666666 12/10/2021
I need to create a new df that summarizes how many heroes are in each city and in which quarter will their missions be complete. I'll be able to match the superhero (and their city) in df1 to the mission end date via their Superhero ID or SID in Df2 ('Superhero Id'=='SID'). Superhero IDs appear only once in Df1 but can appear multiple times in DF2.
Ultimately I need a count for the total no. of heroes in the different cities (which I can do - see below) as well as how many heroes will be free per quarter.
These are the thresholds for the quarters
Quarter 1 – Apr, May, Jun
Quarter 2 – Jul, Aug, Sept
Quarter 3 – Oct, Nov, Dec
Quarter 4 – Jan, Feb, Mar
The following code tells me how many heroes are in each city:
df_Count = pd.DataFrame(df1.City.value_counts().reset_index())
Which produces:
City Count
New york 3
Gotham 2
Asgard 2
Metropolis 1
I can also convert the dates into datetime format via the following operation:
#Convert to datetime series
Df2['Mission End date'] = pd.to_datetime('Df2['Mission End date']')
Ultimately I need a new df that looks like this
City Total Count No. of heroes free in Q3 No. of heroes free in Q4 Free in Q1 2021+
New york 3 2 0 1
Gotham 2 2 2 0
Asgard 2 1 2 0
Metropolis 1 0 0 1
If anyone can help me create the appropriate quarters and be able to sort them into the appropriate columns I'd be extremely grateful. I'd also like a way to handle heroes having multiple mission end dates. I can't ignore them I need to still count them. I suspect I'll need to create a custom function which I can than apply to each row via the apply() method and a lambda expression. This issue has been a pain for a while now so I'd appreciate all the help I can get. Thank you very much :)
After merging your dataframe with
df = df1.merge(df2, left_on='Superhero ID', right_on='SID')
And converting your date column to pd.datetime format
df.assign(missing_end_date=lambda x: pd.to_datetime(x['Missing End Date']))
You can create two columns; one to extract the quarter and one to extract the year of the newly created datetime column
df.assign(quarter_end_date=lambda x: x.missing_end_date.dt.quarter)
.assign(year_end_date=lambda x: x.missing_end_date.dt.year)
And combine them into a column that shows the quarter in a format Qx, yyyy
df.assign(quarter_year_end=lambda x: f"Q{int(x.quarter_end_date)}, {int(x.year_end_date)}")
Finally groupby the city and quarter, count the number of superheros and pivot the dataframe to get your desired result
df.groupby(['City', 'quarter_year_end'])
.count()
.reset_index()
.pivot(index='City', columns='quarter_year_end', values='Superhero')
I have a dataframe like this with more than 50 columns(for years from 1963 to 2016). I was looking to select all countries with a population over a certain number(say 60 million). Now, when I looked, all the questions were about picking values from a single column. Which is not the case here. I also tried
df[df.T[(df.T > 0.33)].any()] as was suggested in an answer. Doesn't work. Any ideas?
The data frame looks like this:
Country Country_Code Year_1979 Year_1999 Year_2013
Aruba ABW 59980.0 89005 103187.0
Angola AGO 8641521.0 15949766 25998340.0
Albania ALB 2617832.0 3108778 2895092.0
Andorra AND 34818.0 64370 80788.0
First filter only columns with Year in columns names by DataFrame.filter, compare all rows and then test by DataFrame.any at least one matched value per row:
df1 = df[(df.filter(like='Year') > 2000000).any(axis=1)]
print (df1)
Country Country_Code Year_1979 Year_1999 Year_2013
1 Angola AGO 8641521.0 15949766 25998340.0
2 Albania ALB 2617832.0 3108778 2895092.0
Or compare all columns without first 2 selected by positons with DataFrame.iloc:
df1 = df[(df.iloc[:, 2:] > 2000000).any(axis=1)]
print (df1)
Country Country_Code Year_1979 Year_1999 Year_2013
1 Angola AGO 8641521.0 15949766 25998340.0
2 Albania ALB 2617832.0 3108778 2895092.0