Suppose I had a data set containing job titles and salaries over the past three years and I want to go about calculating the difference in salary average from the first year to the last.
Using Pandas, how exactly would I go about doing that? I've managed to create a df with the average salaries for each year but I suppose what I'm trying to do is say "for data scientist, subtract the average salary from 2022 with the average salary from 2020" and iterate through all job_titles doing the same thing.
work_year job_title salary_in_usd
0 2020 AI Scientist 45896.000000
1 2020 BI Data Analyst 98000.000000
2 2020 Big Data Engineer 97690.333333
3 2020 Business Data Analyst 117500.000000
4 2020 Computer Vision Engineer 60000.000000
.. ... ... ...
93 2022 Machine Learning Scientist 141766.666667
94 2022 NLP Engineer 37236.000000
95 2022 Principal Data Analyst 75000.000000
96 2022 Principal Data Scientist 162674.000000
97 2022 Research Scientist 105569.000000
Create a function which does the thing you want on each group:
def first_to_last_year_diff(df):
diff = (
df[df.work_year == df.work_year.max()].salary_in_usd
- df[df.work_year == df.work_year.max()].salary_in_usd
)
return diff
Then group on job title and apply your function:
df.groupby("job_title").apply(first_to_last_year_diff)
Related
I am trying to carry out what I thought would be a typical groupby and average problem on a DataFrame, but this has gotten a bit more complex than I had anticipated since the problem will deal with string/ordinal years and float values. I am using python. I will explain below.
I have a data frame showing different model years for different models of refrigerators across several counties in a state. I want to find the average model year of refrigerator for each county.
I have this example dataframe (abbreviated since the full dataframe would be far too long to show):
County_ID Type Year Population
--------------------------------------------
1 A 2022 54355
1 A 2021 54645
1 A 2020 14554
...
1 B 2022 23454
1 B 2021 34657
1 B 2020 12343
...
1 C 2022 23454
1 C 2021 34537
1 C 2020 23323
...
2 A 2022 54355
2 A 2021 54645
2 A 2020 14554
...
2 B 2022 23454
2 B 2021 34657
2 B 2020 12343
...
2 C 2022 23454
2 C 2021 34537
2 C 2020 23323
...
3 A 2022 54355
3 A 2021 54645
3 A 2020 14554
...
3 B 2022 23454
3 B 2021 34657
3 B 2020 12343
...
3 C 2022 23454
3 C 2021 34537
3 C 2020 23323
...
And so I kept this abbreviated for space, but the concept goes as I have many counties in my data, with county IDs going from 1 all the way to 50, and so 50 counties. In this example, there are 3 types of refrigerators. And then for each of these 3 types of refrigerators, there are the model year vintages of these refrigerators shown, e.g. how old the refrigerator is. And then we have population, showing how many of each of these physical units (unique pair of type and year) found in each of these counties. What I am trying to find is, for each County ID, I want the average year.
And so I want to produce the following DataFrame:
County_ID Average_vintage
--------------------------------
1 XXXX.XX
2 XXXX.XX
3 XXXX.XX
4 XXXX.XX
5 XXXX.XX
6 XXXX.XX
...
But here is why this is confusing me, since I want to find the average year, but year is ordinal data and not float, so I am a bit confused conceptually here. What I want to do is weight this by population, I think. And so, basically, if you want to find the average vintage of refrigerators, you would want to find the average of years, but of course, the vintage with a higher population of that vintage would have more influence in that average. And so I want to weight the vintages by population, and basically treat the years like float, so I could have the average year, and then a decimal attached, so there could be an average that says basically, the average refrigerator vintage for County 22 is 2015.48 or something like that. That is what I am trying to go for. I am trying this:
avg_vintage = df.groupby(['County_ID']).mean()
but I don't think this is really going to make much sense, since I need to account for how many (population) of each refrigerator there actually are in each county. How can I find the average year/vintage for each County, considering how many of each refrigerator (population) are found in each County using python?
Hi I’m new to pandas and struggling with a challenging problem.
I have 2 dataframes:
Df1
Superhero ID Superhero City
212121 Spiderman New york
364331 Ironman New york
678523 Batman Gotham
432432 Dr Strange New york
665544 Thor Asgard
123456 Superman Metropolis
555555 Nightwing Gotham
666666 Loki Asgard
And
Df2
SID Mission End date
665544 10/10/2020
665544 03/03/2021
212121 02/02/2021
665544 05/12/2020
212121 15/07/2021
123456 03/06/2021
666666 12/10/2021
I need to create a new df that summarizes how many heroes are in each city and in which quarter will their missions be complete. Also note the dates are written in the European format so (day/month/year).
I am able to summarize how many heroes are in each city with the line:
df_Count = pd.DataFrame(df1.City.value_counts().reset_index())
Which gives me :
City Count
New york 3
Gotham 2
Asgard 2
Metropolis 1
I need to add another column that lists if the hero will be free from missions certain quarters.
Quarter 1 – Apr, May, Jun
Quarter 2 – Jul, Aug, Sept
Quarter 3 – Oct, Nov, Dec
Quarter 4 – Jan, Feb, Mar
If the hero ID in Df2 does not have a mission end date, the count should increase by one. If they do have an end date and it’s separated into
So in the end it should look like this:
City Total Count No. of heroes free in Q3 No. of heroes free in Q4 Free in Q1 2021+
New york 3 2 0 1
Gotham 2 2 2 0
Asgard 2 1 2 0
Metropolis 1 0 0 1
I think I need to use the python datetime library to get the current date time. Than create a custom function which I can than apply to each row using a lambda. Something similar to the below code:
from datetime import date
today = date.today()
q1 = '05/04/2021'
q3 = '05/10/2020'
q4 = '05/01/2021'
count=0
def QuarterCount(Eid,AssignmentEnd )
if df1['Superhero ID'] == df2['SID'] :
if df2['Mission End date']<q3:
++count
return count
elif df2['Mission End date']>q3 && <q4:
++count
return count
elif df2['Mission End date']>q1:\
++count
return count
df['No. of heroes free in Q3'] = df1[].apply(lambda x(QuarterCount))
Please help me correct my syntax or logic or let me know if there is a better way to do this. Learning pandas is challenging but oddly fun. I'd appreciate any help you can provide :)
I have a housing market dataset categorized by U.S Counties showing columns such as total_homes_sold. I'm trying to show a comparison between housing sales YoY (e.g. Jan 2020 vs. Jan 2019) and by county (e.g. Aberdeen Mar 2020 vs. Suffolk Mar 2020). However not sure how to group the dates as they are not sorted by months (Jan, Feb, Mar etc.) but rather by 4-week intervals: period_begin and period_end.
Intervals between years vary. The period_begin for Aberdeen (around Jan) for 2019 might be 1/7 to 2/3 but 1/6 to 2/2 for 2020 (image shown below).
I tried using count (code below) to label each 4-week period as a number (shown below) thinking I could compare Aberdeen 2017-1 to Aberdeen 2020-1 (1 coded as the first time interval) but realized that some years for some regions have more 4 week periods in a year than others (2017 has 13 whereas 2018 has 14).
*df['count'] = df.groupby((everyfourth['region_name'] != df['region_name'].shift(1)).cumsum()).cumcount()+1*
Any ideas on what code I could use to closely categorize these two columns into month-like periods?
Snippet of Dataset here
Let me know if you have any questions. Not sure I made sense! Thanks.
I have a dataframe that includes the category of a project, currency, number of investors, goal, etc., and I want to create a new column which will be "average success rate of their category":
state category main_category currency backers country \
0 0 Poetry Publishing GBP 0 GB
1 0 Narrative Film Film & Video USD 15 US
2 0 Narrative Film Film & Video USD 3 US
3 0 Music Music USD 1 US
4 1 Restaurants Food USD 224 US
usd_goal_real duration year hour
0 1533.95 59 2015 morning
1 30000.00 60 2017 morning
2 45000.00 45 2013 morning
3 5000.00 30 2012 morning
4 50000.00 35 2016 afternoon
I have the average success rates in series format:
Dance 65.435209
Theater 63.796134
Comics 59.141527
Music 52.660558
Art 44.889045
Games 43.890467
Film & Video 41.790649
Design 41.594386
Publishing 34.701650
Photography 34.110847
Fashion 28.283186
Technology 23.785582
And now I want to add in a new column, where each column will have a success rate matching their category, i.e. wherever the row is technology, the new column will include 23.78 for that row.
df[category_success_rate] = i want the output column to be the % success which matches with the category in "main category" column.
I think you need GroupBy.transform with a Boolean mask, df['state'].eq(1) or (df['state'] == 1):
df['category_success_rate'] = (df['state'].eq(1)
.groupby(df['main_category']).transform('mean') * 100)
Alternative:
df['category_success_rate'] = ((df['state'] == 1)
.groupby(df['main_category']).transform('mean') * 100)
I have a text file like this:
PERSONAL INFORMATION
First Name: Michael
Last Name: Junior
Birth Date: May 17, 1999
Location: Whitehurst Hall 301. City: Stillwater. State: OK
Taken on July 8, 2000 10:50:30 AM MST
WORK EXPERIENCE
Work type select Part-time
ID number 10124
Company name ABCDFG Inc.
Positions Software Engineer/Research Scientist
Data Analyst/Scientist
As you could see the first column is feature names and the second column values. I read it using this code:
import pandas as pd
import numpy as np
import scipy as sp
df=pd.read_table('personal.txt',skiprows=1)
pd.set_option('display.max_colwidth',10000)
pd.set_option('display.max_rows',1000)
df
But it merges columns and outputs:
PERSONAL INFORMATION
0 First Name: Michael
1 Last Name: Junior
2 Birth Date: May 17, 1999
3 Location: Whitehurst Hall 301. City: Stillwater. State: OK
4 Taken on July 8, 2000 10:50:30 AM MST
5 WORK EXPERIENCE
6 Work type select Part-time
7 ID number 10124
8 Company name Google Inc.
9 Positions Software Engineer/Research Scientist
10 Data Analyst/Scientist
I should escape from those titles PERSONAL INFORMATION and WORK EXPERIENCE as well. How can I read in a way that it gives me results appropriately in two columns?