I have a housing market dataset categorized by U.S Counties showing columns such as total_homes_sold. I'm trying to show a comparison between housing sales YoY (e.g. Jan 2020 vs. Jan 2019) and by county (e.g. Aberdeen Mar 2020 vs. Suffolk Mar 2020). However not sure how to group the dates as they are not sorted by months (Jan, Feb, Mar etc.) but rather by 4-week intervals: period_begin and period_end.
Intervals between years vary. The period_begin for Aberdeen (around Jan) for 2019 might be 1/7 to 2/3 but 1/6 to 2/2 for 2020 (image shown below).
I tried using count (code below) to label each 4-week period as a number (shown below) thinking I could compare Aberdeen 2017-1 to Aberdeen 2020-1 (1 coded as the first time interval) but realized that some years for some regions have more 4 week periods in a year than others (2017 has 13 whereas 2018 has 14).
*df['count'] = df.groupby((everyfourth['region_name'] != df['region_name'].shift(1)).cumsum()).cumcount()+1*
Any ideas on what code I could use to closely categorize these two columns into month-like periods?
Snippet of Dataset here
Let me know if you have any questions. Not sure I made sense! Thanks.
Related
Suppose I had a data set containing job titles and salaries over the past three years and I want to go about calculating the difference in salary average from the first year to the last.
Using Pandas, how exactly would I go about doing that? I've managed to create a df with the average salaries for each year but I suppose what I'm trying to do is say "for data scientist, subtract the average salary from 2022 with the average salary from 2020" and iterate through all job_titles doing the same thing.
work_year job_title salary_in_usd
0 2020 AI Scientist 45896.000000
1 2020 BI Data Analyst 98000.000000
2 2020 Big Data Engineer 97690.333333
3 2020 Business Data Analyst 117500.000000
4 2020 Computer Vision Engineer 60000.000000
.. ... ... ...
93 2022 Machine Learning Scientist 141766.666667
94 2022 NLP Engineer 37236.000000
95 2022 Principal Data Analyst 75000.000000
96 2022 Principal Data Scientist 162674.000000
97 2022 Research Scientist 105569.000000
Create a function which does the thing you want on each group:
def first_to_last_year_diff(df):
diff = (
df[df.work_year == df.work_year.max()].salary_in_usd
- df[df.work_year == df.work_year.max()].salary_in_usd
)
return diff
Then group on job title and apply your function:
df.groupby("job_title").apply(first_to_last_year_diff)
I know the year-on-year inflation rates for the past 5yrs. But I want to derive another column containing compounded inflation relative to the current year.
To illustrate, I have the below table where compound_inflation_to_2022 is the product of all yoy_inflation instances from each year prior to 2022.
So, for 2021 this is simply 2021's yoy_inflation rate.
For 2020 the compound rate is 2020 x 2021.
For 2019 the compound rate is 2019 x 2020 x 2021, and so on.
year
yoy_inflation
compound_inflation_to_2022
2021
1.048
1.048
2020
1.008
1.056
2019
1.014
1.071
2018
1.02
1.093
2017
1.027
1.122
2016
1.018
1.142
Does anyone have an elegant solution for calculating this compound inflation column in python?
So Pandas DataFrame has this feature called .cumprod() and I think it can be of utmost help to you.
df['compound_inflation_to_2022'] = df['yoy_inflation'].cumprod()
I hope this was what you were looking for ^_^
I am trying to sort a chart with flight accident information. So in csv file there are different airlines, year of the accident and bunch of other things. I want to add up all the incidents by year and another chart adding by each year and each airline:
First chart desirable outcome:
year
incidents
2012
11
2013
12
Second chart desirable outcome:
year
incidents
Airline
2011
23
United
2011
20
Hawaii
2011
30
United
I tried to use dt.year but it's not working. Because csv year is in 2018,2019 format, not in 2018-10-12. I cannot use it as date information.
Try:
import matplotlib.pyplot as plt
# Per year
df.value_counts('year').plot()
# Per year, for each company
df.value_counts(['year', 'Airline']).unstack('Airline').plot(kind='bar')
plt.show()
I have a dataset with sales per customer, per month. I have both a date field (e.g. June 2018) and a "month counter" which gives each month a progressive number (e.g., if data starts in Jan 2018, Jan 2018 is "1", Dec 2018 is "12", and Jan 2019 is "13").
Please see the image, the first 4 columns is a sample of the data I have.
I'd like, for each month and each customer, to sum the sales of the previous 6 months and of the next 6 months, like in the last 2 columns in the attached image.
For instance: for month 1 and customer "John", I'd like to sum sales for month 2,3,4,5,6,7, only looking at "John", this would be "Next 6 months sales" for John in month 1. Reverse logic for the last 6 months sales.
I tried building a for loop and building some functions, but I didn't quite manage to build anything like what I need.
data
I am doing an analysis on whether there is a meaningful difference between the temperature in June and December in Hawaii. I first identify the average temperature in June at all stations across all available years in the dataset. I did the same for December temperature. Now I have average temperatures for both june and december between the years of 2010-2017, shown below:
The average temperature in June 2010 is 74.9908 F
The average temperature in June 2011 is 73.9024 F
The average temperature in June 2012 is 74.0888 F
The average temperature in June 2013 is 74.6405 F
The average temperature in June 2014 is 75.0717 F
The average temperature in June 2015 is 75.0356 F
The average temperature in June 2016 is 75.1348 F
———————————————————————————
The average temperature in December 2010 is 73.125 F
The average temperature in December 2011 is 68.75 F
The average temperature in December 2012 is 70.1667 F
The average temperature in December 2013 is 73.1667 F
The average temperature in December 2014 is 71.625 F
The average temperature in December 2015 is 73.6 F
The average temperature in December 2016 is 73.7143 F
I now have to use a t-test to determine whether the difference in the means, if any, is statistically significant. Will I use a paired t-test, or an unpaired t-test? Why?
I am unclear whether to use a paired or unpaired t-test. I know paired should be used for similar samples taken at different times (i.e. rat tumor size before and after treatment). However, I am confused because the variable temperature is taken at two different times (June and December) at the same location (the average temperature recorded at all stations in Hawaii). I am confused as to which t-test I use for this example and why I use it. Thank you.
This is more of a question for Cross Validated, the exchange for statistics.
That being said, at first glance, either type of t-test would suffice for a simple analysis. Using a paired t-test would account for variation in temperatures across years (at least to some degree). Average temperatures and years are surely correlated.