I have a pandas dataframe, US state temperature data that is grouped firstly by State and then by Year. I have already selected the first and last years of entries by subsetting the original dataframe. I want to create a new dataframe that shows the difference in AvgTemperature from the first year (1995) and the last year (2019) for all 50 states.
State
Year
AvgTemperature
Alabama
1995
63.66
Alabama
2019
66.32
Alaska
1995
35.97
...
...
...
I want to have a result that I can plot to show which states have changed the most over time, preferably in the format simply of State as column 1 and Temperature_Change as column 2.
You can pivot, compute the diff and plot as bar:
(df.pivot('State', 'Year', 'AvgTemperature')
.diff(axis=1)
.iloc[:,-1]
.rename('diff')
.plot.bar()
)
NB. I used dummy data for Alaska in 2019.
Output:
Try this:
df.sort_values(['State', 'Year']).groupby('State').apply(lambda g: g.iloc[-1]['AvgTemperature'] - g.iloc[0]['AvgTemperature'])
Related
I have a pandas dataframe like this:
Name Year Sales
Ann 2010 500
Ann 2011 500
Bob 2010 400
Bob 2011 700
Ed 2010 300
Ed 2011 300
I want to be able to combine the figures in the sales column for each name returning:
Name Sales
Ann 1000
Bob 1100
Ed 600
Perhaps I need a for loop to go through and combine the 2 values for both years and create a new column, but I'm not quite sure. Is there a pandas function that can help me with this?
That's a simple dataframe groupby.
In that case you'll just have to select the two columns you need
df = df[["Name", "Sales"]]
And then apply the groupby
df.groupby(["name"], as_index=False).sum()
By default the groupby will make the grouped by columns part of the index. If you want to keep them as colum you need to specify as_index=False
I have pandas dataframe and the first 4 columns are related to country information and the other columns are the number of passengers divided according to year. I have only one row and I am trying to get the closest value to that of 2020. It goes until 2020.
Country Name
Country Code
1960
1961
1961
European Union
EUU
NaN
1.392831e+7
1.519181e+7
Every month I collect data that contains details of employees to be stored in our database.
I need to find a solution to compare the data stored in the previous month to the data received and, for each row that any of the columns had a change, it would return into a new dataframe.
I would also need to know somehow which columns in each row of this new returned dataframe had a change when this comparison happened.
There are also some important details to mention:
Each column can also contain blank values in any of the dataframes;
The dataframes have the same column names but not necessarily the same data type;
The dataframes do not have the same number of rows necessarily;
If a row do not find its Index match, do not return to the new dataframe;
The rows of the dataframes can be matched by a column named "Index"
So, for example, we would have this dataframe (which is just a slice of the real one as it has 63 columns):
df1:
Index Department Salary Manager Email Start_Date
1 IT 6000.00 Jack ax#i.com 01-01-2021
2 HR 7000 O'Donnel ay#i.com
3 MKT $7600 Maria d 30-06-2021
4 I'T 8000 Peter az#i.com 14-07-2021
df2:
Index Department Salary Manager Email Start_Date
1 IT 6000.00 Jack ax#i.com 01-01-2021
2 HR 7000 O'Donnel ay#i.com 01-01-2021
3 MKT 7600 Maria dy#i.com 30-06-2021
4 IT 8000 Peter az#i.com 14-07-2021
5 IT 9000 John NOT PROVIDED
6 IT 9900 John NOT PROVIDED
df3:
Index Department Salary Manager Email Start_Date
2 HR 7000 O'Donnel ay#i.com 01-01-2021
3 MKT 7600 Maria dy#i.com 30-06-2021
4 IT 8000 Peter az#i.com 14-07-2021
**The differences in this example are:
Start date added in row of Index 2
Salary format corrected and email corrected for row Index 3
Department format corrected for row Index 4
What would be the best way to to this comparison?
I am not sure if there is an easy solution to understand what changed in each field but returning the dataframe with rows that had at least 1 change would be helpful.
Thank you for the support!
I think compare could do the trick: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.compare.html
But first you would need to align the rows between old and new dataframe via the index:
new_df_to_compare=new_df.loc[old_df.index]
When datatypes don't match. You would also need to align them:
new_df_to_compare = new_df_to_compare.astype(old_df.dtypes.to_dict())
Then compare should work just like this:
difference_df = old_df.compare(new_df_to_compare)
I am working on a Netflix dataset where some columns having comma-separated values.
I would like a have count of shows released per country but data is like
Image of dataset
How do I split the data and make it countrywide like 1 show releases in 3 countries(Norway, Iceland, United States) then row should come 3 times with a single country in the country column.
show_id
country
s5
Norway
s5
Iceland
NOTE: Using pandas
You can split the comma-separated string to the list and then apply 'explode' to that column.
df['country'] = df['country'].str.split(',')
df = df.explode('country')
print(df)
This question already has answers here:
Count unique values per groups with Pandas [duplicate]
(4 answers)
Closed 2 years ago.
One of the columns in the Dataframe is STANME (State name). I want to create a pandas series with index = STNAME and value = number of entries in DataFrame. E.g of sample output is shown below
STNAME
Michigan 83
Arizona 15
Wisconsin 72
Montana 56
North Carolina 100
Utah 29
New Jersey 21
Wyoming 23
My current solution is the following, but seems a but clumsy due to the need to pick arbitrary column, rename this column etc. Would like to know if there is a better way to do this
grouped=df.groupby('STNAME')
# Note: County is an arbitrary column name I picked from the dataframe
grouped_df = grouped['COUNTY'].agg(np.size)
grouped_df.columns = ['Num Counties']
You can achieve this using value_counts(). This function is used to get a pd.Series containing counts of unique values:
freq = df['STANME'].value_counts()
The index will be STANME, and the value will be it's frequency (first element is the most frequently-occurring element).
Note that NA's will be excluded by default.