Using a function to do a %change on a dataset - python

I am on python using pandas but running into this issue. I am having a dataset that has the countries on the columns and dates(my months) on the rows. The data consists of the population of an item.
I am required to calculate the % change of population month by month is there a function that I can use to get the data into a dataset with the %change month by month in the format attached?
I am trying to do the apply a function onto the dataset but getting the function to retrieve the previous month's population to do a % change is an issue.
Anyone has any good ideas to get this done? Thanks

You can use pct_change:
df.pct_change()

First order the data by month (if it isn't already), and then use the .shift() function for pandas dataframes
df['pct_change'] = (df.US - df.US.shift(1) ) / df.US
.shift() allows you to shift rows up or down depending on the argument.

Related

pandas computing new column as a average of other two conditions

So I have this dataset of temperatures. Each line describe the temperature in celsius measured by hour in a day.
So, I need to compute a new variable called avg_temp_ar_mensal which representsthe average temperature of a city in a month. City in this dataset is represented as estacao and month as mes.
I'm trying to do this using pandas. The following line of code is the one I'm trying to use to solve this problem:
df2['avg_temp_ar_mensal'] = df2['temp_ar'].groupby(df2['mes', 'estacao']).mean()
The goal of this code is to store in a new column the average of the temperature of the city and month. But it doesn't work. If I try the following line of code:
df2['avg_temp_ar_mensal'] = df2['temp_ar'].groupby(df2['mes']).mean()
It will works, but it is wrong. It will calculate for every city of the dataset and I don't want it because it will cause noise in my data. I need to separate each temperature based on month and city and then calculate the mean.
The dataframe after groupby is smaller than the initial dataframe, that is why your code run into error.
There is two ways to solve this problem. The first one is using transform as:
df.groupby(['mes', 'estacao'])['temp_ar'].transform(lambda g: g.mean())
The second is to create a new dfn from groupby then merge back to df
dfn = df.groupby(['mes', 'estacao'])['temp_ar'].mean().reset_index(name='average')
df = pd.merge(df, dfn, on=['mes', 'estacao'], how='left']
You are calling a groupby on a single column when you are doing df2['temp_ar'].groupby(...). This doesn't make much sense since in a single column, there's nothing to group by.
Instead, you have to perform the groupby on all the columns you need. Also, make sure that the final output is a series and not a dataframe
df['new_column'] = df[['city_column', 'month_column', 'temp_column']].groupby(['city_column', 'month_column']).mean()['temp_column']
This should do the trick if I understand your dataset correctly. If not, please provide a reproducible version of your df

Pycharm problem set (Stuck from step 3 onwards)

Using the ff_monthly.csv data set https://github.com/alexpetralia/fama_french,
use the first column as an index
(this contains the year and month of the data as a string
Create a new column ‘Mkt’ as ‘Mkt-RF’ + ‘RF’
Create two new columns in the loaded DataFrame, ‘Month’ and ‘Year’ to
contain the year and month of the dataset extracted from the index column.
Create a new DataFrame with columns ‘Mean’ and ‘Standard
Deviation’ and the full set of years from (b) above.
Write a function which accepts (r_m,s_m) the monthy mean and standard
deviation of a return series and returns a tuple (r_a,s_a), the annualised
mean and standard deviation. Use the formulae: r_a = (1+r_m)^12 -1, and
s_a = s_m * 12^0.5.
Loop through each year in the data, and calculate the annualised mean and
standard deviation of the new ‘Mkt’ column, storing each in the newly
created DataFrame. Note that the values in the input file are % returns, and
need to be divided by 100 to return decimals (i.e the value for August 2022
represents a return of -3.78%).
. Print the DataFrame and output it to a csv file.
Workings so far:
import pandas as pd
ff_monthly=pd.read_csv(r"file path")
ff_monthly=pd.read_csv(r"file path",index_col=0)
Mkt=ff_monthly['Mkt-RF']+ff_monthly['RF']
ff_monthly= ff_monthly.assign(Mkt=Mkt)
df=pd.DataFrame(ff_monthly)
enter image description here
There are a few things to pay attention to.
The Date is the index of your DataFrame. This is treated in a special way compared to the normal columns. This is the reason df.Date gives an Attribute error. Date is not an Attribute, but the index. Instead try df.index
df.Date.str.split("_", expand=True) would work if your Date would look like 22_10. However according to your picture it doesn't contain an underscore and also contains the day, so this cannot work
In fact the format you have is not even following any standard. In order to properly deal with that the best way would be parsing this to a proper datetime64[ns] type that pandas will understand with df.index = pd.to_datetime(df.index, format='%y%m%d'). See the python docu for supported format strings.
If all this works, it should be rather straightforward to create the columns
df['year'] = df.index.dt.year
In fact, this part has been asked before

How can I add 'duration' column in the given DataFrame

I am pretty new to Python and doing some project work on my own. Hence need a little help to understand a few things.
I have a DataFrame that contains Netflix Data.
what I need to do is to Find out the Sum of DURATION column for each Profile Name i.e want to know who watches Netflix the most.
How can I add the duration Column? I am unable to understand the to_timedelta function.
You can use a combination of to_timedelta and GroupBy.sum:
out = (pd.to_timedelta(df['Duration']) # convert strings to timedelta
.groupby(df['Profile Name']).sum() # sum per Profile
.sort_values(ascending=False) # sort by total duration
)
print(out)

Is there a function to get the difference between two values on a pandas dataframe timeseries?

I am messing around in the NYT covid dataset which has total covid cases for each county, per day.
I would like to find out the difference of cases between each day, so theoretically I could get the number of new cases per day instead of total cases. Taking a rolling mean, or resampling every 2 days using a mean/sum/etc all work just fine. It's just subtracting that is giving me such a headache.
Tried methods:
df.resample('2d').diff()
'DatetimeIndexResampler' object has no attribute 'diff'
df.resample('1d').agg(np.subtract)
ufunc() missing 1 of 2required positional argument(s)
df.rolling(2).diff()
'Rolling' object has no attribute 'diff'
df.rolling('2').agg(np.subtract)
ufunc() missing 1 of 2required positional argument(s)
Sample data:
pd.DataFrame(data={'state':['Alabama','Alabama','Alabama','Alabama','Alabama'],
'date':[dt.date(2020,3,13),dt.date(2020,3,14),dt.date(2020,3,15),dt.date(2020,3,16),dt.date(2020,3,17)],
'covid_cases':[1.2,2.0,2.9,3.6,3.9]
})
Desired sample output:
pd.DataFrame(data={'state':['Alabama','Alabama','Alabama','Alabama','Alabama'],
'date':[dt.date(2020,3,13),dt.date(2020,3,14),dt.date(2020,3,15),dt.date(2020,3,16),dt.date(2020,3,17)],
'new_covid_cases':[np.nan,0.8,0.9,0.7,0.3]
})
Recreate sample data from original NYT dataset:
df = pd.read_csv('https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv',parse_dates=['date'])
df.groupby(['state','date'])[['cases']].mean().reset_index()
Any help would be greatly appreciated! Would like to learn how to do this manually/via function rather than finding a "new cases" dataset as I will be working with timeseries a lot in the very near future.
Let's try this bit of complete code:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv')
df['date'] = pd.to_datetime(df['date'])
df_daily_state = df.groupby(['date','state'])['cases'].sum().unstack()
daily_new_cases_AL = df_daily_state.diff()['Alabama']
ax = daily_new_cases_AL.iloc[-30:].plot.bar(title='Last 30 days Alabama New Cases')
Output:
Details:
Download the historical case records from NYTimes github using the
raw URL
Convert the dtype of the 'date' column to datetime dtype
Groupby 'date' and 'state' columns sum 'cases' and unstack the state
level of the index to get dates of rows and states for columns.
Take the difference by columns and select only the Alabama column
Plot the last 30 days
The diff function is correct, but if you look at your error message:
'DatetimeIndexResampler' object has no attribute 'diff'
in your first tried methods, it's because diff is a function available for DataFrames, not for Resamplers, so turn it back into a DataFrame by specifying how you want to resample it.
If you have the total number of COVID cases for each day and want to resample it to 2 days, you probably only want to keep the latest update out of the two days, in which case something like df.resample('2d').last().diff() should work.

Pandas Python Dataframe

I have a dataset with YYYY-MM as data, however I want to find the mean of the temperature for the year, therefore I need to add up the 12 months in a year, and find the summary. How do I do that using Pandas?
An example of my data: (I have more than a year dataset, tried to reshape them, but it doesn't seem to work)
Ket us do string slice then groupby + sum
s=df.groupby(df['month'].str[:4]).sum()

Categories

Resources