Pandas: Bin and Sum - python

I have the following data (in csv form):
Country,City,Year,Value1,Value2
Germany,Berlin,2020,9,3
Germany,Berlin,2017,1,4
Germany,Berlin,2011,1,4
Israel,Tel Aviv, 2007,4.5,1
I would like to create bins according to the Year column such that instead of using the specific year there would be a 5-year-range, and then sum up the values in Value1, Value2, grouping by the Country, City and bin ID (in the following example, I called this YearRange).
For example, after running this process, the data would look like so:
Country,City,YearRange,Value1,Value2
Germany,Berlin,2016-2020,10,7
Germany,Berlin,2011-2015,1,4
Israel,Tel Aviv,2006-2010,4.5,1
If this simplifies thigs, I don't mind creating the possible ranges in advance (i.e. I will have a table with all possible ranges: 2016-2020, 2011-2015, 2006-2010, until the earliest date possible in my data).
How can I achieve this using Pandas?
Thanks!

Using pd.cut with groupby
df.groupby([df.Country,df.City,pd.cut(df.Year,[2006,2011,2016,2020]).astype(str)])[['Value1','Value2']].sum().reset_index()
Out[254]:
Country City Year Value1 Value2
0 Germany Berlin (2006, 2011] 1.0 4
1 Germany Berlin (2016, 2020] 10.0 7
2 Israel Tel Aviv (2006, 2011] 4.5 1

Related

Time series convert/summarise series of values into a single value

Hi I have a time series dataset which looks like this
Date Value1
2021/08/01 2
and
Date Value1
2020/08/01 4
and
Date Value1
2019/08/01 6
I want to compare the 2021 data with 2020 first and compare 2021 data with 2019. I am doing a percent change calculation to get both the values, which will be 2021 vs 2020 and 2021 vs 2019. I want to now interpret these two changes into a single value i.e.
For example, if change in Value1 from 2021 vs 2020 is 10%, change in Value1 from 2021 vs 2019 is 15%, I want to summarize this to a single value. What would be the better version to summarise other than just doing a plain average of these two ((10+15)/2)

Calculate standard deviation for intervals in dataframe column

I would like to calculate standard deviations for non rolling intervals.
I have a df like this:
value std year
3 nan 2001
2 nan 2001
4 nan 2001
19 nan 2002
23 nan 2002
34 nan 2002
and so on. I would just like to calculate the standard deviation for every year and save it in every cell in the respective row in "std". I have the same amount of data for every year, thus the length of the intervals never changes.
I already tried:
df["std"] = df.groupby("year").std()
but since the right gives a new dataframe that calculates the std for every column gouped by year this obviously does not work.
Thank you all very much for your support!
IIUC:
try via transform() method:
df['std']=df.groupby("year")['value'].transform('std')
OR
If you want to find the standard deviation of multiple columns then:
df[['std1','std2']]=df.groupby("year")[['column1','column2']].transform('std')

How can I create a unique id in this pandas dataframe with datetime and number entry?

I've got a pandas dataframe that looks like this
miles dollars gallons date gal_cost mpg tank%_used day
0 253.2 21.37 11.138 2019-01-15 1.918657 22.732986 0.821993 Tuesday
1 211.9 22.24 11.239 2019-01-26 1.978824 18.853991 0.829446 Saturday
2 258.1 22.70 11.708 2019-02-02 1.938845 22.044756 0.864059 Saturday
3 223.0 22.24 11.713 2019-02-15 1.898745 19.038675 0.864428 Friday
I'd like to create a new column called 'id' that is unique for each entry. For the first entry in the df, the id would be c0115201901 because it is from the df_c dataframe, the date is 01 15 2019 and it is the first entry.
I know I'll end up doing something like this
df_c = df_c.assign(id=('c'+df_c['date']) + ?????)
but I'd like to parse the df_c['date'] column to pull values for the day, month and year individually. The df_c['date'] column is a datetime64[ns] type.
The other issue is I'd like to have a counter at the end of the id to count which number entry for the date it is. For example, 01 for the first entry, 02 for the second, etc.
I also have a df_m dataframe, but I can repeat the process with a different letter for that dataframe.
Refer pandas datetime-properties docs.
The date can be extracted easily like
df_c['date'].dt.day + df_c['date'].dt.month df_c['date'].dt.year

Index automatically replaced when creating a new column out of it

I am currently doing some exercises on a Pandas DataFrame indexed by date (DD/MM/YY). The current exercise requires me to groupby on Year to obtain average yearly values.
So what I tried to do was to create a new column containing only the years extracted from the DataFrame's index. The code I wrote is:
data["year"] = [t.year for t in data.index]
data.groupby("year").mean()
but for some reason, the new column "year" ends up replacing the previous full-date indexing (which does not even become a "standard" column, it plain disappears), which came a bit by surprise. How can this be?
Thanks in advance!
For a sample dataframe:
value
2016-01-22 1
2014-02-02 2
2014-08-27 3
2016-01-23 4
2014-03-18 5
If you would like to keep your logic, you just need to call the column you want to take the mean() of and use transform() and then assign it back to the value column:
data['year'] = [t.year for t in data.index]
data['value'] = data.groupby('year')['value'].transform('mean')
Yields:
value year
2016-01-22 2.500000 2016
2014-02-02 3.333333 2014
2014-08-27 3.333333 2014
2016-01-23 2.500000 2016
2014-03-18 3.333333 2014

Create a calculated column in pivot table in python

I need to add the new calculated column in pivot table in python. The formula for that column should be like the one below:
math.log10(2.718281+(table['eventid']+table['nkill']+table['nwound'])/3).
I'm getting an error every time.
Could you, please, help me to solve this issue? Thank you!
I added the part of my pivot table. It is built by country and by year for three variables: eventid, nkill and nwound.
eventid nkill nwound
Crime Crime Crime Crime
country_txt iyear
Afghanistan 1995 1 0.000000 0.000000
2001 2 1.500000 0.500000
2002 6 0.833333 0.800000
2003 36 2.117647 2.968750
2004 28 3.222222 2.538462
IIUC
Cause you did not show the error code, however, base on my understanding, usually
two types, first int + float which is cover by .astype(float), second index mismatch when you assign the new column , which is cover by .values, notice I using .mean(1) to get the average value of the row.
table['New']=np.log10(table.mean(1).astype(float).add(2.718281)).values

Categories

Resources