I need to add the new calculated column in pivot table in python. The formula for that column should be like the one below:
math.log10(2.718281+(table['eventid']+table['nkill']+table['nwound'])/3).
I'm getting an error every time.
Could you, please, help me to solve this issue? Thank you!
I added the part of my pivot table. It is built by country and by year for three variables: eventid, nkill and nwound.
eventid nkill nwound
Crime Crime Crime Crime
country_txt iyear
Afghanistan 1995 1 0.000000 0.000000
2001 2 1.500000 0.500000
2002 6 0.833333 0.800000
2003 36 2.117647 2.968750
2004 28 3.222222 2.538462
IIUC
Cause you did not show the error code, however, base on my understanding, usually
two types, first int + float which is cover by .astype(float), second index mismatch when you assign the new column , which is cover by .values, notice I using .mean(1) to get the average value of the row.
table['New']=np.log10(table.mean(1).astype(float).add(2.718281)).values
Related
I have dataframe df with daily stock market for 10 years having columns Date, Open, Close.
I want to calculate the change between any two consecutive values (in Close) as a ratio of the previous value.
For example, in photo below, first entry (-0.0002) for Interday_return is calculated as = (43.06-43.07)/43.07.
Similarly the next value 0.0046 is calculated as = (43.26-43.06)/43.06.
And so on..
I am able to create a new column Interday_Close_change which is basically the difference between each 2 consecutive rows using this code (ie.. finding the numerator of the above mentioned fraction). However, I dont know how to divide any element in Interday_Close_change by value in the preceding row and get a new column Interday_return.
df = pd.DataFrame(data, columns=columns)
df['Interday_Close_change'] = df['Close'].astype(float).diff()
df.fillna('', inplace=True)
This should do it:
df['Interday_Close_change'] = df['Close'].pct_change().fillna('')
Sample input:
Date Open Close
0 1/2/2018 42.54 43.07
1 1/3/2018 43.13 43.06
2 1/4/2018 43.14 43.26
3 1/5/2018 43.36 43.75
Sample output:
Date Open Close Interday_Close_change
0 1/2/2018 42.54 43.07
1 1/3/2018 43.13 43.06 -0.000232
2 1/4/2018 43.14 43.26 0.004645
3 1/5/2018 43.36 43.75 0.011327
Docs on pct_change.
Say I have a df that looks like this:
name day var_A var_B
0 Pete Wed 4 5
1 Luck Thu 1 10
2 Pete Sun 10 10
And I want to sum var_A and var_B for every name/person and then get the average of this sum by the number of ocurrences of that name/person.
Let's take Pete for example. Sum his variables (in this case, (4+10) + (5+10) = 29), and divide this sum by the ocurrences of Pete in the df (29/2 = 14,5). And the "day" column would be eliminated, there would be only one column for the name and another for the average.
Would look like this:
>>> df.method().method()
name avg
0 Pete 14.5
1 Luck 11.0
I've been trying to do this using groupby and other methods, but I eventually got stuck.
Any help would be appreciated.
I came up with
df.groupby('name')['var_A', 'var_B'].apply(lambda g: g.stack().sum()/len(g)).rename('avg').reset_index()
which produces the correct result, but I'm not sure it's the most elegant way.
pandas' groupby is a lazy expression, and as such it is reusable:
# create group
group = df.drop(columns="day").groupby("name")
# compute output
group.sum().sum(1) / group.size()
name
Luck 11.0
Pete 14.5
dtype: float64
I am trying to use PyCaret for time series, according to this tutorial.
My analysis did not work. When I created a new column
data['MA12'] = data['variable'].rolling(12).mean()
I got this new MA12 column with NA values only.
As a resulted I decided to replicate the code from the tutorial, using AirPassangers dataset, but got the same issue.
When I print data, I get
Month Passengers MA12
0 1949-01-01 112 NaN
1 1949-02-01 118 NaN
2 1949-03-01 132 NaN
3 1949-04-01 129 NaN
4 1949-05-01 121 NaN
I would greatly appreciate any tips on what is going on here.
My only guess, I use a default version of PyCaret, maybe I need to install a full one. Tried this too - the same result.
Since you want the previous 12 reads, the first 11 will be NaN. You need more rows than 12 before you get a moving average of 12. You can see this on the link you provided. The chart of MA doesn't start up right away.
I would like to calculate standard deviations for non rolling intervals.
I have a df like this:
value std year
3 nan 2001
2 nan 2001
4 nan 2001
19 nan 2002
23 nan 2002
34 nan 2002
and so on. I would just like to calculate the standard deviation for every year and save it in every cell in the respective row in "std". I have the same amount of data for every year, thus the length of the intervals never changes.
I already tried:
df["std"] = df.groupby("year").std()
but since the right gives a new dataframe that calculates the std for every column gouped by year this obviously does not work.
Thank you all very much for your support!
IIUC:
try via transform() method:
df['std']=df.groupby("year")['value'].transform('std')
OR
If you want to find the standard deviation of multiple columns then:
df[['std1','std2']]=df.groupby("year")[['column1','column2']].transform('std')
I am currently doing some exercises on a Pandas DataFrame indexed by date (DD/MM/YY). The current exercise requires me to groupby on Year to obtain average yearly values.
So what I tried to do was to create a new column containing only the years extracted from the DataFrame's index. The code I wrote is:
data["year"] = [t.year for t in data.index]
data.groupby("year").mean()
but for some reason, the new column "year" ends up replacing the previous full-date indexing (which does not even become a "standard" column, it plain disappears), which came a bit by surprise. How can this be?
Thanks in advance!
For a sample dataframe:
value
2016-01-22 1
2014-02-02 2
2014-08-27 3
2016-01-23 4
2014-03-18 5
If you would like to keep your logic, you just need to call the column you want to take the mean() of and use transform() and then assign it back to the value column:
data['year'] = [t.year for t in data.index]
data['value'] = data.groupby('year')['value'].transform('mean')
Yields:
value year
2016-01-22 2.500000 2016
2014-02-02 3.333333 2014
2014-08-27 3.333333 2014
2016-01-23 2.500000 2016
2014-03-18 3.333333 2014