Calculate Rolling Mean Without Using pd.rolling_mean() - python

I know rolling_mean() exists, but this is for a school project so I'm trying to avoid using rolling_mean()
I'm trying to use the following function on a dataframe series
def run_mean(array, period):
ret = np.cumsum(array, dtype=float)
ret[period:] = ret[period:] - ret[:-period]
return ret[period - 1:] / period
data['run_mean'] = run_mean(data['ratio'], 150)
But I'm getting the error 'ValueError: cannot set using a slice indexer with a different length than the value'.
Using data['run_mean'] = pd.rolling_mean(raw_data['ratio'],150) works exactly fine, what am I missing?

Fill the initial values up to period with NaN.
def run_mean(array, period): # Vector
ret = np.cumsum(array / period, dtype=float) # First divide by period to avoid overflow.
ret[period:] = ret[period:] - ret[:-period]
ret[:period - 1] = np.nan
return ret
run_mean(np.array(range(5)), 3)
Out[35]: array([ nan, nan, 1., 2., 3.])

To quote the pandas documentation,
A key difference between Series and ndarray is that operations between Series automatically align the data based on label. Thus, you can write computations without giving consideration to whether the Series involved have the same labels.
This example should illustrate what's going on:
In [1]: import numpy as np
...: import pandas as pd
In [2]: a = pd.Series(np.random.random(5))
In [3]: a
Out[3]:
0 0.740975
1 0.983654
2 0.274207
3 0.427542
4 0.874127
dtype: float64
In [4]: a[2:]
Out[4]:
2 0.274207
3 0.427542
4 0.874127
dtype: float64
In [5]: a[:-2]
Out[5]:
0 0.740975
1 0.983654
2 0.274207
dtype: float64
In [6]: a[2:] - a[:-2]
Out[6]:
0 NaN
1 NaN
2 0.0
3 NaN
4 NaN
dtype: float64
In [7]: a[2:] = _
The last statement will produce the ValueError you get.
Converting ret from a pandas Series to a numpy ndarray should give you the behaviour you're looking for.

You're mixing up the use of : in DataFrame slicing.
Solution
What you want to use is shift()
def run_mean(array, period):
ret = np.cumsum(array, dtype=float)
roll = ret - ret.shift(period).fillna(0)
return roll[(period - 1):] / period
Example Setup
import pandas as pd
import numpy as np
np.random.seed(314)
df = pd.DataFrame((np.random.rand(6, 5) * 10).astype(int), columns=list('ABCDE'))
print df
A B C D E
0 9 5 2 7 9
1 8 7 2 9 2
2 7 2 1 3 8
3 2 0 6 5 5
4 6 6 4 3 5
5 4 8 8 1 0
Observe
print df[:4]
A B C D E
0 9 5 2 7 9
1 8 7 2 9 2
2 7 2 1 3 8
3 2 0 6 5 5
print df[:-4]
A B C D E
0 9 5 2 7 9
1 8 7 2 9 2
These are not the same length.
Demonstration
A B C D E
2 8.000000 4.666667 1.666667 6.333333 6.333333
3 5.666667 3.000000 3.000000 5.666667 5.000000
4 5.000000 2.666667 3.666667 3.666667 6.000000
5 4.000000 4.666667 6.000000 3.000000 3.333333

Related

Scaling pandas column to be between specified min and max numbers

Let us say, I have the following data frame.
Frequency
20
14
10
8
6
2
1
I want to scale Frequency value from 0 to 1.
Is there a way to do this in Python? I have found something similar here But it doesn't serve my purpose.
I am sure there's a more standard way to do this in Python, but I use a self-defined function that you can select the range to be scaled on:
def my_scaler(min_scale_num,max_scale_num,var):
return (max_scale_num - min_scale_num) * ( (var - min(var)) / (max(var) - min(var)) ) + min_scale_num
# You can input your range
df['scaled'] = my_scaler(0,1,df['Frequency'].astype(float)) # scaled between 0,1
df['scaled2'] = my_scaler(-5,5,df['Frequency'].astype(float)) # scaled between -5,5
df
Frequency scaled scaled2
0 20 1.000000 5.000000
1 14 0.684211 1.842105
2 10 0.473684 -0.263158
3 8 0.368421 -1.315789
4 6 0.263158 -2.368421
5 2 0.052632 -4.473684
6 1 0.000000 -5.000000
Just change a, b = 10, 50 to a, b = 0, 1 in linked answer for upper and lower values for scale:
a, b = 0, 1
x, y = df.Frequency.min(), df.Frequency.max()
df['normal'] = (df.Frequency - x) / (y - x) * (b - a) + a
print (df)
Frequency normal
0 20 1.000000
1 14 0.684211
2 10 0.473684
3 8 0.368421
4 6 0.263158
5 2 0.052632
6 1 0.000000
You can use applymap to apply any function on each cell of the df.
For example:
df = pd.DataFrame([20, 14, 10, 8, 6, 2, 1], columns=["Frequency"])
min = df.min()
max = df.max()
df2 = df.applymap(lambda x: (x - min)/(max-min))
df
Frequency
0 20
1 14
2 10
3 8
4 6
5 2
6 1
df2
0 Frequency 1.0
dtype: float64
1 Frequency 0.684211
dtype: float64
2 Frequency 0.473684
dtype: float64
3 Frequency 0.368421
dtype: float64
4 Frequency 0.263158
dtype: float64
5 Frequency 0.052632
dtype: float64
6 Frequency 0.0
dtype: float64

Convert data type of multiple columns with for loop

I have a 21840x39 data frame. A few of my columns are numerically valued and I want to make sure they are all in the same data type (which I want to be a float).
Instead of naming all the columns out and converting them:
df[['A', 'B', 'C', '...]] = df[['A', 'B', 'C', '...]].astype(float)
Can I do a for loop that will allow me to say something like " convert to float from column 18 to column 35"
I know how to do one column: df['A'] = df['A'].astype(float)
But how can I do multiple columns? I tried with list slicing within a loop but couldn't get it right.
First idea is convert selected columns, python counts from 0, so for 18 to 36 columns use:
df.iloc[:, 17:35] = df.iloc[:, 17:35].astype(float)
If not working (because possible bug) use another solution:
df = df.astype(dict.fromkeys(df.columns[17:35], float))
Sample - convert 8 to 15th columns:
np.random.seed(2020)
df = pd.DataFrame(np.random.randint(10, size=(3, 18)),
columns=list('abcdefghijklmnopqr')).astype(str)
print (df)
a b c d e f g h i j k l m n o p q r
0 0 8 3 6 3 3 7 8 0 0 8 9 3 7 2 3 6 5
1 0 4 8 6 4 1 1 5 9 5 6 6 6 5 4 6 4 2
2 3 4 7 1 4 9 3 2 0 9 1 2 7 1 0 2 8 8
df = df.astype(dict.fromkeys(df.columns[7:15], float))
print (df)
a b c d e f g h i j k l m n o p q r
0 0 8 3 6 3 3 7 8.0 0.0 0.0 8.0 9.0 3.0 7.0 2.0 3 6 5
1 0 4 8 6 4 1 1 5.0 9.0 5.0 6.0 6.0 6.0 5.0 4.0 6 4 2
2 3 4 7 1 4 9 3 2.0 0.0 9.0 1.0 2.0 7.0 1.0 0.0 2 8 8
Tweaked #jezrael code as typing in column names (I feel) is a good option.
import pandas as pd
import numpy as np
np.random.seed(2020)
df = pd.DataFrame(np.random.randint(10, size=(3, 18)),
columns=list('abcdefghijklmnopqr')).astype(str)
print(df)
columns = list(df.columns)
#change the first and last column names below as required
df = df.astype(dict.fromkeys(
df.columns[columns.index('h'):(columns.index('o')+1)], float))
print (df)
Leaving the original answer below here but note: Never loop in pandas if vectorized alternatives exist
If I had a dataframe and wanted to change columns 'col3' to 'col5' (human readable names) to floats I could...
import pandas as pd
import re
df = pd.read_csv('dummy_data.csv')
df
columns = list(df.columns)
#change the first and last column names below as required
start_column = columns.index('col3')
end_column = columns.index('col5')
for index, col in enumerate(columns):
if (start_column <= index) & (index <= end_column):
df[col] = df[col].astype(float)
df
...by just changing the column names. Perhaps it's easier to work in column names and 'from this one' and 'to that one' (inclusive).

How do I group all values in a pandas column that have a frequency of 2 or less

Here is a fictitious example:
id cluster
1 3
2 3
3 3
4 1
5 5
So the cluster for id 4 and 5 should be replaced by some text.
So, I'm able to find which values have a frequency of less than 3 using:
counts = distclust.groupby("cluster")["cluster"].count()
counts[counts < 3].index.values
Now, I'm not sure I go and replace these values in my dataframe with some arbitrary text (i.e. "noise")
I think that is enough information, let me know if you'd like me to include anything else:
In [82]: df.groupby('cluster').filter(lambda x: len(x) <= 2)
Out[82]:
id cluster
3 4 1
4 5 5
updating:
In [95]: idx = df.groupby('cluster').filter(lambda x: len(x) <= 2).index
In [96]: df.loc[idx, 'cluster'] = -999
In [97]: df
Out[97]:
id cluster
0 1 3
1 2 3
2 3 3
3 4 -999
4 5 -999
df.cluster.replace((df.cluster.value_counts()<=1).replace({True:'noise',False:np.nan}).dropna())
Out[627]:
0 3
1 3
2 3
3 noise
4 noise
Name: cluster, dtype: object
After assign it back
df.cluster=df.cluster.replace((df.cluster.value_counts()<=1).replace({True:'noise',False:np.nan}).dropna())
df
Out[629]:
id cluster
0 1 3
1 2 3
2 3 3
3 4 noise
4 5 noise

Group DataFrame, apply function with inputs then add result back to original

Can't find this question anywhere, so just try here instead:
What I'm trying to do is basically alter an existing DataFrame object using groupby-functionality, and a self-written function:
benchmark =
x y z field_1
1 1 3 a
1 2 5 b
9 2 4 a
1 2 5 c
4 6 1 c
What I want to do, is to groupby field_1, apply a function using specific columns as input, in this case columns x and y, then add back the result to the original DataFrame benchmark as a new column called new_field. The function itself is dependent on the value in field_1, i.e. field_1=a will yield a different result compared to field_1=b etc. (hence the grouping to start with).
Pseudo-code would be something like:
1. grouped_data = benchmark.groupby(['field_1'])
2. apply own_function to grouped_data; with inputs ('x', 'y', grouped_data)
3. add back result from function to benchmark as column 'new_field'
Thanks,
ALTERATION:
benchmark =
x y z field_1
1 1 3 a
1 2 5 b
9 2 4 a
1 2 5 c
4 6 1 c
Elaboration:
I also have a DataFrame separate_data containing separate values for x,
separate_data =
x a b c
1 1 3 7
2 2 5 6
3 2 4 4
4 2 5 9
5 6 1 10
that will need to be interpolated onto the existing benchmark DataFrame. Which column in separate_data that should be used for interpolation is dependent on column field_1 in benchmark (i.e. values in set (a,b,c) above). The interpolated value in the new column, is based on x-value in benchmark.
Result:
benchmark =
x y z field_1 field_new
1 1 3 a interpolate using separate_data with x=1 and col=a
1 2 5 b interpolate using separate_data with x=1 and col=b
9 2 4 a ... etc
1 2 5 c ...
4 6 1 c ...
Makes sense?
EDIT:
I think you need reshape separate_data first by set_index + stack, set index names by rename_axis and set name of Serie by rename.
Then is possible groupby by both levels and use some function.
Then join it to benchmark with default left join:
separate_data1 =separate_data.set_index('x').stack().rename_axis(('x','field_1')).rename('d')
print (separate_data1)
x field_1
1 a 1
b 3
c 7
2 a 2
b 5
c 6
3 a 2
b 4
c 4
4 a 2
b 5
c 9
5 a 6
b 1
c 10
Name: d, dtype: int64
If necessary use some function, mainly if some duplicates in pairs x with field_1 it return nice unique pairs:
def func(x):
#sample function
return x / 2 + x ** 2
separate_data1 = separate_data1.groupby(level=['x','field_1']).apply(func)
print (separate_data1)
x field_1
1 a 1.5
b 10.5
c 52.5
2 a 5.0
b 27.5
c 39.0
3 a 5.0
b 18.0
c 18.0
4 a 5.0
b 27.5
c 85.5
5 a 39.0
b 1.5
c 105.0
Name: d, dtype: float64
benchmark = benchmark.join(separate_data1, on=['x','field_1'])
print (benchmark)
x y z field_1 d
0 1 1 3 a 1.5
1 1 2 5 b 10.5
2 9 2 4 a NaN
3 1 2 5 c 52.5
4 4 6 1 c 85.5
I think you cannot use transform because multiple columns which are read together.
So use apply:
df1 = benchmark.groupby(['field_1']).apply(func)
And then for new column are multiple solutions, e.g. use join (default left join) or map.
Sample solution with both method is here.
Or is possible use flexible apply which can return new DataFrame with new column.
Try something like this:
groups = benchmark.groupby(benchmark["field_1"])
benchmark = benchmark.join(groups.apply(your_function), on="field_1")
In your_function you would create the new column using the other columns that you need, e.g. average them, sum them, etc.
Documentation for apply.
Documentation for join.
Here is a working example:
# Sample function that sums x and y, then append the field as string.
def func(x, y, z):
return (x + y).astype(str) + z
benchmark['new_field'] = benchmark.groupby('field_1')\
.apply(lambda x: func(x['x'], x['y'], x['field_1']))\
.reset_index(level = 0, drop = True)
Result:
benchmark
Out[139]:
x y z field_1 new_field
0 1 1 3 a 2a
1 1 2 5 b 3b
2 9 2 4 a 11a
3 1 2 5 c 3c
4 4 6 1 c 10c

How do I get panda's "update" function to overwrite numbers in one column but not another?

Currently, I'm using:
csvdata.update(data, overwrite=True)
How can I make it update and overwrite a specific column but not another, small but simple question, is there a simple answer?
Rather than update with the entire DataFrame, just update with the subDataFrame of columns which you are interested in. For example:
In [11]: df1
Out[11]:
A B
0 1 99
1 3 99
2 5 6
In [12]: df2
Out[12]:
A B
0 a 2
1 b 4
2 c 6
In [13]: df1.update(df2[['B']]) # subset of cols = ['B']
In [14]: df1
Out[14]:
A B
0 1 2
1 3 4
2 5 6
If you want to do it for a single column:
import pandas
import numpy
csvdata = pandas.DataFrame({"a":range(12), "b":range(12)})
other = pandas.Series(list("abcdefghijk")+[numpy.nan])
csvdata["a"].update(other)
print csvdata
a b
0 a 0
1 b 1
2 c 2
3 d 3
4 e 4
5 f 5
6 g 6
7 h 7
8 i 8
9 j 9
10 k 10
11 11 11
or, as long as the column names match, you can do this:
other = pandas.DataFrame({"a":list("abcdefghijk")+[numpy.nan], "b":list("abcdefghijk")+[numpy.nan]})
csvdata.update(other["a"])

Categories

Resources