I have dataset:
print (df['price'])
0 0.435
1 -2.325
2 -3.866
...
58 -35.876
59 -37.746
Name: price, dtype: float64
moving average:
m_a = df['price'].rolling(window=5).mean()
m_a.plot()
print(m_a)
0 NaN
1 NaN
2 NaN
3 NaN
4 -2.8976
5 -4.9628
...
58 -36.2204
59 -36.4632
M/A
How can I determine the trend for the last n rows - FLAT/UP/DOWN?
In text, or int def result, like:
trend = gettrend(df,5)
print(trend)
>>UP
You can use something like this with np.where and expand on the logic as required:
df['Trend'] = np.where(df['m_a'] < df['m_a'].shift(),'DOWN',
np.where(df['m_a'] > df['m_a'].shift(),'UP','FLAT'))
price m_a Trend
0 1 2 FLAT
1 2 2 FLAT
2 3 4 UP
3 4 5 UP
4 5 6 UP
5 6 7 UP
6 7 -1 DOWN
7 8 2 UP
8 6 7 UP
9 7 -6 DOWN
10 8 -7 DOWN
I'd do it this way:
Setup sample DF:
In [31]: df = pd.DataFrame(np.random.rand(20)*100, columns=['price'])
In [32]: df
Out[32]:
price
0 20.555945
1 58.312756
2 3.723192
3 22.298697
4 71.533725
5 71.257019
6 87.355602
7 55.076239
8 67.941031
9 77.437012
10 94.496416
11 16.937017
12 68.494663
13 79.112648
14 88.298477
15 59.028143
16 16.991677
17 14.835137
18 75.095696
19 95.177781
Solution:
In [33]: df['trend'] = np.sign(df['price']
...: .rolling(window=5)
...: .mean()
...: .diff()
...: .fillna(0)) \
...: .map({0:'FLAT',1:'UP',-1:'DOWN'})
...:
In [34]: df
Out[34]:
price trend
0 20.555945 FLAT
1 58.312756 FLAT
2 3.723192 FLAT
3 22.298697 FLAT
4 71.533725 FLAT
5 71.257019 UP
6 87.355602 UP
7 55.076239 UP
8 67.941031 UP
9 77.437012 UP
10 94.496416 UP
11 16.937017 DOWN
12 68.494663 UP
13 79.112648 UP
14 88.298477 UP
15 59.028143 DOWN
16 16.991677 UP
17 14.835137 DOWN
18 75.095696 DOWN
19 95.177781 UP
Plot:
In [39]: df.price.plot(figsize=(16,6))
Out[39]: <matplotlib.axes._subplots.AxesSubplot at 0xc16e4a8>
In [40]: plt.locator_params(nbins=len(df))
Related
given the following example table
Index
Date
Weekday
Value
1
05/12/2022
2
10
2
06/12/2022
3
20
3
07/12/2022
4
40
4
09/12/2022
6
10
5
10/12/2022
7
60
6
11/12/2022
1
30
7
12/12/2022
2
40
8
13/12/2022
3
50
9
14/12/2022
4
60
10
16/12/2022
6
20
11
17/12/2022
7
50
12
18/12/2022
1
10
13
20/12/2022
3
20
14
21/12/2022
4
10
15
22/12/2022
5
40
I want to calculate a rolling average of the last three observations (at least) a week ago. I cannot use .shift as some dates are randomly missing, and .shift would therefore not produce a reliable output.
Desired output example for last three rows in the example dataset:
Index 13: Avg of indices 8, 7, 6 = (30+40+50) / 3 = 40
Index 14: Avg of indices 9, 8, 7 = (40+50+60) / 3 = 50
Index 15: Avg of indices 9, 8, 7 = (40+50+60) / 3 = 50
What would be a working solution for this? Thanks!
Thanks!
MOSTLY inspired from #Aidis you could, make his solution an apply:
df['mean']=df.apply(lambda y: df["Value"][df['Date'] <= y['Date'] - pd.Timedelta(1, "W")].tail(3).mean(), axis=1)
or spliting the data at each call which may run faster if you have lots of data (to be tested):
df['mean']=df.apply(lambda y: df.loc[:y.name, "Value"][ df.loc[:y.name,'Date'] <= y['Date'] - pd.Timedelta(1, "W")].tail(3).mean(), axis=1)
which returns:
Index Date Weekday Value mean
0 1 2022-12-05 2 10 NaN
1 2 2022-12-06 3 20 NaN
2 3 2022-12-07 4 40 NaN
3 4 2022-12-09 6 10 NaN
4 5 2022-12-10 7 60 NaN
5 6 2022-12-11 1 30 NaN
6 7 2022-12-12 2 40 10.000000
7 8 2022-12-13 3 50 15.000000
8 9 2022-12-14 4 60 23.333333
9 10 2022-12-16 6 20 23.333333
10 11 2022-12-17 7 50 36.666667
11 12 2022-12-18 1 10 33.333333
12 13 2022-12-20 3 20 40.000000
13 14 2022-12-21 4 10 50.000000
14 15 2022-12-22 5 40 50.000000
I apologize for this ugly code. But it seems to work:
df = df.set_index("Index")
df['Date'] = df['Date'].astype("datetime64")
for id in df.index:
dfs = df.loc[:id]
mean = dfs["Value"][dfs['Date'] <= dfs.iloc[-1]['Date'] - pd.Timedelta(1, "W")].tail(3).mean()
print(id, mean)
Result:
1 nan
2 10.0
3 15.0
4 23.333333333333332
5 23.333333333333332
6 36.666666666666664
7 33.333333333333336
8 33.333333333333336
9 33.333333333333336
10 33.333333333333336
11 33.333333333333336
12 33.333333333333336
13 40.0
14 50.0
15 50.0
I'm looking for a way to calculate with Python Pandas rolling(*) min of a Series without window.
Let's consider the following Series
In [26]: s = pd.Series([10, 12, 14, 9, 10, 8, 16, 20])
Out[26]:
0 10
1 12
2 14
3 9
4 10
5 8
6 16
7 20
dtype: int64
I would like to get a Series like
0 10
1 10
2 10
3 9
4 9
5 8
6 8
7 8
dtype: int64
I tried
s.rolling().min()
but I'm getting the following error
TypeError: rolling() missing 1 required positional argument: 'window'
I did this
r = s.copy()
val_min = r.iloc[0]
for i, (idx, val) in enumerate(r.iteritems()):
if i > 0:
if val < val_min:
val_min = val
else:
r[idx] = val_min
and have a correct answer
In [30]: r
Out[30]:
0 10
1 10
2 10
3 9
4 9
5 8
6 8
7 8
dtype: int64
but I think a Pandas method should probably exist (and be much more efficient) or if it doesn't exist, it should probably be implemented.
(*) "rolling" may not be the appropriate term, maybe it should be named instead a "local" min.
Edit: it's in fact named a cumulative minimum or expanding min
Use Series.cummin:
print(s.cummin())
0 10
1 10
2 10
3 9
4 9
5 8
6 8
7 8
dtype: int64
You can use np.minimum.accumulate:
import numpy as np
pd.Series(np.minimum.accumulate(s.values))
0 10
1 10
2 10
3 9
4 9
5 8
6 8
7 8
dtype: int64
Another way is to use s.expanding.min (see Series.expanding):
s.expanding().min()
Output:
0 10.0
1 10.0
2 10.0
3 9.0
4 9.0
5 8.0
6 8.0
7 8.0
I have a dataframe as show below:
df =
index value1 value2 value3
001 0.3 1.3 4.5
002 1.1 2.5 3.7
003 0.1 0.9 7.8
....
365 3.4 1.2 0.9
the index means the days in a year( so sometimes the last number of index is 366), I want to group it with random days (for example 10 days or 30 days),I thinks the code would be as below,
df_new = df.groupby( "method" ).mean()
In some question I saw the they used type of datetime to groupby, however in my dataframe the index are just numbers, is there any better way to group it ? thanks in adavance !
I think need floor index values and aggregate mean:
df_new = df.groupby( df.index // 10).mean()
Another general solution if not default unique numeric index:
df_new = df.groupby( np.arange(len(df.index)) // 10).mean()
Sample:
c = 'val1 val2 val3'.split()
df = pd.DataFrame(np.random.randint(10, size=(20,3)), columns=c)
print (df)
val1 val2 val3
0 5 9 4
1 5 7 1
2 8 3 5
3 2 4 2
4 2 8 4
5 8 5 6
6 0 9 8
7 2 3 6
8 7 0 0
9 3 3 5
10 6 6 3
11 8 9 6
12 5 1 6
13 1 5 9
14 1 4 5
15 3 2 2
16 4 5 4
17 3 5 1
18 9 4 5
19 9 8 7
df_new = df.groupby( df.index // 10).mean()
print (df_new)
val1 val2 val3
0 4.2 5.1 4.1
1 4.9 4.9 4.8
Just create a new index via floored quotient operator // and group by this index. Here is an example with 155 rows. You can drop the original index for the result.
df = pd.DataFrame({'index': list(range(1, 156)),
'val1': np.random.rand(155),
'val2': np.random.rand(155),
'val3': np.random.rand(155)})
df['new_index'] = df['index'] // 10
res = df.groupby('new_index', as_index=False).mean().drop('index', 1)
# new_index val1 val2 val3
# 0 0 0.315851 0.462080 0.491779
# 1 1 0.377690 0.566162 0.588248
# 2 2 0.314571 0.471430 0.626292
# 3 3 0.725548 0.572577 0.530589
# 4 4 0.569597 0.466964 0.443815
# 5 5 0.470747 0.394189 0.321107
# 6 6 0.362968 0.362278 0.415093
# 7 7 0.403529 0.626155 0.322582
# 8 8 0.555819 0.415741 0.525251
# 9 9 0.454660 0.336846 0.524158
# 10 10 0.435777 0.495191 0.380897
# 11 11 0.345916 0.550897 0.487255
# 12 12 0.676762 0.464794 0.612018
# 13 13 0.524610 0.450550 0.472724
# 14 14 0.466074 0.542736 0.680481
# 15 15 0.456921 0.565800 0.442543
I have a pandas DataFrame with a column of integers. I want the rows containing numbers greater than 10. I am able to evaluate True or False but not the actual value, by doing:
df['ints'] = df['ints'] > 10
I don't use Python very often so I'm going round in circles with this.
I've spent 20 minutes Googling but haven't been able to find what I need....
Edit:
observationID recordKey gridReference siteKey siteName featureKey startDate endDate pTaxonVersionKey taxonName authority commonName ints
0 463166539 1767 SM90 NaN NaN 150161 12/02/2006 12/02/2006 NBNSYS0100004720 Pipistrellus pygmaeus (Leach, 1825) Soprano Pipistrelle 2006
1 463166623 4325 TL65 NaN NaN 168651 21/12/2008 21/12/2008 NHMSYS0020001355 Pipistrellus pipistrellus sensu stricto (Schreber, 1774) Common Pipistrelle 2008
2 463166624 4326 TL65 NaN NaN 168651 18/01/2009 18/01/2009 NHMSYS0020001355 Pipistrellus pipistrellus sensu stricto (Schreber, 1774) Common Pipistrelle 2009
3 463166625 4327 TL65 NaN NaN 168651 15/02/2009 15/02/2009 NHMSYS0020001355 Pipistrellus pipistrellus sensu stricto (Schreber, 1774) Common Pipistrelle 2009
4 463166626 4328 TL65 NaN NaN 168651 19/12/2009 19/12/2009 NHMSYS0020001355 Pipistrellus pipistrellus sensu stricto (Schreber, 1774) Common Pipistrelle 2009
Sample DF:
In [79]: df = pd.DataFrame(np.random.randint(5, 15, (10, 3)), columns=list('abc'))
In [80]: df
Out[80]:
a b c
0 6 11 11
1 14 7 8
2 13 5 11
3 13 7 11
4 13 5 9
5 5 11 9
6 9 8 6
7 5 11 10
8 8 10 14
9 7 14 13
present only those rows where b > 10
In [81]: df[df.b > 10]
Out[81]:
a b c
0 6 11 11
5 5 11 9
7 5 11 10
9 7 14 13
Minimums (for all columns) for the rows satisfying b > 10 condition
In [82]: df[df.b > 10].min()
Out[82]:
a 5
b 11
c 9
dtype: int32
Minimum (for the b column) for the rows satisfying b > 10 condition
In [84]: df.loc[df.b > 10, 'b'].min()
Out[84]: 11
UPDATE: starting from Pandas 0.20.1 the .ix indexer is deprecated, in favor of the more strict .iloc and .loc indexers.
You can also use query:
In [2]: df = pd.DataFrame({'ints': range(9, 14), 'alpha': list('ABCDE')})
In [3]: df
Out[3]:
ints alpha
0 9 A
1 10 B
2 11 C
3 12 D
4 13 E
In [4]: df.query('ints > 10')
Out[4]:
ints alpha
2 11 C
3 12 D
4 13 E
I tried to use the following code to normalize a column in python data frame:
df['x_norm'] = df.apply(lambda x: (x['X'] - x['X'].mean()) / (x['X'].max() - x['X'].min()),axis=1)
but got the following error:
df['x_norm'] = df.apply(lambda x: (x['X'] - x['X'].mean()) / (x['X'].max() - x['X'].min()),axis=1)
AttributeError: ("'float' object has no attribute 'mean'", u'occurred at index 0')
Does anyone know what I missed here? Thanks!
I'm assuming you are using Pandas.
Instead of applying to the entire DataFrame apply (Documentation) only to the Series 'X', also you should pre-calculate the mean, max and min values. Something like this:
avg = df['X'].mean()
diff = df['X'].max() - df['X'].min()
new_df = df['X'].apply(lambda x: (x-avg)/diff)
If you are looking to normalize the entire DataFrame check this answer:
df.apply(lambda x: (x - np.mean(x)) / (np.max(x) - np.min(x)))
If you want to normalize values in column X:
df['x_norm'] = df.X.div(df.X.sum())
Step by step:
In [65]: df
Out[65]:
a b X
0 2 1 5
1 1 4 5
2 7 4 7
3 1 6 6
4 5 5 8
5 5 8 2
6 6 7 5
7 8 2 5
8 7 9 9
9 9 6 5
In [68]: df['x_norm'] = df.X.div(df.X.sum())
In [69]: df
Out[69]:
a b X x_norm
0 2 1 5 0.087719
1 1 4 5 0.087719
2 7 4 7 0.122807
3 1 6 6 0.105263
4 5 5 8 0.140351
5 5 8 2 0.035088
6 6 7 5 0.087719
7 8 2 5 0.087719
8 7 9 9 0.157895
9 9 6 5 0.087719
check:
In [70]: df.x_norm.sum()
Out[70]: 1.0