calculating moving average in pandas - python

So, this is fairly a new topic for me and I don't quite understand it yet. I wanted to make a new column in a dataset that contains the moving average of the volume column. The window size is 5 and moving average of row x is calculated from rows x-2, x-1, x, x+1, and x+2. For x=1 and x=2, the moving average is calculated using three and four rows, respectively
I did this.
df['Volume_moving'] = df.iloc[:,5].rolling(window=5).mean()
df
Date Open High Low Close Volume Adj Close Volume_moving
0 2012-10-15 632.35 635.13 623.85 634.76 15446500 631.87 NaN
1 2012-10-16 635.37 650.30 631.00 649.79 19634700 646.84 NaN
2 2012-10-17 648.87 652.79 644.00 644.61 13894200 641.68 NaN
3 2012-10-18 639.59 642.06 630.00 632.64 17022300 629.76 NaN
4 2012-10-19 631.05 631.77 609.62 609.84 26574500 607.07 18514440.0
... ... ... ... ... ... ... ... ...
85 2013-01-08 529.21 531.89 521.25 525.31 16382400 525.31 17504860.0
86 2013-01-09 522.50 525.01 515.99 517.10 14557300 517.10 16412620.0
87 2013-01-10 528.55 528.72 515.52 523.51 21469500 523.51 18185340.0
88 2013-01-11 521.00 525.32 519.02 520.30 12518100 520.30 16443720.0
91 2013-01-14 502.68 507.50 498.51 501.75 26179000 501.75 18221260.0
However, I think that the result is not accurate as I tried it with a different dataframe and get the exact same result.
Can anyone please help me with this?

Try with this:
df['Volume_moving'] = df['Volume'].rolling(window=5).mean()

Related

How can I subtract two panda data frame columns without getting an index error? [duplicate]

In python, how can I reference previous row and calculate something against it? Specifically, I am working with dataframes in pandas - I have a data frame full of stock price information that looks like this:
Date Close Adj Close
251 2011-01-03 147.48 143.25
250 2011-01-04 147.64 143.41
249 2011-01-05 147.05 142.83
248 2011-01-06 148.66 144.40
247 2011-01-07 147.93 143.69
Here is how I created this dataframe:
import pandas
url = 'http://ichart.finance.yahoo.com/table.csv?s=IBM&a=00&b=1&c=2011&d=11&e=31&f=2011&g=d&ignore=.csv'
data = data = pandas.read_csv(url)
## now I sorted the data frame ascending by date
data = data.sort(columns='Date')
Starting with row number 2, or in this case, I guess it's 250 (PS - is that the index?), I want to calculate the difference between 2011-01-03 and 2011-01-04, for every entry in this dataframe. I believe the appropriate way is to write a function that takes the current row, then figures out the previous row, and calculates the difference between them, the use the pandas apply function to update the dataframe with the value.
Is that the right approach? If so, should I be using the index to determine the difference? (note - I'm still in python beginner mode, so index may not be the right term, nor even the correct way to implement this)
I think you want to do something like this:
In [26]: data
Out[26]:
Date Close Adj Close
251 2011-01-03 147.48 143.25
250 2011-01-04 147.64 143.41
249 2011-01-05 147.05 142.83
248 2011-01-06 148.66 144.40
247 2011-01-07 147.93 143.69
In [27]: data.set_index('Date').diff()
Out[27]:
Close Adj Close
Date
2011-01-03 NaN NaN
2011-01-04 0.16 0.16
2011-01-05 -0.59 -0.58
2011-01-06 1.61 1.57
2011-01-07 -0.73 -0.71
To calculate difference of one column. Here is what you can do.
df=
A B
0 10 56
1 45 48
2 26 48
3 32 65
We want to compute row difference in A only and want to consider the rows which are less than 15.
df['A_dif'] = df['A'].diff()
df=
A B A_dif
0 10 56 Nan
1 45 48 35
2 26 48 19
3 32 65 6
df = df[df['A_dif']<15]
df=
A B A_dif
0 10 56 Nan
3 32 65 6
I don't know pandas, and I'm pretty sure it has something specific for this; however, I'll give you the pure-Python solution, that might be of some help even if you need to use pandas:
import csv
import urllib
# This basically retrieves the CSV files and loads it in a list, converting
# All numeric values to floats
url='http://ichart.finance.yahoo.com/table.csv?s=IBM&a=00&b=1&c=2011&d=11&e=31&f=2011&g=d&ignore=.csv'
reader = csv.reader(urllib.urlopen(url), delimiter=',')
# We sort the output list so the records are ordered by date
cleaned = sorted([[r[0]] + map(float, r[1:]) for r in list(reader)[1:]])
for i, row in enumerate(cleaned): # enumerate() yields two-tuples: (<id>, <item>)
# The try..except here is to skip the IndexError for line 0
try:
# This will calculate difference of each numeric field with the same field
# in the row before this one
print row[0], [(row[j] - cleaned[i-1][j]) for j in range(1, 7)]
except IndexError:
pass

Turning 2D pd.dataFrame into 3D array

My Dataset consists of 3535560 rows × 16 columns. However it contains three 'index-variables, I would like to use to reshape the dataset: 1830 days, 46 latitude values and 42 longitude values. The transformed dataset should therefore become (1830,46,42) but I have no idea how to do this. I saw that I could maybe use pd.pivot or pd.multiindex, but I am not able to find a solution.
In short data looks like:
time lat lon var 1 var 2 var 3 var 4 var 5
0 2021-01-01 60.125 -120.125 0.381828 0.917779 0.718022 0.064032 0.886050
1 2021-01-01 60.125 -119.875 0.221697 0.232657 0.298497 0.680900 0.124440
...
41 2021-01-01 60.125 -109.125 0.922149 0.708139 0.778329 0.267685 0.552542
42 2021-01-01 59.875 -120.125 0.569874 0.053829 0.740229 0.747286 0.194214
43 2021-01-01 59.875 -119.875 0.500091 0.185990 0.845510 0.877692 0.556584
....
1931 2021-01-01 48.875 -109.125 0.221697 0.232657 0.298497 0.680900 0.124440
1932 2021-01-02 60.125 -120.125 0.666589 0.849857 0.338648 0.552114 0.730678
1933 2021-01-02 60.125 -119.875 0.351144 0.467692 0.161488 0.530906 0.277561
As you can see in the table, after 42 longitude values, it takes a new latitude value and loops over all longitude values again. it does this for all latitude values before going to the next day (after 1932 rows (46x42)).
Would someone be able to help me fix this?

Get maximum relative difference between row-values and row-mean in new pandas dataframe column

I want to have an extra column with the maximum relative difference [-] of the row-values and the mean of these rows:
The df is filled with energy use data for several years.
The theoretical formula that should get me this is as follows:
df['max_rel_dif'] = MAX [ ABS(highest energy use – mean energy use), ABS(lowest energy use – mean energy use)] / mean energy use
Initial dataframe:
ID y_2010 y_2011 y_2012 y_2013 y_2014
0 23 22631 21954.0 22314.0 22032 21843
1 43 27456 29654.0 28159.0 28654 2000
2 36 61200 NaN NaN 31895 1600
3 87 87621 86542.0 87542.0 88456 86961
4 90 58951 57486.0 2000.0 0 0
5 98 24587 25478.0 NaN 24896 25461
Desired dataframe:
ID y_2010 y_2011 y_2012 y_2013 y_2014 max_rel_dif
0 23 22631 21954.0 22314.0 22032 21843 0.02149
1 43 27456 29654.0 28159.0 28654 2000 0.91373
2 36 61200 NaN NaN 31895 1600 0.94931
3 87 87621 86542.0 87542.0 88456 86961 0.01179
4 90 58951 57486.0 2000.0 0 0 1.48870
5 98 24587 25478.0 NaN 24896 25461 0.02065
tried code:
import pandas as pd
import numpy as np

df = pd.DataFrame({"ID": [23,43,36,87,90,98],
"y_2010": [22631,27456,61200,87621,58951,24587],
"y_2011": [21954,29654,np.nan,86542,57486,25478],
"y_2012": [22314,28159,np.nan,87542,2000,np.nan],
"y_2013": [22032,28654,31895,88456,0,24896,],
"y_2014": [21843,2000,1600,86961,0,25461]})

print(df)

a = df.loc[:, ['y_2010','y_2011','y_2012','y_2013', 'y_2014']]


# calculate mean
mean = a.mean(1)
# calculate max_rel_dif
df['max_rel_dif'] = (((df.max(axis=1).sub(mean)).abs(),(df.min(axis=1).sub(mean)).abs()).max()).div(mean)
# AttributeError: 'tuple' object has no attribute 'max'
-> I'm obviously doing the wrong thing with the tuple, I just don't know how to get the maximum values
from the tuples and divide them then by the mean in the proper Phytonic way
I feel like the whole function can be
s=df.filter(like='y')
s.sub(s.mean(1),axis=0).abs().max(1)/s.mean(1)
0 0.021494
1 0.913736
2 0.949311
3 0.011800
4 1.488707
5 0.020653
dtype: float64

Pandas: How to fill missing values in a large dataset?

I do have a large dataset (around 8 million rows x 25 columns) in Pandas and I am struggling to do one operation in a performant manner.
Here is how my dataset looks like:
temp size
location_id hours
135 78 12.0 100.0
79 NaN NaN
80 NaN NaN
81 15.0 112.0
82 NaN NaN
83 NaN NaN
84 14.0 22.0
I have a multi-index on [location_id, hours]. I have around 60k locations and 140 hours for each location (making up the 8 million rows).
The rest of the data is numeric (float). I have only included 2 columns here, normally there are around 20 columns.
What I am willing to do is to fill those NaN values by using the values around it. Basically, the value of hour 79 will be derived from the values of 78 and 81. For this example, the temp value of 79 will be 13.0 (basic extrapolation).
I always know that only the 78, 81, 84 (multiples of 3) hours will be filled and the rest will have NaN. That will always be the case. This is true for hours between 78-120.
With these in mind, I have implemented the following algorithm in Pandas:
df_relevant_data = df.loc[(df.index.get_level_values(1) >= 78) & (df.index.get_level_values(1) <= 120), :]
for location_id, data_of_location_id in df_relevant_data.groupby("location_id"):
for hour in range(81, 123, 3):
top_hour_data = data_of_location_id.loc[(location_id, hour), ['temp', 'size']] # e.g. 81
bottom_hour_data = data_of_location_id.loc[(location_id, (hour - 3)), ['temp', 'size']] # e.g. 78
difference = top_hour_data.values - bottom_hour_data.values
bottom_bump = difference * (1/3) # amount to add to calculate the 79th hour
top_bump = difference * (2/3) # amount to add to calculate the 80th hour
df.loc[(location_id, (hour - 2)), ['temp', 'size']] = bottom_hour_data.values + bottom_bump
df.loc[(location_id, (hour - 1)), ['temp', 'size']] = bottom_hour_data.values + top_bump
This works really well functionally, however the performance is horrible. It is taking at least 10 minutes on my dataset and that is currently not acceptable.
Is there a better/faster way to implement this? I am actually working only on a slice of the whole data (only hours between 78-120) so I would really expect it to work much faster.
I believe you are looking for interpolate:
print (df.interpolate())
temp size
location_id hours
135 78 12.000000 100.0
79 13.000000 104.0
80 14.000000 108.0
81 15.000000 112.0
82 14.666667 82.0
83 14.333333 52.0
84 14.000000 22.0

Calculating difference between two rows in Python / Pandas

In python, how can I reference previous row and calculate something against it? Specifically, I am working with dataframes in pandas - I have a data frame full of stock price information that looks like this:
Date Close Adj Close
251 2011-01-03 147.48 143.25
250 2011-01-04 147.64 143.41
249 2011-01-05 147.05 142.83
248 2011-01-06 148.66 144.40
247 2011-01-07 147.93 143.69
Here is how I created this dataframe:
import pandas
url = 'http://ichart.finance.yahoo.com/table.csv?s=IBM&a=00&b=1&c=2011&d=11&e=31&f=2011&g=d&ignore=.csv'
data = data = pandas.read_csv(url)
## now I sorted the data frame ascending by date
data = data.sort(columns='Date')
Starting with row number 2, or in this case, I guess it's 250 (PS - is that the index?), I want to calculate the difference between 2011-01-03 and 2011-01-04, for every entry in this dataframe. I believe the appropriate way is to write a function that takes the current row, then figures out the previous row, and calculates the difference between them, the use the pandas apply function to update the dataframe with the value.
Is that the right approach? If so, should I be using the index to determine the difference? (note - I'm still in python beginner mode, so index may not be the right term, nor even the correct way to implement this)
I think you want to do something like this:
In [26]: data
Out[26]:
Date Close Adj Close
251 2011-01-03 147.48 143.25
250 2011-01-04 147.64 143.41
249 2011-01-05 147.05 142.83
248 2011-01-06 148.66 144.40
247 2011-01-07 147.93 143.69
In [27]: data.set_index('Date').diff()
Out[27]:
Close Adj Close
Date
2011-01-03 NaN NaN
2011-01-04 0.16 0.16
2011-01-05 -0.59 -0.58
2011-01-06 1.61 1.57
2011-01-07 -0.73 -0.71
To calculate difference of one column. Here is what you can do.
df=
A B
0 10 56
1 45 48
2 26 48
3 32 65
We want to compute row difference in A only and want to consider the rows which are less than 15.
df['A_dif'] = df['A'].diff()
df=
A B A_dif
0 10 56 Nan
1 45 48 35
2 26 48 19
3 32 65 6
df = df[df['A_dif']<15]
df=
A B A_dif
0 10 56 Nan
3 32 65 6
I don't know pandas, and I'm pretty sure it has something specific for this; however, I'll give you the pure-Python solution, that might be of some help even if you need to use pandas:
import csv
import urllib
# This basically retrieves the CSV files and loads it in a list, converting
# All numeric values to floats
url='http://ichart.finance.yahoo.com/table.csv?s=IBM&a=00&b=1&c=2011&d=11&e=31&f=2011&g=d&ignore=.csv'
reader = csv.reader(urllib.urlopen(url), delimiter=',')
# We sort the output list so the records are ordered by date
cleaned = sorted([[r[0]] + map(float, r[1:]) for r in list(reader)[1:]])
for i, row in enumerate(cleaned): # enumerate() yields two-tuples: (<id>, <item>)
# The try..except here is to skip the IndexError for line 0
try:
# This will calculate difference of each numeric field with the same field
# in the row before this one
print row[0], [(row[j] - cleaned[i-1][j]) for j in range(1, 7)]
except IndexError:
pass

Categories

Resources