Calculating difference between two rows in Python / Pandas - python

In python, how can I reference previous row and calculate something against it? Specifically, I am working with dataframes in pandas - I have a data frame full of stock price information that looks like this:
Date Close Adj Close
251 2011-01-03 147.48 143.25
250 2011-01-04 147.64 143.41
249 2011-01-05 147.05 142.83
248 2011-01-06 148.66 144.40
247 2011-01-07 147.93 143.69
Here is how I created this dataframe:
import pandas
url = 'http://ichart.finance.yahoo.com/table.csv?s=IBM&a=00&b=1&c=2011&d=11&e=31&f=2011&g=d&ignore=.csv'
data = data = pandas.read_csv(url)
## now I sorted the data frame ascending by date
data = data.sort(columns='Date')
Starting with row number 2, or in this case, I guess it's 250 (PS - is that the index?), I want to calculate the difference between 2011-01-03 and 2011-01-04, for every entry in this dataframe. I believe the appropriate way is to write a function that takes the current row, then figures out the previous row, and calculates the difference between them, the use the pandas apply function to update the dataframe with the value.
Is that the right approach? If so, should I be using the index to determine the difference? (note - I'm still in python beginner mode, so index may not be the right term, nor even the correct way to implement this)

I think you want to do something like this:
In [26]: data
Out[26]:
Date Close Adj Close
251 2011-01-03 147.48 143.25
250 2011-01-04 147.64 143.41
249 2011-01-05 147.05 142.83
248 2011-01-06 148.66 144.40
247 2011-01-07 147.93 143.69
In [27]: data.set_index('Date').diff()
Out[27]:
Close Adj Close
Date
2011-01-03 NaN NaN
2011-01-04 0.16 0.16
2011-01-05 -0.59 -0.58
2011-01-06 1.61 1.57
2011-01-07 -0.73 -0.71

To calculate difference of one column. Here is what you can do.
df=
A B
0 10 56
1 45 48
2 26 48
3 32 65
We want to compute row difference in A only and want to consider the rows which are less than 15.
df['A_dif'] = df['A'].diff()
df=
A B A_dif
0 10 56 Nan
1 45 48 35
2 26 48 19
3 32 65 6
df = df[df['A_dif']<15]
df=
A B A_dif
0 10 56 Nan
3 32 65 6

I don't know pandas, and I'm pretty sure it has something specific for this; however, I'll give you the pure-Python solution, that might be of some help even if you need to use pandas:
import csv
import urllib
# This basically retrieves the CSV files and loads it in a list, converting
# All numeric values to floats
url='http://ichart.finance.yahoo.com/table.csv?s=IBM&a=00&b=1&c=2011&d=11&e=31&f=2011&g=d&ignore=.csv'
reader = csv.reader(urllib.urlopen(url), delimiter=',')
# We sort the output list so the records are ordered by date
cleaned = sorted([[r[0]] + map(float, r[1:]) for r in list(reader)[1:]])
for i, row in enumerate(cleaned): # enumerate() yields two-tuples: (<id>, <item>)
# The try..except here is to skip the IndexError for line 0
try:
# This will calculate difference of each numeric field with the same field
# in the row before this one
print row[0], [(row[j] - cleaned[i-1][j]) for j in range(1, 7)]
except IndexError:
pass

Related

How can I subtract two panda data frame columns without getting an index error? [duplicate]

In python, how can I reference previous row and calculate something against it? Specifically, I am working with dataframes in pandas - I have a data frame full of stock price information that looks like this:
Date Close Adj Close
251 2011-01-03 147.48 143.25
250 2011-01-04 147.64 143.41
249 2011-01-05 147.05 142.83
248 2011-01-06 148.66 144.40
247 2011-01-07 147.93 143.69
Here is how I created this dataframe:
import pandas
url = 'http://ichart.finance.yahoo.com/table.csv?s=IBM&a=00&b=1&c=2011&d=11&e=31&f=2011&g=d&ignore=.csv'
data = data = pandas.read_csv(url)
## now I sorted the data frame ascending by date
data = data.sort(columns='Date')
Starting with row number 2, or in this case, I guess it's 250 (PS - is that the index?), I want to calculate the difference between 2011-01-03 and 2011-01-04, for every entry in this dataframe. I believe the appropriate way is to write a function that takes the current row, then figures out the previous row, and calculates the difference between them, the use the pandas apply function to update the dataframe with the value.
Is that the right approach? If so, should I be using the index to determine the difference? (note - I'm still in python beginner mode, so index may not be the right term, nor even the correct way to implement this)
I think you want to do something like this:
In [26]: data
Out[26]:
Date Close Adj Close
251 2011-01-03 147.48 143.25
250 2011-01-04 147.64 143.41
249 2011-01-05 147.05 142.83
248 2011-01-06 148.66 144.40
247 2011-01-07 147.93 143.69
In [27]: data.set_index('Date').diff()
Out[27]:
Close Adj Close
Date
2011-01-03 NaN NaN
2011-01-04 0.16 0.16
2011-01-05 -0.59 -0.58
2011-01-06 1.61 1.57
2011-01-07 -0.73 -0.71
To calculate difference of one column. Here is what you can do.
df=
A B
0 10 56
1 45 48
2 26 48
3 32 65
We want to compute row difference in A only and want to consider the rows which are less than 15.
df['A_dif'] = df['A'].diff()
df=
A B A_dif
0 10 56 Nan
1 45 48 35
2 26 48 19
3 32 65 6
df = df[df['A_dif']<15]
df=
A B A_dif
0 10 56 Nan
3 32 65 6
I don't know pandas, and I'm pretty sure it has something specific for this; however, I'll give you the pure-Python solution, that might be of some help even if you need to use pandas:
import csv
import urllib
# This basically retrieves the CSV files and loads it in a list, converting
# All numeric values to floats
url='http://ichart.finance.yahoo.com/table.csv?s=IBM&a=00&b=1&c=2011&d=11&e=31&f=2011&g=d&ignore=.csv'
reader = csv.reader(urllib.urlopen(url), delimiter=',')
# We sort the output list so the records are ordered by date
cleaned = sorted([[r[0]] + map(float, r[1:]) for r in list(reader)[1:]])
for i, row in enumerate(cleaned): # enumerate() yields two-tuples: (<id>, <item>)
# The try..except here is to skip the IndexError for line 0
try:
# This will calculate difference of each numeric field with the same field
# in the row before this one
print row[0], [(row[j] - cleaned[i-1][j]) for j in range(1, 7)]
except IndexError:
pass

Pandas: How to fill missing values in a large dataset?

I do have a large dataset (around 8 million rows x 25 columns) in Pandas and I am struggling to do one operation in a performant manner.
Here is how my dataset looks like:
temp size
location_id hours
135 78 12.0 100.0
79 NaN NaN
80 NaN NaN
81 15.0 112.0
82 NaN NaN
83 NaN NaN
84 14.0 22.0
I have a multi-index on [location_id, hours]. I have around 60k locations and 140 hours for each location (making up the 8 million rows).
The rest of the data is numeric (float). I have only included 2 columns here, normally there are around 20 columns.
What I am willing to do is to fill those NaN values by using the values around it. Basically, the value of hour 79 will be derived from the values of 78 and 81. For this example, the temp value of 79 will be 13.0 (basic extrapolation).
I always know that only the 78, 81, 84 (multiples of 3) hours will be filled and the rest will have NaN. That will always be the case. This is true for hours between 78-120.
With these in mind, I have implemented the following algorithm in Pandas:
df_relevant_data = df.loc[(df.index.get_level_values(1) >= 78) & (df.index.get_level_values(1) <= 120), :]
for location_id, data_of_location_id in df_relevant_data.groupby("location_id"):
for hour in range(81, 123, 3):
top_hour_data = data_of_location_id.loc[(location_id, hour), ['temp', 'size']] # e.g. 81
bottom_hour_data = data_of_location_id.loc[(location_id, (hour - 3)), ['temp', 'size']] # e.g. 78
difference = top_hour_data.values - bottom_hour_data.values
bottom_bump = difference * (1/3) # amount to add to calculate the 79th hour
top_bump = difference * (2/3) # amount to add to calculate the 80th hour
df.loc[(location_id, (hour - 2)), ['temp', 'size']] = bottom_hour_data.values + bottom_bump
df.loc[(location_id, (hour - 1)), ['temp', 'size']] = bottom_hour_data.values + top_bump
This works really well functionally, however the performance is horrible. It is taking at least 10 minutes on my dataset and that is currently not acceptable.
Is there a better/faster way to implement this? I am actually working only on a slice of the whole data (only hours between 78-120) so I would really expect it to work much faster.
I believe you are looking for interpolate:
print (df.interpolate())
temp size
location_id hours
135 78 12.000000 100.0
79 13.000000 104.0
80 14.000000 108.0
81 15.000000 112.0
82 14.666667 82.0
83 14.333333 52.0
84 14.000000 22.0

Filter rows lesser than the cumulative maximum

I'm trying to filter rows based on a relative simple criteria. If the value for Open is less than the max value for the column until that row, it gets dropped, otherwise the row stays and is the reference value for the new max.
This is the starting example dataframe:
import pandas as pd
import numpy as np
d = {'Date':['22-01-2019','23-01-2019','24-01-2019','25-01-2019','26-01-2019'],'Open': [40,54,54,79,67], 'Close': [43,53,65,65,61]}
df = pd.DataFrame(data=d)
print(df)
In this case I would like to do the filtering on the column Open:
Date Open Close
0 22-01-2019 40 43 #Max is 40
1 23-01-2019 54 53 #54 is higher than 40 so it stays
2 24-01-2019 54 65 #This is not higher than the previous max, should get dropped
3 25-01-2019 79 80 #This is higher than 54, so it stays
4 26-01-2019 67 61 #This is not higher than 79, should get dropped
The only way I could come up to solve the problem with a for loop iterating over each row in particular, defining an auxiliary variable that records does comparison, and returns a boolean series. However it's extremely inefficient when dealing with more than 100k rows. The final goal is to perform the same filter on the Close column and join them to know in which days (the original data is every 15 minutes) both Open and Close values have risen above the highest value ever (previously) recorded.
Finally the output should look like this:
Date Open Close
0 22-01-2019 40 43
1 23-01-2019 54 53
3 25-01-2019 79 80
If doing the same operation for the Close column it should look like:
Date Open Close
0 22-01-2019 40 43
1 23-01-2019 54 53
2 24-01-2019 54 65
3 25-01-2019 79 80
The final goal (which I would know how to do once the I can get through the filtering part, but just sharing for the sake of the full case) is:
Date Open Close
0 22-01-2019 40 43
1 23-01-2019 54 53
3 25-01-2019 79 80
My solution is:
max_v = 0
list_for_filtering = []
for i, value in df.iterrows():
if value['Open'] > max_v:
max_v = value['Open']
list_for_filtering.append(True)
else:
pass
list_for_filtering.append(False)
df['T/F'] = list_for_filtering
And filter keeping only the True values
One simple solution is to compare "Open" with the shifted cummax:
# thanks to Andy L. for the simplification!
df[df['Open'] > df['Open'].cummax().shift(fill_value=-np.inf)]
Date Open Close
0 22-01-2019 40 43
1 23-01-2019 54 53
3 25-01-2019 79 65
Where,
df['Open'].cummax().shift()
0 NaN
1 40.0
2 54.0
3 54.0
4 79.0
Name: Open, dtype: float64

Reading a pandas data frame having unequal columns in observations

I am trying to read this small data file,
Link - https://drive.google.com/open?id=1nAS5mpxQLVQn9s_aAKvJt8tWPrP_DUiJ
I am using the code -
df = pd.read_table('/Data/123451_date.csv', sep=';', index_col=0, engine='python', error_bad_lines=False)
It has ';' as a seprator, and values are missing in the file for some columns values in some observations (or rows).
How can I read it properly. I see the current dataframe, which is not loaded properly.
It looks like the data you use has some garbage in it. Precisely, rows 1-33 (inclusive) have additional, unnecessary (non-GPS) information included. You can either fix the database by manually removing the unneeded information from the datasheet, or use following code snippet to skip the rows that include it:
from pandas import read_table
data = read_table('34_2017-02-06.gpx.csv', sep=';', skiprows=list(range(1, 34)).drop("Unnamed: 28", axis=1)
The drop("Unnamed: 28", axis=1) is simply there to remove an additional column that is created probably due to each row in your datasheet ending with a ; (because it reads the empty space at the end of each line as data).
The result of print(data.head()) is then as follows:
index cumdist ele ... esttotalpower lat lon
0 49 340 -34.8 ... 9 52.077362 5.114530
1 51 350 -34.8 ... 17 52.077468 5.114543
2 52 360 -35.0 ... -54 52.077521 5.114551
3 53 370 -35.0 ... -173 52.077603 5.114505
4 54 380 -34.8 ... 335 52.077677 5.114387
[5 rows x 28 columns]
To explain the role of the drop command even more, here is what would happen without it (notice the last, weird column)
index cumdist ele ... lat lon Unnamed: 28
0 49 340 -34.8 ... 52.077362 5.114530 NaN
1 51 350 -34.8 ... 52.077468 5.114543 NaN
2 52 360 -35.0 ... 52.077521 5.114551 NaN
3 53 370 -35.0 ... 52.077603 5.114505 NaN
4 54 380 -34.8 ... 52.077677 5.114387 NaN
[5 rows x 29 columns]

How to map a function in pandas which compares each record in a column to previous and next records

I have a time series of water levels for which I need to calculate monthly and annual statistics in relation to several arbitrary flood stages. Specifically, I need to determine the duration per month that the water exceeded flood stage, as well as the number of times these excursions occurred. Additionally, because of the noise associated with the dataloggers, I need to exclude floods that lasted less than 1 hour as well as floods with less than 1 hour between events.
Mock up data:
start = datetime.datetime(2014,9,5,12,00)
daterange = pd.date_range(start, periods = 10000, freq = '30min', name = "Datetime")
data = np.random.random_sample((len(daterange), 3)) * 10
columns = ["Pond_A", "Pond_B", "Pond_C"]
df = pd.DataFrame(data = data, index = daterange, columns = columns)
flood_stages = [('Stage_1', 4.0), ('Stage_2', 6.0)]
My desired output is:
Pond_A_Stage_1_duration Pond_A_Stage_1_events \
2014-09-30 12:00:00 35.5 2
2014-10-31 12:00:00 40.5 31
2014-11-30 12:00:00 100 16
2014-12-31 12:00:00 36 12
etc. for the duration and events at each flood stage, at each reservoir.
I've tried grouping by month, iterating through the ponds and then iterating through each row like:
grouper = pd.TimeGrouper(freq = "1MS")
month_groups = df.groupby(grouper)
for name, group in month_groups:
flood_stage_a = group.sum()[1]
flood_stage_b = group.sum()[2]
inundation_a = False
inundation_30_a = False
inundation_hour_a = False
change_inundation_a = 0
for level in group.values:
if level[1]:
inundation_a = True
else:
inundation_a = False
if inundation_hour_a == False and inundation_a == True and inundation_30_a == True:
change_inundation_a += 1
inundation_hour_a = inundation_30_a
inundation_30_a = inundation_a
But this is a caveman solution and the heuristics are getting messy since I don't want to count a new event if a flood started in one month and continued into the next. This also doesn't combine events with less than one hour between their start and end. Is there a better way to compare a record to it previous and next?
My other thought is to create new columns with the series shifted t+1, t+2, t-1, t-2, so I can evaluate each row once, but this still seems inefficient. Is there a smarter way to do this by mapping a function?
Let me give a quick, partial answer since no one has answered yet, and maybe someone else can do something better later on if this does not suffice for you.
You can do the time spent above flood stage pretty easily. I divided by 48 so the units are in days.
df[ df > 4 ].groupby(pd.TimeGrouper( freq = "1MS" )).count() / 48
Pond_A Pond_B Pond_C
Datetime
2014-09-01 15.375000 15.437500 14.895833
2014-10-01 18.895833 18.187500 18.645833
2014-11-01 17.937500 17.979167 18.666667
2014-12-01 18.104167 18.354167 18.958333
2015-01-01 18.791667 18.645833 18.708333
2015-02-01 16.583333 17.208333 16.895833
2015-03-01 18.458333 18.458333 18.458333
2015-04-01 0.458333 0.520833 0.500000
Counting distinct events is a little harder, but something like this will get you most of the way. (Note that this produces an unrealistically high number of flooding events, but that's just because of how the sample data is set up and not reflective of a typical pond, though I'm not an expert on pond flooding!)
for c in df.columns:
df[c+'_events'] = ((df[c] > 4) & (df[c].shift() <= 4))
df.iloc[:,-3:].groupby(pd.TimeGrouper( freq = "1MS" )).sum()
Pond_A_events Pond_B_events Pond_C_events
Datetime
2014-09-01 306 291 298
2014-10-01 381 343 373
2014-11-01 350 346 357
2014-12-01 359 352 361
2015-01-01 355 335 352
2015-02-01 292 337 316
2015-03-01 344 360 386
2015-04-01 9 10 9
A couple things to note. First, an event can span months and this method will group it with the month where the event began. Second, I'm ignoring the duration of the event here, but you can adjust that however you want. For example, if you want to say the event doesn't start unless there are 2 consecutive periods below flood level followed by 2 consecutive periods above flood level, just change the relevant line above to:
df[c+'_events'] = ((df[c] > 4) & (df[c].shift(1) <= 4) &
(df[c].shift(-1) > 4) & (df[c].shift(2) <= 4))
That produces a pretty dramatic reduction in the count of distinct events:
Pond_A_events Pond_B_events Pond_C_events
Datetime
2014-09-01 70 71 72
2014-10-01 91 85 81
2014-11-01 87 75 91
2014-12-01 88 87 77
2015-01-01 91 95 94
2015-02-01 79 90 83
2015-03-01 83 78 85
2015-04-01 0 2 2

Categories

Resources