multiply dataframes based on timestamp intervals overlap - python

I have two pandas dataframes, each with two columns: a measurement and a timestamp. I need to multiply the first differences of the measurements, but only if there is a time overlap between the two measurement intervals. How can I do this efficiently, as the size of the dataframes gets large?
Example:
dfA
mesA timeA
0 125 2015-01-14 04:44:49
1 100 2015-01-14 05:16:23
2 115 2015-01-14 08:57:10
dfB
mesB timeB
0 140 2015-01-14 00:13:17
1 145 2015-01-14 08:52:01
2 120 2015-01-14 11:31:44
Here I would multiply (100-125)*(145-140) since there is a time overlap between the intervals [04:44:49, 05:16:23] and [00:13:17, 08:52:01], but not (100-125)and(120-145), since there isn't one. Similarly, I would have (115-100)*(145-140) but also (115-100)*(120-145), since both have a time overlap.
In the end I will have to sum all the relevant products in a single value, so the result need not be a dataframe. In this case:
s = (100-125)*(145-140)+(115-100)*(145-140)+(115-100)*(120-145) = -425
My current solution:
s = 0
for i in range(1, len(dfA)):
startA = dfA['timeA'][i-1]
endA = dfA['timeA'][i]
for j in range(1, len(dfB)):
startB = dfB['timeB'][j-1]
endB = dfB['timeB'][j]
if (endB>startA) & (startB<endA):
s+=(dfA['mesA'][i]-dfA['mesA'][i-1])*(dfB['mesB'][j]-dfB['mesB'][j-1])
Although it seems to work, it is very inefficient and becomes impractical with very large datasets. I believe it could be vectorized more efficiently, perhaps using numexpr, but I still haven't found a way.
EDIT:
other data
mesA timeA
0 125 2015-01-14 05:54:03
1 100 2015-01-14 11:39:53
2 115 2015-01-14 23:58:13
mesB timeB
0 110 2015-01-14 10:58:32
1 120 2015-01-14 13:30:00
2 135 2015-01-14 22:29:26
s = 125

Edit: the original answer did not work, so I came up with another version that is not vectorize but they need to be sorted by date.
arrA = dfA.timeA.to_numpy()
startA, endA = arrA[0], arrA[1]
arr_mesA = dfA.mesA.diff().to_numpy()
mesA = arr_mesA[1]
arrB = dfB.timeB.to_numpy()
startB, endB = arrB[0], arrB[1]
arr_mesB = dfB.mesB.diff().to_numpy()
mesB = arr_mesB[1]
s = 0
i, j = 1, 1
imax = len(dfA)-1
jmax = len(dfB)-1
while True:
if (endB>startA) & (startB<endA):
s+=mesA*mesB
if (endB>endA) and (i<imax):
i+=1
startA, endA, mesA= endA, arrA[i], arr_mesA[i]
elif j<jmax:
j+=1
startB, endB, mesB = endB, arrB[j], arr_mesB[j]
else:
break
Original not working answer
The idea is to great category with pd.cut based on the value in dfB['timeB'] in both dataframes to see where they could overlap. Then calculate the diff in measurements. merge both dataframes on categories and finally multiply and sum the whole thing
# create bins
bins_dates = [min(dfB['timeB'].min(), dfA['timeA'].min())-pd.DateOffset(hours=1)]\
+ dfB['timeB'].tolist()\
+ [max(dfB['timeB'].max(), dfA['timeA'].max())+pd.DateOffset(hours=1)]
# work on dfB
dfB['cat'] = pd.cut(dfB['timeB'], bins=bins_dates,
labels=range(len(bins_dates)-1), right=False)
dfB['deltaB'] = -dfB['mesB'].diff(-1).ffill()
# work on dfA
dfA['cat'] = pd.cut(dfA['timeA'], bins=bins_dates,
labels=range(len(bins_dates)-1), right=False)
# need to calcualte delta for both start and end of intervals
dfA['deltaAStart'] = -dfA['mesA'].diff(-1)
dfA['deltaAEnd'] = dfA['mesA'].diff().mask(dfA['cat'].astype(float).diff().eq(0))
# in the above method, for the end of interval, use a mask to not count twice
# intervals that are fully included in one interval of B
# then merge and calcualte the multiplication you are after
df_ = dfB[['cat', 'deltaB']].merge(dfA[['cat','deltaAStart', 'deltaAEnd']])
s = (df_['deltaB'].to_numpy()[:,None]*df_[['deltaAStart', 'deltaAEnd']]).sum().sum()
print (s)
#-425.0

Related

Summarising features with multiple values in Python for Machine Learning model

I have a data file containing different foetal ultrasound measurements. The measurements are collected at different points during pregnancy, like so:
PregnancyID MotherID gestationalAgeInWeeks abdomCirc
0 0 14 150
0 0 21 200
1 1 20 294
1 1 25 315
1 1 30 350
2 2 8 170
2 2 9 180
2 2 18 NaN
As you can see from the table above, I have multiple measurements per pregnancy (between 1 and 26 observations each).
I want to summarise the ultrasound measurements somehow such that I can replace the multiple measurements with a fixed amount of features per pregnancy. So I thought of creating 3 new features, one for each trimester of pregnancy that would hold the maximum measurement recorded during that trimester:
abdomCirc1st: this feature would hold the maximum value of all abdominal circumference measurements measured between 0 to 13 Weeks
abdomCirc2nd: this feature would hold the maximum value of all abdominal circumference measurements measured between 14 to 26 Weeks
abdomCirc3rd: this feature would hold the maximum value of all abdominal circumference measurements measured between 27 to 40 Weeks
So my final dataset would look like this:
PregnancyID MotherID abdomCirc1st abdomCirc2nd abdomCirc3rd
0 0 NaN 200 NaN
1 1 NaN 315 350
2 2 180 NaN NaN
The reason for using the maximum here is that a larger abdominal circumference is associated with the adverse outcome I am trying to predict.
But I am quite confused about how to go about this. I have used the groupby function previously to derive certain statistical features from the multiple measurements, however this is a more complex task.
What I want to do is the following:
Group all abdominal circumference measurements that belong to the same pregnancy into 3 trimesters based on gestationalAgeInWeeks value
Compute the maximum value of all abdominal circumference measurements within each trimester, and assign this value to the relevant feature; abdomCirc1st, abdomCir2nd or abdomCirc3rd.
I think I have to do something along the lines of:
df["abdomCirc1st"] = df.groupby(['MotherID', 'PregnancyID', 'gestationalAgeInWeeks'])["abdomCirc"].transform('max')
But this code does not check what trimester the measurement was taken in (gestationalAgeInWeeks). I would appreciate some help with this task.
You can try this. a bit of a complicated query but it seems to work:
(df.groupby(['MotherID', 'PregnancyID'])
.apply(lambda d: d.assign(tm = (d['gestationalAgeInWeeks']+ 13 - 1 )// 13))
.groupby('tm')['abdomCirc']
.apply(max))
.unstack()
)
produces
tm 1 2 3
MotherID PregnancyID
0 0 NaN 200.0 NaN
1 1 NaN 294.0 350.0
2 2 180.0 NaN NaN
Let's unpick this a bit. First we groupby on MontherId, PregnancyID. Then we apply a function to each grouped dataframe (d)
For each d, we create a 'trimester' column 'tm' via assign (I assume I got the math right here, but correct it if it is wrong!), then we groupby by 'tm' and apply max. For each sub-dataframe d then we obtain a Series which is tm:max(abdomCirc).
Then we unstack() that moves tm to the column names
You may want to rename this columns later, but I did not bother
Solution 2
Come to think of it you can simplify the above a bit:
(df.assign(tm = (df['gestationalAgeInWeeks']+ 13 - 1 )// 13))
.drop(columns = 'gestationalAgeInWeeks')
.groupby(['MotherID', 'PregnancyID','tm'])
.agg('max')
.unstack()
)
similar idea, same output.
There is a magic command called query. This should do your work for now:
abdomCirc1st = df.query('MotherID == 0 and PregnancyID == 0 and gestationalAgeInWeeks <= 13')['abdomCirc'].max()
abdomCirc2nd = df.query('MotherID == 0 and PregnancyID == 0 and gestationalAgeInWeeks >= 14 and gestationalAgeInWeeks <= 26')['abdomCirc'].max()
abdomCirc3rd = df.query('MotherID == 0 and PregnancyID == 0 and gestationalAgeInWeeks >= 27 and gestationalAgeInWeeks <= 40')['abdomCirc'].max()
If you want something more automatic (and not manually changing the values of your ID's: MotherID and PregnancyID, every time for each different group of rows), you have to combine it with groupby (as you did on your own)
Check this as well: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html

Average between points based on time

I'm trying to use Python to get time taken, as well as average speed between an object traveling between points.
The data looks somewhat like this,
location initialtime id speed distance
1 2020-09-18T12:03:14.485952Z car_uno 72 9km
2 2020-09-18T12:10:14.485952Z car_uno 83 8km
3 2020-09-18T11:59:14.484781Z car_duo 70 9km
7 2020-09-18T12:00:14.484653Z car_trio 85 8km
8 2020-09-18T12:12:14.484653Z car_trio 70 7.5km
The function I'm using currently is essentially like this,
Speeds.index = pd.to_datetime(Speeds.index)
..etc
Now if I were doing this usually, I would just take the unique values of the id's,
for x in speeds.id.unique():
Speeds[speeds.id=="x"]...
But this method really isn't working.
What is the best approach for simply seeing if there are multiple id points over time, then taking the average of the speeds by that time given? Otherwise just returning the speed itself if there are not multiple values.
Is there a simpler pandas filter I could use?
Expected output is simply,
area - id - initial time - journey time - average speed.
the point is to get the average time and journey time for a vehicle going past two points
To get the average speed and journey times you can use groupby() and pass in the columns that determine one complete journey, like id or area.
import pandas as pd
from io import StringIO
data = StringIO("""
area initialtime id speed
1 2020-09-18T12:03:14.485952Z car_uno 72
2 2020-09-18T12:10:14.485952Z car_uno 83
3 2020-09-18T11:59:14.484781Z car_duo 70
7 2020-09-18T12:00:14.484653Z car_trio 85
8 2020-09-18T12:12:14.484653Z car_trio 70
""")
df = pd.read_csv(data, delim_whitespace=True)
df["initialtime"] = pd.to_datetime(df["initialtime"])
# change to ["id", "area"] if need more granular aggregation
group_cols = ["id"]
time = df.groupby(group_cols)["initialtime"].agg([max, min]).eval('max-min').reset_index(name="journey_time")
speed = df.groupby(group_cols)["speed"].mean().reset_index(name="average_speed")
pd.merge(time, speed, on=group_cols)
id journey_time average_speed
0 car_duo 00:00:00 70.0
1 car_trio 00:12:00 77.5
2 car_uno 00:07:00 77.5
I tryed to use a very intuitive solution. I'm assuming the data has already been loaded to df.
df['initialtime'] = pd.to_datetime(df['initialtime'])
result = []
for car in df['id'].unique():
_df = df[df['id'] == car].sort_values('initialtime', ascending=True)
# Where the car is leaving "from" and where it's heading "to"
_df['From'] = _df['location']
_df['To'] = _df['location'].shift(-1, fill_value=_df['location'].iloc[0])
# Auxiliary columns
_df['end_time'] = _df['initialtime'].shift(-1, fill_value=_df['initialtime'].iloc[0])
_df['end_speed'] = _df['speed'].shift(-1, fill_value=_df['speed'].iloc[0])
# Desired columns
_df['journey_time'] = _df['end_time'] - _df['initialtime']
_df['avg_speed'] = (_df['speed'] + _df['end_speed']) / 2
_df = _df[_df['journey_time'] >= pd.Timedelta(0)]
_df.drop(['location', 'distance', 'speed', 'end_time', 'end_speed'],
axis=1, inplace=True)
result.append(_df)
final_df = pd.concat(result).reset_index(drop=True)
The final DataFrame is as follows:
initialtime id From To journey_time avg_speed
0 2020-09-18 12:03:14.485952+00:00 car_uno 1 2 0 days 00:07:00 77.5
1 2020-09-18 11:59:14.484781+00:00 car_duo 3 3 0 days 00:00:00 70.0
2 2020-09-18 12:00:14.484653+00:00 car_trio 7 8 0 days 00:12:00 77.5
Here is another approach. My results are different that other posts, so I may have misunderstood the requirements. In brief, I calculated each average speed as total distance divided by total time (for each car).
from io import StringIO
import pandas as pd
# speed in km / hour; distance in km
data = '''location initial-time id speed distance
1 2020-09-18T12:03:14.485952Z car_uno 72 9
2 2020-09-18T12:10:14.485952Z car_uno 83 8
3 2020-09-18T11:59:14.484781Z car_duo 70 9
7 2020-09-18T12:00:14.484653Z car_trio 85 8
8 2020-09-18T12:12:14.484653Z car_trio 70 7.5
'''
Now create data frame and perform calculations
# create data frame
df = pd.read_csv(StringIO(data), delim_whitespace=True)
df['elapsed-time'] = df['distance'] / df['speed'] # in hours
# utility function
def hours_to_hms(elapsed):
''' Convert `elapsed` (in hours) to hh:mm:ss (round to nearest sec)'''
h, m = divmod(elapsed, 1)
m *= 60
_, s = divmod(m, 1)
s *= 60
hms = '{:02d}:{:02d}:{:02d}'.format(int(h), int(m), int(round(s, 0)))
return hms
# perform calculations
start_time = df.groupby('id')['initial-time'].min()
journey_hrs = df.groupby('id')['elapsed-time'].sum().rename('elapsed-hrs')
hms = journey_hrs.apply(lambda x: hours_to_hms(x)).rename('hh:mm:ss')
ave_speed = ((df.groupby('id')['distance'].sum()
/ df.groupby('id')['elapsed-time'].sum())
.rename('ave speed (km/hr)')
.round(2))
# assemble results
result = pd.concat([start_time, journey_hrs, hms, ave_speed], axis=1)
print(result)
initial-time elapsed-hrs hh:mm:ss \
id
car_duo 2020-09-18T11:59:14.484781Z 0.128571 00:07:43
car_trio 2020-09-18T12:00:14.484653Z 0.201261 00:12:05
car_uno 2020-09-18T12:03:14.485952Z 0.221386 00:13:17
ave speed (km/hr)
id
car_duo 70.00
car_trio 77.01
car_uno 76.79
You should provide a better dataset (ie with identical time points) so that we understand better the inputs, and an exemple of expected output so that we understand the computation of the average speed.
Thus I'm just guessing that you may be looking for df.groupby('initialtime')['speed'].mean() if df is a dataframe containing your input data.

Pandas: duplicating dataframe entries while column higher or equal to 0

I have a dataframe containing clinical readings of hospital patients, for example a similar dataframe could look like this
heartrate pid time
0 67 151 0.0
1 75 151 1.2
2 78 151 2.5
3 99 186 0.0
In reality there are many more columns, but I will just keep those 3 to make the example more concise.
I would like to "expand" the dataset. In short, I would like to be able to give an argument n_times_back and another argument interval.
For each iteration, which corresponds to for i in range (n_times_back + 1), we do the following:
Create a new, unique pid [OLD ID | i] (Although as long as the new
pid is unique for each duplicated entry, the exact name isn't
really important to me so feel free to change this if it makes it
easier)
For every patient (pid), remove the rows with time column which is
more than the final time of that patient - i * interval. For
example if i * interval = 2.0 and the times associated to one pid
are [0, 0.5, 1.5, 2.8], the new times will be [0, 0.5], as final
time - 2.0 = 0.8
iterate
Since I realize that explaining this textually is a bit messy, here is an example.
With the dataset above, if we let n_times_back = 1 and interval=1 then we get
heartrate pid time
0 67 15100 0.0
1 75 15100 1.2
2 78 15100 2.5
3 67 15101 0.0
4 75 15101 1.2
5 99 18600 0.0
For n_times_back = 2, the result would be
heartrate pid time
0 67 15100 0.0
1 75 15100 1.2
2 78 15100 2.5
3 67 15101 0.0
4 75 15101 1.2
5 67 15102 0.0
6 99 18600 0.0
n_times_back = 3 and above would lead to the same result as n_times_back = 2, as no patient data goes below that point in time
I have written code for this.
def expand_df(df, n_times_back, interval):
for curr_patient in df['pid'].unique():
patient_data = df[df['pid'] == curr_patient]
final_time = patient_data['time'].max()
for i in range(n_times_back + 1):
new_data = patient_data[patient_data['time'] <= final_time - i * interval]
new_data['pid'] = patient_data['pid'].astype(str) + str(i).zfill(2)
new_data['pid'] = new_data['pid'].astype(int)
#check if there is any time index left, if not don't add useless entry to dataframe
if(new_data['time'].count()>0):
df = df.append(new_data)
df = df[df['pid'] != curr_patient] # remove original patient data, now duplicate
df.reset_index(inplace = True, drop = True)
return df
As far as functionality goes, this code works as intended. However, it is very slow. I am working with a dataframe of 30'000 patients and the code has been running for over 2 hours now.
Is there a way to use pandas operations to speed this up? I have looked around but so far I haven't managed to reproduce this functionality with high level pandas functions
ended up using a groupby function and breaking when no more times were available, as well as creating an "index" column that I merge with the "pid" column at the end.
def expand_df(group, n_times, interval):
df = pd.DataFrame()
final_time = group['time'].max()
for i in range(n_times + 1):
new_data = group[group['time'] <= final_time - i * interval]
new_data['iteration'] = str(i).zfill(2)
#check if there is any time index left, if not don't add useless entry to dataframe
if(new_data['time'].count()>0):
df = df.append(new_data)
else:
break
return df
new_df = df.groupby('pid').apply(lambda x : expand_df(x, n_times_back, interval))
new_df = new_df.reset_index(drop=True)
new_df['pid'] = new_df['pid'].map(str) + new_df['iteration']

How to map a function in pandas which compares each record in a column to previous and next records

I have a time series of water levels for which I need to calculate monthly and annual statistics in relation to several arbitrary flood stages. Specifically, I need to determine the duration per month that the water exceeded flood stage, as well as the number of times these excursions occurred. Additionally, because of the noise associated with the dataloggers, I need to exclude floods that lasted less than 1 hour as well as floods with less than 1 hour between events.
Mock up data:
start = datetime.datetime(2014,9,5,12,00)
daterange = pd.date_range(start, periods = 10000, freq = '30min', name = "Datetime")
data = np.random.random_sample((len(daterange), 3)) * 10
columns = ["Pond_A", "Pond_B", "Pond_C"]
df = pd.DataFrame(data = data, index = daterange, columns = columns)
flood_stages = [('Stage_1', 4.0), ('Stage_2', 6.0)]
My desired output is:
Pond_A_Stage_1_duration Pond_A_Stage_1_events \
2014-09-30 12:00:00 35.5 2
2014-10-31 12:00:00 40.5 31
2014-11-30 12:00:00 100 16
2014-12-31 12:00:00 36 12
etc. for the duration and events at each flood stage, at each reservoir.
I've tried grouping by month, iterating through the ponds and then iterating through each row like:
grouper = pd.TimeGrouper(freq = "1MS")
month_groups = df.groupby(grouper)
for name, group in month_groups:
flood_stage_a = group.sum()[1]
flood_stage_b = group.sum()[2]
inundation_a = False
inundation_30_a = False
inundation_hour_a = False
change_inundation_a = 0
for level in group.values:
if level[1]:
inundation_a = True
else:
inundation_a = False
if inundation_hour_a == False and inundation_a == True and inundation_30_a == True:
change_inundation_a += 1
inundation_hour_a = inundation_30_a
inundation_30_a = inundation_a
But this is a caveman solution and the heuristics are getting messy since I don't want to count a new event if a flood started in one month and continued into the next. This also doesn't combine events with less than one hour between their start and end. Is there a better way to compare a record to it previous and next?
My other thought is to create new columns with the series shifted t+1, t+2, t-1, t-2, so I can evaluate each row once, but this still seems inefficient. Is there a smarter way to do this by mapping a function?
Let me give a quick, partial answer since no one has answered yet, and maybe someone else can do something better later on if this does not suffice for you.
You can do the time spent above flood stage pretty easily. I divided by 48 so the units are in days.
df[ df > 4 ].groupby(pd.TimeGrouper( freq = "1MS" )).count() / 48
Pond_A Pond_B Pond_C
Datetime
2014-09-01 15.375000 15.437500 14.895833
2014-10-01 18.895833 18.187500 18.645833
2014-11-01 17.937500 17.979167 18.666667
2014-12-01 18.104167 18.354167 18.958333
2015-01-01 18.791667 18.645833 18.708333
2015-02-01 16.583333 17.208333 16.895833
2015-03-01 18.458333 18.458333 18.458333
2015-04-01 0.458333 0.520833 0.500000
Counting distinct events is a little harder, but something like this will get you most of the way. (Note that this produces an unrealistically high number of flooding events, but that's just because of how the sample data is set up and not reflective of a typical pond, though I'm not an expert on pond flooding!)
for c in df.columns:
df[c+'_events'] = ((df[c] > 4) & (df[c].shift() <= 4))
df.iloc[:,-3:].groupby(pd.TimeGrouper( freq = "1MS" )).sum()
Pond_A_events Pond_B_events Pond_C_events
Datetime
2014-09-01 306 291 298
2014-10-01 381 343 373
2014-11-01 350 346 357
2014-12-01 359 352 361
2015-01-01 355 335 352
2015-02-01 292 337 316
2015-03-01 344 360 386
2015-04-01 9 10 9
A couple things to note. First, an event can span months and this method will group it with the month where the event began. Second, I'm ignoring the duration of the event here, but you can adjust that however you want. For example, if you want to say the event doesn't start unless there are 2 consecutive periods below flood level followed by 2 consecutive periods above flood level, just change the relevant line above to:
df[c+'_events'] = ((df[c] > 4) & (df[c].shift(1) <= 4) &
(df[c].shift(-1) > 4) & (df[c].shift(2) <= 4))
That produces a pretty dramatic reduction in the count of distinct events:
Pond_A_events Pond_B_events Pond_C_events
Datetime
2014-09-01 70 71 72
2014-10-01 91 85 81
2014-11-01 87 75 91
2014-12-01 88 87 77
2015-01-01 91 95 94
2015-02-01 79 90 83
2015-03-01 83 78 85
2015-04-01 0 2 2

Calculating difference between two rows in Python / Pandas

In python, how can I reference previous row and calculate something against it? Specifically, I am working with dataframes in pandas - I have a data frame full of stock price information that looks like this:
Date Close Adj Close
251 2011-01-03 147.48 143.25
250 2011-01-04 147.64 143.41
249 2011-01-05 147.05 142.83
248 2011-01-06 148.66 144.40
247 2011-01-07 147.93 143.69
Here is how I created this dataframe:
import pandas
url = 'http://ichart.finance.yahoo.com/table.csv?s=IBM&a=00&b=1&c=2011&d=11&e=31&f=2011&g=d&ignore=.csv'
data = data = pandas.read_csv(url)
## now I sorted the data frame ascending by date
data = data.sort(columns='Date')
Starting with row number 2, or in this case, I guess it's 250 (PS - is that the index?), I want to calculate the difference between 2011-01-03 and 2011-01-04, for every entry in this dataframe. I believe the appropriate way is to write a function that takes the current row, then figures out the previous row, and calculates the difference between them, the use the pandas apply function to update the dataframe with the value.
Is that the right approach? If so, should I be using the index to determine the difference? (note - I'm still in python beginner mode, so index may not be the right term, nor even the correct way to implement this)
I think you want to do something like this:
In [26]: data
Out[26]:
Date Close Adj Close
251 2011-01-03 147.48 143.25
250 2011-01-04 147.64 143.41
249 2011-01-05 147.05 142.83
248 2011-01-06 148.66 144.40
247 2011-01-07 147.93 143.69
In [27]: data.set_index('Date').diff()
Out[27]:
Close Adj Close
Date
2011-01-03 NaN NaN
2011-01-04 0.16 0.16
2011-01-05 -0.59 -0.58
2011-01-06 1.61 1.57
2011-01-07 -0.73 -0.71
To calculate difference of one column. Here is what you can do.
df=
A B
0 10 56
1 45 48
2 26 48
3 32 65
We want to compute row difference in A only and want to consider the rows which are less than 15.
df['A_dif'] = df['A'].diff()
df=
A B A_dif
0 10 56 Nan
1 45 48 35
2 26 48 19
3 32 65 6
df = df[df['A_dif']<15]
df=
A B A_dif
0 10 56 Nan
3 32 65 6
I don't know pandas, and I'm pretty sure it has something specific for this; however, I'll give you the pure-Python solution, that might be of some help even if you need to use pandas:
import csv
import urllib
# This basically retrieves the CSV files and loads it in a list, converting
# All numeric values to floats
url='http://ichart.finance.yahoo.com/table.csv?s=IBM&a=00&b=1&c=2011&d=11&e=31&f=2011&g=d&ignore=.csv'
reader = csv.reader(urllib.urlopen(url), delimiter=',')
# We sort the output list so the records are ordered by date
cleaned = sorted([[r[0]] + map(float, r[1:]) for r in list(reader)[1:]])
for i, row in enumerate(cleaned): # enumerate() yields two-tuples: (<id>, <item>)
# The try..except here is to skip the IndexError for line 0
try:
# This will calculate difference of each numeric field with the same field
# in the row before this one
print row[0], [(row[j] - cleaned[i-1][j]) for j in range(1, 7)]
except IndexError:
pass

Categories

Resources