I am trying to add together elements in the second column from from two dataframes where the time(in the first column) is the same, however the time in each DataFrame is spaced at different intervals. So, in the image below, I would like to add the y values of both lines together:
enter image description here
So where they overlap, the combined value would be at around 3200.
Each dataframe is two columns: first one is time in unix timestamp, and the second column is power in watts, and the spacing between each row is usually 6 seconds, but sometimes more or less. Also, each dataframe starts and ends at a different time, although there is some overlap in the inner portion.
I've added the first few rows for ease of viewing:
df1:
time power
0 1355526770 1500
1 1355526776 1800
2 1355526782 1600
3 1355526788 1700
4 1355526794 1400
df2:
time power
0 1355526771 1250
1 1355526777 1200
2 1355526783 1280
3 1355526789 1290
4 1355526795 1300
I first though to reindex each dataframe inserting a row for every second across the time range of each df, and then linearly interpolating the power value data between each time. Then I would add together the dataframes by adding the power value where the timestamp matched exactly.
The problem with this method is that it would increase the size of each dataframe by at least 6x, and since they're already pretty big, this would slow things down a lot.
If anyone knows another method to do this I would be very grateful.
Beyond what the other users have said, you could also consider trying out Modin instead of pure pandas for your datasets if you want another way to speed up computation and so forth. Modin is easily integrated with your system with just one line of code. Take a look here: IntelĀ® Distribution of Modin
Using a merge_asof to align on the nearest time:
(pd.merge_asof(df1, df2, on='time', direction='nearest', suffixes=(None, '_2'))
.assign(power=lambda d: d['power'].add(d.pop('power_2')))
)
Output:
time power
0 1355526770 2750
1 1355526776 3000
2 1355526782 2880
3 1355526788 2990
4 1355526794 2700
Related
Suppose I have a dataset that records camera sightings of some object over time, and I groupby date so that each group represents sightings within the same day. I'd then like to break one group into 'subgroups' based on the time between sightings -- if the gap is too large, then I want them to be in different groups.
Consider the following as one group.
(Camera). (Time)
A 6
B 12
C 17
D 21
E 47
F 50
Suppose I had a cutoff matrix that told me how close the next sighting had to be for two adjacent cameras to be in the same group. For example, we might have cutoff_mat[d, e] = 10 which means that since cameras D and E are more than 10 units apart in time, I should break the group into two after D and before E. I would like to do so in a way that allows for efficient iteration over each of the resulting groups since my real goal is to compute some other matrix using values within each sub-group, and need to potentially break one group into many and not just two. How do I do this? The dataset is large (>100M points) so something fast would be appreciated.
I am thinking I could do this by creating another column in the original dataset that represents time between consecutive sightings on the same day, and somehow groupby both date AND this new column, but I'm not quite sure how that'd work. I also don't think pd.df.cut() works here since I don't have pre-determined bins.
Given a data frame with start time of a new time period (a new work shift), sum all sales that occur up to next time period (work shift).
import pandas as pd
df_checkpoints = pd.DataFrame({'time':[1,5,10], 'shift':['Adam','Ben','Carl']})
df_sales = pd.DataFrame({'time':[2,6,7,9,15], 'soldCount':[1,2,3,4,5]})
# This is the wanted output...
df_output = pd.DataFrame({'time':[1,5,10], 'shift':['Adam','Ben','Carl'], 'totSold':[1,9,5]})
So pd.merge_asof does what I want except it only does 1:1 merge. Best would be to get a multiIndex dataframe with index[0] being the checkpoints and index[1] being the sales rows, such that I can aggregate freely afterwards. Last resort would be an ugly O(n) loop.
Number of rows in each df is a couple of millions.
Any idea?
You can use pd.cut
For instance if you want to group by range you can use like this.
As you aware I added 24 to show finish of range
pd.cut(df_sales["time"], [1,5,10,24])
If you want to automate this you can use like this:
get your checkpoints, add 24 to finish time, group it, sum sales, reset index for concat
group_and_sum = df_sales.groupby(pd.cut(df_sales["time"], df_checkpoints['time'].append(pd.Series(24))),as_index = False).sum().drop('time',axis=1)
concat 2 dataframes for names
pd.concat([group_and_sum,df_checkpoints],axis=1)
output
soldCount time shift
0 1 1 Adam
1 9 5 Ben
2 5 10 Carl
I have a time series (array of values) and I would like to find the starting points where a long drop in values begins (at least X consecutive values going down). For example:
Having a list of values
[1,2,3,4,3,4,5,4,3,4,5,4,3,2,1,2,3,2,3,4,3,4,5,6,7,8]
I would like to find a drop of at least 5 consecutive values. So in this case I would find the segment 5,4,3,2,1.
However, in a real scenario, there is noise in the data, so the actual drop includes a lot of little ups and downs.
I could write an algorithm for this. But I was wondering whether there is an existing library or standard signal processing method for this type of analysis.
You can do this pretty easily with pandas (which I know you have). Convert your list to a series, and then perform a groupby + count to find consecutively declining values:
v = pd.Series([...])
v[v.groupby(v.diff().gt(0).cumsum()).transform('size').ge(5)]
10 5
11 4
12 3
13 2
14 1
dtype: int64
I have a panda dataframe containing OHLC 1mn data. (19724 rows). I am looking at adding 2 new columns keeping tracks of the min price and the maximum price over the past 3 days (including today up to the current bar in question, and ignoring missing days). However I am running into performance issues as a %timeit of the for loop indicates 57 seconds... I am looking at ways to speed up (vectorization? I tried but I am struggling a little bit I must admit).
#Import the data and put them in a DataFrame. The DataFrame should contain
#the following fields: DateTime (the index), Open, Close, High, Low, Volume.
#----------------------
#The following assume the first column of the file is Datetime
dfData=pd.read_csv(os.path.join(DataLocation,FileName),index_col='Date')
dfData.index=pd.to_datetime(dfData.index,dayfirst=True)
dfData.index.tz_localize('Singapore')
# Calculate the list of unique dates in the dataframe to find T-2
ListOfDates=pd.to_datetime(dfData.index.date).unique()
#Add a ExtMin and and ExtMax to the dataFrame to keep track of the min and max over a certain window
dfData['ExtMin']=np.nan
dfData['ExtMax']=np.nan
#For each line in the dataframe, calculate the minimum price reached over the past 3 days including today.
def addMaxMin(dfData):
for index,row in dfData.iterrows():
#Find the index in ListOfDates, strip out the time, offset by -2 rows
Start=ListOfDates[max(0,ListOfDates.get_loc(index.date())-2)]
#Populate the ExtMin and ExtMax columns
dfData.ix[index,'ExtMin']=dfData[(Start<=dfData.index) & (dfData.index<index)]['LOW'].min()
dfData.ix[index,'ExtMax']=dfData[(Start<=dfData.index) & (dfData.index<index)]['HIGH'].max()
return dfData
%timeit addMaxMin(dfData)
Thanks.
I'm relatively new to pandas (and python... and programming) and I'm trying to do a Montecarlo simulation, but I have not being able to find a solution that takes a reasonable amount of time
The data is stored in a data frame called "YTDSales" which has sales per day, per product
Date Product_A Product_B Product_C Product_D ... Product_XX
01/01/2014 1000 300 70 34500 ... 780
02/01/2014 400 400 70 20 ... 10
03/01/2014 1110 400 1170 60 ... 50
04/01/2014 20 320 0 71300 ... 10
...
15/10/2014 1000 300 70 34500 ... 5000
and what I want to do is to simulate different scenarios, using for the rest of the year (from October 15 to Year End) the historical distribution that each product had. For example with the data presented I will like to fill the rest of the year with sales between 20 and 1100.
What I've done is the following
# creates range of "future dates"
last_historical = YTDSales.index.max()
year_end = dt.datetime(2014,12,30)
DatesEOY = pd.date_range(start=last_historical,end=year_end).shift(1)
# function that obtains a random sales number per product, between max and min
f = lambda x:np.random.randint(x.min(),x.max())
# create all the "future" dates and fill it with the output of f
for i in DatesEOY:
YTDSales.loc[i]=YTDSales.apply(f)
The solution works, but takes about 3 seconds, which is a lot if I plan to 1,000 iterations... Is there a way not to iterate?
Thanks
Use the size option for np.random.randint to get a sample of the needed size all at once.
One approach that I would consider is briefly as follows.
Allocate the space you'll need into a new array that will have index values from DatesEOY, columns from the original DataFrame, and all NaN values. Then concatenate onto the original data.
Now that you know the length of each random sample you'll need, use the extra size keyword in numpy.random.randint to sample all at once, per column, instead of looping.
Overwrite the data with this batch sampling.
Here's what this could look like:
new_df = pandas.DataFrame(index=DatesEOY, columns=YTDSales.columns)
num_to_sample = len(new_df)
f = lambda x: np.random.randint(x[1].min(), x[1].max(), num_to_sample)
output = pandas.concat([YTDSales, new_df], axis=0)
output[len(YTDSales):] = np.asarray(map(f, YTDSales.iteritems())).T
Along the way, I choose to make a totally new DataFrame, by concatenating the old one with the new "placeholder" one. This could obviously be inefficient for very large data.
Another way to approach is setting with enlargement as you've done in your for-loop solution.
I did not play around with that approach long enough to figure out how to "enlarge" batches of indexes all at once. But, if you figure that out, you can just "enlarge" the original data frame with all NaN values (at index values from DatesEOY), and then apply the function about to YTDSales instead of bringing output into it at all.