I have a pandas dataframe with a datetime index and some column, 'value'. I would like to compare the 'value' value at a given time of day to the value at a different time of the same day. E.g. compare the 10am value to the 10pm value.
Right now I can get the value at either side using:
mask = df[(df.index.hour == hour)]
the problem is that this returns a dataframe indexed at hour. So doing mask1.value - mask2.value returns Nan's since the indexes are different.
I can get around this in a convoluted way:
out = mask.value.loc["2020-07-15"].reset_index() - mask2.value.loc["2020-07-15"].reset_index() #assuming mask2 is the same as the mask call but at a different hour
but this is tiresome to loop over for a dataset that spans years. (Obviously I could timedelta +=1 in the loop to avoid the hard calls).
I don't actually care if some nan's get into the end result if some, e.g. 10am, values are missing.
Edit:
Initial dataframe:
index values
2020-05-10T10:00:00 23
2020-05-10T11:00:00 20
2020-05-10T12:00:00 5
.....
2020-05-30T22:00:00 8
2020-05-30T23:00:00 8
2020-05-30T24:00:00 9
Expected dataframe:
index date newval
0 2020-05-10 18
.....
x 2020-05-30 1
where newval is some subtraction of the two different times I described above (eg. the 10am measurement - the 12pm measurement so 23-5 = 18), second entry is made up
it doesn't matter to me if date is a separate column or the index.
A workaround:
mask1 = df[(df.index.hour == hour1)]
mask2 = df[(df.index.hour == hour2)]
out = mask1.values - mask2.values # df.values returns an np array without indices
result_df = pd.DataFrame(index=pd.daterange(start,end), data=out)
It should save you the effort of looping over the dates
Related
i have a data frame which is of format(date range from start="2018-09-09",end="2020-02-02") with values from 1 to 513
I have another data frame with the format(only 3 dates)
based on second data frame i want 2 dates before and 1 date after what I mean is this
Edited: Corrected answer as per the question
If you do this:
keep = []
for val in df2['value']:
keep += [val-3, val-2, val-1, val]
df_final = df1.take(keep)
Assumption: Your value columns always starts from 1 and is sequential. Also, its datatype is integer not string.
What it does:
The row numbers (indices) of every date = value of that row - 1, since indices start from 0.
So this keeps only the value-3 (2 days before), value-2 (1 day before), value-1 (that day present in df2) and value (1 day after) indices in the keep list.
Then DataFrame.take(indices) does the work for us, by only taking from the mentioned DataFrame df1 the rows with indices mentioned in the argument indices, a list.
I have a Dataframe which has a column for Minutes and correlated value, the frequency is about 79 seconds but sometimes there is missing data for a period (no rows at all). I want to detect if there is a gap of 25 or more Minutes and delete the dataset if so.
How do I test if there is a gap which is?
The dataframe looks like this:
INDEX minutes data
0 23.000 1.456
1 24.185 1.223
2 27.250 0.931
3 55.700 2.513
4 56.790 1.446
... ... ...
So there is a irregular but short gap and one that exceeds 25 Minutes. In this case I want the dataset to be empty:
I am quite new to Python, especially to Pandas so an explanation would be helpful to learn.
You can use numpy.roll to create a column with shifted values (i.e. the first value from the original column becomes the second value, the second becomes the third, etc):
import pandas as pd
import numpy as np
df = pd.DataFrame({'minutes': [23.000, 24.185, 27.250, 55.700, 56.790]})
np.roll(df['minutes'], 1)
# output: array([56.79 , 23. , 24.185, 27.25 , 55.7 ])
Add this as a new column to your dataframe and subtract the original column with the new column.
We also drop the first row beforehand, since we don't want to calculate the difference from your first timepoint in the original column and your last timepoint that got rolled to the start of the new column.
Then we just ask if any of the values resulting from the subtraction is above your threshold:
df['rolled_minutes'] = np.roll(df['minutes'], 1)
dropped_df = df.drop(index=0)
diff = dropped_df['minutes'] - dropped_df['rolled_minutes']
(diff > 25).any()
# output: True
I want to update the mergeAllGB.Intensity columns NaN values with values from another dataframe where ID, weekday and hour are matching. I'm trying:
mergeAllGB.Intensity[mergeAllGB.Intensity.isnull()] = precip_hourly[precip_hourly.SId == mergeAllGB.SId & precip_hourly.Hour == mergeAllGB.Hour & precip_hourly.Weekday == mergeAllGB.Weekday].Intensity
However, this returns ValueError: Series lengths must match to compare. How could I do this?
Minimal example:
Inputs:
_______
mergeAllGB
SId Hour Weekday Intensity
1 12 5 NaN
2 5 6 3
precip_hourly
SId Hour Weekday Intensity
1 12 5 2
Desired output:
________
mergeAllGB
SId Hour Weekday Intensity
1 12 5 2
2 5 6 3
TL;DR this will (hopefully) work:
# Set the index to compare by
df = mergeAllGB.set_index(["SId", "Hour", "Weekday"])
fill_df = precip_hourly.set_index(["SId", "Hour", "Weekday"])
# Fill the nulls with the relevant values of intensity
df["Intensity"] = df.Intensity.fillna(fill_df.Intensity)
# Cancel the special indexes
mergeAllGB = df.reset_index()
Alternatively, the line before the last could be
df.loc[df.Intensity.isnull(), "Intensity"] = fill_df.Intensity
Assignment and comparison in pandas are done by index (which isn't shown in your example).
In the example, running precip_hourly.SId == mergeAllGB.SId results in ValueError: Can only compare identically-labeled Series objects. This is because we try to compare the two columns by value, but precip_hourly doesn't have a row with index 1 (default indexing starts at 0), so the comparison fails.
Even if we assume the comparison succeeded, the assignment stage is problematic.
Pandas tries to assign according to the index - but this doesn't have the intended meaning.
Luckily, we can use it for our own benefit - by setting the index to be ["SId", "Hour", "Weekday"], any comparison and assignments will be done with relation to this index, so running df.Intensity= fill_df.Intensity will assign to df.Intensity the values in fill_df.Intensity wherever the index match, that is, wherever they have the same ["SId", "Hour", "Weekday"].
In order to assign only to the places where the Intensity is NA, we need to filter first (or use fillna). Note that filter by df.Intensity[df.Intensity.isnull()] will work, but assignment to it will probably fail if you have several values with the same (SId, Hour, Weekday) values.
I've created a pandas dataframe from a 205MB csv (approx 1.1 million rows by 15 columns). It holds a column called starttime that is dtype object (it's more precisely a string). The format is as follows: 7/1/2015 00:00:03.
I would like to create two new dataframes from this pandas dataframe. One should contain all rows corresponding with weekend dates, the other should contain all rows corresponding with weekday dates.
Weekend dates are:
weekends = ['7/4/2015', '7/5/2015', '7/11/2015', '7/12/2015',
'7/18/2015', '7/19/2015', '7/25/2015', '7,26/2015']
I attempted to convert the string to datetime (pd.to_datetime) hoping that would make the values easier to parse, but when I do it hangs for so long that I ended up restarting the kernel several times.
Then I decided to use df["date"], df["time"] = zip(*df['starttime'].str.split(' ').tolist()) to create two new columns in the original dataframe (one for date, one for time). Next I figured I'd use a boolean test to 'flag' weekend records (according to the new date field) as True and all others False and create another column holding those values, then I'd be able to group by True and False.
For example,
test1 = bikes['date'] == '7/1/2015' returns True for all 7/1/2015 values, but I can't figure out how to iterate over all items in weekends so that I get True for all weekend dates. I tried this and broke Python (hung again):
for i in weekends:
for k in df['date']:
test2 = df['date'] == i
I'd appreciate any help (with both my logic and my code).
First, create a DataFrame of string timestamps with 1.1m rows:
df = pd.DataFrame({'date': ['7/1/2015 00:00:03', '7/1/2015 00:00:04'] * 550000})
Next, you can simply convert them to Pandas timestamps as follows:
df['ts'] = pd.to_datetime(df.date)
This operation took just under two minutes. However, it took under seven seconds if you specify the format:
df['ts'] = pd.to_datetime(df.date, format='%m/%d/%Y %H:%M:%S')
Now, it is easy to set up a weekend flag as follows (which took about 3 seconds):
df['weekend'] = [d.weekday() >= 5 for d in df.ts]
Finally, it is easy to subset your DataFrame, which takes virtually no time:
df_weekdays = df.loc[~df.weekend, :]
df_weekends = df.loc[df.weekend, :]
The weekend flag is to help explain what is happening. You can simplify as follows:
df_weekdays = df.loc[df.ts.apply(lambda ts: ts.weekday() < 5), :]
df_weekends = df.loc[df.ts.apply(lambda ts: ts.weekday() >= 5), :]
I have a file with intraday prices every ten minutes. [0:41] times in a day. Each date is repeated 42 times. The multi-index below should "collapse" the repeated dates into one for all times.
There are 62,035 rows x 3 columns: [date, time, price].
I would like write a function to get the difference of the ten minute prices, restricting differences to each unique date.
In other words, 09:30 is the first time of each day and 16:20 is the last: I cannot overlap differences between days of price from 16:20 - 09:30. The differences should start as 09:40 - 09:30 and end as 16:20 - 16:10 for each unique date in the dataframe.
Here is my attempt. Any suggestions would be greatly appreciated.
def diffSeries(rounded,data):
'''This function accepts a column called rounded from 'data'
The 2nd input 'data' is a dataframe
'''
df=rounded.shift(1)
idf=data.set_index(['date', 'time'])
data['diff']=['000']
for i in range(0,length(rounded)):
for day in idf.index.levels[0]:
for time in idf.index.levels[1]:
if idf.index.levels[1]!=1620:
data['diff']=rounded[i]-df[i]
else:
day+=1
time+=2
data[['date','time','price','II','diff']].to_csv('final.csv')
return data['diff']
Then I call:
data=read_csv('file.csv')
rounded=roundSeries(data['price'],5)
diffSeries(rounded,data)
On the traceback - I get an Assertion Error.
You can use groupby and then apply to achieve what you want:
diffs = data.groupby(lambda idx: idx[0]).apply(lambda row: row - row.shift(1))
For a full example, suppose you create a test data set for 14 Nov to 16 Nov:
import pandas as pd
from numpy.random import randn
from datetime import time
# Create date range with 10 minute intervals, and filter out irrelevant times
times = pd.bdate_range(start=pd.datetime(2012,11,14,0,0,0),end=pd.datetime(2012,11,17,0,0,0), freq='10T')
filtered_times = [x for x in times if x.time() >= time(9,30) and x.time() <= time(16,20)]
prices = randn(len(filtered_times))
# Create MultiIndex and data frame matching the format of your CSV
arrays = [[x.date() for x in filtered_times]
,[x.time() for x in filtered_times]]
tuples = zip(*arrays)
m_index = pd.MultiIndex.from_tuples(tuples, names=['date', 'time'])
data = pd.DataFrame({'prices': prices}, index=m_index)
You should get a DataFrame a bit like this:
prices
date time
2012-11-14 09:30:00 0.696054
09:40:00 -1.263852
09:50:00 0.196662
10:00:00 -0.942375
10:10:00 1.915207
As mentioned above, you can then get the differences by grouping by the first index and then subtracting the previous row for each row:
diffs = data.groupby(lambda idx: idx[0]).apply(lambda row: row - row.shift(1))
Which gives you something like:
prices
date time
2012-11-14 09:30:00 NaN
09:40:00 -1.959906
09:50:00 1.460514
10:00:00 -1.139036
10:10:00 2.857582
Since you are grouping by the date, the function is not applied for 16:20 - 09:30.
You might want to consider using a TimeSeries instead of a DataFrame, because it will give you far greater flexibility with this kind of data. Supposing you have already loaded your DataFrame from the CSV file, you can easily convert it into a TimeSeries and perform a similar function to get the differences:
dt_index = pd.DatetimeIndex([datetime.combine(i[0],i[1]) for i in data.index])
# or dt_index = pd.DatetimeIndex([datetime.combine(i.date,i.time) for i in data.index])
# if you don't have an multi-level index on data yet
ts = pd.Series(data.prices.values, dt_index)
diffs = ts.groupby(lambda idx: idx.date()).apply(lambda row: row - row.shift(1))
However, you would now have access to the built-in time series functions such as resampling. See here for more about time series in pandas.
#MattiJohn's construction gives a filtered list of length 86,772--when run over 1/3/2007-8/30/2012 for 42 times (10 minute intervals). Observe the data cleaning issues.
Here the data of prices coming from the csv is length: 62,034.
Hence, simply importing from the .csv, as follows, is problematic:
filtered_times = [x for x in times if x.time() >= time(9,30) and x.time() <= time(16,20)]
DF=pd.read_csv('MR10min.csv')
prices = DF.price
# I.E. rather than the generic: prices = randn(len(filtered_times)) above.
The fact that the real data falls short of the length it "should be" means there are data cleaning issues. Often we do not have the full times as bdate_time will generate (half days in the market, etc, holidays).
Your solution is elegant. But I am not sure how to overcome the mismatch between the actual data and the a priori, prescribed dataframe.
Your second TimesSeries suggestion seems to still require construction of a datetime index similar to the first one. For example, if I were use the following two lines to get the actual data of interest:
DF=pd.read_csv('MR10min.csv')
data=pd.DF.set_index(['date','time'])
dt_index = pd.DatetimeIndex([datetime.combine(i[0],i[1]) for i in data.index])
It will generate a:
TypeError: combine() argument 1 must be datetime.date, not str
How does one make a bdate_time array completely informed by the actual data available?
Thank you to (#MattiJohn) and to anyone with interest in continuing this discussion.