Using asfreq to resample a pandas dataframe - python

EDIT: I had made a mistake and my index was starting at 00:00:00, not at 06:00:00 (see below). So this question is spurious, but of course Wen's solution is correct.
I have a dataframe whose index goes like this:
2017-11-01 06:00:00
2017-11-02 06:00:00
2017-11-03 06:00:00
...
and so on. But I have the suspicion there're missing entries, for instance, index for 2017-11-04 06:00:00 could be missing. I have used
df = df.asfreq(freq="1D")
to fill with NaN the missing values, but it creates a new index that doesn't take into consideration the hours, it goes 2017-11-01, 2017-11-02 and so on, so the values in the adjacent column are all NaN!
How can I fix this? I don't see any option in asfreq that can solve it. Perhaps other tool? Thanks in advance.

It work find on my side
l=[
'2017-11-01 06:00:00',
'2017-11-03 06:00:00']
ts = pd.Series(np.random.randn(len(l)), index=l)
ts.index=pd.to_datetime(ts.index)
ts.asfreq(freq="D")
Out[745]:
2017-11-01 06:00:00 -0.467919
2017-11-02 06:00:00 NaN
2017-11-03 06:00:00 1.610024
Freq: D, dtype: float64

Related

selecting rows in dataframe using datetime.datetime

Python is new for me.
I want to select a range of rows by using the datetime which is also the index.
I am not sure if having the datetime as the index is a problem or not.
my dataframe looks like this:
gradient
date
2022-04-15 10:00:00 0.013714
2022-04-15 10:20:00 0.140792
2022-04-15 10:40:00 0.148240
2022-04-15 11:00:00 0.016510
2022-04-15 11:20:00 0.018219
...
2022-05-02 15:40:00 0.191208
2022-05-02 16:00:00 0.016198
2022-05-02 16:20:00 0.043312
2022-05-02 16:40:00 0.500573
2022-05-02 17:00:00 0.955833
And I have made variables which contain the start and end date of the rows I want to select. This looks like this:
A_start_646 = datetime.datetime(2022,4,27, 11,0,0)
S_start_646 = datetime.datetime(2022,4,28, 3,0,0)
D_start_646 = datetime.datetime(2022,5,2, 15,25,0)
D_end_646 = datetime.datetime(2022,5, 2, 15,50,0)
So I would like to make a new dataframe. I saw some examples on the internet, but they use another way of expressing the date.
Does somewan know a solution?
I feel kind of stupid and smart at the same time now. This because I have already answered my own question, my apologies
So this is the answer:
new_df = data_646_mean[A_start_646 : S_start_646]

Comparing 2 csv files against one column with additional condition of matching values in different columns

at first I'd like to let you know, that python is just a tool for me, for a side project that I currently work on. I have nothing to do with programming, I learn everything ad hoc, so I'd really appreciate, if you kept this as simple as possible.
So far I managed to process and write my data into 2 csv files, which look like this:
Police.csv:
gmina, datetime
1, 2008-01-02 12:00:00
1, 2008-01-02 12:00:00
1, 2008-01-02 16:00:00
1, 2008-01-02 16:00:00
1, 2008-01-06 09:00:00
1, 2008-01-06 15:00:00
1, 2008-01-06 20:00:00
1, 2008-01-06 21:00:00
'gmina' goes from 1 to 10 with multiple dates
meteo.csv:
station, datetime, visibility
12100, 2000-01-09 14:00:00, 900.0
12100, 2000-01-09 15:00:00, 900.0
12100, 2000-01-16 06:00:00, 900.0
12100, 2000-01-16 07:00:00, 600.0
12100, 2000-01-16 08:00:00, 900.0
12100, 2000-01-16 12:00:00, 900.0
12100, 2000-01-16 13:00:00, 600.0
There are 10 different values for 'station'. Number of rows is different for both csv.
What I want to do now, is to find rows with the exact same date and hour, write them into new csv, but only for pairs of keys and values like: 'gmina': '1' and 'station': '12100'; 'gmina': '2' and 'station': '12105' and so on. I reckon I need a dictionary for that. I found something like this: Python: Comparing two CSV files and searching for similar items. I need something similar, only with this additional condition of matching values from 'gmina' and 'station'. Could you please give me a hint how to implement this condition in above's code? Or maybe it'd be easier to parse those csv files into dataframes and work with pandas?

pandas asfreq returns NaN if exact date DNE

Let's say I have financial data in a pandas.Series, called fin_series.
Here's a peek at fin_series.
In [565]: fin_series
Out[565]:
Date
2008-05-16 1000.000000
2008-05-19 1001.651747
2008-05-20 1004.137434
...
2014-12-22 1158.085200
2014-12-23 1150.139126
2014-12-24 1148.934665
Name: Close, Length: 1665
I'm interested in looking at the quarterly endpoints of the data. However, not all financial trading days fall exactly on the 'end of the quarter.'
For example:
In [566]: fin_series.asfreq('q')
Out[566]:
2008-06-30 976.169624
2008-09-30 819.518923
2008-12-31 760.429261
...
2009-06-30 795.768956
2009-09-30 870.467121
2009-12-31 886.329978
...
2011-09-30 963.304679
2011-12-31 NaN
2012-03-31 NaN
....
2012-09-30 NaN
2012-12-31 1095.757137
2013-03-31 NaN
2013-06-30 NaN
...
2014-03-31 1138.548881
2014-06-30 1168.248194
2014-09-30 1147.000073
Freq: Q-DEC, Name: Close, dtype: float64
Here's a little function that accomplishes what I'd like, along with the desired end result.
def bmg_qt_asfreq(series):
ind = series[1:].index.quarter != series[:-1].index.quarter
ind = numpy.append(ind, True)
return tmp[ind]
which gives me:
In [15]: bmg_asfreq(tmp)
Out[15]:
Date
2008-06-30 976.169425
2008-09-30 819.517607
2008-12-31 760.428770
...
2011-09-30 963.252831
2011-12-30 999.742132
2012-03-30 1049.848583
...
2012-09-28 1086.689824
2012-12-31 1093.943357
2013-03-28 1117.111859
Name: Close, dtype: float64
Note that I'm preserving the dates of the "closest prior price," instead of simply using pandas.asfreq(freq = 'q', method = 'ffill'), as the preservations of dates that exist within the original Series.Index is crucial.
This seems like a silly problem that many people have had and must be addressed by all of the pandas time manipulation functionality, but I can't figure out how to do it with resample or asfreq.
Anyone who could show me the builtin pandas functionality to accomplish this would be greatly appreciated.
Regards,
Assuming the input is a dataframe Series , first do
import pandas as pd
fin_series.resample("q",pd.Series.last_valid_index)
to get a series with the last non-NA index for each quarter. Then
fin_series.resample("q","last")
for the last non-NA value. You can then join these together. As you suggested in your comment:
fin_series[fin_series.resample("q",pd.Series.last_valid_index)]
df.asfreq('d').interpolate().asfreq('q')

How to resample data in a single dataframe within 3 distinct groups

I've got a dataframe and want to resample certain columns (as hourly sums and means from 10-minutely data) WITHIN the 3 different 'users' that exist in the dataset.
A normal resample would use code like:
import pandas as pd
import numpy as np
df = pd.read_csv('example.csv')
df['Datetime'] = pd.to_datetime(df['date_datetime/_source'] + ' ' + df['time']) #create datetime stamp
df.set_index(df['Datetime'], inplace = True)
df = df.resample('1H', how={'energy_kwh': np.sum, 'average_w': np.mean, 'norm_average_kw/kw': np.mean, 'temperature_degc': np.mean, 'voltage_v': np.mean})
df
To geta a result like (please forgive the column formatting, I have no idea how to paste this properly to make it look nice):
energy_kwh norm_average_kw/kw voltage_v temperature_degc average_w
Datetime
2013-04-30 06:00:00 0.027 0.007333 266.333333 4.366667 30.000000
2013-04-30 07:00:00 1.250 0.052333 298.666667 5.300000 192.500000
2013-04-30 08:00:00 5.287 0.121417 302.333333 7.516667 444.000000
2013-04-30 09:00:00 12.449 0.201000 297.500000 9.683333 726.000000
2013-04-30 10:00:00 26.101 0.396417 288.166667 11.150000 1450.000000
2013-04-30 11:00:00 45.396 0.460250 282.333333 12.183333 1672.500000
2013-04-30 12:00:00 64.731 0.440833 276.166667 13.550000 1541.000000
2013-04-30 13:00:00 87.095 0.562750 284.833333 13.733333 2084.500000
However, in the original CSV, there is a column containing URLs - in the dataset of 100,000 rows, there are 3 different URLs (effectively IDs). I want to have each resampled individually rather than having a 'lump' resample from all (e.g. 9.00 AM on 2014-01-01 would have data for all 3 users, but each should have it's own hourly sums and means).
I hope this makes sense - please let me know if I need to clarify anything.
FYI, I tried using the advice in the following 2 posts but to no avail:
Resampling a multi-index DataFrame
Resampling Within a Pandas MultiIndex
Thanks in advance
You can resample a groupby object, groupby-ed by URLs, in this minimal example:
In [157]:
df=pd.DataFrame({'Val': np.random.random(100)})
df['Datetime'] = pd.date_range('2001-01-01', periods=100, freq='5H') #create random dataset
df.set_index(df['Datetime'], inplace = True)
df.__delitem__('Datetime')
df['Location']=np.tile(['l0', 'l1', 'l2', 'l3', 'l4'], 20)
In [158]:
print df.groupby('Location').resample('10D', how={'Val':np.mean})
Val
Location Datetime
l0 2001-01-01 00:00:00 0.334183
2001-01-11 00:00:00 0.584260
l1 2001-01-01 05:00:00 0.288290
2001-01-11 05:00:00 0.470140
l2 2001-01-01 10:00:00 0.381273
2001-01-11 10:00:00 0.461684
l3 2001-01-01 15:00:00 0.703523
2001-01-11 15:00:00 0.386858
l4 2001-01-01 20:00:00 0.448857
2001-01-11 20:00:00 0.310914

How to sum field across two DataFrames when the indexes don't line up?

I am brand new to complex data analysis in general, and to pandas in particular. I have a feeling that pandas should be able to handle this task easily, but my newbieness prevents me from seeing the path to a solution. I want to sum one column across two files at a given time each day, 3pm in this case. If a file doesn't have a record at 3pm that day, I want to use the previous record.
Let me give a concrete example. I have data in two CSV files. Here are a couple small examples:
datetime value
2013-02-28 09:30:00 0.565019720442
2013-03-01 09:30:00 0.549536266504
2013-03-04 09:30:00 0.5023031467
2013-03-05 09:30:00 0.698370467751
2013-03-06 09:30:00 0.75834927162
2013-03-07 09:30:00 0.783620442226
2013-03-11 09:30:00 0.777265379462
2013-03-12 09:30:00 0.785787872851
2013-03-13 09:30:00 0.784873183044
2013-03-14 10:15:00 0.802959366653
2013-03-15 10:15:00 0.802959366653
2013-03-18 10:15:00 0.805413095911
2013-03-19 09:30:00 0.80816233134
2013-03-20 10:15:00 0.878912249996
2013-03-21 10:15:00 0.986393922571
and the other:
datetime value
2013-02-28 05:00:00 0.0373634672097
2013-03-01 05:00:00 -0.24700085273
2013-03-04 05:00:00 -0.452964976056
2013-03-05 05:00:00 -0.2479288295
2013-03-06 05:00:00 -0.0326855588777
2013-03-07 05:00:00 0.0780461766619
2013-03-08 05:00:00 0.306247682656
2013-03-11 06:00:00 0.0194146154407
2013-03-12 05:30:00 0.0103653153719
2013-03-13 05:30:00 0.0350377752558
2013-03-14 05:30:00 0.0110884755383
2013-03-15 05:30:00 -0.173216846788
2013-03-19 05:30:00 -0.211785013352
2013-03-20 05:30:00 -0.891054563968
2013-03-21 05:30:00 -1.27207563599
2013-03-22 05:30:00 -1.28648629004
2013-03-25 05:30:00 -1.5459897419
Note that a) neither file actually has a 3pm record, and b) the two files don't always have records for any given day. (2013-03-08 is missing from the first file, while 2013-03-18 is missing from the second, and the first file ends before the second.) As output, I envision a dataframe like this (perhaps just the date without the time):
datetime value
2013-Feb-28 15:00:00 0.6023831876517
2013-Mar-01 15:00:00 0.302535413774
2013-Mar-04 15:00:00 0.049338170644
2013-Mar-05 15:00:00 0.450441638251
2013-Mar-06 15:00:00 0.7256637127423
2013-Mar-07 15:00:00 0.8616666188879
2013-Mar-08 15:00:00 0.306247682656
2013-Mar-11 15:00:00 0.7966799949027
2013-Mar-12 15:00:00 0.7961531882229
2013-Mar-13 15:00:00 0.8199109582998
2013-Mar-14 15:00:00 0.8140478421913
2013-Mar-15 15:00:00 0.629742519865
2013-Mar-18 15:00:00 0.805413095911
2013-Mar-19 15:00:00 0.596377317988
2013-Mar-20 15:00:00 -0.012142313972
2013-Mar-21 15:00:00 -0.285681713419
2013-Mar-22 15:00:00 -1.28648629004
2013-Mar-25 15:00:00 -1.5459897419
I have a feeling this is perhaps a three-liner in pandas, but it's not at all clear to me how to do this. Further complicating my thinking about this problem, more complex CSV files might have multiple records for a single day (same date, different times). It seems that I need to somehow either generate a new pair of input dataframes with times at 15:00 and then sum across their values columns keying on just the date, or during the sum operation select the record with the greatest time on any given day with the time <= 15:00:00. Given that datetime.time objects can't be compared for magnitude, I suspect I might have to group rows together having the same date, then within each group, select only the row nearest to (but not greater than) 3pm. Kind of at that point my brain explodes.
I got nowhere looking at the documentation, as I don't really understand all the database-like operations pandas supports. Pointers to relevant documentation (especially tutorials) would be much appreciated.
First combine your DataFrames:
df3 = df1.append(df2)
so that everything is in one table, next use the groupby to sum across timestamps:
df4 = df3.groupby('datetime').aggregate(sum)
now d4 has a value column that is the sum of all matching datetime columns.
Assuming you have the timestamps as datetime objects, you can do whatever filtering you like at any stage:
filtered = df[df['datetime'] < datetime.datetime(year, month, day, hour, minute, second)]
I'm not sure exactly what you are trying to do, you may need to parse your timestamp columns before filtering.

Categories

Resources