Create multiple columns in pandas aggregation function - python

I'd like to create multiple columns while resampling a pandas DataFrame like the built-in ohlc method.
def mhl(data):
return pandas.Series([np.mean(data),np.max(data),np.min(data)],index = ['mean','high','low'])
ts.resample('30Min',how=mhl)
Dies with
Exception: Must produce aggregated value
Any suggestions? Thanks!

You can pass a dictionary of functions to the resample method:
In [35]: ts
Out[35]:
2013-01-01 00:00:00 0
2013-01-01 00:15:00 1
2013-01-01 00:30:00 2
2013-01-01 00:45:00 3
2013-01-01 01:00:00 4
2013-01-01 01:15:00 5
...
2013-01-01 23:00:00 92
2013-01-01 23:15:00 93
2013-01-01 23:30:00 94
2013-01-01 23:45:00 95
2013-01-02 00:00:00 96
Freq: 15T, Length: 97
Create a dictionary of functions:
mhl = {'m':np.mean, 'h':np.max, 'l':np.min}
Pass the dictionary to the how parameter of resample:
In [36]: ts.resample("30Min", how=mhl)
Out[36]:
h m l
2013-01-01 00:00:00 1 0.5 0
2013-01-01 00:30:00 3 2.5 2
2013-01-01 01:00:00 5 4.5 4
2013-01-01 01:30:00 7 6.5 6
2013-01-01 02:00:00 9 8.5 8
2013-01-01 02:30:00 11 10.5 10
2013-01-01 03:00:00 13 12.5 12
2013-01-01 03:30:00 15 14.5 14

Related

Drop all rows for the month if a column has more than one value that crossed the threshold

I have a dataframe with time data in the format:
date values
0 2013-01-01 00:00:00 0.0
1 2013-01-01 01:00:00 0.0
2 2013-01-01 02:00:00 -9999
3 2013-01-01 03:00:00 -9999
4 2013-01-01 04:00:00 0.0
.. ... ...
8754 2016-12-31 18:00:00 427.5
8755 2016-12-31 19:00:00 194.9
8756 2016-12-31 20:00:00 -9999
8757 2016-12-31 21:00:00 237.6
8758 2016-12-31 22:00:00 -9999
8759 2016-12-31 23:00:00 0.0
Suppose the value -9999 was repeated 200 times in the month of January and the threshold is 150. Practically the entire month of January must be deleted or all its rows must be deleted.
date values repeated
1 2013-02 0
2 2013-03 2
4 2013-05 0
5 2013-06 0
6 2013-07 66
7 2013-08 0
8 2013-09 7
With this I think I can drop the rows that repeat but I want drop the whole month.
import numpy as np
df['month'] = df['date'].dt.to_period('M')
df['new_value'] = np.where((df['values'] == -9999) & (df['n_missing'] > 150),np.nan,df['values'])
df.dropna()
How can I do that ?
One way using pandas.to_datetime with pandas.DataFrame.groupby.filter.
Here's a sample with months that have -9999 repeated 2, 1, 0, 2 times each:
date values
0 2013-01-01 00:00:00 0.0
1 2013-01-01 01:00:00 0.0
2 2013-01-01 02:00:00 -9999.0
3 2013-01-01 03:00:00 -9999.0
4 2013-01-01 04:00:00 0.0
5 2013-02-01 12:00:00 -9999.0
6 2013-03-01 12:00:00 0.0
8754 2016-12-31 18:00:00 427.5
8755 2016-12-31 19:00:00 194.9
8756 2016-12-31 20:00:00 -9999.0
8757 2016-12-31 21:00:00 237.6
8758 2016-12-31 22:00:00 -9999.0
8759 2016-12-31 23:00:00 0.0
Then we do filtering:
date = pd.to_datetime(df["date"]).dt.strftime("%Y-%m")
new_df = df.groupby(date).filter(lambda x: x["values"].eq(-9999).sum() < 2)
print(new_df)
Output:
date values
5 2013-02-01 12:00:00 -9999.0
6 2013-03-01 12:00:00 0.0
You can see the months with 2 or more repeats are deleted.

count contiguous NaN values by unique values

I have contiguous periods of NaN values by code. I want to count NaN values from periods of contiguous NaN values by code, and also i want the start and end date of the contiguos period of NaN values.
df :
CODE TMIN
1998-01-01 00:00:00 12 2.5
1999-01-01 00:00:00 12 NaN
2000-01-01 00:00:00 12 NaN
2001-01-01 00:00:00 12 2.2
2002-01-01 00:00:00 12 NaN
1998-01-01 00:00:00 41 NaN
1999-01-01 00:00:00 41 NaN
2000-01-01 00:00:00 41 5.0
2001-01-01 00:00:00 41 9.0
2002-01-01 00:00:00 41 8.0
1998-01-01 00:00:00 52 2.0
1999-01-01 00:00:00 52 NaN
2000-01-01 00:00:00 52 NaN
2001-01-01 00:00:00 52 NaN
2002-01-01 00:00:00 52 1.0
1998-01-01 00:00:00 91 NaN
Expected results :
Start_Date End date CODE number of contiguous missing values
1999-01-01 00:00:00 2000-01-01 00:00:00 12 2
2002-01-01 00:00:00 2002-01-01 00:00:00 12 1
1998-01-01 00:00:00 1999-01-01 00:00:00 41 2
1999-01-01 00:00:00 2001-01-01 00:00:00 52 3
1998-01-01 00:00:00 1998-01-01 00:00:00 91 1
How can i solve this? Thanks!
You can try groupby the cumsum of non-null:
df['group'] = df.TMIN.notna().cumsum()
(df[df.TMIN.isna()]
.groupby(['group','CODE'])
.agg(Start_Date=('group', lambda x: x.index.min()),
End_Date=('group', lambda x: x.index.max()),
cont_missing=('TMIN', 'size')
)
)
Output:
Start_Date End_Date cont_missing
group CODE
1 12 1999-01-01 00:00:00 2000-01-01 00:00:00 2
2 12 2002-01-01 00:00:00 2002-01-01 00:00:00 1
41 1998-01-01 00:00:00 1999-01-01 00:00:00 2
6 52 1999-01-01 00:00:00 2001-01-01 00:00:00 3
7 91 1998-01-01 00:00:00 1998-01-01 00:00:00 1

How do I resample a pandas Series using values around the hour

I have timeseries data recorded at 10min frequency. I want to average the values at one hour interval. But for that I want to take 3 values before the hour and 2 values after the hour, take the average and assign that value to the exact hour timestamp.
for example, I have the series
index = pd.date_range('2000-01-01T00:30:00', periods=63, freq='10min')
series = pd.Series(range(63), index=index)
series
2000-01-01 00:30:00 0
2000-01-01 00:40:00 1
2000-01-01 00:50:00 2
2000-01-01 01:00:00 3
2000-01-01 01:10:00 4
2000-01-01 01:20:00 5
2000-01-01 01:30:00 6
2000-01-01 01:40:00 7
2000-01-01 01:50:00 8
2000-01-01 02:00:00 9
2000-01-01 02:10:00 10
..
2000-01-01 08:50:00 50
2000-01-01 09:00:00 51
2000-01-01 09:10:00 52
2000-01-01 09:20:00 53
2000-01-01 09:30:00 54
2000-01-01 09:40:00 55
2000-01-01 09:50:00 56
2000-01-01 10:00:00 57
2000-01-01 10:10:00 58
2000-01-01 10:20:00 59
2000-01-01 10:30:00 60
2000-01-01 10:40:00 61
2000-01-01 10:50:00 62
Freq: 10T, Length: 63, dtype: int64
So, if I do
series.resample('1H').mean()
2000-01-01 00:00:00 1.0
2000-01-01 01:00:00 5.5
2000-01-01 02:00:00 11.5
2000-01-01 03:00:00 17.5
2000-01-01 04:00:00 23.5
2000-01-01 05:00:00 29.5
2000-01-01 06:00:00 35.5
2000-01-01 07:00:00 41.5
2000-01-01 08:00:00 47.5
2000-01-01 09:00:00 53.5
2000-01-01 10:00:00 59.5
Freq: H, dtype: float64
the first value is the average of 0, 1, 2, and assigned to hour 0, the second the average of the values for 1:00:00 to 1:50:00 assigned to 1:00:00 and so on.
What I would like to have is the first average centered at 1:00:00 calculated using values from 00:30:00 through 01:20:00, the second centered at 02:00:00 calculated from 01:30:00 to 02:20:00 and so on...
What will be the best way to do that?
Thanks!
You should be able to do that with:
series.index = series.index - pd.Timedelta(30, unit='m')
series_grouped_mean = series.groupby(pd.Grouper(freq='60min')).mean()
series_grouped_mean.index = series_grouped_mean.index + pd.Timedelta(60, unit='m')
series_grouped_mean
I got:
2000-01-01 01:00:00 2.5
2000-01-01 02:00:00 8.5
2000-01-01 03:00:00 14.5
2000-01-01 04:00:00 20.5
2000-01-01 05:00:00 26.5
2000-01-01 06:00:00 32.5
2000-01-01 07:00:00 38.5
2000-01-01 08:00:00 44.5
2000-01-01 09:00:00 50.5
2000-01-01 10:00:00 56.5
2000-01-01 11:00:00 61.0
Freq: H, dtype: float64

Comparing dates of two pandas Dataframes and adding values if dates are similar?

I have two dataframes with different number of rows, one has daily values and the second one has hourly values. I wanted to compare them and if the dates match then I would like to add daily values infront of hourly values of the same day. The dataframes are;
import pandas as pd
df1 = pd.read_csv('C:\Users\ABC.csv')
df2 = pd.read_csv('C:\Users\DEF.csv')
df1 = pd.to_datetime(df1['Datetime'])
df2 = pd.to_datetime(df2['Datetime'])
df1.head()
Out [3]
Datetime Value
0 2016-02-02 21:00:00 0.6
1 2016-02-02 22:00:00 0.4
2 2016-02-02 23:00:00 0.4
3 2016-03-02 00:00:00 0.3
4 2016-03-02 01:00:00 0.2
df2.head()
Out [4] Datetime No of people
0 2016-02-02 56
1 2016-03-02 60
2 2016-04-02 91
3 2016-05-02 87
4 2016-06-02 90
What I would like to have is something like this;
Datetime Value No of People
0 2016-02-02 21:00:00 0.6 56
1 2016-02-02 22:00:00 0.4 56
2 2016-02-02 23:00:00 0.4 56
3 2016-03-02 00:00:00 0.3 60
4 2016-03-02 01:00:00 0.2 60
Any idea, how to do this in Python using Pandas? please note, there may be some dates missing.
you can set index to df1.Datetime.dt.date for the df1 DF and then you can join it with df2:
In [46]: df1.set_index(df1.Datetime.dt.date).join(df2.set_index('Datetime')).reset_index(drop=True)
Out[46]:
Datetime Value No_of_people
0 2016-02-02 21:00:00 0.6 56
1 2016-02-02 22:00:00 0.4 56
2 2016-02-02 23:00:00 0.4 56
3 2016-03-02 00:00:00 0.3 60
4 2016-03-02 01:00:00 0.2 60
optionally you may want to use how='left' parameter when calling the join() function
You can just use pd.concat and fillna(method='ffill') because the date values match the first value on any day:
df1 = pd.DataFrame(data={'day': np.random.randint(low=50, high=100, size=10), 'date':pd.date_range(date(2016,1,1), freq='D', periods=10)})
date day
0 2016-01-01 55
1 2016-01-02 51
2 2016-01-03 92
3 2016-01-04 78
4 2016-01-05 72
df2 = pd.DataFrame(data={'hour': np.random.randint(low=1, high=10, size=100), 'datetime': pd.date_range(date(2016,1,1), freq='H', periods=100)})
datetime hour
0 2016-01-01 00:00:00 5
1 2016-01-01 01:00:00 1
2 2016-01-01 02:00:00 4
3 2016-01-01 03:00:00 5
4 2016-01-01 04:00:00 2
like so:
pd.concat([df2.set_index('datetime'), df1.set_index('date')], axis=1).fillna(method='ffill')
to get:
hour day
2016-01-01 00:00:00 5.0 55.0
2016-01-01 01:00:00 1.0 55.0
2016-01-01 02:00:00 4.0 55.0
2016-01-01 03:00:00 5.0 55.0
2016-01-01 04:00:00 2.0 55.0
2016-01-01 05:00:00 3.0 55.0
2016-01-01 06:00:00 5.0 55.0
2016-01-01 07:00:00 6.0 55.0
2016-01-01 08:00:00 6.0 55.0
2016-01-01 09:00:00 8.0 55.0
2016-01-01 10:00:00 3.0 55.0
2016-01-01 11:00:00 5.0 55.0
2016-01-01 12:00:00 7.0 55.0
2016-01-01 13:00:00 7.0 55.0
2016-01-01 14:00:00 4.0 55.0
2016-01-01 15:00:00 5.0 55.0
2016-01-01 16:00:00 7.0 55.0
2016-01-01 17:00:00 4.0 55.0
2016-01-01 18:00:00 6.0 55.0
2016-01-01 19:00:00 1.0 55.0
2016-01-01 20:00:00 8.0 55.0
2016-01-01 21:00:00 8.0 55.0
2016-01-01 22:00:00 2.0 55.0
2016-01-01 23:00:00 3.0 55.0
2016-01-02 00:00:00 7.0 51.0
2016-01-02 01:00:00 6.0 51.0
2016-01-02 02:00:00 8.0 51.0
2016-01-02 03:00:00 6.0 51.0
2016-01-02 04:00:00 1.0 51.0
2016-01-02 05:00:00 5.0 51.0
... ... ...
2016-01-04 03:00:00 6.0 78.0
2016-01-04 04:00:00 9.0 78.0
2016-01-04 05:00:00 1.0 78.0
2016-01-04 06:00:00 6.0 78.0
2016-01-04 07:00:00 3.0 78.0
2016-01-04 08:00:00 9.0 78.0
2016-01-04 09:00:00 5.0 78.0
2016-01-04 10:00:00 3.0 78.0
2016-01-04 11:00:00 6.0 78.0
2016-01-04 12:00:00 4.0 78.0
2016-01-04 13:00:00 2.0 78.0
2016-01-04 14:00:00 4.0 78.0
2016-01-04 15:00:00 3.0 78.0
2016-01-04 16:00:00 4.0 78.0
2016-01-04 17:00:00 9.0 78.0
2016-01-04 18:00:00 8.0 78.0
2016-01-04 19:00:00 4.0 78.0
2016-01-04 20:00:00 7.0 78.0
2016-01-04 21:00:00 1.0 78.0
2016-01-04 22:00:00 6.0 78.0
2016-01-04 23:00:00 1.0 78.0
2016-01-05 00:00:00 5.0 72.0
2016-01-05 01:00:00 8.0 72.0
2016-01-05 02:00:00 6.0 72.0
2016-01-05 03:00:00 3.0 72.0
2016-01-06 00:00:00 3.0 87.0
2016-01-07 00:00:00 3.0 50.0
2016-01-08 00:00:00 3.0 65.0
2016-01-09 00:00:00 3.0 81.0
2016-01-10 00:00:00 3.0 65.0

python pandas date_range when importing with read_csv

I have a .csv file with some data in the following format:
1.69511909, 0.57561167, 0.31437427, 0.35458831, 0.15841189, 0.28239582, -0.18180907, 1.34761404, -1.5059083, 1.29246638
-1.66764664, 0.1488095, 1.03832221, -0.35229205, 1.35705861, -1.56747104, -0.36783851, -0.57636948, 0.9854391, 1.63031066
0.87763775, 0.60757153, 0.64908314, -0.68357724, 0.33499838, -0.08557089, 1.71855596, -0.61235066, -0.32520105, 1.54162629
Every line corresponds to a specific day, and every record in a line corresponds to a specific hour in that day.
Is there is a convenient way of importing the data with read_csv such that everything would be correctly indexed, i.e. the importing function would discriminate different days (lines), and hours within days (separate records in lines).
Something like this. Note that I couldn't copy your string for some reason, so my dataset is cutoff....
Read in the string (it reads as a dataframe because mine had newlines in it)....but need to coerce to a Series.
In [23]: s = pd.read_csv(StringIO(data)).values
In [24]: s
Out[24]:
array([[-1.66764664, 0.1488095 , 1.03832221, -0.35229205, 1.35705861,
-1.56747104, -0.36783851, -0.57636948, 0.9854391 , 1.63031066],
[ 0.87763775, 0.60757153, 0.64908314, -0.68357724, 0.33499838,
-0.08557089, 1.71855596, nan, nan, nan]])
In [25]: s = Series(pd.read_csv(StringIO(data)).values.ravel())
In [26]: s
Out[26]:
0 -1.667647
1 0.148810
2 1.038322
3 -0.352292
4 1.357059
5 -1.567471
6 -0.367839
7 -0.576369
8 0.985439
9 1.630311
10 0.877638
11 0.607572
12 0.649083
13 -0.683577
14 0.334998
15 -0.085571
16 1.718556
17 NaN
18 NaN
19 NaN
dtype: float64
Just set the index directly....Note that you are solely responsible for alignment, this is VERY
easy to be say off by 1
In [27]: s.index = pd.date_range('20130101',freq='H',periods=len(s))
In [28]: s
Out[28]:
2013-01-01 00:00:00 -1.667647
2013-01-01 01:00:00 0.148810
2013-01-01 02:00:00 1.038322
2013-01-01 03:00:00 -0.352292
2013-01-01 04:00:00 1.357059
2013-01-01 05:00:00 -1.567471
2013-01-01 06:00:00 -0.367839
2013-01-01 07:00:00 -0.576369
2013-01-01 08:00:00 0.985439
2013-01-01 09:00:00 1.630311
2013-01-01 10:00:00 0.877638
2013-01-01 11:00:00 0.607572
2013-01-01 12:00:00 0.649083
2013-01-01 13:00:00 -0.683577
2013-01-01 14:00:00 0.334998
2013-01-01 15:00:00 -0.085571
2013-01-01 16:00:00 1.718556
2013-01-01 17:00:00 NaN
2013-01-01 18:00:00 NaN
2013-01-01 19:00:00 NaN
Freq: H, dtype: float64
First just read in the DataFrame:
df = pd.read_csv(file_name, sep=',\s+', header=None)
Then set the index to be the dates and the columns to be the hours*
df.index = pd.date_range('2012-01-01', freq='D', periods=len(df))
from pandas.tseries.offsets import Hour
df.columns = [Hour(7+t) for t in df.columns]
In [5]: df
Out[5]:
<7 Hours> <8 Hours> <9 Hours> <10 Hours> <11 Hours> <12 Hours> <13 Hours> <14 Hours> <15 Hours> <16 Hours>
2012-01-01 1.695119 0.575612 0.314374 0.354588 0.158412 0.282396 -0.181809 1.347614 -1.505908 1.292466
2012-01-02 -1.667647 0.148810 1.038322 -0.352292 1.357059 -1.567471 -0.367839 -0.576369 0.985439 1.630311
2012-01-03 0.877638 0.607572 0.649083 -0.683577 0.334998 -0.085571 1.718556 -0.612351 -0.325201 1.541626
Then stack it and add the Date and the Hour levels of the MultiIndex:
s = df.stack()
s.index = [x[0]+x[1] for x in s.index]
In [8]: s
Out[8]:
2012-01-01 07:00:00 1.695119
2012-01-01 08:00:00 0.575612
2012-01-01 09:00:00 0.314374
2012-01-01 10:00:00 0.354588
2012-01-01 11:00:00 0.158412
2012-01-01 12:00:00 0.282396
2012-01-01 13:00:00 -0.181809
2012-01-01 14:00:00 1.347614
2012-01-01 15:00:00 -1.505908
2012-01-01 16:00:00 1.292466
2012-01-02 07:00:00 -1.667647
2012-01-02 08:00:00 0.148810
...
* You can use different offsets, see here, e.g. Minute, Second.

Categories

Resources