How to resample large dataframe with different functions, using a key?

How to resample large dataframe with different functions, using a key? - python

I have a large time-series set of data with over 200 recorded values (columns). Some values need to be averaged and some need to be summed, and I have a list that determines which is which. I need help figuring out how to feed that list into the how= function of resample.
Example data:
"Timestamp","TZ","TAO (degF)","RHO (%)","WS (mph)","WD (deg)","RAIN (mm)","OAP (hPa)","INSOL (W/m2)","HAIL (hits/cm2)"......."
2014/04/01 01:01:01.005,n,45.3,88.2,0,0.6,0.339,1.0108,-0.270342,0,68.147808,40.91662,68.15884,40.672356,66.55452,......
2014/04/01 01:02:01.027,n,45.3,88,0,3.4,0.339,1.0108,-0.124948,0,68.216736,40.929836,68.15884,40.656932,66.560072,.......
2014/04/01 01:03:01.050,n,45.3,88,0,1.7,0.34,1.0108,-0.145394,0,68.156064,40.890184,68.103736,40.68332,66.557296,......
The best I can come up with is concatenating the list into a string to pass into the how=function, but the concatenation of strings makes the function SeriesGroupBy error out.
df = pandas.read_csv(parsedatafile, parse_dates = True, date_parser=lambda x: datetime.datetime.strptime(x, '%Y/%m/%d %H:%M:%S.%f') , index_col=0)
while i < len(recordname):
if recordhow[i]=="Y":
#parseavgsum[i]="sum"
recordhow[i]=str(recordname[i])+str(": sum")
else:
recordhow[i]=str(recordname[i])+str(": mean")
#parseavgsum[i]="mean"
i+=1
df2=df.resample('60Min', how = recordhow)

I would pass how a dictionary:
>>> df
WD (deg) RAIN (mm)
Timestamp
2014-04-01 01:01:01.005000 40.916620 68.158840
2014-04-01 01:02:01.027000 40.929836 68.158840
2014-04-01 01:03:01.050000 40.890184 68.103736
[3 rows x 2 columns]
>>> what_to_do = {"WD (deg)": "mean", "RAIN (mm)": "sum"}
>>> df.resample("60Min", how=what_to_do)
RAIN (mm) WD (deg)
Timestamp
2014-04-01 01:00:00 204.421416 40.912213
[1 rows x 2 columns]
I think using a recordhow list like you're doing is a little dangerous, because it's very easy for columns to get shuffled accidentally in which case your means and sums would be off. It's much safer to work with column names. But if you have recordhow, you could do something like:
>>> recordhow = ["N", "Y"]
>>> how_map = {"Y": "sum", "N": "mean"}
>>> what_to_do = dict(zip(df.columns, [how_map[x] for x in recordhow]))
>>> what_to_do
{'RAIN (mm)': 'sum', 'WD (deg)': 'mean'}
but again, I recommend moving away from a bare list that doesn't know what maps to what as quickly as possible.

Related

map behaves strangely on DatetimeIndex

I'm seeing a strange behaviour on the map function when applied to a DatetimeIndex, where the first element that is mapped is the whole index, and then each element is processed individually (as expected).
Here's a way to reproduce the issue
(Have tried it on pandas 0.22.0, 0.23.0 and 0.24.0):
df = pd.DataFrame(data = np.random.randn(3,1),
index = pd.DatetimeIndex(
start='2018-05-03',
periods = 3,
freq ='D'))
df.index.map(lambda x: print(x))
yields:
DatetimeIndex(['2018-05-03', '2018-05-04', '2018-05-05'], dtype='datetime64[ns]', freq='D')
2018-05-03 00:00:00
2018-05-04 00:00:00
2018-05-05 00:00:00
Index([None, None, None], dtype='object')
EDIT: The very first line that the print is producing is what I find odd. If I use a RangeIndex this doesn't happen.

Surprising print behaviour
This unusual behaviour only affects a DatetimeIndex and not a Series. So to fix the bug, wrap your index in pd.Series() before mapping the lambda function:
pd.Series(df.index).map(lambda x: print(x))
Alternatively you can use the .to_series() method:
df.index.to_series().map(lambda x: print(x))
Note the return values of the pd.Series() version will be numerically indexed, while the return values of the .to_series() version will be datetime indexed.
Is this a bug?
Index.map(), like Series.map(), returns a Series containing the return values of your lambda function.
In this case, print() just returns None, so you are correctly getting an Index Series of None values. The print behaviour is inconsistent with other types of pandas Indexes and Series, but this is an unusual application.
import pandas as pd
import numpy as np
df = pd.DataFrame(data = np.random.randn(3,1),
index = pd.DatetimeIndex(
start='2018-05-03',
periods = 3,
freq ='D'))
example = df.index.map(lambda x: print(x))
# DatetimeIndex(['2018-05-03', '2018-05-04', '2018-05-05'], dtype='datetime64[ns]', freq='D')
# 2018-05-03 00:00:00
# 2018-05-04 00:00:00
# 2018-05-05 00:00:00
print(example)
# Index([None, None, None], dtype='object')
As you can see, there's nothing wrong with the return value. Or for a clearer example, where we add one day to each item:
example2 = df.index.map(lambda x: x + 1)
print(example2)
# DatetimeIndex(['2018-05-04', '2018-05-05', '2018-05-06'], dtype='datetime64[ns]', freq='D')
So the print behaviour is inconsistent with similar classes in pandas, but the return values are correct.

How to efficiently filter a pandas dataframe and return a pandas series?

The question seems simple and arguably on the verge of stupid. But given my scenario, it seems that I would have to do exactly that in order to keep a bunch of calculations accross several dataframes efficient.
Scenario:
I've got a bunch of pandas dataframes where the column names are constructed by a name part and a time part such as 'AA_2018' and 'BB_2017'. And I'm doing calculations on different columns from different dataframes so I'll have to filter out the timepart. As an mcve let's just say that I'd like to subract the column containing 'AA' from the column containing 'BB' and ignore all other columns in this dataframe:
import pandas as pd
import numpy as np
dates = pd.date_range('20180101',periods=3)
df = pd.DataFrame(np.random.randn(3,3),index=dates,columns=['AA_2018', 'AB_2018', 'BB_2017'])
If i knew the exact name of the columns, this can easily be done using:
diff_series = df['AA_2018'] - df['BB_2017']
This would return a pandas series since I'm using single brackets [] as opposed to a datframe If I had used double brackets [[]].
My challenge:
diff_series is of type pandas.core.series.Series. But since I've got some filtering to do, I'm using df.filter() that returns a dataframe with one column and not a series:
# in:
colAA = df.filter(like = 'AA')
# out:
# AA_2018
# 2018-01-01 0.801295
# 2018-01-02 0.860808
# 2018-01-03 -0.728886
# in:
# type(colAA)
# out:
# pandas.core.frame.DataFrame
Snce colAA is of type pandas.core.frame.DataFrame, the following returns a dataframe too:
# in:
colAA = df.filter(like = 'AA')
colBB = df.filter(like = 'BB')
df_filtered = colBB - colAA
# out:
AA_2018 BB_2017
2018-01-01 NaN NaN
2018-01-02 NaN NaN
2018-01-03 NaN NaN
And that is not what I'm after. This is:
# in:
diff_series = df['AA_2018'] - df['BB_2017']
# out:
2018-01-01 0.828895
2018-01-02 -1.153436
2018-01-03 -1.159985
Why am I adamant in doing it this way?
Because I'd like to end up with a dataframe using .to_frame() with a specified name based on the filters I've used.
My presumably inefficient approach is this:
# in:
colAA_values = [item for sublist in colAA.values for item in sublist]
# (because colAA.values returns a list of lists)
colBB_values = [item for sublist in colBB.values for item in sublist]
serAA = pd.Series(colAA_values, colAA.index)
serBB = pd.Series(colBB_values, colBB.index)
df_diff = (serBB - serAA).to_frame(name = 'someFilter')
# out:
someFilter
2018-01-01 -0.828895
2018-01-02 1.153436
2018-01-03 1.159985
What I've tried / What I was hoping to work:
# in:
(df.filter(like = 'BB') - df.filter(like = 'AA')).to_frame(name = 'somefilter')
# out:
# AttributeError: 'DataFrame' object has no attribute 'to_frame'
# (Of course because df.filter() returns a one-column dataframe)
I was also hoping that df.filter() could be set to return a pandas series, but no.
I guess I could have asked this questions instead: How to convert pandas dataframe column to a pandas series? But that does not seem to have an efficient built-in oneliner either. Most search results handle the other way around instead. I've been messing around with potential work-arounds for quite some time now, and an obvious solution might be right around the corner, but I'm hoping some of you has a suggestion on how to do this efficiently.
All code elements for an easy copy&paste:
import pandas as pd
import numpy as np
dates = pd.date_range('20180101',periods=3)
df = pd.DataFrame(np.random.randn(3,3),index=dates,columns=['AA_2018', 'AB_2018', 'BB_2017'])
#diff_series = df[['AA_2018']] - df[['BB_2017']]
#type(diff_series)
colAA = df.filter(like = 'AA')
colBB = df.filter(like = 'BB')
df_filtered = colBB - colAA
#type(df_filtered)
#type(colAA)
#colAA.values
#colAA.values returns a list of lists that has to be flattened for use in pd.Series
colAA_values = [item for sublist in colAA.values for item in sublist]
colBB_values = [item for sublist in colBB.values for item in sublist]
serAA = pd.Series(colAA_values, colAA.index)
serBB = pd.Series(colBB_values, colBB.index)
df_diff = (serBB - serAA).to_frame(name = 'someFilter')
# Attempts:
# (df.filter(like = 'BB') - df.filter(like = 'AA')).to_frame(name = 'somefilter')

You need opposite of to_frame - DataFrame.squeeze - convert one column DataFrame to Series:
colAA = df.filter(like = 'AA')
colBB = df.filter(like = 'BB')
df_filtered = colBB.squeeze() - colAA.squeeze()
print (df_filtered)
2018-01-01 -0.479247
2018-01-02 -3.801711
2018-01-03 1.567574
Freq: D, dtype: float64

Finding nearest time in a DataFrame

I have two different time format dataset like that
df1 = pd.DataFrame( {'A': [1499503900, 1512522054, 1412525061, 1502527681, 1512532303]})
df2 = pd.DataFrame( {'B' : ['2017-12-15T11:47:58.119Z', '2017-05-31T08:27:41.943Z', '2017-06-05T14:44:56.425Z', '2017-05-30T16:24:03.175Z' , '2017-07-03T10:20:46.333Z', '2017-06-16T10:13:31.535Z' , '2017-12-15T12:26:01.347Z', '2017-06-15T16:00:41.017Z', '2017-11-28T15:25:39.016Z', '2017-08-10T08:48:01.347Z'] })
I need to find the nearest date for each data in the first dataset. Doesn't matter how far is it. Just needed the nearest time. For example:
1499503900 for '2017-07-03T10:20:46.333Z'
1512522054 for '2017-12-15T12:26:01.347Z'
1412525061 for '2017-05-31T08:27:41.943Z'
1502527681 for '2017-08-10T08:48:01.347Z'
1512532303 for '2017-06-05T14:44:56.425Z'
here is a few help:
This is for converting to long format date :
def time1(date_text):
date = datetime.datetime.strptime(date_text, "%Y-%m-%dT%H:%M:%S.%fZ")
return calendar.timegm(date.utctimetuple())
x = '2017-12-15T12:26:01.347Z'
print(time1(x))
out: 1513340761
And this is for converting to ISO format:
def time_covert(time):
seconds_since_epoch = time
DT.datetime.utcfromtimestamp(seconds_since_epoch)
return DT.datetime.utcfromtimestamp(seconds_since_epoch).isoformat()
y = 1499503900
print(time_covert(y))
out = 2017-07-08T08:51:40
Any idea will be extremely useful.
Thank you all in advance!

Here a quick start:
def time_covert(time):
seconds_since_epoch = time
return datetime.utcfromtimestamp(seconds_since_epoch)
# real time series
df2['B'] = pd.to_datetime(df2['B'])
df2.index = df2['B']
del df2['B']
for a in df1['A']:
print( time_covert(a))
i = np.argmin(np.abs(df2.index.to_pydatetime() - time_covert(a)))
print(df2.iloc[i])

I would like to approach this as an algorithmic question rather than pandas specific. My approach is to sort the "df2" series and for each DateTime in df1, perform a binary search on the sorted df2, to get the indexes of insertion. Then check the indexes just below and above the found index to get the desired output.
Here is the code for above procedure.
Use standard pandas DateTime for easy comparison
df1 = pd.DataFrame( {'A': pd.to_datetime([1499503900, 1512522054, 1412525061, 1502527681, 1512532303], unit='s')})
df2 = pd.DataFrame( {'B' : pd.to_datetime(['2017-12-15T11:47:58.119Z', '2017-05-31T08:27:41.943Z', '2017-06-05T14:44:56.425Z', '2017-05-30T16:24:03.175Z' , '2017-07-03T10:20:46.333Z', '2017-06-16T10:13:31.535Z' , '2017-12-15T12:26:01.347Z', '2017-06-15T16:00:41.017Z', '2017-11-28T15:25:39.016Z', '2017-08-10T08:48:01.347Z']) })
sort df2 according to dates, and get the position of insertion using binary search
df2 = df2.sort_values('B').reset_index(drop=True)
ind = df2['B'].searchsorted(df1['A'])
Now check for the minimum difference between the index just above and just below the position of the insertion
for index, row in df1.iterrows():
i = ind[index]
if i not in df2.index:
print(df2.iloc[i-1]['B'])
elif i-1 not in df2.index:
print(df2.iloc[i]['B'])
else:
if abs(df2.iloc[i]['B'] - row['A']) > abs(df2.iloc[i-1]['B'] - row['A']):
print(df2.iloc[i-1]['B'])
else:
print(df2.iloc[i]['B'])
The test outputs are these, for each value in df1 respectively. (Note: Please recheck your outputs given in the question, they do not correspond to the minimum difference)
2017-07-03 10:20:46.333000
2017-11-28 15:25:39.016000
2017-05-30 16:24:03.175000
2017-08-10 08:48:01.347000
2017-11-28 15:25:39.016000
The above procedure has the time complexity of O(NlogN) for sorting and O(logN) (N = len(df2)) for finding each output. If the size of "df1" is large this will be a fairly fast approach.

Replace text with numbers using dictionary in pandas

I'm trying to replace months represented as a character (e.g. 'NOV') for their numerical counterparts ('-11-'). I can get the following piece of code to work properly.
df_cohorts['ltouch_datetime'] = df_cohorts['ltouch_datetime'].str.replace('NOV','-11-')
df_cohorts['ltouch_datetime'] = df_cohorts['ltouch_datetime'].str.replace('DEC','-12-')
df_cohorts['ltouch_datetime'] = df_cohorts['ltouch_datetime'].str.replace('JAN','-01-')
However, to avoid redundancy, I'd like to use a dictionary and .replace to replace the character variable for all months.
r_month1 = {'JAN':'-01-','FEB':'-02-','MAR':'-03-','APR':'-04-','MAY':'-05-','JUN':'-06-','JUL':'-07-','AUG':'-08-','SEP':'-09-','OCT':'-10-','NOV':'-11-','DEC':'-12-'}
df_cohorts.replace({'conversion_datetime': r_month1,'ltouch_datetime': r_month1})
When I enter the code above, my output dataset is unchanged. For reference, please see my sample data below.
User_ID ltouch_datetime conversion_datetime
001 11NOV14:13:12:56 11NOV14:16:12:00
002 07NOV14:17:46:14 08NOV14:13:10:00
003 04DEC14:17:46:14 04DEC15:13:12:00
Thanks!

Let me suggest a different approach: You could parse the date strings into a column of pandas TimeStamps like this:
import pandas as pd
df = pd.read_table('data', sep='\s+')
for col in ('ltouch_datetime', 'conversion_datetime'):
df[col] = pd.to_datetime(df[col], format='%d%b%y:%H:%M:%S')
print(df)
# User_ID ltouch_datetime conversion_datetime
# 0 1 2014-11-11 13:12:56 2014-11-11 16:12:00
# 1 2 2014-11-07 17:46:14 2014-11-08 13:10:00
# 2 3 2014-12-04 17:46:14 2015-12-04 13:12:00
I would stop right here, since representing dates as TimeStamps is the ideal
form for the data in Pandas.
However, if you need/want date strings with 3-letter months like 'NOV' converted to -11-, then you can convert the Timestamps with strftime and apply:
for col in ('ltouch_datetime', 'conversion_datetime'):
df[col] = df[col].apply(lambda x: x.strftime('%d-%m-%y:%H:%M:%S'))
print(df)
yields
User_ID ltouch_datetime conversion_datetime
0 1 11-11-14:13:12:56 11-11-14:16:12:00
1 2 07-11-14:17:46:14 08-11-14:13:10:00
2 3 04-12-14:17:46:14 04-12-15:13:12:00
To answer your question literally, in order to use Series.str.replace you need a column with the month string abbreviations all by themselves. You can arrange for that by first calling Series.str.extract. Then you can join the columns back into one using apply:
import pandas as pd
import calendar
month_map = {calendar.month_abbr[m].upper():'-{:02d}-'.format(m)
for m in range(1,13)}
df = pd.read_table('data', sep='\s+')
for col in ('ltouch_datetime', 'conversion_datetime'):
tmp = df[col].str.extract(r'(.*?)(\D+)(.*)')
tmp[1] = tmp[1].replace(month_map)
df[col] = tmp.apply(''.join, axis=1)
print(df)
yields
User_ID ltouch_datetime conversion_datetime
0 1 11-11-14:13:12:56 11-11-14:16:12:00
1 2 07-11-14:17:46:14 08-11-14:13:10:00
2 3 04-12-14:17:46:14 04-12-15:13:12:00
Finally, although you haven't asked for this directly, it's good to be aware
that if your data is in a file, you can parse the datestring columns into
TimeStamps directly using
import pandas as pd
import datetime as DT
df = pd.read_table(
'data', sep='\s+', parse_dates=[1,2],
date_parser=lambda x: DT.datetime.strptime(x, '%d%b%y:%H:%M:%S'))
This might be the most convenient method of all (assuming you want TimeStamps).

Time Series using numpy or pandas

I'm a beginner of Python related environment and I have problem with using time series data.
The below is my OHLC 1 minute data.
2011-11-01,9:00:00,248.50,248.95,248.20,248.70
2011-11-01,9:01:00,248.70,249.00,248.65,248.85
2011-11-01,9:02:00,248.90,249.25,248.70,249.15
...
2011-11-01,15:03:00,250.25,250.30,250.05,250.15
2011-11-01,15:04:00,250.15,250.60,250.10,250.60
2011-11-01,15:15:00,250.55,250.55,250.55,250.55
2011-11-02,9:00:00,245.55,246.25,245.40,245.80
2011-11-02,9:01:00,245.85,246.40,245.75,246.35
2011-11-02,9:02:00,246.30,246.45,245.75,245.80
2011-11-02,9:03:00,245.75,245.85,245.30,245.35
...
I'd like to extract the last "CLOSE" data per each row and convert data format like the following:
2011-11-01, 248.70, 248.85, 249.15, ... 250.15, 250.60, 250.55
2011-11-02, 245.80, 246.35, 245.80, ...
...
I'd like to calculate the highest Close value and it's time(minute) per EACH DAY like the following:
2011-11-01, 10:23:03, 250.55
2011-11-02, 11:02:36, 251.00
....
Any help would be very appreciated.
Thank you in advance,

You can use the pandas library. In the case of your data you can get the max as:
import pandas as pd
# Read in the data and parse the first two columns as a
# date-time and set it as index
df = pd.read_csv('your_file', parse_dates=[[0,1]], index_col=0, header=None)
# get only the fifth column (close)
df = df[[5]]
# Resample to date frequency and get the max value for each day.
df.resample('D', how='max')
If you want to show also the times, keep them in your DataFrame as a column and pass a function that will determine the max close value and return that row:
>>> df = pd.read_csv('your_file', parse_dates=[[0,1]], index_col=0, header=None,
usecols=[0, 1, 5], names=['d', 't', 'close'])
>>> df['time'] = df.index
>>> df.resample('D', how=lambda group: group.iloc[group['close'].argmax()])
close time
d_t
2011-11-01 250.60 2011-11-01 15:04:00
2011-11-02 246.35 2011-11-02 09:01:00
And if you wan't a list of the prices per day then just do a groupby per day and return the list of all the prices from every group using the apply on the grouped object:
>>> df.groupby(lambda dt: dt.date()).apply(lambda group: list(group['close']))
2011-11-01 [248.7, 248.85, 249.15, 250.15, 250.6, 250.55]
2011-11-02 [245.8, 246.35, 245.8, 245.35]
For more information take a look at the docs: Time Series
Update for the concrete data set:
The problem with your data set is that you have some days without any data, so the function passed in as the resampler should handle those cases:
def func(group):
if len(group) == 0:
return None
return group.iloc[group['close'].argmax()]
df.resample('D', how=func).dropna()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to resample large dataframe with different functions, using a key? - python

Related

map behaves strangely on DatetimeIndex

How to efficiently filter a pandas dataframe and return a pandas series?

Finding nearest time in a DataFrame

Replace text with numbers using dictionary in pandas

Time Series using numpy or pandas

Categories

Resources