I have a column of timestamps that need to be converted into period ('Month'). e.g.
1985-12-31 00:00:00 to 1985-12
Pandas has a .to_period() function, but:
pd.DatetimeIndex.to_period only works on a timestamp index, not column. So you can only have a period index, but not a period column?
and that function only work if timestamps are the only index, i.e. not if timestamps are part of a multIndex.
Anyway how do I use this on an arbitary Pandas column, not just a tiemstamp index or period index?
I came across this thread today, and after further digging found that Pandas .15 affords an easier option use .dt, you can avoid the step of creating an index and create the column directly. You can use the following to get the same result:
df[1] = df[0].dt.to_period('M')
You're right, you need to do this one DatetimeIndex objects rather than just columns of datetimes. However, this is pretty easy - just wrap it in a DatetimeIndex constructor:
In [11]: df = pd.DataFrame(pd.date_range('2014-01-01', freq='2w', periods=12))
In [12]: df
Out[12]:
0
0 2014-01-05
1 2014-01-19
2 2014-02-02
3 2014-02-16
4 2014-03-02
5 2014-03-16
6 2014-03-30
7 2014-04-13
8 2014-04-27
9 2014-05-11
10 2014-05-25
11 2014-06-08
In [13]: pd.DatetimeIndex(df[0]).to_period('M')
Out[13]:
<class 'pandas.tseries.period.PeriodIndex'>
freq: M
[2014-01, ..., 2014-06]
length: 12
This is a PeriodIndex, but you can make it a column:
In [14]: df[1] = pd.DatetimeIndex(df[0]).to_period('M')
In [15]: df
Out[15]:
0 1
0 2014-01-05 2014-01
1 2014-01-19 2014-01
2 2014-02-02 2014-02
3 2014-02-16 2014-02
4 2014-03-02 2014-03
5 2014-03-16 2014-03
6 2014-03-30 2014-03
7 2014-04-13 2014-04
8 2014-04-27 2014-04
9 2014-05-11 2014-05
10 2014-05-25 2014-05
11 2014-06-08 2014-06
You can do a similar trick if the timestamps are part of a MultiIndex by extracting that "column" and passing it to DatetimeIndex as above, e.g. using df.index.get_level_values:
For example:
df[2] = 2
df.set_index([0, 1], inplace=True)
df.index.get_level_values(0) # returns a DatetimeIndex
Related
I have a Panda data frame (df) with many columns. For the sake of simplicity, I am posting three columns with dummy data here.
Timestamp Source Length
0 1 5
1 1 5
2 1 5
3 2 5
4 2 5
5 3 5
6 1 5
7 3 5
8 2 5
9 1 5
Using Panda functions, First I set timestamp as index of the df.
index = pd.DatetimeIndex(data[data.columns[1]]*10**9) # Convert timestamp
df = df.set_index(index) # Set Timestamp as index
Next I can use groupby and pd.TimeGrouper functions to group the data into 5 seconds bins and compute cumulative length for each bin as following:
df_length = data[data.columns[5]].groupby(pd.TimeGrouper('5S')).sum()
So the df_length dataframe should look like:
Timestamp Length
0 25
5 25
Now the problem is: "I want to get the same bins of 5 seconds, but ant to compute the cumulative length for each source (1,2 and 3) in separate columns in the following format:
Timestamp 1 2 3
0 15 10 0
5 10 5 10
I think I can use df.groupby with some conditions to get it. But confused and tired now :(
Appreciate solution using panda functions only.
You can add new column for groupby Source for MultiIndex DataFrame and then reshape by unstack last level of MultiIndex for columns:
print (df[df.columns[2]].groupby([pd.TimeGrouper('5S'), df['Source']]).sum())
Timestamp Source
1970-01-01 00:00:00 1 15
2 10
1970-01-01 00:00:05 1 10
2 5
3 10
Name: Length, dtype: int64
df1 = df[df.columns[2]].groupby([pd.TimeGrouper('5S'), df['Source']])
.sum()
.unstack(fill_value=0)
print (df1)
Source 1 2 3
Timestamp
1970-01-01 00:00:00 15 10 0
1970-01-01 00:00:05 10 5 10
I have the following time series dataset of the number of sales happening for a day as a pandas data frame.
date, sales
20161224,5
20161225,2
20161227,4
20161231,8
Now if I have to include the missing data points here(i. e. missing dates) with a constant value(zero) and want to make it look the following way, how can I do this efficiently(assuming the data frame is ~50MB) using Pandas.
date, sales
20161224,5
20161225,2
20161226,0**
20161227,4
20161228,0**
20161229,0**
20161231,8
**Missing rows which are been added to the data frame.
Any help will be appreciated.
You can first cast to to_datetime column date, then set_index and reindex by min and max value of index, reset_index and if necessary change format by strftime:
df.date = pd.to_datetime(df.date, format='%Y%m%d')
df = df.set_index('date')
df = df.reindex(pd.date_range(df.index.min(), df.index.max()), fill_value=0)
.reset_index()
.rename(columns={'index':'date'})
print (df)
date sales
0 2016-12-24 5
1 2016-12-25 2
2 2016-12-26 0
3 2016-12-27 4
4 2016-12-28 0
5 2016-12-29 0
6 2016-12-30 0
7 2016-12-31 8
Last if need change format:
df.date = df.date.dt.strftime('%Y%m%d')
print (df)
date sales
0 20161224 5
1 20161225 2
2 20161226 0
3 20161227 4
4 20161228 0
5 20161229 0
6 20161230 0
7 20161231 8
I had no success looking for answers for this question in the forum since it is hard to put it in keywords. Any keywords suggestions are appreciated so that I cane make this question more accessible so that others can benefit from it.
The closest question I found doesn't really answer mine.
My problem is the following:
I have one DataFrame that I called ref, and a dates list called pub. ref has dates for indexes but those dates are different (there will be a few matching values) from the dates in pub. I want to create a new DataFrame that contains all the dates from pub but fill it with the "last available data" from ref.
Thus, say ref is:
Dat col1 col2
2015-01-01 5 4
2015-01-02 6 7
2015-01-05 8 9
And pub
2015-01-01
2015-01-04
2015-01-06
I'd like to create a DataFrame like:
Dat col1 col2
2015-01-01 5 4
2015-01-04 6 7
2015-01-06 8 9
For this matter performance is an issue. So i'm looking for the fastest / a fast way of doing that.
Thanks in advance.
You can do an outer merge, set the new index to Dat, sort it, forward fill, and then reindex based on the dates in pub.
dates = ['2015-01-01', '2015-01-04', '2015-01-06']
pub = pd.DataFrame([dt.datetime.strptime(ts, '%Y-%m-%d').date() for ts in dates],
columns=['Dat'])
>>> (ref
.merge(pub, on='Dat', how='outer')
.set_index('Dat')
.sort_index()
.ffill()
.reindex(pub.Dat))
col1 col2
Dat
2015-01-01 5 4
2015-01-04 6 7
2015-01-06 8 9
Use np.searchsorted for finding the indice just after ('right' option; needed to handle properly equality) :
In [27]: pub = ['2015-01-01', '2015-01-04', '2015-01-06']
In [28]: df
Out[28]:
col1 col2
Dat
2015-01-01 5 4
2015-01-02 6 7
2015-01-05 8 9
In [29]: y=np.searchsorted(list(df.index),pub,'right')
#array([1, 2, 3], dtype=int64)
Then just rebuild :
In [30]: pd.DataFrame(df.iloc[y-1].values,index=pub)
Out[30]:
0 1
2015-01-01 5 4
2015-01-04 6 7
2015-01-06 8 9
I am using pandas.DataFrame.resample to resample a grouped Pandas dataframe with a timestamp index.
In one of the columns, I would like to resample such that I select the most frequent value. At the moment, I am only having success using NumPy functions like np.max or np.sum etc.
#generate test dataframe
data = np.random.randint(0,10,(366,2))
index = pd.date_range(start=pd.Timestamp('1-Dec-2012'), periods=366, unit='D')
test = pd.DataFrame(data, index=index)
#generate group array
group = np.random.randint(0,2,(366,))
#define how dictionary for resample
how_dict = {0: np.max, 1: np.min}
#perform grouping and resample
test.groupby(group).resample('48 h',how=how_dict)
The previous code works because I have used NumPy functions. However, if I want to use resample by most frequent value, I am not sure. I try defining a custom function like
def frequent(x):
(value, counts) = np.unique(x, return_counts=True)
return value[counts.argmax()]
However, if I now do:
how_dict = {0: np.max, 1: frequent}
I get an empty dataframe...
df = test.groupby(group).resample('48 h',how=how_dict)
df.shape
Your resample period is too short, so when a group is empty on a period, your user function raise a ValueError not kindly caught by pandas .
But it works without empty groups, for example with regular groups:
In [8]: test.groupby(arange(366)%2).resample('48h',how=how_dict).head()
Out[8]:
0 1
0 2012-12-01 4 8
2012-12-03 0 3
2012-12-05 9 5
2012-12-07 3 4
2012-12-09 7 3
Or with bigger periods :
In [9]: test.groupby(group).resample('122D',how=how_dict)
Out[9]:
0 1
0 2012-12-02 9 0
2013-04-03 9 0
2013-08-03 9 6
1 2012-12-01 9 3
2013-04-02 9 7
2013-08-02 9 1
EDIT
A workaround can be to manage the empty case :
def frequent(x):
if len(x)==0 : return -1
(value, counts) = np.unique(x, return_counts=True)
return value[counts.argmax()]
For
In [11]: test.groupby(group).resample('48h',how=how_dict).head()
Out[11]:
0 1
0 2012-12-01 5 3
2012-12-03 3 4
2012-12-05 NaN -1
2012-12-07 5 0
2012-12-09 1 4
I have a pandas dataframe that contains two date columns, a start date and an end date that defines a range. I'd like to be able to collect a total count for all dates across all rows in the dataframe, as defined by these columns.
For example, the table looks like:
index start_date end date
0 '2015-01-01' '2015-01-17'
1 '2015-01-03' '2015-01-12'
And the result would be a per date aggregate, like:
date count
'2015-01-01' 1
'2015-01-02' 1
'2015-01-03' 2
and so on.
My current approach works but is extremely slow on a big dataframe as I'm looping across the rows, calculating the range and then looping through this. I'm hoping to find a better approach.
Currently I'm doing :
date = pd.date_range (min (df.start_date), max (df.end_date))
df2 = pd.DataFrame (index =date)
df2 ['count'] = 0
for index, row in df.iterrows ():
dates = pd.date_range (row ['start_date'], row ['end_date'])
for date in dates:
df2.loc['date']['count'] += 1
After stacking the relevant columns as suggested by #Sam, just use value_counts.
df[['start_date', 'end date']].stack().value_counts()
EDIT:
Given that you also want to count the dates between the start and end dates:
start_dates = pd.to_datetime(df.start_date)
end_dates = pd.to_datetime(df.end_date)
>>> pd.Series(dt.date() for group in
[pd.date_range(start, end) for start, end in zip(start_dates, end_dates)]
for dt in group).value_counts()
Out[178]:
2015-01-07 2
2015-01-06 2
2015-01-12 2
2015-01-05 2
2015-01-04 2
2015-01-10 2
2015-01-03 2
2015-01-09 2
2015-01-08 2
2015-01-11 2
2015-01-16 1
2015-01-17 1
2015-01-14 1
2015-01-15 1
2015-01-02 1
2015-01-01 1
2015-01-13 1
dtype: int64
I think the solution here is to 'stack' your two date columns, group by the date,and do a count. Play around with the df.stack() function. Here is something i threw together that yields a good solution:
import datetime
df = pd.DataFrame({'Start' : [datetime.date(2016, 5, i) for i in range(1,30)],
'End':[datetime.date(2016, 5, i) for i in range(1,30)]})
df.stack().reset_index()[[0, 'level_1']].groupby(0).count()
I would use melt() method for that:
In [76]: df
Out[76]:
start_date end_date
index
0 2015-01-01 2015-01-17
1 2015-01-03 2015-01-12
2 2015-01-03 2015-01-17
In [77]: pd.melt(df, value_vars=['start_date','end_date']).groupby('value').size()
Out[77]:
value
2015-01-01 1
2015-01-03 2
2015-01-12 1
2015-01-17 2
dtype: int64