Fastest way to create DataFrame from last available data - python

I had no success looking for answers for this question in the forum since it is hard to put it in keywords. Any keywords suggestions are appreciated so that I cane make this question more accessible so that others can benefit from it.
The closest question I found doesn't really answer mine.
My problem is the following:
I have one DataFrame that I called ref, and a dates list called pub. ref has dates for indexes but those dates are different (there will be a few matching values) from the dates in pub. I want to create a new DataFrame that contains all the dates from pub but fill it with the "last available data" from ref.
Thus, say ref is:
Dat col1 col2
2015-01-01 5 4
2015-01-02 6 7
2015-01-05 8 9
And pub
2015-01-01
2015-01-04
2015-01-06
I'd like to create a DataFrame like:
Dat col1 col2
2015-01-01 5 4
2015-01-04 6 7
2015-01-06 8 9
For this matter performance is an issue. So i'm looking for the fastest / a fast way of doing that.
Thanks in advance.

You can do an outer merge, set the new index to Dat, sort it, forward fill, and then reindex based on the dates in pub.
dates = ['2015-01-01', '2015-01-04', '2015-01-06']
pub = pd.DataFrame([dt.datetime.strptime(ts, '%Y-%m-%d').date() for ts in dates],
columns=['Dat'])
>>> (ref
.merge(pub, on='Dat', how='outer')
.set_index('Dat')
.sort_index()
.ffill()
.reindex(pub.Dat))
col1 col2
Dat
2015-01-01 5 4
2015-01-04 6 7
2015-01-06 8 9

Use np.searchsorted for finding the indice just after ('right' option; needed to handle properly equality) :
In [27]: pub = ['2015-01-01', '2015-01-04', '2015-01-06']
In [28]: df
Out[28]:
col1 col2
Dat
2015-01-01 5 4
2015-01-02 6 7
2015-01-05 8 9
In [29]: y=np.searchsorted(list(df.index),pub,'right')
#array([1, 2, 3], dtype=int64)
Then just rebuild :
In [30]: pd.DataFrame(df.iloc[y-1].values,index=pub)
Out[30]:
0 1
2015-01-01 5 4
2015-01-04 6 7
2015-01-06 8 9

Related

How to truncate a column in a Pandas time series data frame so as to remove leading and trailing zeros?

I have the following time series df in Pandas:
date value
2015-01-01 0
2015-01-02 0
2015-01-03 0
2015-01-04 3
2015-01-05 0
2015-01-06 4
2015-01-07 0
I would like to remove the leading and trailing zeroes, so as to have the following df:
date value
2015-01-04 3
2015-01-05 0
2015-01-06 4
Simply dropping rows with 0s in them would lead to deleting the 0s in the middle as well, which I don't want.
I thought of writing a forward loop that starts from the first row and then continues until the first non 0 value, and a second backwards loop that goes back from the end and stops at the last non 0 value. But that seems like overkill, is there a more efficient way of doing so?
General solution returned empty DataFrame, if all 0 values in data with cumulative sum of mask tested not equal 0 values and swapped values by [::-1] chained by bitwise AND and filtering by boolean indexing:
s = df['value'].ne(0)
df = df[s.cumsum().ne(0) & s[::-1].cumsum().ne(0)]
print (df)
date value
3 2015-01-04 3
4 2015-01-05 0
5 2015-01-06 4
If always at least one non 0 value is possible convert 0 to missing values and use DataFrame.loc with DataFrame.first_valid_index and
DataFrame.last_valid_index:
s = df['value'].mask(df['value'] == 0)
df = df.loc[s.first_valid_index():s.last_valid_index()]
print (df)
date value
3 2015-01-04 3
4 2015-01-05 0
5 2015-01-06 4
Another idea is use DataFrame.idxmax or DataFrame.idxmin:
s = df['value'].eq(0)
df = df.loc[s.idxmin():s[::-1].idxmin()]
print (df)
date value
3 2015-01-04 3
4 2015-01-05 0
5 2015-01-06 4
s = df['value'].ne(0)
df = df.loc[s.idxmax():s[::-1].idxmax()]
You can get a list of the indexes where value is > than 0, and then find the min.
data = [
['2015-01-01', 0],
['2015-01-02', 0],
['2015-01-03', 0],
['2015-01-04', 3],
['2015-01-05', 0],
['2015-01-06', 4]
]
df = pd.DataFrame(data, columns = ['date', 'value'])
print(min(df.index[df['value'] > 0].tolist()))
# 3
Then filter the main df like this:
df.iloc[3:]
Or even better:
df.iloc[min(df.index[df['value'] > 0].tolist()):]
And you get:
date value
3 2015-01-04 3
4 2015-01-05 0
5 2015-01-06 4

pandas group by year, id and do some statistics per id? with multiindex

I have a dataframe that is the child of a merge operation on two dataframes. I end up with a multi-index that looks like (timestamp,id) and for the sake of argument, a single column X.
I would like to do several statistics on X by year, and by ID. Instead of posting all the crazy errors that I am getting trying to blindly solve this problem, I ask instead "how would you do this?"
There is one row of X per id, per period (daily). I want to aggregate to an annual period.
I think you can use groupby with resample and aggregate e.g. sum, but need pandas 0.18.1:
start = pd.to_datetime('2016-12-28')
rng = pd.date_range(start, periods=10)
df = pd.DataFrame({'timestamp': rng, 'X': range(10),
'id': ['a'] * 3 + ['b'] * 3 + ['c'] * 4 })
df = df.set_index(['timestamp','id'])
print (df)
X
timestamp id
2016-12-28 a 0
2016-12-29 a 1
2016-12-30 a 2
2016-12-31 b 3
2017-01-01 b 4
2017-01-02 b 5
2017-01-03 c 6
2017-01-04 c 7
2017-01-05 c 8
2017-01-06 c 9
df = df.reset_index(level='id')
print (df.groupby('id').resample('A')['X'].sum())
id timestamp
a 2016-12-31 3
b 2016-12-31 3
2017-12-31 9
c 2017-12-31 30
Name: X, dtype: int32
Another solution is use get_level_values with groupby:
print (df.X.groupby([df.index.get_level_values('timestamp').year,
df.index.get_level_values('id')])
.sum())
id
2016 a 3
b 3
2017 b 9
c 30
Name: X, dtype: int32
If you want to ensure that the groups happen together then you must place all the groups in the groupby.
Assuming your timestamp is in the left outer group, the following should work.
df.groupby([pd.TimeGrouper('A', level=0), pd.Grouper(level='id')])['X'].sum()

Counting dates in a range set by pandas dataframe

I have a pandas dataframe that contains two date columns, a start date and an end date that defines a range. I'd like to be able to collect a total count for all dates across all rows in the dataframe, as defined by these columns.
For example, the table looks like:
index start_date end date
0 '2015-01-01' '2015-01-17'
1 '2015-01-03' '2015-01-12'
And the result would be a per date aggregate, like:
date count
'2015-01-01' 1
'2015-01-02' 1
'2015-01-03' 2
and so on.
My current approach works but is extremely slow on a big dataframe as I'm looping across the rows, calculating the range and then looping through this. I'm hoping to find a better approach.
Currently I'm doing :
date = pd.date_range (min (df.start_date), max (df.end_date))
df2 = pd.DataFrame (index =date)
df2 ['count'] = 0
for index, row in df.iterrows ():
dates = pd.date_range (row ['start_date'], row ['end_date'])
for date in dates:
df2.loc['date']['count'] += 1
After stacking the relevant columns as suggested by #Sam, just use value_counts.
df[['start_date', 'end date']].stack().value_counts()
EDIT:
Given that you also want to count the dates between the start and end dates:
start_dates = pd.to_datetime(df.start_date)
end_dates = pd.to_datetime(df.end_date)
>>> pd.Series(dt.date() for group in
[pd.date_range(start, end) for start, end in zip(start_dates, end_dates)]
for dt in group).value_counts()
Out[178]:
2015-01-07 2
2015-01-06 2
2015-01-12 2
2015-01-05 2
2015-01-04 2
2015-01-10 2
2015-01-03 2
2015-01-09 2
2015-01-08 2
2015-01-11 2
2015-01-16 1
2015-01-17 1
2015-01-14 1
2015-01-15 1
2015-01-02 1
2015-01-01 1
2015-01-13 1
dtype: int64
I think the solution here is to 'stack' your two date columns, group by the date,and do a count. Play around with the df.stack() function. Here is something i threw together that yields a good solution:
import datetime
df = pd.DataFrame({'Start' : [datetime.date(2016, 5, i) for i in range(1,30)],
'End':[datetime.date(2016, 5, i) for i in range(1,30)]})
df.stack().reset_index()[[0, 'level_1']].groupby(0).count()
I would use melt() method for that:
In [76]: df
Out[76]:
start_date end_date
index
0 2015-01-01 2015-01-17
1 2015-01-03 2015-01-12
2 2015-01-03 2015-01-17
In [77]: pd.melt(df, value_vars=['start_date','end_date']).groupby('value').size()
Out[77]:
value
2015-01-01 1
2015-01-03 2
2015-01-12 1
2015-01-17 2
dtype: int64

Pandas Dataframe Groupby Determine Values in 1 group vs. another group

I have a dataframe as follows:
Date ID
2014-12-31 1
2014-12-31 2
2014-12-31 3
2014-12-31 4
2014-12-31 5
2014-12-31 6
2014-12-31 7
2015-01-01 1
2015-01-01 2
2015-01-01 3
2015-01-01 4
2015-01-01 5
2015-01-02 1
2015-01-02 3
2015-01-02 7
2015-01-02 9
What I would like to do is determine the ID(s) on one date that are exclusive to that date versus the values of another date.
Example1: The result df would be the exclusive ID(s) in 2014-12-31 vs. the ID(s) in 2015-01-01 and the exclusive ID(s) in 2015-01-01 vs. the ID(s) in 2015-01-02:
2015-01-01 6
2015-01-01 7
2015-01-02 2
2015-01-02 4
2015-01-02 6
I would like to 'choose' how many days 'back' I compare. For instance I can enter a variable daysback=1 and each day would compare to the previous. Or I can enter variable daysback=2 and each day would compare to two days ago. etc.
Outside of df.groupby('Date'), I'm not sure where to go with this. Possibly use of diff()?
I'm assuming that the "Date" in your DataFrame is: 1) a date object and 2) not the index.
If those assumptions are wrong, then that changes things a bit.
import datetime
from datetime import timedelta
def find_unique_ids(df, date, daysback=1):
date_new = date
date_old = date - timedelta(days = daysback)
ids_new = df[df['Date'] == date_new]['ID']
ids_old = df[df['Date'] == date_old]['ID']
return df.iloc[ids_new[-ids_new.isin(ids_old)]]
date = datetime.date(2015, 1, 2)
daysback = 1
print find_unique_ids(df, date, daysback)
Running that produces the following output:
Date ID
7 2015-01-01 1
9 2015-01-01 3
If the Date is your Index field, then you need to modify two lines in the function:
ids_new = df.ix[date_new]['ID']
ids_old = df.ix[date_old]['ID']
Output:
ID
Date
2015-01-01 1
2015-01-01 3
EDIT:
This is kind of dirty, but it should accomplish what you want to do. I added comments inline that explain what is going on. There are probably cleaner and more efficient ways to go about this if this is something that you're going to be running regularly or across massive amounts of data.
def find_unique_ids(df,daysback):
# We need both Date and ID to both be either fields or index fields -- no mix/match.
df = df.reset_index()
# Calculate DateComp by adding our daysback value as a timedelta
df['DateComp'] = df['Date'].apply(lambda dc: dc + timedelta(days=daysback))
# Join df back on to itself, SQL style LEFT OUTER.
df2 = pd.merge(df,df, left_on=['DateComp','ID'], right_on=['Date','ID'], how='left')
# Create series of missing_id values from the right table
missing_ids = (df2['Date_y'].isnull())
# Create series of valid DateComp values.
# DateComp is the "future" date that we're comparing against. Without this
# step, all records on the last Date value will be flagged as unique IDs.
valid_dates = df2['DateComp_x'].isin(df['Date'].unique())
# Use those to find missing IDs and valid dates. Create a new output DataFrame.
output = df2[(valid_dates) & (missing_ids)][['DateComp_x','ID']]
# Rename columns of output and return
output.columns = ['Date','ID']
return output
Test output:
Date ID
5 2015-01-01 6
6 2015-01-01 7
8 2015-01-02 2
10 2015-01-02 4
11 2015-01-02 5
EDIT:
missing_ids=df2[df2['Date_y'].isnull()] #gives the whole necessary dataframe
Another way by applying list to aggregation,
df
Out[146]:
Date Unnamed: 2
0 2014-12-31 1
1 2014-12-31 2
2 2014-12-31 3
3 2014-12-31 4
4 2014-12-31 5
5 2014-12-31 6
6 2014-12-31 7
7 2015-01-01 1
8 2015-01-01 2
9 2015-01-01 3
10 2015-01-01 4
11 2015-01-01 5
12 2015-01-02 1
13 2015-01-02 3
14 2015-01-02 7
15 2015-01-02 9
abbs = df.groupby(['Date'])['Unnamed: 2'].apply(list)
abbs
Out[142]:
Date
2014-12-31 [1, 2, 3, 4, 5, 6, 7]
2015-01-01 [1, 2, 3, 4, 5]
2015-01-02 [1, 3, 7, 9]
Name: Unnamed: 2, dtype: object
abbs.loc['2015-01-01']
Out[143]: [1, 2, 3, 4, 5]
list(set(abbs.loc['2014-12-31']) - set(abbs.loc['2015-01-01']))
Out[145]: [6, 7]
In function
def uid(df,date1,date2):
abbs = df.groupby(['Date'])['Unnamed: 2'].apply(list)
return list(set(abbs.loc[date1]) - set(abbs.loc[date2]))
uid(df,'2015-01-01','2015-01-02')
Out[162]: [2, 4, 5]
You could write a function and use date instead of str :)

Convert a column of timestamps into periods in pandas

I have a column of timestamps that need to be converted into period ('Month'). e.g.
1985-12-31 00:00:00 to 1985-12
Pandas has a .to_period() function, but:
pd.DatetimeIndex.to_period only works on a timestamp index, not column. So you can only have a period index, but not a period column?
and that function only work if timestamps are the only index, i.e. not if timestamps are part of a multIndex.
Anyway how do I use this on an arbitary Pandas column, not just a tiemstamp index or period index?
I came across this thread today, and after further digging found that Pandas .15 affords an easier option use .dt, you can avoid the step of creating an index and create the column directly. You can use the following to get the same result:
df[1] = df[0].dt.to_period('M')
You're right, you need to do this one DatetimeIndex objects rather than just columns of datetimes. However, this is pretty easy - just wrap it in a DatetimeIndex constructor:
In [11]: df = pd.DataFrame(pd.date_range('2014-01-01', freq='2w', periods=12))
In [12]: df
Out[12]:
0
0 2014-01-05
1 2014-01-19
2 2014-02-02
3 2014-02-16
4 2014-03-02
5 2014-03-16
6 2014-03-30
7 2014-04-13
8 2014-04-27
9 2014-05-11
10 2014-05-25
11 2014-06-08
In [13]: pd.DatetimeIndex(df[0]).to_period('M')
Out[13]:
<class 'pandas.tseries.period.PeriodIndex'>
freq: M
[2014-01, ..., 2014-06]
length: 12
This is a PeriodIndex, but you can make it a column:
In [14]: df[1] = pd.DatetimeIndex(df[0]).to_period('M')
In [15]: df
Out[15]:
0 1
0 2014-01-05 2014-01
1 2014-01-19 2014-01
2 2014-02-02 2014-02
3 2014-02-16 2014-02
4 2014-03-02 2014-03
5 2014-03-16 2014-03
6 2014-03-30 2014-03
7 2014-04-13 2014-04
8 2014-04-27 2014-04
9 2014-05-11 2014-05
10 2014-05-25 2014-05
11 2014-06-08 2014-06
You can do a similar trick if the timestamps are part of a MultiIndex by extracting that "column" and passing it to DatetimeIndex as above, e.g. using df.index.get_level_values:
For example:
df[2] = 2
df.set_index([0, 1], inplace=True)
df.index.get_level_values(0) # returns a DatetimeIndex

Categories

Resources