Pandas Dataframe Groupby Determine Values in 1 group vs. another group

Pandas Dataframe Groupby Determine Values in 1 group vs. another group - python

I have a dataframe as follows:
Date ID
2014-12-31 1
2014-12-31 2
2014-12-31 3
2014-12-31 4
2014-12-31 5
2014-12-31 6
2014-12-31 7
2015-01-01 1
2015-01-01 2
2015-01-01 3
2015-01-01 4
2015-01-01 5
2015-01-02 1
2015-01-02 3
2015-01-02 7
2015-01-02 9
What I would like to do is determine the ID(s) on one date that are exclusive to that date versus the values of another date.
Example1: The result df would be the exclusive ID(s) in 2014-12-31 vs. the ID(s) in 2015-01-01 and the exclusive ID(s) in 2015-01-01 vs. the ID(s) in 2015-01-02:
2015-01-01 6
2015-01-01 7
2015-01-02 2
2015-01-02 4
2015-01-02 6
I would like to 'choose' how many days 'back' I compare. For instance I can enter a variable daysback=1 and each day would compare to the previous. Or I can enter variable daysback=2 and each day would compare to two days ago. etc.
Outside of df.groupby('Date'), I'm not sure where to go with this. Possibly use of diff()?

I'm assuming that the "Date" in your DataFrame is: 1) a date object and 2) not the index.
If those assumptions are wrong, then that changes things a bit.
import datetime
from datetime import timedelta
def find_unique_ids(df, date, daysback=1):
date_new = date
date_old = date - timedelta(days = daysback)
ids_new = df[df['Date'] == date_new]['ID']
ids_old = df[df['Date'] == date_old]['ID']
return df.iloc[ids_new[-ids_new.isin(ids_old)]]
date = datetime.date(2015, 1, 2)
daysback = 1
print find_unique_ids(df, date, daysback)
Running that produces the following output:
Date ID
7 2015-01-01 1
9 2015-01-01 3
If the Date is your Index field, then you need to modify two lines in the function:
ids_new = df.ix[date_new]['ID']
ids_old = df.ix[date_old]['ID']
Output:
ID
Date
2015-01-01 1
2015-01-01 3
EDIT:
This is kind of dirty, but it should accomplish what you want to do. I added comments inline that explain what is going on. There are probably cleaner and more efficient ways to go about this if this is something that you're going to be running regularly or across massive amounts of data.
def find_unique_ids(df,daysback):
# We need both Date and ID to both be either fields or index fields -- no mix/match.
df = df.reset_index()
# Calculate DateComp by adding our daysback value as a timedelta
df['DateComp'] = df['Date'].apply(lambda dc: dc + timedelta(days=daysback))
# Join df back on to itself, SQL style LEFT OUTER.
df2 = pd.merge(df,df, left_on=['DateComp','ID'], right_on=['Date','ID'], how='left')
# Create series of missing_id values from the right table
missing_ids = (df2['Date_y'].isnull())
# Create series of valid DateComp values.
# DateComp is the "future" date that we're comparing against. Without this
# step, all records on the last Date value will be flagged as unique IDs.
valid_dates = df2['DateComp_x'].isin(df['Date'].unique())
# Use those to find missing IDs and valid dates. Create a new output DataFrame.
output = df2[(valid_dates) & (missing_ids)][['DateComp_x','ID']]
# Rename columns of output and return
output.columns = ['Date','ID']
return output
Test output:
Date ID
5 2015-01-01 6
6 2015-01-01 7
8 2015-01-02 2
10 2015-01-02 4
11 2015-01-02 5
EDIT:
missing_ids=df2[df2['Date_y'].isnull()] #gives the whole necessary dataframe

Another way by applying list to aggregation,
df
Out[146]:
Date Unnamed: 2
0 2014-12-31 1
1 2014-12-31 2
2 2014-12-31 3
3 2014-12-31 4
4 2014-12-31 5
5 2014-12-31 6
6 2014-12-31 7
7 2015-01-01 1
8 2015-01-01 2
9 2015-01-01 3
10 2015-01-01 4
11 2015-01-01 5
12 2015-01-02 1
13 2015-01-02 3
14 2015-01-02 7
15 2015-01-02 9
abbs = df.groupby(['Date'])['Unnamed: 2'].apply(list)
abbs
Out[142]:
Date
2014-12-31 [1, 2, 3, 4, 5, 6, 7]
2015-01-01 [1, 2, 3, 4, 5]
2015-01-02 [1, 3, 7, 9]
Name: Unnamed: 2, dtype: object
abbs.loc['2015-01-01']
Out[143]: [1, 2, 3, 4, 5]
list(set(abbs.loc['2014-12-31']) - set(abbs.loc['2015-01-01']))
Out[145]: [6, 7]
In function
def uid(df,date1,date2):
abbs = df.groupby(['Date'])['Unnamed: 2'].apply(list)
return list(set(abbs.loc[date1]) - set(abbs.loc[date2]))
uid(df,'2015-01-01','2015-01-02')
Out[162]: [2, 4, 5]
You could write a function and use date instead of str :)

Related

Filter out values within 30 period of previous entry

I need to pick one value per 30 day period for the entire dataframe. For instance if I have the following dataframe:
df:
Date Value
0 2015-09-25 e
1 2015-11-11 b
2 2015-11-24 c
3 2015-12-02 d
4 2015-12-14 a
5 2016-02-01 b
6 2016-03-23 c
7 2016-05-02 d
8 2016-05-25 a
9 2016-06-15 a
10 2016-06-28 a
I need to pick the first entry and then filter out any entry within the next 30 days of that entry and then proceed along the dataframe. For instance, indexes, 0 and 1 should stay since they are at least 30 days apart, but 2 and 3 are within 30 days of 1 so they should be removed. This should continue chronologically until we have 1 entry per 30 day period:
Date Value
0 2015-09-25 e
1 2015-11-11 b
4 2015-12-14 a
5 2016-02-01 b
6 2016-03-23 c
7 2016-05-02 d
9 2016-06-15 a
The end result should have only 1 entry per 30 day period. Any advice or assistance would be greatly appreciated!
I have tried df.groupby(pd.Grouper(freq='M')).first() but that picks the first entry in each month rather than each entry that is at least 30 days from the previous entry.
I came up with a simple iterative solution which uses the fact that the DF is sorted, but its fairly slow:
index = df.index.values
dates = df['Date'].tolist()
index_to_keep = []
curr_date = None
for i in range(len(dates)):
if not curr_date or (dates[i] - curr_date).days > 30:
index_to_keep.append(index[i])
curr_date = dates[i]
df_out = df.loc[index_to_keep, :]
return df_out
Any ideas on how to speed it up?

I think this should be what you are looking for.
You need to transform your date column into a datetime datastructure to not be interpreted as a string.
here is what it looks like:
df = pd.DataFrame({'Date': ['2015-09-25', '2015-11-11','2015-11-24', '2015-12-02','2015-12-14'],
'Value' : ['e', 'b', 'c','d','a']})
df['Date'] = pd.to_datetime(df['Date'])
df = df.set_index('Date')
df = df.groupby(pd.Grouper(freq='30D')).nth(0)
and here is the result
Value
Date
2015-09-25 e
2015-10-25 b
2015-11-24 c

Shifting selected months in python

I am struggling to find a solution for the following problem: I have a dataframe which reports quarterly values. Unfortunately, some of the companies report their quarterly numbers a month after the typical release quarter-dates. For this reason, I would like to select these dates and change them to the typical release date. My dataframe looks like this:
# dataframe 1
rng1 = pd.date_range('2014-12-31', periods=5, freq='3M')
df1 = pd.DataFrame({ 'Date': rng1, 'Company': [1, 1, 1, 1 ,1], 'Val': np.random.randn(len(rng1)) })
# dataframe 2
rng2 = pd.date_range('2015-01-30', periods=5, freq='3M')
df2 = pd.DataFrame({ 'Date': rng2, 'Company': [2, 2, 2, 2 ,2],'Val': np.random.randn(len(rng2)) })
# Target Dataframe
frames = [df1, df2]
df_fin = pd.concat(frames)
Output:
Date Company Val
0 2014-12-31 1 0.374427
1 2015-03-31 1 0.328239
2 2015-06-30 1 -1.226196
3 2015-09-30 1 -0.153937
4 2015-12-31 1 -0.146096
0 2015-01-31 2 0.283528
1 2015-04-30 2 0.426100
2 2015-07-31 2 -0.044960
3 2015-10-31 2 -1.316574
4 2016-01-31 2 0.353073
So what I would like to do is the following: Company 2 reports their numbers a month later. For this reason I would like to change their dates so they allign with company 1. This means I would change dates such as the 2015-01-31 to the 2014-12-31.
Any help is highly appreciated
Thanks in advance

Use, pd.merge_asof with direction nearest to merge the dataframe df_in with the reference quarterly dates qDates:
# Refrence quarterly dates (typical release dates)
qDates = pd.date_range('2014-12-31', periods=5, freq='Q')
df = pd.merge_asof(
df_fin.sort_values(by='Date'), pd.Series(qDates, name='Quarter'),
left_on='Date', right_on='Quarter', direction='nearest')
df = (
df.sort_values(by=['Company', 'Quarter'])
.drop('Date', 1)
.rename(columns={'Quarter': 'Date'})
.reindex(df_fin.columns, axis=1)
.reset_index(drop=True)
)
# print(df)
Date Company Val
0 2014-12-31 1 0.146874
1 2015-03-31 1 0.297248
2 2015-06-30 1 1.444860
3 2015-09-30 1 -0.348871
4 2015-12-31 1 -0.093267
5 2014-12-31 2 -0.238166
6 2015-03-31 2 -1.503571
7 2015-06-30 2 0.791149
8 2015-09-30 2 -0.419414
9 2015-12-31 2 -0.598963

I hope I get what you mean. You can use pd.DateOffset or pd.offsets.MonthOffset here, to add/minus number of month(s) to Date condition by column value Company == 2
For example:
df_fin.loc[df_fin['Company'] == 2,'Date'] = df_fin.loc[df_fin['Company'] == 2,'Date'] - pd.DateOffset(months=1)
df_fin prints:
# df_fin
Date Company Val
0 2014-12-31 1 -0.794092
1 2015-03-31 1 -2.632114
2 2015-06-30 1 -0.176383
3 2015-09-30 1 0.701986
4 2015-12-31 1 -0.447678
0 2014-12-31 2 -0.003322
1 2015-03-30 2 0.475669
2 2015-06-30 2 -1.024190
3 2015-09-30 2 1.241122
4 2015-12-31 2 0.096882

How to truncate a column in a Pandas time series data frame so as to remove leading and trailing zeros?

I have the following time series df in Pandas:
date value
2015-01-01 0
2015-01-02 0
2015-01-03 0
2015-01-04 3
2015-01-05 0
2015-01-06 4
2015-01-07 0
I would like to remove the leading and trailing zeroes, so as to have the following df:
date value
2015-01-04 3
2015-01-05 0
2015-01-06 4
Simply dropping rows with 0s in them would lead to deleting the 0s in the middle as well, which I don't want.
I thought of writing a forward loop that starts from the first row and then continues until the first non 0 value, and a second backwards loop that goes back from the end and stops at the last non 0 value. But that seems like overkill, is there a more efficient way of doing so?

General solution returned empty DataFrame, if all 0 values in data with cumulative sum of mask tested not equal 0 values and swapped values by [::-1] chained by bitwise AND and filtering by boolean indexing:
s = df['value'].ne(0)
df = df[s.cumsum().ne(0) & s[::-1].cumsum().ne(0)]
print (df)
date value
3 2015-01-04 3
4 2015-01-05 0
5 2015-01-06 4
If always at least one non 0 value is possible convert 0 to missing values and use DataFrame.loc with DataFrame.first_valid_index and
DataFrame.last_valid_index:
s = df['value'].mask(df['value'] == 0)
df = df.loc[s.first_valid_index():s.last_valid_index()]
print (df)
date value
3 2015-01-04 3
4 2015-01-05 0
5 2015-01-06 4
Another idea is use DataFrame.idxmax or DataFrame.idxmin:
s = df['value'].eq(0)
df = df.loc[s.idxmin():s[::-1].idxmin()]
print (df)
date value
3 2015-01-04 3
4 2015-01-05 0
5 2015-01-06 4
s = df['value'].ne(0)
df = df.loc[s.idxmax():s[::-1].idxmax()]

You can get a list of the indexes where value is > than 0, and then find the min.
data = [
['2015-01-01', 0],
['2015-01-02', 0],
['2015-01-03', 0],
['2015-01-04', 3],
['2015-01-05', 0],
['2015-01-06', 4]
]
df = pd.DataFrame(data, columns = ['date', 'value'])
print(min(df.index[df['value'] > 0].tolist()))
# 3
Then filter the main df like this:
df.iloc[3:]
Or even better:
df.iloc[min(df.index[df['value'] > 0].tolist()):]
And you get:
date value
3 2015-01-04 3
4 2015-01-05 0
5 2015-01-06 4

Fastest way to create DataFrame from last available data

I had no success looking for answers for this question in the forum since it is hard to put it in keywords. Any keywords suggestions are appreciated so that I cane make this question more accessible so that others can benefit from it.
The closest question I found doesn't really answer mine.
My problem is the following:
I have one DataFrame that I called ref, and a dates list called pub. ref has dates for indexes but those dates are different (there will be a few matching values) from the dates in pub. I want to create a new DataFrame that contains all the dates from pub but fill it with the "last available data" from ref.
Thus, say ref is:
Dat col1 col2
2015-01-01 5 4
2015-01-02 6 7
2015-01-05 8 9
And pub
2015-01-01
2015-01-04
2015-01-06
I'd like to create a DataFrame like:
Dat col1 col2
2015-01-01 5 4
2015-01-04 6 7
2015-01-06 8 9
For this matter performance is an issue. So i'm looking for the fastest / a fast way of doing that.
Thanks in advance.

You can do an outer merge, set the new index to Dat, sort it, forward fill, and then reindex based on the dates in pub.
dates = ['2015-01-01', '2015-01-04', '2015-01-06']
pub = pd.DataFrame([dt.datetime.strptime(ts, '%Y-%m-%d').date() for ts in dates],
columns=['Dat'])
>>> (ref
.merge(pub, on='Dat', how='outer')
.set_index('Dat')
.sort_index()
.ffill()
.reindex(pub.Dat))
col1 col2
Dat
2015-01-01 5 4
2015-01-04 6 7
2015-01-06 8 9

Use np.searchsorted for finding the indice just after ('right' option; needed to handle properly equality) :
In [27]: pub = ['2015-01-01', '2015-01-04', '2015-01-06']
In [28]: df
Out[28]:
col1 col2
Dat
2015-01-01 5 4
2015-01-02 6 7
2015-01-05 8 9
In [29]: y=np.searchsorted(list(df.index),pub,'right')
#array([1, 2, 3], dtype=int64)
Then just rebuild :
In [30]: pd.DataFrame(df.iloc[y-1].values,index=pub)
Out[30]:
0 1
2015-01-01 5 4
2015-01-04 6 7
2015-01-06 8 9

Counting dates in a range set by pandas dataframe

I have a pandas dataframe that contains two date columns, a start date and an end date that defines a range. I'd like to be able to collect a total count for all dates across all rows in the dataframe, as defined by these columns.
For example, the table looks like:
index start_date end date
0 '2015-01-01' '2015-01-17'
1 '2015-01-03' '2015-01-12'
And the result would be a per date aggregate, like:
date count
'2015-01-01' 1
'2015-01-02' 1
'2015-01-03' 2
and so on.
My current approach works but is extremely slow on a big dataframe as I'm looping across the rows, calculating the range and then looping through this. I'm hoping to find a better approach.
Currently I'm doing :
date = pd.date_range (min (df.start_date), max (df.end_date))
df2 = pd.DataFrame (index =date)
df2 ['count'] = 0
for index, row in df.iterrows ():
dates = pd.date_range (row ['start_date'], row ['end_date'])
for date in dates:
df2.loc['date']['count'] += 1

After stacking the relevant columns as suggested by #Sam, just use value_counts.
df[['start_date', 'end date']].stack().value_counts()
EDIT:
Given that you also want to count the dates between the start and end dates:
start_dates = pd.to_datetime(df.start_date)
end_dates = pd.to_datetime(df.end_date)
>>> pd.Series(dt.date() for group in
[pd.date_range(start, end) for start, end in zip(start_dates, end_dates)]
for dt in group).value_counts()
Out[178]:
2015-01-07 2
2015-01-06 2
2015-01-12 2
2015-01-05 2
2015-01-04 2
2015-01-10 2
2015-01-03 2
2015-01-09 2
2015-01-08 2
2015-01-11 2
2015-01-16 1
2015-01-17 1
2015-01-14 1
2015-01-15 1
2015-01-02 1
2015-01-01 1
2015-01-13 1
dtype: int64

I think the solution here is to 'stack' your two date columns, group by the date,and do a count. Play around with the df.stack() function. Here is something i threw together that yields a good solution:
import datetime
df = pd.DataFrame({'Start' : [datetime.date(2016, 5, i) for i in range(1,30)],
'End':[datetime.date(2016, 5, i) for i in range(1,30)]})
df.stack().reset_index()[[0, 'level_1']].groupby(0).count()

I would use melt() method for that:
In [76]: df
Out[76]:
start_date end_date
index
0 2015-01-01 2015-01-17
1 2015-01-03 2015-01-12
2 2015-01-03 2015-01-17
In [77]: pd.melt(df, value_vars=['start_date','end_date']).groupby('value').size()
Out[77]:
value
2015-01-01 1
2015-01-03 2
2015-01-12 1
2015-01-17 2
dtype: int64

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas Dataframe Groupby Determine Values in 1 group vs. another group - python

Related

Filter out values within 30 period of previous entry

Shifting selected months in python

How to truncate a column in a Pandas time series data frame so as to remove leading and trailing zeros?

Fastest way to create DataFrame from last available data

Counting dates in a range set by pandas dataframe

Categories

Resources