Pandas - multiple condition lookup speed

Pandas - multiple condition lookup speed - python

I'm working with some historical baseball data and trying to get matchup information (batter/pitcher) for previous games.
Example data:
import pandas as pd
data = {'ID': ['A','A','A','A','A','A','B','B','B','B','B'],
'Year' : ['2017-05-01', '2017-06-03', '2017-08-02', '2018-05-30', '2018-07-23', '2018-09-14', '2017-06-01', '2017-08-03', '2018-05-15', '2018-07-23', '2017-05-01'],
'ID2' : [1,2,3,2,2,1,2,2,2,1,1],
'Score 2': [1,4,5,7,5,5,6,1,4,5,6],
'Score 3': [1,4,5,7,5,5,6,1,4,5,6],
'Score 4': [1,4,5,7,5,5,6,1,4,5,6]}
df = pd.DataFrame(data)
lookup_data = {"First_Person" : ['A', 'B'],
"Second_Person" : ['1', '2'],
"Year" : ['2018', '2018']}
lookup_df = pd.DataFrame(lookup_data)
Lookup df has the current matchups, df has the historical data and current matchups.
I want to find, for example, for Person A against Person 2, what were the results of any of their matchups on any previous date?
I can do this with:
history_list = []
def get_history(row, df, hist_list):
#we filter the df to matchups containing both players before the previous date and sum all events in their history
history = df[(df['ID'] == row['First_Person']) & (df['ID2'] == row['Second_Person']) & (df['Year'] < row['Year'])].sum().iloc[3:]
#add to a list to keep track of results
hist_list.append(list(history.values) + [row['Year']+row['First_Person']+row['Second_Person']])
and then execute with apply like so:
lookup_df.apply(get_history, df=df, hist_list = history_list, axis=1)
Expected results would be something like:
1st P Matchup date 2nd p Historical scores
A 2018-07-23 2 11 11 11
B 2018-05-15 2 7 7 7
But this is pretty slow - the filtering operation takes around 50ms per lookup.
Is there a better way I can approach this problem? This currently would take over 3 hours to run across 250k historical matchups.

You can merge or map and groupby,
lookup_df['Second_Person'] = lookup_df['Second_Person'].astype(int)
merged = df.merge(lookup_df, left_on = ['ID', 'ID2'], right_on = ['First_Person', 'Second_Person'], how = 'left').query('Year_x < Year_y').drop(['Year_x', 'First_Person', 'Second_Person', 'Year_y'], axis = 1)
merged.groupby('ID', as_index = False).sum()
ID ID2 Score 2 Score 3 Score 4
0 A 1 1 1 1
1 B 4 7 7 7

Related

Drop rows in a pandas dataframe by criteria from another dataframe

I have the following dataframe containing scores for a competition as well as a column that counts what number entry for each person.
import pandas as pd
df = pd.DataFrame({'Name': ['John', 'Jim', 'John','Jim', 'John','Jim','John','Jim','John','Jim','Jack','Jack','Jack','Jack'],'Score': [10,8,9,3,5,0, 1, 2,3, 4,5,6,8,9]})
df['Entry_No'] = df.groupby(['Name']).cumcount() + 1
df
Then I have another table that stores data on the maximum number of entries that each person can have:
df2 = pd.DataFrame({'Name': ['John', 'Jim', 'Jack'],'Limit': [2,3,1]})
df2
I am trying to drop rows from df where the entry number is greater than the Limit according to each person in df2 so that my expected output is this:
If there are any ideas on how to help me achieve this that would be fantastic! Thanks

You can use pandas.merge to create another dataframe and drop columns by your condition:
df3 = pd.merge(df, df2, on="Name", how="left")
df3[df3["Entry_No"] <= df3["Limit"]][df.columns].reset_index(drop=True)
Name Score Entry_No
0 John 10 1
1 Jim 8 1
2 John 9 2
3 Jim 3 2
4 Jim 0 3
5 Jack 5 1
I used how="left" to keep the order of df and reset_index(drop=True) to reset the index of the resulting dataframe.

You could join the 2 dataframes, and then drop with a condition:
import pandas as pd
df = pd.DataFrame({'Name': ['John', 'Jim', 'John','Jim', 'John','Jim','John','Jim','John','Jim','Jack','Jack','Jack','Jack'],'Score': [10,8,9,3,5,0, 1, 2,3, 4,5,6,8,9]})
df['Entry_No'] = df.groupby(['Name']).cumcount() + 1
df2 = pd.DataFrame({'Name': ['John', 'Jim', 'Jack'],'Limit': [2,3,1]})
df2 = df2.set_index('Name')
df = df.join(df2, on='Name')
df.drop(df[df.Entry_No>df.Limit].index, inplace = True)
gives the expected output

how to add new raw to the data frame during a loop and a certain condition is met?

i want to add a new row when the iteration reach the raw which has 'total charges'.
FYI: as shown in code, column number 1 is where it has to be performed.
python
for row in df.itertuples():
row[1] == 'Total Charges'
this is how the data look like, i need to separate it with a row, right under total charges

Hope I understood you correctly (I used the data example you provided).
Iterate the rows and search for Total Charges. Then use pandas.concat().
import pandas as pd
df = pd.DataFrame({'column1': ['data_row_1', 'data_row_2', 'Total Charges', 'data_row_3', 'data_row_4'], 'column2': range(1, 6)})
for index, row in df.iterrows():
if row['column1'] == 'Total Charges':
df_before = df.iloc[:index+1]
df_after = df.iloc[index+1:]
new_row = pd.DataFrame({'column1': ['new_data_1'], 'column2': ['new_data_2']})
new_df = pd.concat([df_before, new_row, df_after], ignore_index=True)
break
print(new_df)
Output:
column1 column2
0 data_row_1 1
1 data_row_2 2
2 Total Charges 3
3 new_data_1 new_data_2
4 data_row_3 4
5 data_row_4 5

Use:
import pandas as pd
s = list(range(3))
s.append('Total Charges')
s.extend(list(range(3)))
df = pd.DataFrame({'c1': s, 'c2': range(7)})
ind = df[df['c1']=='Total Charges'].index
df.loc[ind[0]+.5]='',''
df = df.sort_index().reset_index()
del df['index']
Output:

Count the number of records based on a variable criteria Python

I have a dataframe as follow:
dashboard = pd.DataFrame({
'id':[1,1,1,1,1,2,2,3,3,4,4],
'level': [1,2,2.1,2.2,3,3.1,4,1.1,2,3,4],
'cost': [10,6,4,8,9,6,11,23,3,2,12],
'category': ['Original', 'Time', 'Money','Original','Original','Time','Original','Original','Time','Original','Original']
})
I need to get the following table where if for example the level is 3, the code will sum all the previous levels only (2.2, 2.1 - excluding 2):
pd.DataFrame({
'id': [1,2,3,4],
'level': [3,4,2,4],
'cost': [12,6,23,0],
'category': ['Time & Money','Time','Time','']
})

You can do it this way
df2 = dashboard.groupby('id')['level'].last().astype(int).reset_index()
df2['cost'] = dashboard.groupby('id').apply(lambda x: x[x['level']>=(x['level'].tail(1)-0.9).sum()]['cost'].sum()-x['cost'].tail(1)).reset_index(drop=True)
df2['category'] = dashboard.groupby('id').apply(lambda x: x[x['level']>=(x['level'].tail(1)-0.9).sum()].groupby('id')['category'].agg(' & '.join)).reset_index(drop=True).replace('Original','', regex=True).str.strip((' & '))
df2
Output (the input & the output you have provided do not math for column 'category')
id level cost category
0 1 3 12 Money
1 2 4 6 Time
2 3 2 23 Time
3 4 4 0

Efficiently finding overlap between many date ranges

How can I efficiently find overlapping dates between many date ranges?
I have a pandas dataframe containing information on the daily warehouse stock of many products. There are only records for those dates where stock actually changed.
import pandas as pd
df = pd.DataFrame({'product': ['a', 'a', 'a', 'b', 'b', 'b'],
'stock': [10, 0, 10, 5, 0, 5],
'date': ['2016-01-01', '2016-01-05', '2016-01-15',
'2016-01-01', '2016-01-10', '2016-01-20']})
df['date'] = pd.to_datetime(df['date'])
Out[4]:
date product stock
0 2016-01-01 a 10
1 2016-01-05 a 0
2 2016-01-15 a 10
3 2016-01-01 b 5
4 2016-01-10 b 0
5 2016-01-20 b 5
From this data I want to identify the number of days where stock of all products was 0. In the example this would be 5 days (from 2016-01-10 to 2016-01-14).
I initially tried resampling the date to create one record for every day and then comparing day by day. This works but it creates a very large dataframe, that I can hardly keep in Memory, because my data contains many dates where stock does not change.
Is there a more memory-efficient way to calculate overlaps other than creating a record for every date and comparing day by day?
Maybe I can somehow create a period representation for the time range implicit in every records and then compare all periods for all products?
Another option could be to first subset only those time periods where a product has zero stock (relatively few) and then apply the resampling only on that subset of the data.
What other, more efficient ways are there?

You can pivot the table using the dates as index and the products as columns, then fill nan's with previous values, convert to daily frequency and look for rows with 0's in all columns.
ptable = (df.pivot(index='date', columns='product', values='stock')
.fillna(method='ffill').asfreq('D', method='ffill'))
cond = ptable.apply(lambda x: (x == 0).all(), axis='columns')
print(ptable.index[cond])
DatetimeIndex(['2016-01-10', '2016-01-11', '2016-01-12', '2016-01-13',
'2016-01-14'],
dtype='datetime64[ns]', name=u'date', freq='D')

Here try this, I know its not the prettiest of codes but according to all the data provided here this should work:
from datetime import timedelta
import pandas as pd
df = pd.DataFrame({'product': ['a', 'a', 'a', 'b', 'b', 'b'],
'stock': [10, 0, 10, 5, 0, 5],
'date': ['2016-01-01', '2016-01-05', '2016-01-15',
'2016-01-01', '2016-01-10', '2016-01-20']})
df['date'] = pd.to_datetime(df['date'])
df = df.sort('date', ascending=True)
no_stock_dates = []
product_stock = {}
in_flag = False
begin = df['date'][0]
for index, row in df.iterrows():
current = row['date']
product_stock[row['product']] = row['stock']
if current > begin:
if sum(product_stock.values()) == 0 and not in_flag:
in_flag = True
begin = row['date']
if sum(product_stock.values()) != 0 and in_flag:
in_flag = False
no_stock_dates.append((begin, current-timedelta(days=1)))
print no_stock_dates
This code should run at O(n*k) where n is the number of lines, and k is the number of product categories.

Merging and sum up several value-counts series in Pandas

I usually use value_counts() to get the number of occurrences of a value. However, I deal now with large database-tables (cannot load it fully into RAM) and query the data in fractions of 1 month.
Is there a way to store the result of value_counts() and merge it with / add it to the next results?
I want to count the number user actions. Assume the following structure of
user-activity logs:
# month 1
id userId actionType
1 1 a
2 1 c
3 2 a
4 3 a
5 3 b
# month 2
id userId actionType
6 1 b
7 1 b
8 2 a
9 3 c
Using value_counts() on those produces:
# month 1
userId
1 2
2 1
3 2
# month 2
userId
1 2
2 1
3 1
Expected output:
# month 1+2
userId
1 4
2 2
3 3
Up until now, I just have found a method using groupby and sum:
# count users actions and remember them in new column
df1['count'] = df1.groupby(['userId'], sort=False)['id'].transform('count')
# delete not necessary columns
df1 = df1[['userId', 'count']]
# delete not necessary rows
df1 = df1.drop_duplicates(subset=['userId'])
# repeat
df2['count'] = df2.groupby(['userId'], sort=False)['id'].transform('count')
df2 = df2[['userId', 'count']]
df2 = df2.drop_duplicates(subset=['userId'])
# merge and sum up
print pd.concat([df1,df2]).groupby(['userId'], sort=False).sum()
What is the pythonic / pandas' way of merging the information of several series' (and dataframes) efficiently?

Let me suggest "add" and specify a fill value of 0. This has an advantage over the previously suggested answer in that it will work when the two Dataframes have non-identical sets of unique keys.
# Create frames
df1 = pd.DataFrame(
{'User_id': ['a', 'a', 'b', 'c', 'c', 'd'], 'a': [1, 1, 2, 3, 3, 5]})
df2 = pd.DataFrame(
{'User_id': ['a', 'a', 'b', 'b', 'c', 'c', 'c'], 'a': [1, 1, 2, 2, 3, 3, 4]})
Now add the the two sets of values_counts(). The fill_value argument will handle any NaN values that would arise, in this example, the 'd' that appears in df1, but not df2.
a = df1.User_id.value_counts()
b = df2.User_id.value_counts()
a.add(b,fill_value=0)

You can sum the series generated by the value_counts method directly:
#create frames
df= pd.DataFrame({'User_id': ['a','a','b','c','c'],'a':[1,1,2,3,3]})
df1= pd.DataFrame({'User_id': ['a','a','b','b','c','c','c'],'a':[1,1,2,2,3,3,4]})
sum the series:
df.User_id.value_counts() + df1.User_id.value_counts()
output:
a 4
b 3
c 5
dtype: int64

This is know as "Split-Apply-Combine". It is done in 1 line and 3-4 clicks, using a lambda function as follows.
1️⃣ paste this into your code:
df['total_for_this_label'] = df.groupby('label', as_index=False)['label'].transform(lambda x: x.count())
2️⃣ replace 3x label with the name of the column whose values you are counting (case-sensitive)
3️⃣ print df.head() to check it's worked correctly

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas - multiple condition lookup speed - python

Related

Drop rows in a pandas dataframe by criteria from another dataframe

how to add new raw to the data frame during a loop and a certain condition is met?

Count the number of records based on a variable criteria Python

Efficiently finding overlap between many date ranges

Merging and sum up several value-counts series in Pandas

Categories

Resources