Python Pandas Feature Generation as aggregate function - python

I have a pandas df which is mire or less like
ID key dist
0 1 57 1
1 2 22 1
2 3 12 1
3 4 45 1
4 5 94 1
5 6 36 1
6 7 38 1
.....
this DF contains couple of millions of points. I am trying to generate some descriptors now to incorporate the time nature of the data. The idea is for each line I should create a window of lenght x going back in the data and counting the occurrences of the particular key in the window. I did a implementation, but according to my estimation for 23 different windows the calculation will run 32 days. Here is the code
def features_wind2(inp):
all_window = inp
all_window['window1'] = 0
for index, row in all_window.iterrows():
lid = index
lid1 = lid - 200
pid = row['key']
row['window1'] = all_window.query('index < %d & index > %d & key == %d' % (lid, lid1, key)).count()[0]
return all_window
There are multiple different windows of different length. I however have that uneasy feeling that the iteration is probably not the smartest way to go for this data aggregation. Is there way to implement it to run faster?

On a toy example data frame, you can achieve about a 7x speedup by using apply() instead of iterrows().
Here's some sample data, expanded a bit from OP to include multiple key values:
ID key dist
0 1 57 1
1 2 22 1
2 3 12 1
3 4 45 1
4 5 94 1
5 6 36 1
6 7 38 1
7 8 94 1
8 9 94 1
9 10 38 1
import pandas as pd
df = pd.read_clipboard()
Based on these data, and the counting criteria defined by OP, we expect the output to be:
key dist window
ID
1 57 1 0
2 22 1 0
3 12 1 0
4 45 1 0
5 94 1 0
6 36 1 0
7 38 1 0
8 94 1 1
9 94 1 2
10 38 1 1
Using OP's approach:
def features_wind2(inp):
all_window = inp
all_window['window1'] = 0
for index, row in all_window.iterrows():
lid = index
lid1 = lid - 200
pid = row['key']
row['window1'] = all_window.query('index < %d & index > %d & key == %d' % (lid, lid1, pid)).count()[0]
return all_window
print('old solution: ')
%timeit features_wind2(df)
old solution:
10 loops, best of 3: 25.6 ms per loop
Using apply():
def compute_window(row):
# when using apply(), .name gives the row index
# pandas indexing is inclusive, so take index-1 as cut_idx
cut_idx = row.name - 1
key = row.key
# count the number of instances key appears in df, prior to this row
return sum(df.ix[:cut_idx,'key']==key)
print('new solution: ')
%timeit df['window1'] = df.apply(compute_window, axis='columns')
new solution:
100 loops, best of 3: 3.71 ms per loop
Note that with millions of records, this will still take awhile, and the relative performance gains will likely be diminished somewhat compared to this small test case.
UPDATE
Here's an even faster solution, using groupby() and cumsum(). I made some sample data that seems roughly in line with the provided example, but with 10 million rows. The computation finishes in well under a second, on average:
# sample data
import numpy as np
import pandas as pd
N = int(1e7)
idx = np.arange(N)
keys = np.random.randint(1,100,size=N)
dists = np.ones(N).astype(int)
df = pd.DataFrame({'ID':idx,'key':keys,'dist':dists})
df = df.set_index('ID')
Now performance testing:
%timeit df['window'] = df.groupby('key').cumsum().subtract(1)
1 loop, best of 3: 755 ms per loop
Here's enough output to show that the computation is working:
dist key window
ID
0 1 83 0
1 1 4 0
2 1 87 0
3 1 66 0
4 1 31 0
5 1 33 0
6 1 1 0
7 1 77 0
8 1 49 0
9 1 49 1
10 1 97 0
11 1 36 0
12 1 19 0
13 1 75 0
14 1 4 1
Note: To revert ID from index to column, use df.reset_index() at the end.

Related

Sample from dataframe with conditions

I have a large dataset and I want to sample from it but with a conditional. What I need is a new dataframe with the almost the same amount (count) of values of a boolean column of `0 and 1'
What I have:
df['target'].value_counts()
0 = 4000
1 = 120000
What I need:
new_df['target'].value_counts()
0 = 4000
1 = 6000
I know I can df.sample but I dont know how to insert the conditional.
Thanks
Since 1.1.0, you can use groupby.sample if you need the same number of rows for each group:
df.groupby('target').sample(4000)
Demo:
df = pd.DataFrame({'x': [0] * 10 + [1] * 25})
df.groupby('x').sample(5)
x
8 0
6 0
7 0
2 0
9 0
18 1
33 1
24 1
32 1
15 1
If you need to sample conditionally based on the group value, you can do:
df.groupby('target', group_keys=False).apply(
lambda g: g.sample(4000 if g.name == 0 else 6000)
)
Demo:
df.groupby('x', group_keys=False).apply(
lambda g: g.sample(4 if g.name == 0 else 6)
)
x
7 0
8 0
2 0
1 0
18 1
12 1
17 1
22 1
30 1
28 1
Assuming the following input and using the values 4/6 instead of 4000/6000:
df = pd.DataFrame({'target': [0,1,1,1,0,1,1,1,0,1,1,1,0,1,1,1]})
You could groupby your target and sample to take at most N values per group:
df.groupby('target', group_keys=False).apply(lambda g: g.sample(min(len(g), 6)))
example output:
target
4 0
0 0
8 0
12 0
10 1
14 1
1 1
7 1
11 1
13 1
If you want the same size you can simply use df.groupby('target').sample(n=4)

How to assign running values to each columns with for loops in Pandas?

I have two dataframes, both have same shapes.
dfA
2008LG 2007LG 2006LG 2005LG
0 44 65 30 20
1 10 16 56 70
2 65 30 20 122
3 0.0 0.00 679 158
4 0.0 0.00 30 20
dfB
2008Net 2007Net 2006Net 2005Net
0 0 0 0 452
1 0 0 0 365
2 0 0 0 778
3 0 0 0 78
4 0 0 0 60
The calculation logic is: for each row in dfB , start from the very end 2005Net column and use 2005LG - 2005net and get a value which gets assigned to the first right columns of 2005Net.
For example: for the first iteration 2005LG - 2005Net = 20-452 = -432 and assign -432 to 2006Net. and the second iteration will start from 2006LG - 2006Net= 30 - -432 = 462 and assign to 2007Net.
below is my code, but it is not cutting it, what exactly is wrong here ?
import pandas as pd
import numpy as np
from tqdm import tqdm
for index in tqdm(range(dfA.shape[0])):
for col_index in reversed(range(4)):
the_value = 0
the_value = dfA[dfA.columns[col_index]].iloc[index] - dfB[dfB.columns[col_index]].iloc[index]
dfB[dfB.columns[col_index-1]].iloc[index] = the_value
Try something like this.
for index in reverse(range(4)):
dfB[index - 1] = dfA.iloc[:, index] - dfB.iloc[:,index]
This assume that each column you want to subtract have the same lenght.

Count various entrys in a DataFrame

I want to find out how many different devices are in this list?
Is this sufficient for my SQL statement or do I have to do more for it.
Unfortunately, I do not know with such a large amount of data, which method is right and if the solution is right.
Some devices come in more than once. That is, line number not = number of devices
Suggestions as Python or as SQL are welcome
import pandas as pd
from sqlalchemy import create_engine # database connection
from IPython.display import display
disk_engine = create_engine('sqlite:///gender-train-devices.db')
phones = pd.read_sql_query('SELECT device_id, COUNT(device_id) FROM phone_brand_device_model GROUP BY [device_id]', disk_engine)
print phones
the output is:
device_id COUNT(device_id)
0 -9223321966609553846 1
1 -9223067244542181226 1
2 -9223042152723782980 1
3 -9222956879900151005 1
4 -9222896629442493034 1
5 -9222894989445037972 1
6 -9222894319703307262 1
7 -9222754701995937853 1
8 -9222661944218806987 1
9 -9222399302879214035 1
10 -9222352239947207574 1
11 -9222173362545970626 1
12 -9221825537663503111 1
13 -9221768839350705746 1
14 -9221767098072603291 1
15 -9221674814957667064 1
16 -9221639938103564513 1
17 -9221554258551357785 1
18 -9221307795397202665 1
19 -9221086586254644858 1
20 -9221079146476055829 1
21 -9221066489596332354 1
22 -9221046405740900422 1
23 -9221026417907250887 1
24 -9221015678978880842 1
25 -9220961720447724253 1
26 -9220830859283101130 1
27 -9220733369151052329 1
28 -9220727250496861488 1
29 -9220452176650064280 1
... ... ...
186686 9219686542557325817 1
186687 9219842210460037807 1
186688 9219926280825642237 1
186689 9219937375310355234 1
186690 9219958455132520777 1
186691 9220025918063413114 1
186692 9220160557900894171 1
186693 9220562120895859549 1
186694 9220807070557263555 1
186695 9220814716773471568 1
186696 9220880169487906579 1
186697 9220914901466458680 1
186698 9221114774124234731 1
186699 9221149157342105139 1
186700 9221152396628736959 1
186701 9221297143137682579 1
186702 9221586026451102237 1
186703 9221608286127666096 1
186704 9221693095468078153 1
186705 9221768426357971629 1
186706 9221843411551060582 1
186707 9222110179000857683 1
186708 9222172248989688166 1
186709 9222214407720961524 1
186710 9222355582733155698 1
186711 9222539910510672930 1
186712 9222779211060772275 1
186713 9222784289318287993 1
186714 9222849349208140841 1
186715 9223069070668353002 1
[186716 rows x 2 columns]
If you want the number of different devices, you can just query the database:
SELECT COUNT(distinct device_id)
FROM phone_brand_device_model ;
Of course, if you already have the data in a data frame for some other purpose you can count the number of rows there.
If you already have data in memory as a dataframe, you can use:
df['device_id'].nunique()
otherwise use Gordon's solution - it should be faster
If you want to do it in pandas. You can do something like:
len(phones.device_id.unique())

In Pandas, how to operate between columns in max perfornace

I have the following df:
usersidid clienthostid LoginDaysSumLastMonth LoginDaysSumLast7Days LoginDaysSum
0 9 1 50 7 1728
1 3 1 43 3 1331
2 6 1 98 9 216
3 4 1 10 6 64
4 9 2 64 32 343
5 12 3 45 43 1000
6 8 3 87 76 512
7 9 3 16 3 1200
What I'm trying to do is:
For every 'clienthostid' look for the 'usersidid' with the highest 'LoginDaysSum', I check if there is a usersidid which is the highest LoginDaysSum in two different clienthostid (for instance, usersidid = 9 ia the highest LoginDaysSum in both clienthostid 1, 2 and 3 in rows 0, 4 and 7 accordingly).
In this case, I want to choose the higher LoginDaysSum (in the example it would be the row with 1728), lets call it maxRT.
I want to calculate the ratio of LoginDaysSumLast7Days between maxRT and each of the other rows (in example, it would be rows index 7 and 4).
If the ratio is below 0.8 than I want to drop the row:
index 4- LoginDaysSumLast7Days_ratio = 7/32 < 0.8 //row will drop!
index 7- LoginDaysSumLast7Days_ratio = 7/3 > 0.8 //row will stay!
The same condition also will be applied of LoginDaysSumLastMonth.
So for the example the result will be:
usersidid clienthostid LoginDaysSumLastMonth LoginDaysSumLast7Days LoginDaysSum
0 9 1 50 7 1728
1 3 1 43 3 1331
2 6 1 98 9 216
3 4 1 10 6 64
5 12 3 45 43 1000
6 8 3 87 76 512
7 9 3 16 3 1200
Now here's the snag- performance is critical.
I tried to implement it using .apply but not only i couldn't make it to work right, it also ran way too slow :(
My code so far (forgive me of it's written terribly wrong, I only started working for the first time with SQL, Pandas and Python last week and everything I learned is from examples I found here ^_^):
df_client_Logindayssum_pairs = df.merge(df.groupby(['clienthostid'], as_index=False, sort=False)['LoginDaysSum'].max(),df, how='inner', on=['clienthostid', 'LoginDaysSum'])
UsersWithMoreThan1client = df_client_Logindayssum_pairs.groupby(['usersidid'], as_index=False, sort=False)['LoginDaysSum'].count().rename(columns={'LoginDaysSum': 'NumOfClientsPerUesr'})
UsersWithMoreThan1client = UsersWithMoreThan1client[UsersWithMoreThan1client.NumOfClientsPerUesr >= 2]
UsersWithMoreThan1client = df_client_Logindayssum_pairs[df_client_Logindayssum_pairs.usersidid.isin(UsersWithMoreThan1Device.loc[:, 'usersidid'])].reset_index(drop=True)
UsersWithMoreThan1client = UsersWithMoreThan1client.sort_values(['clienthostid', 'LoginDaysSum'], ascending=[True, False], inplace=True)
UsersWithMoreThan1client = ttm.groupby(['clienthostid'], sort=False)['LoginDaysSumLast7Days'].apply(lambda x: x.iloc[0] / x.iloc[1]).reset_index(name='ratio')
UsersWithMoreThan1client = UsersWithMoreThan1client[UsersWithMoreThan1client.ratio > 0.8]
UsersWithMoreThan1client = ttm.groupby(['clienthostid'], sort=False)['LoginDaysSumLastMonth'].apply(lambda x: x.iloc[0] / x.iloc[1]).reset_index(name='ratio2')
UsersWithMoreThan1client = UsersWithMoreThan1client[UsersWithMoreThan1client.ratio2 > 0.8]
Would very much appreciate any suggestions on how to do it
Thank you
I believe this is what you need:
# Put the index as a regular column
data = data.reset_index()
# Find greates LoginDaysSum for each clienthostid
agg1 = data.sort_values(by='LoginDaysSum', ascending=False).groupby(['clienthostid']).first()
# Collect greates LoginDaysSum for each usersidid
agg2 = agg1.sort_values(by='LoginDaysSum', ascending=False).groupby('usersidid').first()
# Join both previous aggregations
joined = agg1.set_index('usersidid').join(agg2, rsuffix='_max')
# Compute ratios
joined['LoginDaysSumLast7Days_ratio'] = joined['LoginDaysSumLast7Days_max'] / joined['LoginDaysSumLast7Days']
joined['LoginDaysSumLastMonth_ratio'] = joined['LoginDaysSumLastMonth_max'] / joined['LoginDaysSumLastMonth']
# Select index values that do not meet the required criteria
rem_idx = joined[(joined['LoginDaysSumLast7Days_ratio'] < 0.8) | (joined['LoginDaysSumLastMonth_ratio'] < 0.8)]['index']
# Restore index and remove the selected rows
data = data.set_index('index').drop(rem_idx)
The result in data is:
usersidid clienthostid LoginDaysSumLastMonth LoginDaysSumLast7Days LoginDaysSum
index
0 9 1 50 7 1728
1 3 1 43 3 1331
2 6 1 98 9 216
3 4 1 10 6 64
5 12 3 45 43 1000
6 8 3 87 76 512
7 9 3 16 3 1200

Pandas: Can pandas groupby filter work on original object?

Starting with this question as base.
Python Pandas: remove entries based on the number of occurrences
data = pandas.DataFrame(
{'pid' : [1,1,1,2,2,3,3,3],
'tag' : [23,45,62,24,45,34,25,62],
})
# pid tag
# 0 1 23
# 1 1 45
# 2 1 62
# 3 2 24
# 4 2 45
# 5 3 34
# 6 3 25
# 7 3 62
g = data.groupby('tag')
g.filter(lambda x: len(x) > 1) # filters out lengths > 1.
# pid tag
# 1 1 45
# 2 1 62
# 4 2 45
# 7 3 62
#This would create a new object g:
g = g.filter(lambda x: len(x) > 1) #where g is now a dataframe.
I was wondering is there a way to filter out 'groups' by deleting
them from original object g. And, would it be faster than creating a new groupby object from filtered groupby.
There is only so many ways you can solve this problem. My answer includes 4 solutions. I am sure, there are other ways. Maybe some other answers will present a better way.
Solution #1:
data = data.groupby('tag').filter(lambda x: len(x) > 1)
pid tag
1 1 45
2 1 62
4 2 45
7 3 62
Solution #2:
data['count'] = data.groupby(['tag']).transform('count')
data.loc[data['count'] == 2]
pid tag count
1 1 45 2
2 1 62 2
4 2 45 2
7 3 62 2
Solution #3:
If you want to delete the rows instead, you can use .index.tolist() and then drop().
data['count'] = data.groupby(['tag']).transform('count')
data.drop(data[data['count'] != 2].index.tolist())
pid tag count
1 1 45 2
2 1 62 2
4 2 45 2
7 3 62 2
Solution #4:
data['count'] = data.groupby(['tag']).transform('count')
g = data.groupby('count')
data.loc[g.groups[2],('tag','pid')]
tag pid
1 45 1
2 62 1
4 45 2
7 62 3
A couple of options (yours is at the bottom):
This first one is inplace and as quick as I could make it. Its a bit quicker than your solution but not by virtue of dropping rows in place. I can get even better performance with the second option and this does not change in place.
%%timeit
data = pd.DataFrame(
{'pid' : [1,1,1,2,2,3,3,3],
'tag' : [23,45,62,24,45,34,25,62],
})
mask = ~data.duplicated(subset=['tag'], keep=False)
data.drop(mask[mask].index, inplace=True)
data
1000 loops, best of 3: 1.16 ms per loop
%%timeit
data = pd.DataFrame(
{'pid' : [1,1,1,2,2,3,3,3],
'tag' : [23,45,62,24,45,34,25,62],
})
data = data.loc[data.duplicated(subset=['tag'], keep=False)]
data
1000 loops, best of 3: 719 µs per loop
%%timeit
data = pd.DataFrame(
{'pid' : [1,1,1,2,2,3,3,3],
'tag' : [23,45,62,24,45,34,25,62],
})
g = data.groupby('tag')
g = g.filter(lambda x: len(x) > 1)
g
1000 loops, best of 3: 1.55 ms per loop

Categories

Resources