Finding top N values for each group, 200 million rows

Finding top N values for each group, 200 million rows - python

I have a pandas DataFrame which has around 200 million rows and looks like this:
UserID MovieID Rating
1 455 5
2 411 4
1 288 2
2 300 3
2 137 5
1 300 3
...
I want to get top N movies for each user sorted by rating in descending order, so for N=2 the output should look like this:
UserID MovieID Rating
1 455 5
1 300 3
2 137 5
2 411 4
When I try to do it like this, I get a 'memory error' caused by the 'groupby' (I have 8gb of RAM on my machine)
df.sort_values(by=['rating']).groupby('userID').head(2)
Any suggestions?

Quick and dirty answer
Given that the sort works, you may be able to squeak by with the following, which uses a Numpy-based memory efficient alternative to the Pandas groupby:
import pandas as pd
d = '''UserID MovieID Rating
1 455 5
2 411 4
3 207 5
1 288 2
3 69 2
2 300 3
3 410 4
3 108 3
2 137 5
3 308 3
1 300 3'''
df = pd.read_csv(pd.compat.StringIO(d), sep='\s+', index_col='UserID')
df = df.sort_values(['UserID', 'Rating'])
# carefully handle the construction of ix to ensure no copies are made
ix = np.zeros(df.shape[0], np.int8)
np.subtract(df.index.values[1:], df.index.values[:-1], out=ix[:-1])
# the above assumes that UserID is the index of df. If it's just a column, use this instead
#np.subtract(df['UserID'].values[1:], df['UserID'].values[:-1], out=ix[:-1])
ix[:-1] += ix[1:]
ix[-2:] = 1
ix = ix.view(np.bool)
print(df.iloc[ix])
Output:
MovieID Rating
UserID
1 300 3
1 455 5
2 411 4
2 137 5
3 410 4
3 207 5
More memory efficient answer
Instead of a Pandas dataframe, for stuff this big you should just work with Numpy arrays (which Pandas uses for storing data under the hood). If you use an appropriate structured array, you should be able to fit all of your data into a single array roughly of size:
2 * 10**8 * (4 + 2 + 1)
1,400,000,000 bytes
or ~1.304 GB
which means that it (and a couple of temporaries for calculations) should easily fit into your 8 GB system memory.
Here's some details:
The trickiest part will be initializing the structured array. You may be able to get away with manually initializing the array and then copying the data over:
dfdtype = np.dtype([('UserID', np.uint32), ('MovieID', np.uint16), ('Rating', np.uint8)])
arr = np.empty(df.shape[0], dtype=dfdtype)
arr['UserID'] = df.index.values
for n in dfdtype.names[1:]:
arr[n] = df[n].values
If the above causes an out of memory error, from the start of your program you'll have to build and populate a structured array instead of a dataframe:
arr = np.empty(rowcount, dtype=dfdtype)
...
adapt the code you use to populate the df and put it here
...
Once you have arr, here's how you'd do the groupby you're aiming for:
arr.sort(order=['UserID', 'Rating'])
ix = np.zeros(arr.shape[0], np.int8)
np.subtract(arr['UserID'][1:], arr['UserID'][:-1], out=ix[:-1])
ix[:-1] += ix[1:]
ix[-2:] = 1
ix = ix.view(np.bool)
print(arr[ix])
The above size calculation and dtype assumes that no UserID is larger than 4,294,967,295, no MovieID is larger than 65535, and no rating is larger than 255. This means that the columns of your dataframe can be (np.uint32, np.uint16, np.uint8) without loosing any data.

If you want to keep working with pandas, you can divide your data into batches - 10K rows at a time, for example. You can split the data either after loading the source data to the DF, or even better, load the data in parts.
You can save the results of each iteration (batch) into a dictionary keeping only the number of movies you're interested with:
{userID: {MovieID_1: score1, MovieID_2: s2, ... MovieID_N: sN}, ...}
and update the nested dictionary on each iteration, keeping only the best N movies per user.
This way you'll be able to analyze data much larger than your computer's memory

Related

How to randomly select a sample of data according to specified proportions in Python Pandas?

I have DataFrame in Python Pandas like below:
ID
TG
111
0
222
0
333
1
444
1
555
0
...
...
Above DataFrame has 5 000 000 rows, with:
99.40 % -> 0
0.60% -> 1
And I need to randomly select sample of this data, so as to have 5% of '1' in column TG.
So as a result I need to have DataFrame with observations where 5% are '1', and rest (95% of '0') randomly selected.
For example I need 200 000 observations from my dataset where 5% will be 1 and rest 0
How can I do that in Python Pandas ?

I'm sure there is some more performant way but maybe this works using .sample? Based on a dataset of 5_000 rows.
zeros = df.query("TG.eq(0)")
frac = int(round(.05 * len(zeros), 0))
ones = df.query("TG.ne(0)").sample(n=frac)
df = pd.concat([ones, zeros]).reset_index(drop=True)
print(df["TG"].value_counts())
0 4719
1 236

Repeat rows based on numbers in multiple columns - Python

I have a lot of data that I'm trying to do some basic machine learning on, kind of like the Titanic example that predicts whether a passenger survived or died (I learned this in an intro Python class) based on factors like their gender, age, fare class...
What I'm trying to predict is whether a screw fails depending on how it was made (referred to as Lot). The engineers just listed how many times a failure occurred. Here's how it's formatted.
Lot
Failed?
100
3
110
0
120
1
130
4
The values in the cells are the number of occurrences, so for example:
Lot 100 had three screws that failed
Lot 110 had 0 screws that failed
Lot 120 had one screw that failed
Lot 130 had four screws that failed
I plan on doing a logistic regression using scikit-learn, but first I need each row to be listed as a failure or not. What I'd like to see is a row for every observation, and have them listed as either a 0 (did not occur) or 1 (did occur). Here's what it'd look like after
Lot
Failed?
100
1
100
1
100
1
110
0
120
1
140
1
140
1
140
1
140
1
Here's what I've tried and what I've gotten
df = pd.DataFrame({
'Lot' : ['100', '110', '120', '130'],
'Failed?' : [3, 0, 1, 4]
})
df.loc[df.index.repeat(df['Failed?'])].reset_index(drop = True)
When I do this it repeats the rows but keeps the same values in the Failed? column.
Lot
Failed?
100
3
100
3
100
3
110
0
120
1
140
4
140
4
140
4
140
4
Any ideas? Thank you!

You can use pandas.Series.repeat with reindex, but first you need to differentiate between rows that have 0 and those that do not:
s = df[df['Failed?'].eq(0)] # "save" rows with 0 as value as they will be excluded in repeat since they are repeated 0 times.
df = df.reindex(df.index.repeat(df['Failed?'])) #repeat each row depending on value
df['Failed?'] = 1 #set all values equal to 1
df = pd.concat([df,s]).sort_index() #bring in the 0 values that we saved as 's' earlier and sort by the index to put back in order
df
#The above code as a one-liner:
(pd.concat([df.reindex(df.index.repeat(df['Failed?'])).assign(**{'Failed?' : 1}),
df[df['Failed?'].eq(0)]])
.sort_index())
Out[1]:
Lot Failed?
0 100 1
0 100 1
0 100 1
1 110 0
2 120 1
3 130 1
3 130 1
3 130 1
3 130 1

below will give you failure or not but I suppose you are better served by the other answer.
df.loc[df['Failed?']>0,'Failed?'] = 1
Just as a comment: this is a bit of a strange data transformation, you might want to just keep a numerical target variable

creating new dataframe with manhattan distance in python

I need to create a dataframe containing the manhattan distance between two dataframes with the same columns, and I need the indexes of each dataframe to be the index and column name, so for example lets say I have these two dataframes:
x_train :
index a b c
11 2 5 7
23 4 2 0
312 2 2 2
x_test :
index a b c
22 1 1 1
30 2 0 0
so the columns match but the size and indexes do not, the expected dataframe would look like this:
dist_dataframe:
index 11 23 312
22 11 5 3
30 12 4 4
and what I have right now is this:
def manhattan_distance(a, b):
return sum(abs(e1-e2) for e1, e2 in zip(a,b))
def calc_distance(X_test,X_train):
dist_dataframe = pd.DataFrame(index=X_test.index,columns = X_train.index)
for i in X_train.index:
for j in X_test.index:
dist_dataframe.loc[i,j]=manhattan_distance(X_train.loc[[i]],X_test.loc[[j]])
return dist_dataframe
what I get from the code I have is this dataframe:
dist_dataframe:
index
index 11 23 312
22 NaN NaN NaN
30 NaN NaN NaN
I get the right dataframe size except that it has 2 rows called indexes that I get from the creation of the new dataframe, and also I get an error no matter what I do in the manhattan calculation line, can anyone help me out here please?

Problem in your code
There is a very small problem in your code, i.e. accessing values in dist_dataframe. So,instead of dist_dataframe.loc[i,j], you should reverse the order of i and j and make it like dist_dataframe.loc[j,i]
More efficient solution
It will work fine but since you are a new contributor, I would also like to point out the efficiency of your code. Always try to replace loops with pandas in-built functions. Since they are written in C, it makes them much faster. So here is a more efficient solution:
def manhattan_distance(a, b):
return sum(abs(e1-e2) for e1, e2 in zip(a,b))
def xtrain_distance(row):
distances = {}
for i,each in x_train.iterrows():
distances[i] = manhattan_distance(each,row)
return distances
result = x_test.apply(xtrain_distance, axis=1)
# converting into dataframe
pd.DataFrame(dict(result)).transpose()
It also produces same output and on your example and you can't see any time difference. But when run on a larger size (same data scaled over 20 times), i.e. 60 x_train samples and 40 x_test samples, here is the time difference:
Your solution took: 929 ms
This solution took: 207 ms
It got 4x faster just by eliminating one for loop. Note that, it can be made more efficient but for the sake of demonstration, I have used this solution.

index out of bounds with itertuples : IndexError: index 850 is out of bounds for axis 1 with size 786

My dataframe
userID storeID rating
0 1 662 3.6
1 2 665 3.4
2 3 678 4.0
3 4 500 3.1
4 5 421 2.9
n_users = df.userID.unique().shape[0]
n_stores = df.storeID.unique().shape[0]
I have 2 problems. if i want to built my training dataset like that
ratings = np.zeros((n_users, n_stores))
for row in df.itertuples():
ratings[row[1]-1, row[2]-1] = row[3]
I have IndexEroor like this
IndexError: index 850 is out of bounds for axis 1 with size 786

From what I can tell, you're trying to make a 2-dimensional array of floats, each representing a rating, indexed by the user ID in the first axis and the store ID in the second axis.
You're creating an array of shape (n_users, n_stores), where n_users and n_stores are the number of unique users and stores respectively. When indexing this array,
for row in df.itertuples():
ratings[row[1]-1, row[2]-1] = row[3]
you're using the user/store ID directly (shifted by 1) as an index. This only works if you know that all user/store IDs range from 1 to the number of unique users/stores, without any gaps in between. For example, given the snippet of dataframe you have shown, there are 5 unique users and 5 unique stores, but even if I make a 5 by 5 array, I won't be able to index the second axis (store ID) directly, since the values of store ID are [662, 665, 678, 500, 421], but it can only be indexed by [0, 1, 2, 3, 4].
The IndexError that you get is happening in axis 1 (i.e. the second axis, the one for the store IDs) with index value 850. That means that your store numbers are not contiguous from 1 to 786 (the number of unique store IDs), but rather they are just "individual" integers with gaps in between, since there is a store with ID 850.
What you're looking for is more like a dictionary: an arbitrary mapping between keys and values, in which the indices (keys) don't have to be contiguous, like for an array. Specifically, I think whatever you're trying to do will be much easier by getting a ratings series indexed by a MultiIndex of userID and storeID:
>>> indexed_df = df.set_index(['userID', 'storeID'])
>>> indexed_df
rating
userID storeID
1 662 3.6
2 665 3.4
3 678 4.0
4 500 3.1
5 421 2.9
>>> ratings = indexed_df['ratings']
>>> ratings
userID storeID
1 662 3.6
2 665 3.4
3 678 4.0
4 500 3.1
5 421 2.9
Name: rating, dtype: float64

Grouping numerical values in pandas

In my Dataframe I have one column with numeric values, let say - distance. I want to find out which group of distance (range) have the biggest number of records (rows).
Doing simple:
df.distance.count_values() returns:
74 1
90 1
94 1
893 1
889 1
885 1
877 1
833 1
122 1
545 1
What I want to achieve is something like buckets from histogram, so I am expecting output like this:
900 4 #all values < 900 and > 850
100 3
150 1
550 1
850 1
The one approach I've figured out so far, but I don't think is the best and most optimal one is just find max and min values, divide by my step (50 in this case) and then do loop checking all the values and assigning to appropriate group.
Is there any other, better approach for that?

I'd suggest doing the following, assuming your value column is labeled val
import numpy as np
df['bin'] = df['val'].apply(lambda x: 50*np.floor(x/50))
The result is the following:
df.groupby('bin')['val'].count()

Thanks to EdChum suggestion and based on this example I've figured out, the best way (at least for me) is to do something like this:
import numpy as np
step = 50
#...
max_val = df.distance.max()
bins = list(range(0,int(np.ceil(max_val/step))*step+step,step))
clusters = pd.cut(df.distance,bins,labels=bins[1:])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Finding top N values for each group, 200 million rows - python

Related

How to randomly select a sample of data according to specified proportions in Python Pandas?

Repeat rows based on numbers in multiple columns - Python

creating new dataframe with manhattan distance in python

index out of bounds with itertuples : IndexError: index 850 is out of bounds for axis 1 with size 786

Grouping numerical values in pandas

Categories

Resources