Grouping numerical values in pandas - python

In my Dataframe I have one column with numeric values, let say - distance. I want to find out which group of distance (range) have the biggest number of records (rows).
Doing simple:
df.distance.count_values() returns:
74 1
90 1
94 1
893 1
889 1
885 1
877 1
833 1
122 1
545 1
What I want to achieve is something like buckets from histogram, so I am expecting output like this:
900 4 #all values < 900 and > 850
100 3
150 1
550 1
850 1
The one approach I've figured out so far, but I don't think is the best and most optimal one is just find max and min values, divide by my step (50 in this case) and then do loop checking all the values and assigning to appropriate group.
Is there any other, better approach for that?

I'd suggest doing the following, assuming your value column is labeled val
import numpy as np
df['bin'] = df['val'].apply(lambda x: 50*np.floor(x/50))
The result is the following:
df.groupby('bin')['val'].count()

Thanks to EdChum suggestion and based on this example I've figured out, the best way (at least for me) is to do something like this:
import numpy as np
step = 50
#...
max_val = df.distance.max()
bins = list(range(0,int(np.ceil(max_val/step))*step+step,step))
clusters = pd.cut(df.distance,bins,labels=bins[1:])

Related

How to randomly select a sample of data according to specified proportions in Python Pandas?

I have DataFrame in Python Pandas like below:
ID
TG
111
0
222
0
333
1
444
1
555
0
...
...
Above DataFrame has 5 000 000 rows, with:
99.40 % -> 0
0.60% -> 1
And I need to randomly select sample of this data, so as to have 5% of '1' in column TG.
So as a result I need to have DataFrame with observations where 5% are '1', and rest (95% of '0') randomly selected.
For example I need 200 000 observations from my dataset where 5% will be 1 and rest 0
How can I do that in Python Pandas ?
I'm sure there is some more performant way but maybe this works using .sample? Based on a dataset of 5_000 rows.
zeros = df.query("TG.eq(0)")
frac = int(round(.05 * len(zeros), 0))
ones = df.query("TG.ne(0)").sample(n=frac)
df = pd.concat([ones, zeros]).reset_index(drop=True)
print(df["TG"].value_counts())
0 4719
1 236

Group by a category

I have done KMeans clusters and now I need to analyse each individual cluster. For example look at cluster 1 and see what clients are on it and make conclusions.
dfRFM['idcluster'] = num_cluster
dfRFM.head()
idcliente Recencia Frecuencia Monetario idcluster
1 3 251 44 -90.11 0
2 8 1011 44 87786.44 2
6 88 537 36 8589.57 0
7 98 505 2 -179.00 0
9 156 11 15 35259.50 0
How do I group so I only see results from lets say idcluster 0 and sort by lets say "Monetario". Thanks!
To filter a dataframe, the most common way is to use df[df[colname] == val] Then you can use df.sort_values()
In your case, that would look like this:
dfRFM_id0 = dfRFM[dfRFM['idcluster']==0].sort_values('Monetario')
The way this filtering works is that dfRFM['idcluster']==0 returns a series of True/False based on if it is, well, true or false. So then we have a sort of dfRFM[(True,False,True,True...)], and so the dataframe returns only the rows where we have a True. That is, filtering/selecting the data where the condition is true.
edit: add 'the way this works...'
I think you actually just need to filter your DF!
df_new = dfRFM[dfRFM.idcluster == 0]
and then sort by Montario
df_new = df_new.sort_values(by = 'Monetario')
Group by is really best for when you're wanting to look at the cluster as a whole - for example, if you wanted to see the average values for Recencia, Frecuencia, and Monetario for all of Group 0.

Math opeartion across Dataframes

I have two dataframes, let's call them df A and df B
A =
0 1
123 798
456 845
789 932
B =
0 1
321 593
546 603
937 205
Now I would like to multiply them, but also with an expression, as in A-1/B^2 for each of them
AB =
0 1
123-1/(321^2) 798-1/(593^2)
456-1/(546^2) 845-1/603^2)
789-1/(937^2) 932-1/(205^2)
Now, I have figured I could loop through each row and each column and try some sort of
A[i][j]-1/(B[i][j]^2)
But when it goes up to a 1000x1000 matrix, it would take quite some time.
Is there any operation for pandas or numpy that allows these sort of cross matrix operations? Not just multiplying one matrix by the other, but rather doing a math opeartion between them.
Maybe calculate the divider at first for a new df B ?

creating new dataframe with manhattan distance in python

I need to create a dataframe containing the manhattan distance between two dataframes with the same columns, and I need the indexes of each dataframe to be the index and column name, so for example lets say I have these two dataframes:
x_train :
index a b c
11 2 5 7
23 4 2 0
312 2 2 2
x_test :
index a b c
22 1 1 1
30 2 0 0
so the columns match but the size and indexes do not, the expected dataframe would look like this:
dist_dataframe:
index 11 23 312
22 11 5 3
30 12 4 4
and what I have right now is this:
def manhattan_distance(a, b):
return sum(abs(e1-e2) for e1, e2 in zip(a,b))
def calc_distance(X_test,X_train):
dist_dataframe = pd.DataFrame(index=X_test.index,columns = X_train.index)
for i in X_train.index:
for j in X_test.index:
dist_dataframe.loc[i,j]=manhattan_distance(X_train.loc[[i]],X_test.loc[[j]])
return dist_dataframe
what I get from the code I have is this dataframe:
dist_dataframe:
index
index 11 23 312
22 NaN NaN NaN
30 NaN NaN NaN
I get the right dataframe size except that it has 2 rows called indexes that I get from the creation of the new dataframe, and also I get an error no matter what I do in the manhattan calculation line, can anyone help me out here please?
Problem in your code
There is a very small problem in your code, i.e. accessing values in dist_dataframe. So,instead of dist_dataframe.loc[i,j], you should reverse the order of i and j and make it like dist_dataframe.loc[j,i]
More efficient solution
It will work fine but since you are a new contributor, I would also like to point out the efficiency of your code. Always try to replace loops with pandas in-built functions. Since they are written in C, it makes them much faster. So here is a more efficient solution:
def manhattan_distance(a, b):
return sum(abs(e1-e2) for e1, e2 in zip(a,b))
def xtrain_distance(row):
distances = {}
for i,each in x_train.iterrows():
distances[i] = manhattan_distance(each,row)
return distances
result = x_test.apply(xtrain_distance, axis=1)
# converting into dataframe
pd.DataFrame(dict(result)).transpose()
It also produces same output and on your example and you can't see any time difference. But when run on a larger size (same data scaled over 20 times), i.e. 60 x_train samples and 40 x_test samples, here is the time difference:
Your solution took: 929 ms
This solution took: 207 ms
It got 4x faster just by eliminating one for loop. Note that, it can be made more efficient but for the sake of demonstration, I have used this solution.

Finding top N values for each group, 200 million rows

I have a pandas DataFrame which has around 200 million rows and looks like this:
UserID MovieID Rating
1 455 5
2 411 4
1 288 2
2 300 3
2 137 5
1 300 3
...
I want to get top N movies for each user sorted by rating in descending order, so for N=2 the output should look like this:
UserID MovieID Rating
1 455 5
1 300 3
2 137 5
2 411 4
When I try to do it like this, I get a 'memory error' caused by the 'groupby' (I have 8gb of RAM on my machine)
df.sort_values(by=['rating']).groupby('userID').head(2)
Any suggestions?
Quick and dirty answer
Given that the sort works, you may be able to squeak by with the following, which uses a Numpy-based memory efficient alternative to the Pandas groupby:
import pandas as pd
d = '''UserID MovieID Rating
1 455 5
2 411 4
3 207 5
1 288 2
3 69 2
2 300 3
3 410 4
3 108 3
2 137 5
3 308 3
1 300 3'''
df = pd.read_csv(pd.compat.StringIO(d), sep='\s+', index_col='UserID')
df = df.sort_values(['UserID', 'Rating'])
# carefully handle the construction of ix to ensure no copies are made
ix = np.zeros(df.shape[0], np.int8)
np.subtract(df.index.values[1:], df.index.values[:-1], out=ix[:-1])
# the above assumes that UserID is the index of df. If it's just a column, use this instead
#np.subtract(df['UserID'].values[1:], df['UserID'].values[:-1], out=ix[:-1])
ix[:-1] += ix[1:]
ix[-2:] = 1
ix = ix.view(np.bool)
print(df.iloc[ix])
Output:
MovieID Rating
UserID
1 300 3
1 455 5
2 411 4
2 137 5
3 410 4
3 207 5
More memory efficient answer
Instead of a Pandas dataframe, for stuff this big you should just work with Numpy arrays (which Pandas uses for storing data under the hood). If you use an appropriate structured array, you should be able to fit all of your data into a single array roughly of size:
2 * 10**8 * (4 + 2 + 1)
1,400,000,000 bytes
or ~1.304 GB
which means that it (and a couple of temporaries for calculations) should easily fit into your 8 GB system memory.
Here's some details:
The trickiest part will be initializing the structured array. You may be able to get away with manually initializing the array and then copying the data over:
dfdtype = np.dtype([('UserID', np.uint32), ('MovieID', np.uint16), ('Rating', np.uint8)])
arr = np.empty(df.shape[0], dtype=dfdtype)
arr['UserID'] = df.index.values
for n in dfdtype.names[1:]:
arr[n] = df[n].values
If the above causes an out of memory error, from the start of your program you'll have to build and populate a structured array instead of a dataframe:
arr = np.empty(rowcount, dtype=dfdtype)
...
adapt the code you use to populate the df and put it here
...
Once you have arr, here's how you'd do the groupby you're aiming for:
arr.sort(order=['UserID', 'Rating'])
ix = np.zeros(arr.shape[0], np.int8)
np.subtract(arr['UserID'][1:], arr['UserID'][:-1], out=ix[:-1])
ix[:-1] += ix[1:]
ix[-2:] = 1
ix = ix.view(np.bool)
print(arr[ix])
The above size calculation and dtype assumes that no UserID is larger than 4,294,967,295, no MovieID is larger than 65535, and no rating is larger than 255. This means that the columns of your dataframe can be (np.uint32, np.uint16, np.uint8) without loosing any data.
If you want to keep working with pandas, you can divide your data into batches - 10K rows at a time, for example. You can split the data either after loading the source data to the DF, or even better, load the data in parts.
You can save the results of each iteration (batch) into a dictionary keeping only the number of movies you're interested with:
{userID: {MovieID_1: score1, MovieID_2: s2, ... MovieID_N: sN}, ...}
and update the nested dictionary on each iteration, keeping only the best N movies per user.
This way you'll be able to analyze data much larger than your computer's memory

Categories

Resources