Random-shuffing python dataframe row between categories - python

I am aware of the numpy.random.permutation method to conveniently shuffle the rows in a dataframe.
I want, however, the rows of one column to be shuffled such that after the shuffling, same values of this column are associated with the same values of a
second column. For instance here:
sid tid cluster_id coherence
0 484 367 0 (-0.7602504647007313-0.12366326038519604j)
1 485 367 0 (-0.7602504647007313-0.12366326038519604j)
2 227 2 1 (0.8285282150429198+0.007917196582272277j)
3 228 2 1 (0.8285282150429198+0.007917196582272277j)
4 488 245 2 (-0.5247187752391191+0.03756613687159624j)
5 489 245 2 (-0.5247187752391191+0.03756613687159624j)
6 76 504 3 (-0.5017704895797781-0.17508351848297674j)
7 59 545 3 (-0.37153924345882344-0.08026706090664427j)
I want to shuffle the value of the rows of "coherence".
Right now, rows with identical tids also have the same coherence values.
This should also remain after the shuffling - but the coherence values should be assigned to new tids.
Hence a coherence value which was previously associated with a tid X will be
associated with a new tid Y, but all rows with this new tid Y should have this same coherence value.

Since I'm too lazy to reproduce your dataframe, I just use a toy case. What you want is to shuffle in a groupby:
df = pd.DataFrame({'tid':[1,1,1,2,2,2],'others':[1,2,3,4,5,6],'coherence':[1,2,3,4,5,6]})
df['coherence'] = df.groupby('tid').coherence.transform(np.random.permutation)
UPDATE
Okay so I understood you wrong the first time, the previous answer shuffles within each group of tid but you want to shuffle the groups. Still the groupby is the solution, just shuffle the groups first:
import random
df = pd.DataFrame({'tid':[1,1,2,2,3,3,4,4],'val':[1,2,3,4,5,6,7,8],'coherence':[1,1,2,2,3,3,4,4]})
groups = [df for _, df in df[['tid','coherence']].groupby('tid')]
random.shuffle(groups)
df[['tid', 'coherence']] = pd.concat(groups).reset_index(drop=True)
I hope this does it.
UPDATE
What you want is not clear from your question at all. Here's your solution:
df = pd.DataFrame({'tid':[1,1,2,2,3,3,4,4],'val':[1,2,3,4,5,6,7,8],'coherence':[1,1,2,2,3,3,4,4]})
tmp = df[['tid', 'coherence']].drop_duplicates()
tmp['coherence'] = np.random.permutation(tmp.coherence)
pd.merge(df, tmp, 'left', left_on='tid', right_on='tid')
coherence_x is the old one and coherence_y is the new one.

Related

Group by a category

I have done KMeans clusters and now I need to analyse each individual cluster. For example look at cluster 1 and see what clients are on it and make conclusions.
dfRFM['idcluster'] = num_cluster
dfRFM.head()
idcliente Recencia Frecuencia Monetario idcluster
1 3 251 44 -90.11 0
2 8 1011 44 87786.44 2
6 88 537 36 8589.57 0
7 98 505 2 -179.00 0
9 156 11 15 35259.50 0
How do I group so I only see results from lets say idcluster 0 and sort by lets say "Monetario". Thanks!
To filter a dataframe, the most common way is to use df[df[colname] == val] Then you can use df.sort_values()
In your case, that would look like this:
dfRFM_id0 = dfRFM[dfRFM['idcluster']==0].sort_values('Monetario')
The way this filtering works is that dfRFM['idcluster']==0 returns a series of True/False based on if it is, well, true or false. So then we have a sort of dfRFM[(True,False,True,True...)], and so the dataframe returns only the rows where we have a True. That is, filtering/selecting the data where the condition is true.
edit: add 'the way this works...'
I think you actually just need to filter your DF!
df_new = dfRFM[dfRFM.idcluster == 0]
and then sort by Montario
df_new = df_new.sort_values(by = 'Monetario')
Group by is really best for when you're wanting to look at the cluster as a whole - for example, if you wanted to see the average values for Recencia, Frecuencia, and Monetario for all of Group 0.

Correlation analysis with multiple data in a single cell

I have a dataset with some rows containing singular answers and others having multiple answers. Like so:
year length Animation
0 1971 121 1,2,3
1 1939 71 1,3
2 1941 7 0,2
3 1996 70 1,2,0
4 1975 71 3,2,0
With the singular answers I managed to create a heatmap using df.corr(), but I can't figure out what is the best approach for multiple answers rows.
I could split them and add additional columns for each answer like:
year length Animation
0 1971 121 1
1 1971 121 2
2 1971 121 3
3 1939 71 1
4 1939 71 3 ...
and then do the exact same dr.corr(), or add additional Animation_01, Animation_02 ... columns, but there must be a smarter way to work around this issue?
EDIT: Actual data snippet
You should compute a frequency table between two categorical variables using pd.crosstab() and perform subsequent analyses based on this table. df.corr(x, y) is NOT mathematically meaningful when one of x and y is categorical, no matter it is encoded into number or not.
N.B.1 If x is categorical but y is numerical, there are two options to describe the linkage between them:
Group y into quantiles (bins) and treat it as categorical
Perform a linear regression of y against one-hot encoded dummy variables of x
Option 2 is more precise in general but the statistics is beyond the scope of this question. This post will focus on the case of two categorical variables.
N.B.2 For sparse matrix output please see this post.
Sample Solution
Data & Preprocessing
import pandas as pd
import io
import matplotlib.pyplot as plt
from seaborn import heatmap
df = pd.read_csv(io.StringIO("""
year length Animation
0 1971 121 1,2,3
1 1939 71 1,3
2 1941 7 0,2
3 1996 70 1,2,0
4 1975 71 3,2,0
"""), sep=r"\s{2,}", engine="python")
# convert string to list
df["Animation"] = df["Animation"].str.split(',')
# expand list column into new rows
df = df.explode("Animation")
# (optional)
df["Animation"] = df["Animation"].astype(int)
Frequency Table
Note: grouping of length is ignored for simplicity
ct = pd.crosstab(df["Animation"], df["length"])
print(ct)
# Out[65]:
# length 7 70 71 121
# Animation
# 0 1 1 1 0
# 1 0 1 1 1
# 2 1 1 1 1
# 3 0 0 2 1
Visualization
ax = heatmap(ct, cmap="viridis",
yticklabels=df["Animation"].drop_duplicates().sort_values(),
xticklabels=df["length"].drop_duplicates().sort_values(),
)
ax.set_title("Title", fontsize=20)
plt.show()
Example Analysis
Based on the frequency table, you can ask questions about the distribution of y given a certain (subset of) x value(s), or vice versa. This should better describe the linkage between two categorical variables, as the categorical variables have no order.
For example,
Q: What length does Animation=3 produces?
A: 66.7% chance to give 71
33.3% chance to give 121
otherwise unobserved
You want to break Animation (or Preferred_positions in your data snippet) up into a series of one-hot columns, one one-hot column for every unique string in the original column. Every column with have values of either zero or one, one corresponding to rows where that string appeared in the original column.
First, you need to get all the unique substrings in Preferred_positions (see this answer for how to deal with a column of lists).
positions = df.Preferred_positions.str.split(',').sum().unique()
Then you can create the positions columns in a loop based on whether the given position is in Preferred_positions for each row.
for position in positions:
df[position] = df.Preferred_positions.apply(
lambda x: 1 if position in x else 0
)

index out of bounds with itertuples : IndexError: index 850 is out of bounds for axis 1 with size 786

My dataframe
userID storeID rating
0 1 662 3.6
1 2 665 3.4
2 3 678 4.0
3 4 500 3.1
4 5 421 2.9
n_users = df.userID.unique().shape[0]
n_stores = df.storeID.unique().shape[0]
I have 2 problems. if i want to built my training dataset like that
ratings = np.zeros((n_users, n_stores))
for row in df.itertuples():
ratings[row[1]-1, row[2]-1] = row[3]
I have IndexEroor like this
IndexError: index 850 is out of bounds for axis 1 with size 786
From what I can tell, you're trying to make a 2-dimensional array of floats, each representing a rating, indexed by the user ID in the first axis and the store ID in the second axis.
You're creating an array of shape (n_users, n_stores), where n_users and n_stores are the number of unique users and stores respectively. When indexing this array,
for row in df.itertuples():
ratings[row[1]-1, row[2]-1] = row[3]
you're using the user/store ID directly (shifted by 1) as an index. This only works if you know that all user/store IDs range from 1 to the number of unique users/stores, without any gaps in between. For example, given the snippet of dataframe you have shown, there are 5 unique users and 5 unique stores, but even if I make a 5 by 5 array, I won't be able to index the second axis (store ID) directly, since the values of store ID are [662, 665, 678, 500, 421], but it can only be indexed by [0, 1, 2, 3, 4].
The IndexError that you get is happening in axis 1 (i.e. the second axis, the one for the store IDs) with index value 850. That means that your store numbers are not contiguous from 1 to 786 (the number of unique store IDs), but rather they are just "individual" integers with gaps in between, since there is a store with ID 850.
What you're looking for is more like a dictionary: an arbitrary mapping between keys and values, in which the indices (keys) don't have to be contiguous, like for an array. Specifically, I think whatever you're trying to do will be much easier by getting a ratings series indexed by a MultiIndex of userID and storeID:
>>> indexed_df = df.set_index(['userID', 'storeID'])
>>> indexed_df
rating
userID storeID
1 662 3.6
2 665 3.4
3 678 4.0
4 500 3.1
5 421 2.9
>>> ratings = indexed_df['ratings']
>>> ratings
userID storeID
1 662 3.6
2 665 3.4
3 678 4.0
4 500 3.1
5 421 2.9
Name: rating, dtype: float64

Finding top N values for each group, 200 million rows

I have a pandas DataFrame which has around 200 million rows and looks like this:
UserID MovieID Rating
1 455 5
2 411 4
1 288 2
2 300 3
2 137 5
1 300 3
...
I want to get top N movies for each user sorted by rating in descending order, so for N=2 the output should look like this:
UserID MovieID Rating
1 455 5
1 300 3
2 137 5
2 411 4
When I try to do it like this, I get a 'memory error' caused by the 'groupby' (I have 8gb of RAM on my machine)
df.sort_values(by=['rating']).groupby('userID').head(2)
Any suggestions?
Quick and dirty answer
Given that the sort works, you may be able to squeak by with the following, which uses a Numpy-based memory efficient alternative to the Pandas groupby:
import pandas as pd
d = '''UserID MovieID Rating
1 455 5
2 411 4
3 207 5
1 288 2
3 69 2
2 300 3
3 410 4
3 108 3
2 137 5
3 308 3
1 300 3'''
df = pd.read_csv(pd.compat.StringIO(d), sep='\s+', index_col='UserID')
df = df.sort_values(['UserID', 'Rating'])
# carefully handle the construction of ix to ensure no copies are made
ix = np.zeros(df.shape[0], np.int8)
np.subtract(df.index.values[1:], df.index.values[:-1], out=ix[:-1])
# the above assumes that UserID is the index of df. If it's just a column, use this instead
#np.subtract(df['UserID'].values[1:], df['UserID'].values[:-1], out=ix[:-1])
ix[:-1] += ix[1:]
ix[-2:] = 1
ix = ix.view(np.bool)
print(df.iloc[ix])
Output:
MovieID Rating
UserID
1 300 3
1 455 5
2 411 4
2 137 5
3 410 4
3 207 5
More memory efficient answer
Instead of a Pandas dataframe, for stuff this big you should just work with Numpy arrays (which Pandas uses for storing data under the hood). If you use an appropriate structured array, you should be able to fit all of your data into a single array roughly of size:
2 * 10**8 * (4 + 2 + 1)
1,400,000,000 bytes
or ~1.304 GB
which means that it (and a couple of temporaries for calculations) should easily fit into your 8 GB system memory.
Here's some details:
The trickiest part will be initializing the structured array. You may be able to get away with manually initializing the array and then copying the data over:
dfdtype = np.dtype([('UserID', np.uint32), ('MovieID', np.uint16), ('Rating', np.uint8)])
arr = np.empty(df.shape[0], dtype=dfdtype)
arr['UserID'] = df.index.values
for n in dfdtype.names[1:]:
arr[n] = df[n].values
If the above causes an out of memory error, from the start of your program you'll have to build and populate a structured array instead of a dataframe:
arr = np.empty(rowcount, dtype=dfdtype)
...
adapt the code you use to populate the df and put it here
...
Once you have arr, here's how you'd do the groupby you're aiming for:
arr.sort(order=['UserID', 'Rating'])
ix = np.zeros(arr.shape[0], np.int8)
np.subtract(arr['UserID'][1:], arr['UserID'][:-1], out=ix[:-1])
ix[:-1] += ix[1:]
ix[-2:] = 1
ix = ix.view(np.bool)
print(arr[ix])
The above size calculation and dtype assumes that no UserID is larger than 4,294,967,295, no MovieID is larger than 65535, and no rating is larger than 255. This means that the columns of your dataframe can be (np.uint32, np.uint16, np.uint8) without loosing any data.
If you want to keep working with pandas, you can divide your data into batches - 10K rows at a time, for example. You can split the data either after loading the source data to the DF, or even better, load the data in parts.
You can save the results of each iteration (batch) into a dictionary keeping only the number of movies you're interested with:
{userID: {MovieID_1: score1, MovieID_2: s2, ... MovieID_N: sN}, ...}
and update the nested dictionary on each iteration, keeping only the best N movies per user.
This way you'll be able to analyze data much larger than your computer's memory

Grouping numerical values in pandas

In my Dataframe I have one column with numeric values, let say - distance. I want to find out which group of distance (range) have the biggest number of records (rows).
Doing simple:
df.distance.count_values() returns:
74 1
90 1
94 1
893 1
889 1
885 1
877 1
833 1
122 1
545 1
What I want to achieve is something like buckets from histogram, so I am expecting output like this:
900 4 #all values < 900 and > 850
100 3
150 1
550 1
850 1
The one approach I've figured out so far, but I don't think is the best and most optimal one is just find max and min values, divide by my step (50 in this case) and then do loop checking all the values and assigning to appropriate group.
Is there any other, better approach for that?
I'd suggest doing the following, assuming your value column is labeled val
import numpy as np
df['bin'] = df['val'].apply(lambda x: 50*np.floor(x/50))
The result is the following:
df.groupby('bin')['val'].count()
Thanks to EdChum suggestion and based on this example I've figured out, the best way (at least for me) is to do something like this:
import numpy as np
step = 50
#...
max_val = df.distance.max()
bins = list(range(0,int(np.ceil(max_val/step))*step+step,step))
clusters = pd.cut(df.distance,bins,labels=bins[1:])

Categories

Resources