Correlation analysis with multiple data in a single cell - python

I have a dataset with some rows containing singular answers and others having multiple answers. Like so:
year length Animation
0 1971 121 1,2,3
1 1939 71 1,3
2 1941 7 0,2
3 1996 70 1,2,0
4 1975 71 3,2,0
With the singular answers I managed to create a heatmap using df.corr(), but I can't figure out what is the best approach for multiple answers rows.
I could split them and add additional columns for each answer like:
year length Animation
0 1971 121 1
1 1971 121 2
2 1971 121 3
3 1939 71 1
4 1939 71 3 ...
and then do the exact same dr.corr(), or add additional Animation_01, Animation_02 ... columns, but there must be a smarter way to work around this issue?
EDIT: Actual data snippet

You should compute a frequency table between two categorical variables using pd.crosstab() and perform subsequent analyses based on this table. df.corr(x, y) is NOT mathematically meaningful when one of x and y is categorical, no matter it is encoded into number or not.
N.B.1 If x is categorical but y is numerical, there are two options to describe the linkage between them:
Group y into quantiles (bins) and treat it as categorical
Perform a linear regression of y against one-hot encoded dummy variables of x
Option 2 is more precise in general but the statistics is beyond the scope of this question. This post will focus on the case of two categorical variables.
N.B.2 For sparse matrix output please see this post.
Sample Solution
Data & Preprocessing
import pandas as pd
import io
import matplotlib.pyplot as plt
from seaborn import heatmap
df = pd.read_csv(io.StringIO("""
year length Animation
0 1971 121 1,2,3
1 1939 71 1,3
2 1941 7 0,2
3 1996 70 1,2,0
4 1975 71 3,2,0
"""), sep=r"\s{2,}", engine="python")
# convert string to list
df["Animation"] = df["Animation"].str.split(',')
# expand list column into new rows
df = df.explode("Animation")
# (optional)
df["Animation"] = df["Animation"].astype(int)
Frequency Table
Note: grouping of length is ignored for simplicity
ct = pd.crosstab(df["Animation"], df["length"])
print(ct)
# Out[65]:
# length 7 70 71 121
# Animation
# 0 1 1 1 0
# 1 0 1 1 1
# 2 1 1 1 1
# 3 0 0 2 1
Visualization
ax = heatmap(ct, cmap="viridis",
yticklabels=df["Animation"].drop_duplicates().sort_values(),
xticklabels=df["length"].drop_duplicates().sort_values(),
)
ax.set_title("Title", fontsize=20)
plt.show()
Example Analysis
Based on the frequency table, you can ask questions about the distribution of y given a certain (subset of) x value(s), or vice versa. This should better describe the linkage between two categorical variables, as the categorical variables have no order.
For example,
Q: What length does Animation=3 produces?
A: 66.7% chance to give 71
33.3% chance to give 121
otherwise unobserved

You want to break Animation (or Preferred_positions in your data snippet) up into a series of one-hot columns, one one-hot column for every unique string in the original column. Every column with have values of either zero or one, one corresponding to rows where that string appeared in the original column.
First, you need to get all the unique substrings in Preferred_positions (see this answer for how to deal with a column of lists).
positions = df.Preferred_positions.str.split(',').sum().unique()
Then you can create the positions columns in a loop based on whether the given position is in Preferred_positions for each row.
for position in positions:
df[position] = df.Preferred_positions.apply(
lambda x: 1 if position in x else 0
)

Related

creating new dataframe with manhattan distance in python

I need to create a dataframe containing the manhattan distance between two dataframes with the same columns, and I need the indexes of each dataframe to be the index and column name, so for example lets say I have these two dataframes:
x_train :
index a b c
11 2 5 7
23 4 2 0
312 2 2 2
x_test :
index a b c
22 1 1 1
30 2 0 0
so the columns match but the size and indexes do not, the expected dataframe would look like this:
dist_dataframe:
index 11 23 312
22 11 5 3
30 12 4 4
and what I have right now is this:
def manhattan_distance(a, b):
return sum(abs(e1-e2) for e1, e2 in zip(a,b))
def calc_distance(X_test,X_train):
dist_dataframe = pd.DataFrame(index=X_test.index,columns = X_train.index)
for i in X_train.index:
for j in X_test.index:
dist_dataframe.loc[i,j]=manhattan_distance(X_train.loc[[i]],X_test.loc[[j]])
return dist_dataframe
what I get from the code I have is this dataframe:
dist_dataframe:
index
index 11 23 312
22 NaN NaN NaN
30 NaN NaN NaN
I get the right dataframe size except that it has 2 rows called indexes that I get from the creation of the new dataframe, and also I get an error no matter what I do in the manhattan calculation line, can anyone help me out here please?
Problem in your code
There is a very small problem in your code, i.e. accessing values in dist_dataframe. So,instead of dist_dataframe.loc[i,j], you should reverse the order of i and j and make it like dist_dataframe.loc[j,i]
More efficient solution
It will work fine but since you are a new contributor, I would also like to point out the efficiency of your code. Always try to replace loops with pandas in-built functions. Since they are written in C, it makes them much faster. So here is a more efficient solution:
def manhattan_distance(a, b):
return sum(abs(e1-e2) for e1, e2 in zip(a,b))
def xtrain_distance(row):
distances = {}
for i,each in x_train.iterrows():
distances[i] = manhattan_distance(each,row)
return distances
result = x_test.apply(xtrain_distance, axis=1)
# converting into dataframe
pd.DataFrame(dict(result)).transpose()
It also produces same output and on your example and you can't see any time difference. But when run on a larger size (same data scaled over 20 times), i.e. 60 x_train samples and 40 x_test samples, here is the time difference:
Your solution took: 929 ms
This solution took: 207 ms
It got 4x faster just by eliminating one for loop. Note that, it can be made more efficient but for the sake of demonstration, I have used this solution.

Finding top N values for each group, 200 million rows

I have a pandas DataFrame which has around 200 million rows and looks like this:
UserID MovieID Rating
1 455 5
2 411 4
1 288 2
2 300 3
2 137 5
1 300 3
...
I want to get top N movies for each user sorted by rating in descending order, so for N=2 the output should look like this:
UserID MovieID Rating
1 455 5
1 300 3
2 137 5
2 411 4
When I try to do it like this, I get a 'memory error' caused by the 'groupby' (I have 8gb of RAM on my machine)
df.sort_values(by=['rating']).groupby('userID').head(2)
Any suggestions?
Quick and dirty answer
Given that the sort works, you may be able to squeak by with the following, which uses a Numpy-based memory efficient alternative to the Pandas groupby:
import pandas as pd
d = '''UserID MovieID Rating
1 455 5
2 411 4
3 207 5
1 288 2
3 69 2
2 300 3
3 410 4
3 108 3
2 137 5
3 308 3
1 300 3'''
df = pd.read_csv(pd.compat.StringIO(d), sep='\s+', index_col='UserID')
df = df.sort_values(['UserID', 'Rating'])
# carefully handle the construction of ix to ensure no copies are made
ix = np.zeros(df.shape[0], np.int8)
np.subtract(df.index.values[1:], df.index.values[:-1], out=ix[:-1])
# the above assumes that UserID is the index of df. If it's just a column, use this instead
#np.subtract(df['UserID'].values[1:], df['UserID'].values[:-1], out=ix[:-1])
ix[:-1] += ix[1:]
ix[-2:] = 1
ix = ix.view(np.bool)
print(df.iloc[ix])
Output:
MovieID Rating
UserID
1 300 3
1 455 5
2 411 4
2 137 5
3 410 4
3 207 5
More memory efficient answer
Instead of a Pandas dataframe, for stuff this big you should just work with Numpy arrays (which Pandas uses for storing data under the hood). If you use an appropriate structured array, you should be able to fit all of your data into a single array roughly of size:
2 * 10**8 * (4 + 2 + 1)
1,400,000,000 bytes
or ~1.304 GB
which means that it (and a couple of temporaries for calculations) should easily fit into your 8 GB system memory.
Here's some details:
The trickiest part will be initializing the structured array. You may be able to get away with manually initializing the array and then copying the data over:
dfdtype = np.dtype([('UserID', np.uint32), ('MovieID', np.uint16), ('Rating', np.uint8)])
arr = np.empty(df.shape[0], dtype=dfdtype)
arr['UserID'] = df.index.values
for n in dfdtype.names[1:]:
arr[n] = df[n].values
If the above causes an out of memory error, from the start of your program you'll have to build and populate a structured array instead of a dataframe:
arr = np.empty(rowcount, dtype=dfdtype)
...
adapt the code you use to populate the df and put it here
...
Once you have arr, here's how you'd do the groupby you're aiming for:
arr.sort(order=['UserID', 'Rating'])
ix = np.zeros(arr.shape[0], np.int8)
np.subtract(arr['UserID'][1:], arr['UserID'][:-1], out=ix[:-1])
ix[:-1] += ix[1:]
ix[-2:] = 1
ix = ix.view(np.bool)
print(arr[ix])
The above size calculation and dtype assumes that no UserID is larger than 4,294,967,295, no MovieID is larger than 65535, and no rating is larger than 255. This means that the columns of your dataframe can be (np.uint32, np.uint16, np.uint8) without loosing any data.
If you want to keep working with pandas, you can divide your data into batches - 10K rows at a time, for example. You can split the data either after loading the source data to the DF, or even better, load the data in parts.
You can save the results of each iteration (batch) into a dictionary keeping only the number of movies you're interested with:
{userID: {MovieID_1: score1, MovieID_2: s2, ... MovieID_N: sN}, ...}
and update the nested dictionary on each iteration, keeping only the best N movies per user.
This way you'll be able to analyze data much larger than your computer's memory

Pandas DataFrame with Function: Columns Varying

Given the following DataFrame:
import pandas as pd
import numpy as np
d=pd.DataFrame({' Label':['a','a','b','b'],'Count1':[10,20,30,40],'Count2':[20,45,10,35],
'Count3':[40,30,np.nan,22],'Nobs1':[30,30,70,70],'Nobs2':[65,65,45,45],
'Nobs3':[70,70,22,32]})
d
Label Count1 Count2 Count3 Nobs1 Nobs2 Nobs3
0 a 10 20 40.0 30 65 70
1 a 20 45 30.0 30 65 70
2 b 30 10 NaN 70 45 22
3 b 40 35 22.0 70 45 32
I would like to apply the z test for proportions on each combination of column groups (1 and 2, 1 and 3, 2 and 3) per row. By column group, I mean, for example, "Count1" and "Nobs1".
For example, one such test would be:
count = np.array([10, 20]) #from first row of Count1 and Count2, respectively
nobs = np.array([30, 65]) #from first row of Nobs1 and Nobs2, respectively
pv = proportions_ztest(count=count,nobs=nobs,value=0,alternative='two-sided')[1] #this returns just the p-value, which is of interest
pv
0.80265091465415639
I would want the result (pv) to go into a new column (first row) called "p_1_2" or something logical that corresponds to its respective columns.
In summary, here are the challenges I'm facing:
How to apply this per row.
...for each paired combination, mentioned above.
...where the column names and number of pairs of "Count" and "Nobs" columns may vary (assuming that there will always be a "Nobs" column for each "Count" column).
Related to 3: For example, I might have a column called "18-24" and another called "18-24_Nobs".
Thanks in advance!
To 1) and 2) for one test, additional tests can be coded similar or within an additonal loop
for i,row in d.iterrows():
d.loc[i,'test'] = proportions_ztest(count=row['Count1':'Count2'].values,
nobs=row['Nobs1':'Nobs2'].values,
value=0,alternative='two-sided')[1]
for 3) it should be possible the handle these case with pure python inside the loop

pandas histogram/barplot on categorical index and axis

I have this series:
data:
0 17
1 25
2 10
3 60
4 0
5 20
6 300
7 50
8 10
9 80
10 100
11 65
12 125
13 50
14 100
15 150
Name: 1, dtype: int64
I wanted to plot an histogram with variable bin size, so I made this:
filter_values = [0,25,50,60,75,100,150,200,250,300,350]
out = pd.cut(data."1", bins = filter_values)
counts = pd.value_counts(out)
print(counts)
My problem is that when I use counts.plot(kind="hist"), i have not the good label for x axis. I only get them by using a bargraph instead counts.plot(kind="bar"), but I can't get the right order then.
I tried to use xticks=counts.index.values[0] but it makes an error, and xticks=filter_values give an odd figure shape as the numbers go far beyond what the plot understand the bins to be.
I also tried counts.hist(), data.hist(), and counts.plot.hist without success.
I don't know how to plot correctly the categorical data from counts (it includes as index a pandas categorical index) so, I don't know which process I should apply, if there is a way to plot variable bins directly in data.hist() or data.plot(kind="hist") or data.plot.hist(), or if I am right to build counts, but then how to represent this correctly (with good labels on the xaxis and the right order, not a descending one as in the bar graph.

Generate multiple 2D arrays from column data of a pandas dataframe and store them in a data structure

I have several versions of this type of DataFrame.
My idea was to structure each individual value column in a 2D mesh/array for each time step. These 2D arrays should be sequenced by increasing TIME values and stored as a separate dataset (in pandas or numpy??) per variable.
This way i could call the value and load all the TIME instances of it. If i plot these consecutive time steps it should give me a temporal representation of the 2D space (a moving image of the 2D space in time) for each variable.
INDEX TIME ELEM var1 var2 var3 ....
0 0 h1 0.555 0.97 1.555
1 0 t5 0 0.8 1.2
2 0 y7 1 7 1
...
300 15 h1 0.6 0.477 0
301 15 t5 0.9 0.777 1
302 15 y7 0.555 0.897 5
...
800 23 h1 20 7 2
801 23 t5 5 7 5
802 23 y7 0.1 3 55
...
1010 58 h1 9 0.7 11
1011 58 t5 10 977 6
1012 58 y7 5 71 52
...
Hierarchically what i want to achieve is essentially this data structure, where each variable is stored in a 2D array :
Full dataset (dataframe)
Sub-dataset version (dataframe)
Time instance (dataframe)
var1 (2D array)
var2 (2D array)
var3 (2D array)
My first idea was to do groupby TIME and ELEM. But i don't think this is the way to go. I have also looked into the melt function but that doesn't seem to cut it either.
Logically i think i first need to slice the data per unique TIME values, then reshape each variable column of that slice based on element code to a 2D matrix/array as discussed here. Lastly each 2D matrix/array should be added in a datastructure as discussed above. How could i make this work?
My understanding is that ideally the data structure should be in pandas for increased efficiency of operations and broadcasting. Is there a better way? I have looked into panels but it's not so clear yet
This is what i got so far:
import pandas as pd
import numpy as np
# read the csv file
b = pd.read_csv('D:/myfile.csv', skipinitialspace=False, skiprows=0)
# remove possible empty spaces from the headers
b.rename(columns=lambda x:x.strip(), inplace=True)
# extract unique times and variable names
times = b.TIME.unique()
#make an empty list
listt = []
# for each time instance
for i in range(len(times)):
# generate the sub-dataset for each uniqe time
foo = b.loc[b.TIME==times[i]]
# re-extract the column names for each iteration
colnames=b.columns.unique().tolist()
# for each columnname in the dataset
for k in range(len(colnames)):
# reshape and assign the reshaped arrays inside the colnames list
colnames[k] = np.reshape(foo[colnames[k]], (-1,51))
# append each TIME instance with it's respective structured data to the list
listt.append(colnames)
# convert the generated list to a 4D panel
mypanel = pd.Panel4D(listt)
This way my indexes are numerical so i am not able to keep the actual values for unique time and the column names which is not optimal. I have the feeling this can be done better and in a more efficient way, just don't know how
Suggestions are welcome.... :-)

Categories

Resources