I have a small dataset, for example :
import pandas as pd
df = pd.DataFrame({'a': [1,2,3,4,5,6,7,8,9,10], 'b': [11,22,11,22,33,11,22,44,11,22]})
df
I want to find out the co-occurrence of column b values for the column a.
What I tried :
df_co = pd.get_dummies(a.a).groupby(a.b).apply(max)
df_co
But this is not a co-occurrence matrix. So I also tried this:
df_co.T.dot(df_co)
which gives me:
Is this a correct method to calculate the co-occurrence matrix?
You can use df.pivot with a dummy column to represent count=1
df.assign(v=1).pivot('a','b').fillna(0)
v
b 11 22 33 44
a
1 1.0 0.0 0.0 0.0
2 0.0 1.0 0.0 0.0
3 1.0 0.0 0.0 0.0
4 0.0 1.0 0.0 0.0
5 0.0 0.0 1.0 0.0
6 1.0 0.0 0.0 0.0
7 0.0 1.0 0.0 0.0
8 0.0 0.0 0.0 1.0
9 1.0 0.0 0.0 0.0
10 0.0 1.0 0.0 0.0
Or, as #Quang Hoang suggested, try pd.crosstab
Related
I have this data frame structure:
AU01_r AU02_r AU04_r AU05_r AU06_r AU07_r AU09_r AU10_r AU12_r
AU14_r AU15_r AU17_r AU20_r AU23_r AU25_r AU26_r AU45_r Segment
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1
where each 7500 records have the same segment, I mean segments holds the following records ranges:
[[1, 7500], [7501, 15000], [15001, 22500], [22501, 30000], [30001, 37500], [37501, 45000], [45001, 52500], [52501, 60000], [60001, 67500], [67501, 75000], [75001, 82626]]
Currently I have a code - using pandas - that calculate the mean for each segment:
df_AU_r_2_mean = df_AU_r_2.groupby(['Segment']).mean()
AU01_r AU02_r AU04_r AU05_r AU06_r AU07_r AU09_r AU10_r
AU12_r AU14_r AU15_r AU17_r AU20_r AU23_r AU25_r AU26_r AU45_r WD CF
Segment
1 0.192525 0.156520 0.888929 0.049577 0.092363 0.609992 0.039349 0.385985 0.242643 0.395441 0.456475 0.504961 0.253471 0.074785 0.509816 0.307315 0.093600 1 1
2 0.190215 0.155545 1.027495 0.144367 0.121984 0.872449 0.103985 0.582804 0.311179 0.685669 0.358625 0.605624 0.182963 0.187416 0.530021 0.521449 0.158552 1 0
3 0.187849 0.114435 1.028465 0.110275 0.045937 0.755899 0.088371 0.395693 0.128856 0.376444 0.491379 0.528315 0.245704 0.086708 0.483681 0.442268 0.173515 1 0
But I need to enhance it, in such a way that I'll be able to calculate mean/sem/std for each one of the AU columns for each 1500 records (to divide each segment into smaller parts).
I wondered if it can be done using pandas data frame transformations?
First add a new column as an incremental id. This will be used to create your newer and smaller segments.
df.insert(0, 'id', range(1, 1 + len(df)))
After that create a new column that indicates each 1500 rows.
df["new_Segment"] = pd.to_numeric(df.id//1500).shift(fill_value=0).add(1)
Now, you can do the calculations based on new segment column.
df_AU_r_2_mean = df_AU_r_2.groupby(['new_Segment']).mean()
At the end, the dataframe will be:
id A B C Segment new_Segment
1 x x x 1 1
2 x x x 1 1
..
1500 x x x 1 1
1501 x x x 1 2
..
7500 x x x 1 5
7501 x x x 2 6
..
For creating new columns with calculations:
df["A_mean"] = df["A"].groupby(['new_Segment']).mean()
I have a matrix of the form :
movie_id 1 2 3 ... 1494 1497 1500
user_id
1600 1.0 0.0 1.0 ... 0.0 0.0 1.0
1601 1.0 0.0 0.0 ... 1.0 0.0 0.0
1602 0.0 0.0 0.0 ... 0.0 1.0 1.0
1603 0.0 0.0 1.0 ... 0.0 0.0 0.0
1604 1.0 0.0 0.0 ... 1.0 0.0 0.0
. ...
.
.
As you can see even though the movies in my dataset are 1500, some movies haven't been recorded cause of the preprocess that my data has gone through.
What i want is to add and fill all the columns (movie_ids) that haven't been recorded with values of 0 (I don't know which movie_ids haven't been recorded exactly). So for example i want a new matrix of the form:
movie_id 1 2 3 ... 1494 1495 1496 1497 1498 1499 1500
user_id
1600 1.0 0.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 1.0
1601 1.0 0.0 0.0 ... 1.0 0.0 0.0 0.0 0.0 0.0 0.0
1602 0.0 0.0 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 1.0
1603 0.0 0.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1604 1.0 0.0 0.0 ... 1.0 0.0 0.0 0.0 0.0 0.0 0.0
. ...
.
.
Use DataFrame.reindex along axis=1 with fill_value=0 to conform the dataframe columns to a new index range:
df = df.reindex(range(df.columns.min(), df.columns.max() + 1), axis=1, fill_value=0)
Result:
movie_id 1 2 3 1498 1499 1500
user_id
1600 1.0 0.0 1.0 0 0 1.0
1601 1.0 0.0 0.0 0 0 0.0
1602 0.0 0.0 0.0 ... 0 0 1.0
1603 0.0 0.0 1.0 ... 0 0 0.0
1604 1.0 0.0 0.0 0 0 0.0
I assume variable name of the matrix is matrix
n_moovies = 1500
moove_ids = matrix.columns
for moovie_id in range(1, n_moovies + 1):
# iterate over id-s
if moovie_id not in moove_ids:
# if there's no such moovie create a column filled with zeros
matrix[moovie_id] = 0
I have dataframe not sequences. if I use len(df.columns), my data has 3586 columns. How to re-order the data sequences?
ID V1 V10 V100 V1000 V1001 V1002 ... V990 V991 V992 V993 V994
A 1 9.0 2.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
B 1 1.2 0.1 3.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
C 2 8.6 8.0 2.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0
D 3 0.0 2.0 0.0 0.0 0.0 0.0 3.0 0.0 0.0 0.0 0.0
E 4 7.8 6.6 3.0 0.0 0.0 0.0 4.0 0.0 0.0 0.0 0.0
I used this df = df.reindex(sorted(df.columns), axis=1) (based on this question Re-ordering columns in pandas dataframe based on column name) but still not working.
thank you
First get all columns without pattern V + number by filtering with str.contains, then sorting all another values by Index.difference, add together and pass to DataFrame.reindex - get first all non numeric non matched columns in first positions and then sorted V + number columns:
L1 = df.columns[~df.columns.str.contains('^V\d+$')].tolist()
L2 = sorted(df.columns.difference(L1), key=lambda x: float(x[1:]))
df = df.reindex(L1 + L2, axis=1)
print (df)
ID V1 V10 V100 V990 V991 V992 V993 V994 V1000 V1001 V1002
A 1 9.0 2.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
B 1 1.2 0.1 3.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
C 2 8.6 8.0 2.0 2.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
D 3 0.0 2.0 0.0 3.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
E 4 7.8 6.6 3.0 4.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
I have a Pandas Dataframe which tells me monthly sales of items in shops
df.head():
ID month sold
0 150983 0 1.0
1 56520 0 13.0
2 56520 1 7.0
3 56520 2 13.0
4 56520 3 8.0
I want to remove all IDs where there were no sales last month. I.e. month == 33 & sold == 0. Doing the following
unwanted_df = df[((df['month'] == 33) & (df['sold'] == 0.0))]
I just get 46 rows, which is far too little. But nevermind, I would like to have the data in different format anyway. Pivoted version of above table is just what I want:
pivoted_df = df.pivot(index='month', columns = 'ID', values = 'sold').fillna(0)
pivoted_df.head()
ID 0 2 3 5 6 7 8 10 11 12 ... 214182 214185 214187 214190 214191 214192 214193 214195 214197 214199
month
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Question. How to remove columns with the value 0 in the last row in pivoted_df?
You can do this with one line:
pivoted_df= pivoted_df.drop(pivoted_df.columns[pivoted_df.iloc[-1,:]==0],axis=1)
I want to remove all IDs where there were no sales last month
You can first calculate the IDs satisfying your condition:
id_selected = df.loc[(df['month'] == 33) & (df['sold'] == 0), 'ID']
Then filter these from your dataframe via a Boolean mask:
df = df[~df['ID'].isin(id_selected)]
Finally, use pd.pivot_table with your filtered dataframe.
I have the following numpy matrix:
0 1 2 3 4 5 6 7 8 9
0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 5.0 0.0 9.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 5.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5 0.0 0.0 7.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0
6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
7 5.0 0.0 0.0 0.0 0.0 0.0 0.0 6.0 0.0 0.0
8 2.0 0.0 0.0 0.0 3.0 0.0 6.0 0.0 8.0 0.0
9 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
10 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
I want to calculate the non-zero values average of every row and column separately. So my result should be something like this:
average_rows = [1.0,7.0,2.0,5.0,0.0,4.0,0.0,5.5,4.75,1.0,0.0]
average_cols = [3.5,1.0,4.33333,0.0,4.33333,0.0,4.0,6.0,6.5,0.0]
I can't figure out how to iterate over them, and I keep getting TypeError: unhashable type
Also, I'm not sure if iterating is the best solution, I also tried something like R[:,i] to grab each column and sum it using sum(R[:,i]), but keep getting the same error.
It is better to use 2d np.array instead of matrix.
import numpy as np
data = np.array([[1, 2, 0], [0, 0, 1], [0, 2, 4]], dtype='float')
data[data == 0] = np.nan
# replace all zeroes with `nan`'s to skip them
# [[ 1. 2. nan]
# [ nan nan 1.]
# [ nan 2. 4.]]
np.nanmean(data, axis=0)
# array([ 1. , 2. , 2.5])
np.nanmean(data, axis=1)
# array([ 1.5, 1. , 3. ])