Pivoting a One-Hot-Encode Dataframe - python

I have a pandas dataframe that looks like this:
genres.head()
Drama Comedy Action Crime Romance Thriller Adventure Horror Mystery Fantasy ... History Music War Documentary Sport Musical Western Film-Noir News number_of_genres
tconst
tt0111161 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
tt0468569 1 0 1 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 3
tt1375666 0 0 1 0 0 0 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 3
tt0137523 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
tt0110912 1 0 0 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 2
I want to be able to get a table where the rows are the genres, the columns are the number of labels for a given movie and the values are the counts. In other words, I want this:
number_of_genres 1 2 3 totals
Drama 451 1481 3574 5506
Comedy 333 1108 2248 3689
Action 9 230 1971 2210
Crime 1 284 1687 1972
Romance 1 646 1156 1803
Thriller 22 449 1153 1624
Adventure 1 98 1454 1553
Horror 137 324 765 1226
Mystery 0 108 792 900
Fantasy 1 74 642 717
Sci-Fi 0 129 551 680
Biography 0 95 532 627
Family 0 60 452 512
Animation 0 6 431 437
History 0 32 314 346
Music 1 87 223 311
War 0 90 162 252
Documentary 70 82 78 230
Sport 0 78 142 220
Musical 0 13 131 144
Western 19 44 57 120
Film-Noir 0 11 50 61
News 0 1 2 3
Total 1046 5530 18567 25143
What is the best way of getting that table pythonistically? I solved the problem through the following code but was wondering if there's a better way:
genres['number_of_genres'] = genres.sum(axis=1)
pivots = []
for column in genres.columns[0:-1]:
column = pd.DataFrame(genres[column])
columns = column.join(genres.number_of_genres)
pivot = pd.pivot_table(columns, values=columns.columns[0], columns='number_of_genres', aggfunc=np.sum)
pivots.append(pivot)
pivots_df = pd.concat(pivots)
pivots_df['totals'] = pivots_df.sum(axis=1)
pivots_df.loc['Total'] = pivots_df.sum()
[EDIT]: Added jupyter output that should be compatible with pd.read_clipboard(). If I can format the output better, please let me know how I can do so.

Maybe I'm missing something but doesn't this work for you?
agg = df.groupby('number_of_genres').agg('sum').T
agg['totals'] = agg.sum(axis=1)
Edit: Solution via pivot_table
agg = df.pivot_table(columns='number_of_genres', aggfunc='sum')
agg['total'] = agg.sum(axis=1)

Related

Renaming a number of columns using for loop (python)

The dataframe below has a number of columns but columns names are random numbers.
daily1=
0 1 2 3 4 5 6 7 8 9 ... 11 12 13 14 15 16 17 18 19 20
0 0 0 0 0 0 0 4 0 0 0 ... 640 777 674 842 786 865 809 674 679 852
1 0 0 0 0 0 0 0 0 0 0 ... 108 29 74 102 82 62 83 68 30 61
2 rows × 244 columns
I would like to organise columns names in numerical order(from 0 to 243)
I tried
for i, n in zip(daily1.columns, range(244)):
asd=daily1.rename(columns={i:n})
asd
but output has not shown...
Ideal output is
0 1 2 3 4 5 6 7 8 9 ... 234 235 236 237 238 239 240 241 242 243
0 0 0 0 0 0 0 4 0 0 0 ... 640 777 674 842 786 865 809 674 679 852
1 0 0 0 0 0 0 0 0 0 0 ... 108 29 74 102 82 62 83 68 30 61
Could I get some advice guys? Thank you
If you want to reorder the columns you can try that
columns = sorted(list(df.columns), reverse=False)
df = df[columns]
If you just want to rename the columns then you can try
df.columns = [i for i in range(df.shape[1])]

Reshape data to new format for object detection

I have a data set in this format in dataframe
0--Parade/0_Parade_marchingband_1_849.jpg
2
449 330 122 149 0 0 0 0 0 0
0--Parade/0_Parade_Parade_0_904.jpg
1
361 98 263 339 0 0 0 0 0 0
0--Parade/0_Parade_marchingband_1_799.jpg
45
78 221 7 8 0 0 0 0 0
78 238 14 17 2 0 0 0 0 0
3 232 11 15 2 0 0 0 2 0
20 215 12 16 2 0 0 0 2 0
0--Parade/0_Parade_marchingband_1_117.jpg
23
69 359 50 36 1 0 0 0 0 1
227 382 56 43 1 0 1 0 0 1
296 305 44 26 1 0 0 0 0 1
353 280 40 36 2 0 0 0 2 1
885 377 63 41 1 0 0 0 0 1
819 391 34 43 2 0 0 0 1 0
727 342 37 31 2 0 0 0 0 1
598 246 33 29 2 0 0 0 0 1
740 308 45 33 1 0 0 0 2 1
0--Parade/0_Parade_marchingband_1_778.jpg
35
27 226 33 36 1 0 0 0 2 0
63 95 16 19 2 0 0 0 0 0
64 63 17 18 2 0 0 0 0 0
88 13 16 15 2 0 0 0 1 0
231 1 13 13 2 0 0 0 1 0
263 122 14 20 2 0 0 0 0 0
367 68 15 23 2 0 0 0 0 0
198 98 15 18 2 0 0 0 0 0
293 161 52 59 1 0 0 0 1 0
412 36 14 20 2 0 0 0 1 0
Can anyone tell me how to put these in dataframe where 1st column contain all the .jpg path next column contains all the coordinates but all the coordinate should be in correspondence to that .jpg path
eg.
Column1 coulmn2 column3
0--Parade/0_Parade_marchingband_1_849.jpg | 2 | 449 330 122 149 0 0 0 0 0 0
0--Parade/0_Parade_Parade_0_904.jpg | 1 | 361 98 263 339 0 0 0 0 0 0
0--Parade/0_Parade_marchingband_1_799.jpg | 45 | 78 221 7 8 0 0 0 0 0
| | 78 238 14 17 2 0 0 0 0 0
| | 3 232 11 15 2 0 0 0 2 0
| | 20 215 12 16 2 0 0 0 2 0
I have tried this
count1=0
count2=0
dict1 = {}
dict2 = {}
dict3 = {}
for i in data[0]:
if (i.find('.jpg') == -1):
dict1[count1] = i
count1+=1
else:
dict2[count2] = i
count2+=1

Pandas Groupby, MultiIndex, Multiple Columns

I just worked on creating some columns using .transform() to count some entries.
I used this reference.
For example:
userID deviceName POWER_DOWN USER LOW_RSSI NONE CMD_SUCCESS
0 24 IR_00 85 0 39 0 0
1 24 IR_00 85 0 39 0 0
2 24 IR_00 85 0 39 0 0
3 24 IR_00 85 0 39 0 0
4 25 BED_08 0 109 78 0 0
5 25 BED_08 0 109 78 0 0
6 25 BED_08 0 109 78 0 0
7 24 IR_00 85 0 39 0 0
8 23 IR_09 2 0 0 0 0
9 23 V33_17 3 0 2 0 134
10 23 V33_17 3 0 2 0 134
11 23 V33_17 3 0 2 0 134
12 23 V33_17 3 0 2 0 134
I want to group them by userID and deviceName?
So that it would look like:
userID deviceName POWER_DOWN USER LOW_RSSI NONE CMD_SUCCESS
0 23 IR_09 2 0 0 0 0
1 V33_17 3 0 2 0 134
2 24 IR_00 85 0 39 0 0
3 25 BED_08 0 109 78 0 0
I also want them to be sorted by userID and maybe make userID and deviceName as multi-index.
I tried the df = df.groupby(['userID', 'deviceName'])
but returned a
<pandas.core.groupby.DataFrameGroupBy object at0x00000249BBB13DD8>.
not the dataframe.
By the way, Im sorry. I dont know how to copy a Jupyter notebook In and Out.
I believe need drop_duplicates with sort_values:
df1 = df.drop_duplicates(['userID', 'deviceName']).sort_values('userID')
print (df1)
userID deviceName POWER_DOWN USER LOW_RSSI NONE CMD_SUCCESS
8 23 IR_09 2 0 0 0 0
9 23 V33_17 3 0 2 0 134
0 24 IR_00 85 0 39 0 0
4 25 BED_08 0 109 78 0 0
If want create MultiIndex add set_index:
df1 = (df.drop_duplicates(['userID', 'deviceName'])
.sort_values('userID')
.set_index(['userID', 'deviceName']))
print (df1)
POWER_DOWN USER LOW_RSSI NONE CMD_SUCCESS
userID deviceName
23 IR_09 2 0 0 0 0
V33_17 3 0 2 0 134
24 IR_00 85 0 39 0 0
25 BED_08 0 109 78 0 0

How to count occurrences that appear in 2 or more dataframe columns?

Here is the data from my problem below. This is a set of code based on movie reviewers. One line = one review by a reviewer.
bigdataframe
Out[43]:
movie id movietitle releasedate \
0 1 Toy Story (1995) 01-Jan-1995
1 4 Get Shorty (1995) 01-Jan-1995
2 5 Copycat (1995) 01-Jan-1995
3 7 Twelve Monkeys (1995) 01-Jan-1995
4 8 Babe (1995) 01-Jan-1995
5 9 Dead Man Walking (1995) 01-Jan-1995
6 11 Seven (Se7en) (1995) 01-Jan-1995
7 12 Usual Suspects, The (1995) 14-Aug-1995
8 15 Mr. Holland's Opus (1995) 29-Jan-1996
9 17 From Dusk Till Dawn (1996) 05-Feb-1996
10 19 Antonia's Line (1995) 01-Jan-1995
11 21 Muppet Treasure Island (1996) 16-Feb-1996
12 22 Braveheart (1995) 16-Feb-1996
13 23 Taxi Driver (1976) 16-Feb-1996
14 24 Rumble in the Bronx (1995) 23-Feb-1996
15 25 Birdcage, The (1996) 08-Mar-1996
16 28 Apollo 13 (1995) 01-Jan-1995
17 30 Belle de jour (1967) 01-Jan-1967
18 31 Crimson Tide (1995) 01-Jan-1995
19 32 Crumb (1994) 01-Jan-1994
20 42 Clerks (1994) 01-Jan-1994
21 44 Dolores Claiborne (1994) 01-Jan-1994
22 45 Eat Drink Man Woman (1994) 01-Jan-1994
23 47 Ed Wood (1994) 01-Jan-1994
24 48 Hoop Dreams (1994) 01-Jan-1994
25 49 I.Q. (1994) 01-Jan-1994
26 50 Star Wars (1977) 01-Jan-1977
27 54 Outbreak (1995) 01-Jan-1995
28 55 Professional, The (1994) 01-Jan-1994
29 56 Pulp Fiction (1994) 01-Jan-1994
... ... ...
99970 332 Kiss the Girls (1997) 01-Jan-1997
99971 334 U Turn (1997) 01-Jan-1997
99972 338 Bean (1997) 01-Jan-1997
99973 346 Jackie Brown (1997) 01-Jan-1997
99974 682 I Know What You Did Last Summer (1997) 17-Oct-1997
99975 873 Picture Perfect (1997) 01-Aug-1997
99976 877 Excess Baggage (1997) 01-Jan-1997
99977 886 Life Less Ordinary, A (1997) 01-Jan-1997
99978 1527 Senseless (1998) 09-Jan-1998
99979 272 Good Will Hunting (1997) 01-Jan-1997
99980 288 Scream (1996) 20-Dec-1996
99981 294 Liar Liar (1997) 21-Mar-1997
99982 300 Air Force One (1997) 01-Jan-1997
99983 310 Rainmaker, The (1997) 01-Jan-1997
99984 313 Titanic (1997) 01-Jan-1997
99985 322 Murder at 1600 (1997) 18-Apr-1997
99986 328 Conspiracy Theory (1997) 08-Aug-1997
99987 333 Game, The (1997) 01-Jan-1997
99988 338 Bean (1997) 01-Jan-1997
99989 346 Jackie Brown (1997) 01-Jan-1997
99990 354 Wedding Singer, The (1998) 13-Feb-1998
99991 362 Blues Brothers 2000 (1998) 06-Feb-1998
99992 683 Rocket Man (1997) 01-Jan-1997
99993 689 Jackal, The (1997) 01-Jan-1997
99994 690 Seven Years in Tibet (1997) 01-Jan-1997
99995 748 Saint, The (1997) 14-Mar-1997
99996 751 Tomorrow Never Dies (1997) 01-Jan-1997
99997 879 Peacemaker, The (1997) 01-Jan-1997
99998 894 Home Alone 3 (1997) 01-Jan-1997
99999 901 Mr. Magoo (1997) 25-Dec-1997
videoreleasedate IMDb URL \
0 NaN http://us.imdb.com/M/title-exact?Toy%20Story%2...
1 NaN http://us.imdb.com/M/title-exact?Get%20Shorty%...
2 NaN http://us.imdb.com/M/title-exact?Copycat%20(1995)
3 NaN http://us.imdb.com/M/title-exact?Twelve%20Monk...
4 NaN http://us.imdb.com/M/title-exact?Babe%20(1995)
5 NaN http://us.imdb.com/M/title-exact?Dead%20Man%20...
6 NaN http://us.imdb.com/M/title-exact?Se7en%20(1995)
7 NaN http://us.imdb.com/M/title-exact?Usual%20Suspe...
8 NaN http://us.imdb.com/M/title-exact?Mr.%20Holland...
9 NaN http://us.imdb.com/M/title-exact?From%20Dusk%2...
10 NaN http://us.imdb.com/M/title-exact?Antonia%20(1995)
11 NaN http://us.imdb.com/M/title-exact?Muppet%20Trea...
12 NaN http://us.imdb.com/M/title-exact?Braveheart%20...
13 NaN http://us.imdb.com/M/title-exact?Taxi%20Driver...
14 NaN http://us.imdb.com/M/title-exact?Hong%20Faan%2...
15 NaN http://us.imdb.com/M/title-exact?Birdcage,%20T...
16 NaN http://us.imdb.com/M/title-exact?Apollo%2013%2...
17 NaN http://us.imdb.com/M/title-exact?Belle%20de%20...
18 NaN http://us.imdb.com/M/title-exact?Crimson%20Tid...
19 NaN http://us.imdb.com/M/title-exact?Crumb%20(1994)
20 NaN http://us.imdb.com/M/title-exact?Clerks%20(1994)
21 NaN http://us.imdb.com/M/title-exact?Dolores%20Cla...
22 NaN http://us.imdb.com/M/title-exact?Yinshi%20Nan%...
23 NaN http://us.imdb.com/M/title-exact?Ed%20Wood%20(...
24 NaN http://us.imdb.com/M/title-exact?Hoop%20Dreams...
25 NaN http://us.imdb.com/M/title-exact?I.Q.%20(1994)
26 NaN http://us.imdb.com/M/title-exact?Star%20Wars%2...
27 NaN http://us.imdb.com/M/title-exact?Outbreak%20(1...
28 NaN http://us.imdb.com/Title?L%E9on+(1994)
29 NaN http://us.imdb.com/M/title-exact?Pulp%20Fictio...
... ...
99970 NaN http://us.imdb.com/M/title-exact?Kiss+the+Girl...
99971 NaN http://us.imdb.com/Title?U+Turn+(1997)
99972 NaN http://us.imdb.com/M/title-exact?Bean+(1997)
99973 NaN http://us.imdb.com/M/title-exact?imdb-title-11...
99974 NaN http://us.imdb.com/M/title-exact?I+Know+What+Y...
99975 NaN http://us.imdb.com/M/title-exact?Picture+Perfe...
99976 NaN http://us.imdb.com/M/title-exact?Excess+Baggag...
99977 NaN http://us.imdb.com/M/title-exact?Life+Less+Ord...
99978 NaN http://us.imdb.com/M/title-exact?imdb-title-12...
99979 NaN http://us.imdb.com/M/title-exact?imdb-title-11...
99980 NaN http://us.imdb.com/M/title-exact?Scream%20(1996)
99981 NaN http://us.imdb.com/Title?Liar+Liar+(1997)
99982 NaN http://us.imdb.com/M/title-exact?Air+Force+One...
99983 NaN http://us.imdb.com/M/title-exact?Rainmaker,+Th...
99984 NaN http://us.imdb.com/M/title-exact?imdb-title-12...
99985 NaN http://us.imdb.com/M/title-exact?Murder%20at%2...
99986 NaN http://us.imdb.com/M/title-exact?Conspiracy+Th...
99987 NaN http://us.imdb.com/M/title-exact?Game%2C+The+(...
99988 NaN http://us.imdb.com/M/title-exact?Bean+(1997)
99989 NaN http://us.imdb.com/M/title-exact?imdb-title-11...
99990 NaN http://us.imdb.com/M/title-exact?Wedding+Singe...
99991 NaN http://us.imdb.com/M/title-exact?Blues+Brother...
99992 NaN http://us.imdb.com/M/title-exact?Rocket+Man+(1...
99993 NaN http://us.imdb.com/M/title-exact?Jackal%2C+The...
99994 NaN http://us.imdb.com/M/title-exact?Seven+Years+i...
99995 NaN http://us.imdb.com/M/title-exact?Saint%2C%20Th...
99996 NaN http://us.imdb.com/M/title-exact?imdb-title-12...
99997 NaN http://us.imdb.com/M/title-exact?Peacemaker%2C...
99998 NaN http://us.imdb.com/M/title-exact?imdb-title-11...
99999 NaN http://us.imdb.com/M/title-exact?imdb-title-11...
unknown Action Adventure Animation Childrens ... Western \
0 0 0 0 1 1 ... 0
1 0 1 0 0 0 ... 0
2 0 0 0 0 0 ... 0
3 0 0 0 0 0 ... 0
4 0 0 0 0 1 ... 0
5 0 0 0 0 0 ... 0
6 0 0 0 0 0 ... 0
7 0 0 0 0 0 ... 0
8 0 0 0 0 0 ... 0
9 0 1 0 0 0 ... 0
10 0 0 0 0 0 ... 0
11 0 1 1 0 0 ... 0
12 0 1 0 0 0 ... 0
13 0 0 0 0 0 ... 0
14 0 1 1 0 0 ... 0
15 0 0 0 0 0 ... 0
16 0 1 0 0 0 ... 0
17 0 0 0 0 0 ... 0
18 0 0 0 0 0 ... 0
19 0 0 0 0 0 ... 0
20 0 0 0 0 0 ... 0
21 0 0 0 0 0 ... 0
22 0 0 0 0 0 ... 0
23 0 0 0 0 0 ... 0
24 0 0 0 0 0 ... 0
25 0 0 0 0 0 ... 0
26 0 1 1 0 0 ... 0
27 0 1 0 0 0 ... 0
28 0 0 0 0 0 ... 0
29 0 0 0 0 0 ... 0
... ... ... ... ... ... ...
99970 0 0 0 0 0 ... 0
99971 0 1 0 0 0 ... 0
99972 0 0 0 0 0 ... 0
99973 0 0 0 0 0 ... 0
99974 0 0 0 0 0 ... 0
99975 0 0 0 0 0 ... 0
99976 0 0 1 0 0 ... 0
99977 0 0 0 0 0 ... 0
99978 0 0 0 0 0 ... 0
99979 0 0 0 0 0 ... 0
99980 0 0 0 0 0 ... 0
99981 0 0 0 0 0 ... 0
99982 0 1 0 0 0 ... 0
99983 0 0 0 0 0 ... 0
99984 0 1 0 0 0 ... 0
99985 0 0 0 0 0 ... 0
99986 0 1 0 0 0 ... 0
99987 0 0 0 0 0 ... 0
99988 0 0 0 0 0 ... 0
99989 0 0 0 0 0 ... 0
99990 0 0 0 0 0 ... 0
99991 0 1 0 0 0 ... 0
99992 0 0 0 0 0 ... 0
99993 0 1 0 0 0 ... 0
99994 0 0 0 0 0 ... 0
99995 0 1 0 0 0 ... 0
99996 0 1 0 0 0 ... 0
99997 0 1 0 0 0 ... 0
99998 0 0 0 0 1 ... 0
99999 0 0 0 0 0 ... 0
The genres are Action Adventure Animation Children's ... Western. There are around 20 genres, but the dataframe doesn't print them all out. How can I figure out what reviews classified their movies in at least 2 genres? This means that they said that there movie belonged in two genres such as action and drama.
Since each of the genres is in its own dataframe column, I am a bit confused on how to do this. If there was one dataframe column I would simply use groupby becuase it would work well with the genres and their counts.
Any insight would help!
Edit: As and example you can see movie "0" toy story was classified in animation and Children's because it has a 1 in both columns.
Essentially you are only interested in rows for which the sum of the genres columns is greater than 1.
For all the columns this can be achieved by df = df[df.sum(axis=1) > 1] which will automatically ignore non-numeric columns.
The real issue here is how to sum only the genres columns (because movie id column also seem to be numeric).
If you have an external list of genres you can use it, ie df = df[df[['Horror', 'Comedy']].sum(axis=1) > 1].

How to change this dataframe with python in order to use collaborative filtering

Here is my original data:
enter image description here
As you can see.The cust_id column records the consumption record for each ID.And second column means the product name,third is the munber they bought each time.
I want to get this kind of data:
enter image description here
The result data shows each customer bought which product and how many.If they never bought,then the data is None.I think this is Sparse matrix.
I have tried many ways and still can't fix it up.....
Maybe pandas?Numpy?
There is problem with duplicates, I add last row with same cust_id and prd_id values for demonstrate it.
print (df)
cust_id prd_id prd_number
8 462 40 1
9 462 46 3
10 462 59 1
11 462 63 13
12 462 67 1
13 462 82 12
14 462 88 1
15 462 163 3
16 463 68 1
17 463 90 1
18 463 159 2
16 464 93 11
20 464 94 8
21 464 96 1
22 464 142 4
23 465 50 1
24 465 50 5
Then need groupby by columns cust_id and prd_id with aggreagting some function like mean() or sum(). Last unstack with replacing NaN to 0:
print (df.groupby(['cust_id', 'prd_id'])['prd_number'].sum().unstack(fill_value=0))
prd_id 40 46 50 59 63 67 68 82 88 90 93 94 96 142 \
cust_id
462 1 3 0 1 13 1 0 12 1 0 0 0 0 0
463 0 0 0 0 0 0 1 0 0 1 0 0 0 0
464 0 0 0 0 0 0 0 0 0 0 11 8 1 4
465 0 0 6 0 0 0 0 0 0 0 0 0 0 0
prd_id 159 163
cust_id
462 0 3
463 2 0
464 0 0
465 0 0

Categories

Resources