I have this irregular list:
['6', '20553737100', '6', '20431084172', '25200.00', '4536.00', 'PEN', '09', 'EG01', '124', '2022-06-20', '29735.43', ['POLO MANGA LARGA T L', '600.00', '16.90', '19.942', '1825.20', '10140.00', '18.00'], ['POLO MANGA LARGA T M', '600.00', '16.90', '19.942', '1825.20', '10140.00', '18.00'], ['LENTE LUNA CLARA TSG-100 ANTIEMPAÑO SIMPLE', '800.00', '2.65', '3.127', '381.60', '2120.00', '18.00'], ['LENTE LUNA OSCURA TSG-100 ANTIEMPAÑO C/CORDON', '800.00', '3.50', '4.13', '504.00', '2800.00', '18.00']
I would like it to look like this in a dataframe:
0 6 20553737100 6 20431284172 25200 4536 PEN 09 EG01 124 2022-06-02 29735.43 POLO MANGA LARGA T L 600 16.9 19.942 10140 1825.2 18
1 6 20553737100 6 20431284172 25200 4536 PEN 09 EG01 124 2022-06-02 29735.43 POLO MANGA LARGA T M 600 16.9 19.942 10140 1825.2 18
2 6 20553737100 6 20431284172 25200 4536 PEN 09 EG01 124 2022-06-02 29735.43 LENTE LUNA OSCURA TSG 800 2.65 3.127 2120 381.6 18
3 6 20553737100 6 20431284172 25200 4536 PEN 09 EG01 124 2022-06-02 29735.43 LENTE LUNA OSCURA JAE 800 3.5 4.13 2800 504 18
Sorry for the incomplete question a moment ago, apparently my list of lists is more complex.
You can separate input list in 2 parts to construct 2 dataframes which are then concatenated:
lst = ['6', '20553737100', '6', '20431084172', '25200.00', '4536.00', 'PEN', '09', 'EG01', '124', '2022-06-20', '29735.43', ['POLO MANGA LARGA T L', '600.00', '16.90', '19.942', '1825.20', '10140.00', '18.00'], ['POLO MANGA LARGA T M', '600.00', '16.90', '19.942', '1825.20', '10140.00', '18.00'], ['LENTE LUNA CLARA TSG-100 ANTIEMPAÑO SIMPLE', '800.00', '2.65', '3.127', '381.60', '2120.00', '18.00'], ['LENTE LUNA OSCURA TSG-100 ANTIEMPAÑO C/CORDON', '800.00', '3.50', '4.13', '504.00', '2800.00', '18.00']]
df = pd.concat([pd.DataFrame({f'column_{i}': v for i, v in enumerate(lst[:12], 1)}, index=[0]),
pd.DataFrame(columns=[f'column_{i}' for i in range(13, 13 + len(lst[12]))],
data=lst[12:])], axis=1).ffill()
print(df)
column_1 column_2 column_3 column_4 column_5 column_6 column_7 \
0 6 20553737100 6 20431084172 25200.00 4536.00 PEN
1 6 20553737100 6 20431084172 25200.00 4536.00 PEN
2 6 20553737100 6 20431084172 25200.00 4536.00 PEN
3 6 20553737100 6 20431084172 25200.00 4536.00 PEN
column_8 column_9 column_10 column_11 column_12 \
0 09 EG01 124 2022-06-20 29735.43
1 09 EG01 124 2022-06-20 29735.43
2 09 EG01 124 2022-06-20 29735.43
3 09 EG01 124 2022-06-20 29735.43
column_13 column_14 column_15 \
0 POLO MANGA LARGA T L 600.00 16.90
1 POLO MANGA LARGA T M 600.00 16.90
2 LENTE LUNA CLARA TSG-100 ANTIEMPAÑO SIMPLE 800.00 2.65
3 LENTE LUNA OSCURA TSG-100 ANTIEMPAÑO C/CORDON 800.00 3.50
column_16 column_17 column_18 column_19
0 19.942 1825.20 10140.00 18.00
1 19.942 1825.20 10140.00 18.00
2 3.127 381.60 2120.00 18.00
3 4.13 504.00 2800.00 18.00
Using a cross merge:
l = [['a'],['b'],['c'],['d'],[[5],[9],[7],[4]]]
(pd.DataFrame(l[:4]).T.merge(pd.DataFrame(l[4]), how='cross')
.set_axis([f'column_{i+1}' for i in range(5)], axis=1)
)
Output:
column_1 column_2 column_3 column_4 column_5
0 a b c d 5
1 a b c d 9
2 a b c d 7
3 a b c d 4
Another idea using a custom function to unnest the original list:
def unnest(l):
if len(l) == 1:
return l[0]
return [unnest(x) for x in l]
pd.DataFrame([unnest(l)]).explode(4, ignore_index=True))
Or, programmatic variant to avoid having to specify the columns to explode:
(pd.DataFrame([unnest(l)])
.pipe(lambda d: d.explode(list (d.columns[d.iloc[0].str.len().gt(1)]), ignore_index=True))
)
output:
0 1 2 3 4
0 a b c d 5
1 a b c d 9
2 a b c d 7
3 a b c d 4
I was trying the solutions that they gave me, step by step and that's how I found this solution:
df = (pd.DataFrame(filas[:12])
.T
.merge(pd.DataFrame(filas[13:17]), how='cross')
.set_axis([f'column_{i+1}' for i in range(19)], axis=1))
So it is as follows
column_1 column_2 column_3 column_4 column_5 column_6 column_7 column_8 ... column_12 column_13 column_14 column_15 column_16 column_17 column_18 column_19
0 6 20553737100 6 20431084172 25200.00 4536.00 PEN 09 ... 2022-06-20 POLO MANGA LARGA T L 600.00 16.90 19.942 10140.00 1825.20 18.00
1 6 20553737100 6 20431084172 25200.00 4536.00 PEN 09 ... 2022-06-20 POLO MANGA LARGA T M 600.00 16.90 19.942 10140.00 1825.20 18.00
2 6 20553737100 6 20431084172 25200.00 4536.00 PEN 09 ... 2022-06-20 LENTE LUNA CLARA TSG-100 ANTIEMPAÑO SIMPLE 800.00 2.65 3.127 2120.00 381.60 18.00
3 6 20553737100 6 20431084172 25200.00 4536.00 PEN 09 ... 2022-06-20 LENTE LUNA OSCURA TSG-100 ANTIEMPAÑO C/CORDON 800.00 3.50 4.13 2800.00 504.00 18.00
Thanks for everything
Related
I have a dataset like this
import pandas as pd
df = pd.read_csv("music.csv")
df
name
date
singer
language
phase
1
Yes or No
02.01.20
Benjamin Smith
en
1
2
Parabens
01.06.21
Rafael Galvao;Simon Murphy
pt;en
2
3
Love
12.11.20
Michaela Condell
en
1
4
Paz
11.07.19
Ana Perez; Eduarda Pinto
es;pt
3
5
Stop
12.01.21
Michael Conway;Gabriel Lee
en;en
1
6
Shalom
18.06.21
Shimon Cohen
hebr
1
7
Habibi
22.12.19
Fuad Khoury
ar
3
8
viva
01.08.21
Veronica Barnes
en
1
9
Buznanna
23.09.20
Kurt Azzopardi
mt
1
10
Frieden
21.05.21
Gabriel Meier
dt
1
11
Uruguay
11.04.21
Julio Ramirez
es
1
12
Beautiful
17.03.21
Cameron Armstrong
en
3
13
Holiday
19.06.20
Bianca Watson
en
3
14
Kiwi
21.10.20
Lachlan McNamara
en
1
15
Amore
01.12.20
Vasco Grimaldi
it
1
16
La vie
28.04.20
Victor Dubois
fr
3
17
Yom
21.02.20
Ori Azerad; Naeem al-Hindi
hebr;ar
2
18
Elefthería
15.06.19
Nikolaos Gekas
gr
1
This table is not in 1NF. I would like to convert in the form of pd.DataFrame, that satiesfy 1NF.
How can I do that?
I did this, but seems not work
import pandas as pd
import numpy as np
df = pd.read_csv("music.csv")
lens = list(map(len, df['singer','language].values))
res = pd.DataFrame({'name': np.repeat(
df['name'], lens), 'singer': np.concatenate(df['singer'].values),'language': np.concatenate(df['language'].values)})
print(res)
It should satisfy only 1NF not 3NF and so on.
Split language and singer by values and use pd.explode:
df['language']=df['language'].str.split(';')
df['singer']=df['singer'].str.split(";")
df.explode(['language','singer'])
Id
name
date
singer
language
phase
1
Yes or No
02.01.20
Benjamin Smith
en
1
2
Parabens
01.06.21
Rafael Galvao
pt
2
2
Parabens
01.06.21
Simon Murphy
en
2
3
Love
12.11.20
Michaela Condell
en
1
4
Paz
11.07.19
Ana Perez
es
3
4
Paz
11.07.19
Eduarda Pinto
pt
3
5
Stop
12.01.21
Michael Conway
en
1
5
Stop
12.01.21
Gabriel Lee
en
1
6
Shalom
18.06.21
Shimon Cohen
hebr
1
7
Habibi
22.12.19
Fuad Khoury
ar
3
8
viva
01.08.21
Veronica Barnes
en
1
9
Buznanna
23.09.20
Kurt Azzopardi
mt
1
10
Frieden
21.05.21
Gabriel Meier
dt
1
11
Uruguay
11.04.21
Julio Ramirez
es
1
12
Beautiful
17.03.21
Cameron Armstrong
en
3
13
Holiday
19.06.20
Bianca Watson
en
3
14
Kiwi
21.10.20
Lachlan McNamara
en
1
15
Amore
01.12.20
Vasco Grimaldi
it
1
16
La vie
28.04.20
Victor Dubois
fr
3
17
Yom
21.02.20
Ori Azerad
hebr
2
17
Yom
21.02.20
Naeem al-Hindi
ar
2
18
Elefthería
15.06.19
Nikolaos Gekas
gr
1
I have dataframe similar to this one:
name hobby date country 5 10 15 20 ...
Toby Guitar 2020-01-19 Brazil 0.1245 0.2543 0.7763 0.2264
Linda Cooking 2020-03-05 Italy 0.5411 0.2213 Nan 0.3342
Ben Diving 2020-04-02 USA 0.8843 0.2333 0.4486 0.2122
...
I want to the the int colmns, duplicate them, and put the int as the new value of the column, something like this:
name hobby date country 5 5 10 10 15 15 20 20...
Toby Guitar 2020-01-19 Brazil 0.1245 5 0.2543 10 0.7763 15 0.2264 20
Linda Cooking 2020-03-05 Italy 0.5411 5 0.2213 10 Nan 15 0.3342 20
Ben Diving 2020-04-02 USA 0.8843 5 0.2333 10 0.4486 15 0.2122 20
...
I'm not sure how to tackle this and looking for ideas
Here is a solution you can try out,
digits_ = pd.DataFrame(
{col: [int(col)] * len(df) for col in df.columns if col.isdigit()}
)
pd.concat([df, digits_], axis=1)
name hobby date country 5 ... 20 5 10 15 20
0 Toby Guitar 2020-01-19 Brazil 0.1245 ... 0.2264 5 10 15 20
1 Linda Cooking 2020-03-05 Italy 0.5411 ... 0.3342 5 10 15 20
2 Ben Diving 2020-04-02 USA 0.8843 ... 0.2122 5 10 15 20
I'm not sure if it is the best way to organise data with duplicated column names. I would recommend stacking (melting) it into long format.
df.melt(id_vars=["name", "hobby", "date", "country"])
Result
name hobby date country variable value
0 Toby Guitar 2020-01-19 Brazil 5 0.1245
1 Linda Cooking 2020-03-05 Italy 5 0.5411
2 Ben Diving 2020-04-02 USA 5 0.8843
3 Toby Guitar 2020-01-19 Brazil 10 0.2543
4 Linda Cooking 2020-03-05 Italy 10 0.2213
5 Ben Diving 2020-04-02 USA 10 0.2333
6 Toby Guitar 2020-01-19 Brazil 15 0.7763
7 Linda Cooking 2020-03-05 Italy 15 Nan
8 Ben Diving 2020-04-02 USA 15 0.4486
9 Toby Guitar 2020-01-19 Brazil 20 0.2264
10 Linda Cooking 2020-03-05 Italy 20 0.3342
11 Ben Diving 2020-04-02 USA 20 0.2122
You could use the pandas insert(...) function combined with a for loop
import numpy as np
import pandas as pd
df = pd.DataFrame([['Toby', 'Guitar', '2020-01-19', 'Brazil', 0.1245, 0.2543, 0.7763, 0.2264],
['Linda', 'Cooking', '2020-03-05', 'Italy', 0.5411, 0.2213, np.nan, 0.3342],
['Ben', 'Diving', '2020-04-02', 'USA', 0.8843, 0.2333, 0.4486, 0.2122]],
columns=['name', 'hobby', 'date', 'country', 5, 10, 5, 20])
start_col=4
for i in range(0, len(df.columns)-start_col):
dcol = df.columns[start_col+i*2] # digit col name to duplicate
df.insert(start_col+i*2+1, dcol, [dcol]*len(df.index), True)
results:
name hobby date country 5 ... 10 5 5 20 20
0 Toby Guitar 2020-01-19 Brazil 0.1245 ... 10 0.7763 5 0.2264 20
1 Linda Cooking 2020-03-05 Italy 0.5411 ... 10 NaN 5 0.3342 20
2 Ben Diving 2020-04-02 USA 0.8843 ... 10 0.4486 5 0.2122 20
[3 rows x 12 columns]
I assumed that all your columns are digits from the 5th, but if not you could add in the for loop an if condition to prevent this :
start_col=4
for i in range(0, len(df.columns)-start_col):
dcol = df.columns[start_col+i*2] # digit col name to duplicate
if type(dcol) is int:
df.insert(start_col+i*2+1, dcol, [dcol]*len(df.index), True)
I have a three-month sales data set. I need to get the sales total count by week wise and group by an agent. and want to get daily standard division by the agent in sperate table
Agent District Agent_type Date Device
12 abc br 01/02/2020 4233
12 abc br 01/02/2020 4123
12 abc br 03/02/2020 4314
12 abc br 05/02/2020 4134
12 abc br 19/02/2020 5341
12 abc br 19/02/2020 52141
12 abc br 19/02/2020 12141
12 abc br 26/02/2020 4224
12 abc br 28/02/2020 9563
12 abc br 05/03/2020 0953
12 abc br 10/03/2020 1212
12 abc br 15/03/2020 4309
12 abc br 02/03/2020 4200
12 abc br 30/03/2020 4299
12 abc br 01/04/2020 4211
12 abc br 10/04/2020 2200
12 abc br 19/04/2020 3300
12 abc br 29/04/2020 3222
12 abc br 29/04/2020 32222
12 abc br 29/04/2020 4212
12 abc br 29/04/2020 20922
12 abc br 29/04/2020 67822
13 aaa ae 15/02/2020 22222
13 aaa ae 29/02/2020 42132
13 aaa ae 10/02/2020 89022
13 aaa ae 28/02/2020 31111
13 aaa ae 28/02/2020 31132
13 aaa ae 28/02/2020 31867
13 aaa ae 14/02/2020 91122
output
Agent District Agent_type 1st_week_feb 2nd_week_feb 3rd_week_feb ..... 4th_week_apr
12 abc br count count count count
13 aaa ae count count count count
2nd output - daily std by agent
Agent tot_sale daily_std
12 22 2.40
13 7 1.34
You can use:
#convert values to datetimes
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
#get weeks strating by 1
week = df['Date'].dt.isocalendar().week
df['Week'] = (week - week.min() + 1)
#lowercase months
df['Month'] = df['Date'].dt.strftime('%b').str.lower()
print (df)
Agent Date Device Week Month
0 12 2020-02-01 4233 1 feb
1 12 2020-02-01 4123 1 feb
2 12 2020-02-03 4314 2 feb
3 12 2020-02-05 4134 2 feb
4 12 2020-02-19 5341 4 feb
5 12 2020-02-26 4224 5 feb
6 12 2020-02-28 9563 5 feb
7 12 2020-03-05 953 6 mar
8 12 2020-03-10 1212 7 mar
9 12 2020-03-15 4309 7 mar
10 12 2020-03-02 4200 6 mar
11 12 2020-03-30 4299 10 mar
12 12 2020-04-01 4211 10 apr
13 12 2020-04-10 2200 11 apr
14 12 2020-04-19 3300 12 apr
15 12 2020-04-29 3222 14 apr
16 13 2020-02-15 22222 3 feb
17 13 2020-02-29 42132 5 feb
18 13 2020-03-10 89022 7 mar
19 13 2020-03-28 31111 9 mar
20 13 2020-04-14 91122 12 apr
#if need count rows use crosstab
df1 = pd.crosstab(df['Agent'], [df['Week'], df['Month']])
df1.columns = df1.columns.map(lambda x: f'{x[0]}_week_{x[1]}')
print (df1)
1_week_feb 2_week_feb 3_week_feb 4_week_feb 5_week_feb 6_week_mar \
Agent
12 2 2 0 1 2 2
13 0 0 1 0 1 0
7_week_mar 9_week_mar 10_week_apr 10_week_mar 11_week_apr \
Agent
12 2 0 1 1 1
13 1 1 0 0 0
12_week_apr 14_week_apr
Agent
12 1 1
13 1 0
#if need sum Device column use pivot_table
df2 = df.pivot_table(index='Agent',
columns=['Week', 'Month'],
values='Device',
aggfunc='sum',
fill_value=0)
df2.columns = df2.columns.map(lambda x: f'{x[0]}_week_{x[1]}')
print (df2)
1_week_feb 2_week_feb 3_week_feb 4_week_feb 5_week_feb 6_week_mar \
Agent
12 8356 8448 0 5341 13787 5153
13 0 0 22222 0 42132 0
7_week_mar 9_week_mar 10_week_apr 10_week_mar 11_week_apr \
Agent
12 5521 0 4211 4299 2200
13 89022 31111 0 0 0
12_week_apr 14_week_apr
Agent
12 3300 3222
13 91122 0
EDIT: Thank you #Henry Yik for pointed another way for count weeks by days:
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
df['Week'] = (df["Date"].dt.day-1)//7+1
df['Month'] = df['Date'].dt.strftime('%b').str.lower()
print (df)
Agent Date Device Week Month
0 12 2020-02-01 4233 1 feb
1 12 2020-02-01 4123 1 feb
2 12 2020-02-03 4314 1 feb
3 12 2020-02-05 4134 1 feb
4 12 2020-02-19 5341 3 feb
5 12 2020-02-26 4224 4 feb
6 12 2020-02-28 9563 4 feb
7 12 2020-03-05 953 1 mar
8 12 2020-03-10 1212 2 mar
9 12 2020-03-15 4309 3 mar
10 12 2020-03-02 4200 1 mar
11 12 2020-03-30 4299 5 mar
12 12 2020-04-01 4211 1 apr
13 12 2020-04-10 2200 2 apr
14 12 2020-04-19 3300 3 apr
15 12 2020-04-29 3222 5 apr
16 13 2020-02-15 22222 3 feb
17 13 2020-02-29 42132 5 feb
18 13 2020-03-10 89022 2 mar
19 13 2020-03-28 31111 4 mar
20 13 2020-04-14 91122 2 apr
Assuming that Date column has been converted to datetime, you can do your
task in the following one-liner:
df.groupby(['Agent', pd.Grouper(key='Date', freq='W-MON', closed='left',
label='left')]).count().unstack(level=1, fill_value=0)
For your data sample the result is:
Device
Date 2020-01-27 2020-02-03 2020-02-17 2020-02-24 2020-03-02 2020-03-09 2020-03-30 2020-04-06 2020-04-13 2020-04-27 2020-02-10 2020-03-23
Agent
12 2 2 1 2 2 2 2 1 1 1 0 0
13 0 0 0 1 0 1 0 0 1 0 1 1
The column name is from the date of a Monday "opening" the week.
Here is my dataframe:, yes it is quite large.
bigdataframe
Out[2]:
movie id movietitle releasedate \
0 1 Toy Story (1995) 01-Jan-1995
1 4 Get Shorty (1995) 01-Jan-1995
2 5 Copycat (1995) 01-Jan-1995
3 7 Twelve Monkeys (1995) 01-Jan-1995
4 8 Babe (1995) 01-Jan-1995
5 9 Dead Man Walking (1995) 01-Jan-1995
6 11 Seven (Se7en) (1995) 01-Jan-1995
7 12 Usual Suspects, The (1995) 14-Aug-1995
8 15 Mr. Holland's Opus (1995) 29-Jan-1996
9 17 From Dusk Till Dawn (1996) 05-Feb-1996
10 19 Antonia's Line (1995) 01-Jan-1995
11 21 Muppet Treasure Island (1996) 16-Feb-1996
12 22 Braveheart (1995) 16-Feb-1996
13 23 Taxi Driver (1976) 16-Feb-1996
14 24 Rumble in the Bronx (1995) 23-Feb-1996
15 25 Birdcage, The (1996) 08-Mar-1996
16 28 Apollo 13 (1995) 01-Jan-1995
17 30 Belle de jour (1967) 01-Jan-1967
18 31 Crimson Tide (1995) 01-Jan-1995
19 32 Crumb (1994) 01-Jan-1994
20 42 Clerks (1994) 01-Jan-1994
21 44 Dolores Claiborne (1994) 01-Jan-1994
22 45 Eat Drink Man Woman (1994) 01-Jan-1994
23 47 Ed Wood (1994) 01-Jan-1994
24 48 Hoop Dreams (1994) 01-Jan-1994
25 49 I.Q. (1994) 01-Jan-1994
26 50 Star Wars (1977) 01-Jan-1977
27 54 Outbreak (1995) 01-Jan-1995
28 55 Professional, The (1994) 01-Jan-1994
29 56 Pulp Fiction (1994) 01-Jan-1994
... ... ...
99970 332 Kiss the Girls (1997) 01-Jan-1997
99971 334 U Turn (1997) 01-Jan-1997
99972 338 Bean (1997) 01-Jan-1997
99973 346 Jackie Brown (1997) 01-Jan-1997
99974 682 I Know What You Did Last Summer (1997) 17-Oct-1997
99975 873 Picture Perfect (1997) 01-Aug-1997
99976 877 Excess Baggage (1997) 01-Jan-1997
99977 886 Life Less Ordinary, A (1997) 01-Jan-1997
99978 1527 Senseless (1998) 09-Jan-1998
99979 272 Good Will Hunting (1997) 01-Jan-1997
99980 288 Scream (1996) 20-Dec-1996
99981 294 Liar Liar (1997) 21-Mar-1997
99982 300 Air Force One (1997) 01-Jan-1997
99983 310 Rainmaker, The (1997) 01-Jan-1997
99984 313 Titanic (1997) 01-Jan-1997
99985 322 Murder at 1600 (1997) 18-Apr-1997
99986 328 Conspiracy Theory (1997) 08-Aug-1997
99987 333 Game, The (1997) 01-Jan-1997
99988 338 Bean (1997) 01-Jan-1997
99989 346 Jackie Brown (1997) 01-Jan-1997
99990 354 Wedding Singer, The (1998) 13-Feb-1998
99991 362 Blues Brothers 2000 (1998) 06-Feb-1998
99992 683 Rocket Man (1997) 01-Jan-1997
99993 689 Jackal, The (1997) 01-Jan-1997
99994 690 Seven Years in Tibet (1997) 01-Jan-1997
99995 748 Saint, The (1997) 14-Mar-1997
99996 751 Tomorrow Never Dies (1997) 01-Jan-1997
99997 879 Peacemaker, The (1997) 01-Jan-1997
99998 894 Home Alone 3 (1997) 01-Jan-1997
99999 901 Mr. Magoo (1997) 25-Dec-1997
videoreleasedate IMDb URL \
0 NaN http://us.imdb.com/M/title-exact?Toy%20Story%2...
1 NaN http://us.imdb.com/M/title-exact?Get%20Shorty%...
2 NaN http://us.imdb.com/M/title-exact?Copycat%20(1995)
3 NaN http://us.imdb.com/M/title-exact?Twelve%20Monk...
4 NaN http://us.imdb.com/M/title-exact?Babe%20(1995)
5 NaN http://us.imdb.com/M/title-exact?Dead%20Man%20...
6 NaN http://us.imdb.com/M/title-exact?Se7en%20(1995)
7 NaN http://us.imdb.com/M/title-exact?Usual%20Suspe...
8 NaN http://us.imdb.com/M/title-exact?Mr.%20Holland...
9 NaN http://us.imdb.com/M/title-exact?From%20Dusk%2...
10 NaN http://us.imdb.com/M/title-exact?Antonia%20(1995)
11 NaN http://us.imdb.com/M/title-exact?Muppet%20Trea...
12 NaN http://us.imdb.com/M/title-exact?Braveheart%20...
13 NaN http://us.imdb.com/M/title-exact?Taxi%20Driver...
14 NaN http://us.imdb.com/M/title-exact?Hong%20Faan%2...
15 NaN http://us.imdb.com/M/title-exact?Birdcage,%20T...
16 NaN http://us.imdb.com/M/title-exact?Apollo%2013%2...
17 NaN http://us.imdb.com/M/title-exact?Belle%20de%20...
18 NaN http://us.imdb.com/M/title-exact?Crimson%20Tid...
19 NaN http://us.imdb.com/M/title-exact?Crumb%20(1994)
20 NaN http://us.imdb.com/M/title-exact?Clerks%20(1994)
21 NaN http://us.imdb.com/M/title-exact?Dolores%20Cla...
22 NaN http://us.imdb.com/M/title-exact?Yinshi%20Nan%...
23 NaN http://us.imdb.com/M/title-exact?Ed%20Wood%20(...
24 NaN http://us.imdb.com/M/title-exact?Hoop%20Dreams...
25 NaN http://us.imdb.com/M/title-exact?I.Q.%20(1994)
26 NaN http://us.imdb.com/M/title-exact?Star%20Wars%2...
27 NaN http://us.imdb.com/M/title-exact?Outbreak%20(1...
28 NaN http://us.imdb.com/Title?L%E9on+(1994)
29 NaN http://us.imdb.com/M/title-exact?Pulp%20Fictio...
... ...
99970 NaN http://us.imdb.com/M/title-exact?Kiss+the+Girl...
99971 NaN http://us.imdb.com/Title?U+Turn+(1997)
99972 NaN http://us.imdb.com/M/title-exact?Bean+(1997)
99973 NaN http://us.imdb.com/M/title-exact?imdb-title-11...
99974 NaN http://us.imdb.com/M/title-exact?I+Know+What+Y...
99975 NaN http://us.imdb.com/M/title-exact?Picture+Perfe...
99976 NaN http://us.imdb.com/M/title-exact?Excess+Baggag...
99977 NaN http://us.imdb.com/M/title-exact?Life+Less+Ord...
99978 NaN http://us.imdb.com/M/title-exact?imdb-title-12...
99979 NaN http://us.imdb.com/M/title-exact?imdb-title-11...
99980 NaN http://us.imdb.com/M/title-exact?Scream%20(1996)
99981 NaN http://us.imdb.com/Title?Liar+Liar+(1997)
99982 NaN http://us.imdb.com/M/title-exact?Air+Force+One...
99983 NaN http://us.imdb.com/M/title-exact?Rainmaker,+Th...
99984 NaN http://us.imdb.com/M/title-exact?imdb-title-12...
99985 NaN http://us.imdb.com/M/title-exact?Murder%20at%2...
99986 NaN http://us.imdb.com/M/title-exact?Conspiracy+Th...
99987 NaN http://us.imdb.com/M/title-exact?Game%2C+The+(...
99988 NaN http://us.imdb.com/M/title-exact?Bean+(1997)
99989 NaN http://us.imdb.com/M/title-exact?imdb-title-11...
99990 NaN http://us.imdb.com/M/title-exact?Wedding+Singe...
99991 NaN http://us.imdb.com/M/title-exact?Blues+Brother...
99992 NaN http://us.imdb.com/M/title-exact?Rocket+Man+(1...
99993 NaN http://us.imdb.com/M/title-exact?Jackal%2C+The...
99994 NaN http://us.imdb.com/M/title-exact?Seven+Years+i...
99995 NaN http://us.imdb.com/M/title-exact?Saint%2C%20Th...
99996 NaN http://us.imdb.com/M/title-exact?imdb-title-12...
99997 NaN http://us.imdb.com/M/title-exact?Peacemaker%2C...
99998 NaN http://us.imdb.com/M/title-exact?imdb-title-11...
99999 NaN http://us.imdb.com/M/title-exact?imdb-title-11...
unknown Action Adventure Animation Childrens ... Western \
0 0 0 0 1 1 ... 0
1 0 1 0 0 0 ... 0
2 0 0 0 0 0 ... 0
3 0 0 0 0 0 ... 0
4 0 0 0 0 1 ... 0
5 0 0 0 0 0 ... 0
6 0 0 0 0 0 ... 0
7 0 0 0 0 0 ... 0
8 0 0 0 0 0 ... 0
9 0 1 0 0 0 ... 0
10 0 0 0 0 0 ... 0
11 0 1 1 0 0 ... 0
12 0 1 0 0 0 ... 0
13 0 0 0 0 0 ... 0
14 0 1 1 0 0 ... 0
15 0 0 0 0 0 ... 0
16 0 1 0 0 0 ... 0
17 0 0 0 0 0 ... 0
18 0 0 0 0 0 ... 0
19 0 0 0 0 0 ... 0
20 0 0 0 0 0 ... 0
21 0 0 0 0 0 ... 0
22 0 0 0 0 0 ... 0
23 0 0 0 0 0 ... 0
24 0 0 0 0 0 ... 0
25 0 0 0 0 0 ... 0
26 0 1 1 0 0 ... 0
27 0 1 0 0 0 ... 0
28 0 0 0 0 0 ... 0
29 0 0 0 0 0 ... 0
... ... ... ... ... ... ...
99970 0 0 0 0 0 ... 0
99971 0 1 0 0 0 ... 0
99972 0 0 0 0 0 ... 0
99973 0 0 0 0 0 ... 0
99974 0 0 0 0 0 ... 0
99975 0 0 0 0 0 ... 0
99976 0 0 1 0 0 ... 0
99977 0 0 0 0 0 ... 0
99978 0 0 0 0 0 ... 0
99979 0 0 0 0 0 ... 0
99980 0 0 0 0 0 ... 0
99981 0 0 0 0 0 ... 0
99982 0 1 0 0 0 ... 0
99983 0 0 0 0 0 ... 0
99984 0 1 0 0 0 ... 0
99985 0 0 0 0 0 ... 0
99986 0 1 0 0 0 ... 0
99987 0 0 0 0 0 ... 0
99988 0 0 0 0 0 ... 0
99989 0 0 0 0 0 ... 0
99990 0 0 0 0 0 ... 0
99991 0 1 0 0 0 ... 0
99992 0 0 0 0 0 ... 0
99993 0 1 0 0 0 ... 0
99994 0 0 0 0 0 ... 0
99995 0 1 0 0 0 ... 0
99996 0 1 0 0 0 ... 0
99997 0 1 0 0 0 ... 0
99998 0 0 0 0 1 ... 0
99999 0 0 0 0 0 ... 0
user id rating timestamp age gender occupation zipcode state \
0 308 4 887736532 60 M retired 95076 CA
1 308 5 887737890 60 M retired 95076 CA
2 308 4 887739608 60 M retired 95076 CA
3 308 4 887738847 60 M retired 95076 CA
4 308 5 887736696 60 M retired 95076 CA
5 308 4 887737194 60 M retired 95076 CA
6 308 5 887737837 60 M retired 95076 CA
7 308 5 887737243 60 M retired 95076 CA
8 308 3 887739426 60 M retired 95076 CA
9 308 4 887739056 60 M retired 95076 CA
10 308 3 887737383 60 M retired 95076 CA
11 308 3 887740729 60 M retired 95076 CA
12 308 4 887737647 60 M retired 95076 CA
13 308 5 887737293 60 M retired 95076 CA
14 308 4 887738057 60 M retired 95076 CA
15 308 4 887740649 60 M retired 95076 CA
16 308 3 887737036 60 M retired 95076 CA
17 308 4 887738933 60 M retired 95076 CA
18 308 3 887739472 60 M retired 95076 CA
19 308 5 887737432 60 M retired 95076 CA
20 308 4 887738191 60 M retired 95076 CA
21 308 4 887740451 60 M retired 95076 CA
22 308 4 887736843 60 M retired 95076 CA
23 308 4 887738933 60 M retired 95076 CA
24 308 4 887736880 60 M retired 95076 CA
25 308 3 887740833 60 M retired 95076 CA
26 308 5 887737431 60 M retired 95076 CA
27 308 2 887740254 60 M retired 95076 CA
28 308 3 887738760 60 M retired 95076 CA
29 308 5 887736924 60 M retired 95076 CA
... ... ... ... ... ... ... ...
99970 631 3 888465180 18 F student 38866 MS
99971 631 2 888464941 18 F student 38866 MS
99972 631 2 888465299 18 F student 38866 MS
99973 631 4 888465004 18 F student 38866 MS
99974 631 2 888465247 18 F student 38866 MS
99975 631 2 888465084 18 F student 38866 MS
99976 631 2 888465131 18 F student 38866 MS
99977 631 4 888465216 18 F student 38866 MS
99978 631 2 888465351 18 F student 38866 MS
99979 729 4 893286638 19 M student 56567 MN
99980 729 2 893286261 19 M student 56567 MN
99981 729 2 893286338 19 M student 56567 MN
99982 729 4 893286638 19 M student 56567 MN
99983 729 3 893286204 19 M student 56567 MN
99984 729 3 893286638 19 M student 56567 MN
99985 729 4 893286637 19 M student 56567 MN
99986 729 3 893286638 19 M student 56567 MN
99987 729 4 893286638 19 M student 56567 MN
99988 729 1 893286373 19 M student 56567 MN
99989 729 1 893286168 19 M student 56567 MN
99990 729 5 893286637 19 M student 56567 MN
99991 729 4 893286637 19 M student 56567 MN
99992 729 2 893286511 19 M student 56567 MN
99993 729 4 893286638 19 M student 56567 MN
99994 729 2 893286149 19 M student 56567 MN
99995 729 4 893286638 19 M student 56567 MN
99996 729 3 893286338 19 M student 56567 MN
99997 729 3 893286299 19 M student 56567 MN
99998 729 1 893286511 19 M student 56567 MN
99999 729 1 893286491 19 M student 56567 MN
State1
0 CA
1 CA
2 CA
3 CA
4 CA
5 CA
6 CA
7 CA
8 CA
9 CA
10 CA
11 CA
12 CA
13 CA
14 CA
15 CA
16 CA
17 CA
18 CA
19 CA
20 CA
21 CA
22 CA
23 CA
24 CA
25 CA
26 CA
27 CA
28 CA
29 CA
...
99970 MS
99971 MS
99972 MS
99973 MS
99974 MS
99975 MS
99976 MS
99977 MS
99978 MS
99979 MN
99980 MN
99981 MN
99982 MN
99983 MN
99984 MN
99985 MN
99986 MN
99987 MN
99988 MN
99989 MN
99990 MN
99991 MN
99992 MN
99993 MN
99994 MN
99995 MN
99996 MN
99997 MN
99998 MN
99999 MN
All of the genres are: [['Action', 'Adventure','Animation', 'Childrens', 'Comedy', 'Crime','Documentary', 'Drama', 'Fantasy', 'FilmNoir',
'Horror', 'Musical', 'Mystery', 'Romance','SciFi', 'Thriller', 'War', 'Western']]
How would I be able to figure out what genre had the highest average review, and which had the lowest average review? Should I groupby with ratings and then all of the corresponding genres?
df = bigdataframe[['Action', 'Adventure','Animation', 'Childrens', 'Comedy',
'Crime','Documentary', 'Drama', 'Fantasy', 'FilmNoir',
'Horror', 'Musical', 'Mystery',
'Romance','SciFi', 'Thriller', 'War', 'Western','rating']]
gp = df.groupby('rating')
result = gp.agg(['mean'])
result gives me this:
Action Adventure Animation Childrens Comedy Crime \
mean mean mean mean mean mean
rating
1 0.253191 0.131588 0.030442 0.093944 0.372995 0.068249
2 0.286192 0.150308 0.032806 0.084521 0.339138 0.073351
3 0.267232 0.143710 0.037502 0.081709 0.322380 0.073899
4 0.246708 0.129806 0.036051 0.064728 0.284485 0.082958
5 0.240696 0.136928 0.037545 0.057403 0.246403 0.092590
Documentary Drama Fantasy FilmNoir Horror Musical \
mean mean mean mean mean mean
rating
1 0.009656 0.289034 0.018331 0.007365 0.082324 0.046645
2 0.005101 0.320756 0.019349 0.008531 0.071592 0.050484
3 0.006042 0.363861 0.016983 0.013520 0.055738 0.052238
4 0.007842 0.427459 0.011207 0.019430 0.047112 0.047609
5 0.009858 0.471534 0.008301 0.026414 0.041366 0.049526
Mystery Romance SciFi Thriller War Western
mean mean mean mean mean mean
rating
1 0.041735 0.154173 0.118494 0.203764 0.060065 0.011620
2 0.046262 0.177397 0.133597 0.229903 0.067018 0.015743
3 0.048112 0.186443 0.121422 0.224277 0.074415 0.019893
4 0.056563 0.201381 0.125154 0.222772 0.097589 0.019606
5 0.057780 0.215037 0.137446 0.203387 0.137446 0.018584
I think you need idxmin and idxmax, also new DataFrame is not necessary, you can use bigdataframe and filter columns in []:
genres = ['Action', 'Adventure','Animation', 'Childrens', 'Comedy', 'Crime','Documentary', 'Drama', 'Fantasy', 'FilmNoir', 'Horror', 'Musical', 'Mystery', 'Romance','SciFi', 'Thriller', 'War', 'Western']
df1 = bigdataframe.groupby('rating')[genres].mean()
print (df1)
Action Adventure Animation Childrens Comedy Crime \
rating
1 0.253191 0.131588 0.030442 0.093944 0.372995 0.068249
2 0.286192 0.150308 0.032806 0.084521 0.339138 0.073351
3 0.267232 0.143710 0.037502 0.081709 0.322380 0.073899
4 0.246708 0.129806 0.036051 0.064728 0.284485 0.082958
5 0.240696 0.136928 0.037545 0.057403 0.246403 0.092590
Documentary Drama Fantasy FilmNoir Horror Musical \
rating
1 0.009656 0.289034 0.018331 0.007365 0.082324 0.046645
2 0.005101 0.320756 0.019349 0.008531 0.071592 0.050484
3 0.006042 0.363861 0.016983 0.013520 0.055738 0.052238
4 0.007842 0.427459 0.011207 0.019430 0.047112 0.047609
5 0.009858 0.471534 0.008301 0.026414 0.041366 0.049526
Mystery Romance SciFi Thriller War Western
rating
1 0.041735 0.154173 0.118494 0.203764 0.060065 0.011620
2 0.046262 0.177397 0.133597 0.229903 0.067018 0.015743
3 0.048112 0.186443 0.121422 0.224277 0.074415 0.019893
4 0.056563 0.201381 0.125154 0.222772 0.097589 0.019606
5 0.057780 0.215037 0.137446 0.203387 0.137446 0.018584
mingen = df1.idxmin(axis=1).reset_index(name='Genre')
print (mingen)
rating Genre
0 1 FilmNoir
1 2 Documentary
2 3 Documentary
3 4 Documentary
4 5 Fantasy
maxgen = df1.idxmax(axis=1).reset_index(name='Genre')
print (maxgen)
rating Genre
0 1 Comedy
1 2 Comedy
2 3 Drama
3 4 Drama
4 5 Drama
I have three data frames (A, B, C) of the following type
2006 2007 2008 2009 2010 2011
Age
0 1556 1623 5943 6133 6111 6345
1 5707 5838 0355 6049 2366 5828
2 5616 5770 5899 6080 6137 6303
3 5564 5593 8129 9388 1341 6215
4 5702 5598 7030 8576 9827 2007
I would like to get them into one dataframe with the following design (multi-index0
A B C
Year Age
2006 0 1556 3532 23
1 5707 4352 53
2 5616 2533 67
...
2011 3 6215 4255 55
4 9827 3333 50
Any suggestions?
Cheers, Mike
You can use concat with unstack:
df1 = pd.DataFrame({
'2010': [6111, 2366, 6137, 1341, 9827],
'2007': [1623, 5838, 5770, 5593, 5598],
'2008': [5943, 355, 5899, 8129, 7030],
'2011': [6345, 5828, 6303, 6215, 2007],
'2006': [1556, 5707, 5616, 5564, 5702],
'2009': [6133, 6049, 6080, 9388, 8576]})
print (df1)
2006 2007 2008 2009 2010 2011
0 1556 1623 5943 6133 6111 6345
1 5707 5838 355 6049 2366 5828
2 5616 5770 5899 6080 6137 6303
3 5564 5593 8129 9388 1341 6215
4 5702 5598 7030 8576 9827 2007
df2 = df1*2
df3 = df1*3
print (pd.concat([df1.unstack(),df2.unstack(),df3.unstack()], axis=1, keys=list('ABC'))
.rename_axis(('Year','Age')))
A B C
Year Age
2006 0 1556 3112 4668
1 5707 11414 17121
2 5616 11232 16848
3 5564 11128 16692
4 5702 11404 17106
2007 0 1623 3246 4869
1 5838 11676 17514
...
Or concat with stack, but then is necessary swaplevel with sort_index, last rename levels names by rename_axis:
print (pd.concat([df1,df2,df3], axis=1, keys=list('ABC'))
.stack()
.swaplevel(0,1)
.sort_index()
.rename_axis(('Year','Age')))
A B C
Year Age
2006 0 1556 3112 4668
1 5707 11414 17121
2 5616 11232 16848
3 5564 11128 16692
4 5702 11404 17106
2007 0 1623 3246 4869
1 5838 11676 17514
2 5770 11540 17310
3 5593 11186 16779
4 5598 11196 16794
2008 0 5943 11886 17829
1 355 710 1065
2 5899 11798 17697
...