I have a dataset like this
import pandas as pd
df = pd.read_csv("music.csv")
df
name
date
singer
language
phase
1
Yes or No
02.01.20
Benjamin Smith
en
1
2
Parabens
01.06.21
Rafael Galvao;Simon Murphy
pt;en
2
3
Love
12.11.20
Michaela Condell
en
1
4
Paz
11.07.19
Ana Perez; Eduarda Pinto
es;pt
3
5
Stop
12.01.21
Michael Conway;Gabriel Lee
en;en
1
6
Shalom
18.06.21
Shimon Cohen
hebr
1
7
Habibi
22.12.19
Fuad Khoury
ar
3
8
viva
01.08.21
Veronica Barnes
en
1
9
Buznanna
23.09.20
Kurt Azzopardi
mt
1
10
Frieden
21.05.21
Gabriel Meier
dt
1
11
Uruguay
11.04.21
Julio Ramirez
es
1
12
Beautiful
17.03.21
Cameron Armstrong
en
3
13
Holiday
19.06.20
Bianca Watson
en
3
14
Kiwi
21.10.20
Lachlan McNamara
en
1
15
Amore
01.12.20
Vasco Grimaldi
it
1
16
La vie
28.04.20
Victor Dubois
fr
3
17
Yom
21.02.20
Ori Azerad; Naeem al-Hindi
hebr;ar
2
18
EleftherĂa
15.06.19
Nikolaos Gekas
gr
1
This table is not in 1NF. I would like to convert in the form of pd.DataFrame, that satiesfy 1NF.
How can I do that?
I did this, but seems not work
import pandas as pd
import numpy as np
df = pd.read_csv("music.csv")
lens = list(map(len, df['singer','language].values))
res = pd.DataFrame({'name': np.repeat(
df['name'], lens), 'singer': np.concatenate(df['singer'].values),'language': np.concatenate(df['language'].values)})
print(res)
It should satisfy only 1NF not 3NF and so on.
Split language and singer by values and use pd.explode:
df['language']=df['language'].str.split(';')
df['singer']=df['singer'].str.split(";")
df.explode(['language','singer'])
Id
name
date
singer
language
phase
1
Yes or No
02.01.20
Benjamin Smith
en
1
2
Parabens
01.06.21
Rafael Galvao
pt
2
2
Parabens
01.06.21
Simon Murphy
en
2
3
Love
12.11.20
Michaela Condell
en
1
4
Paz
11.07.19
Ana Perez
es
3
4
Paz
11.07.19
Eduarda Pinto
pt
3
5
Stop
12.01.21
Michael Conway
en
1
5
Stop
12.01.21
Gabriel Lee
en
1
6
Shalom
18.06.21
Shimon Cohen
hebr
1
7
Habibi
22.12.19
Fuad Khoury
ar
3
8
viva
01.08.21
Veronica Barnes
en
1
9
Buznanna
23.09.20
Kurt Azzopardi
mt
1
10
Frieden
21.05.21
Gabriel Meier
dt
1
11
Uruguay
11.04.21
Julio Ramirez
es
1
12
Beautiful
17.03.21
Cameron Armstrong
en
3
13
Holiday
19.06.20
Bianca Watson
en
3
14
Kiwi
21.10.20
Lachlan McNamara
en
1
15
Amore
01.12.20
Vasco Grimaldi
it
1
16
La vie
28.04.20
Victor Dubois
fr
3
17
Yom
21.02.20
Ori Azerad
hebr
2
17
Yom
21.02.20
Naeem al-Hindi
ar
2
18
EleftherĂa
15.06.19
Nikolaos Gekas
gr
1
Related
I need to do forward fillna() on a Pandas dataframe in a specifict manner. Let me explain it.
I have a dataframe with 3 columns city, age, medicine (sorted by ['city', 'Age']).
city
Age
Value
0
NY
30
Nan
1
NY
35
12AA
2
NY
40
Nan
3
NY
45
Nan
4
NY
50
15AA
5
NY
55
Nan
6
LA
25
Nan
7
LA
30
Nan
8
LA
35
14DD
9
LA
40
Nan
10
LA
45
12AA
11
LA
50
Nan
12
LA
55
Nan
13
DC
35
Nan
What I need to do is to fill Nan values in the forwarding direction (replace the Nan values by the previous non-Nan value). The only twist is that when the city changes the forwarding fillna should be reset. The following table shows the desired output.
city
Age
Value
0
NY
30
Nan
1
NY
35
12AA
2
NY
40
12AA
3
NY
45
12AA
4
NY
50
15AA
5
NY
55
15AA
6
LA
25
Nan
7
LA
30
Nan
8
LA
35
14DD
9
LA
40
14DD
10
LA
45
12AA
11
LA
50
12AA
12
LA
55
12AA
13
DC
35
Nan
How can I do this kind of forwarding fillna in pandas that reset based on the city column?
Let's try groupby ffill:
df['Value'] = df.groupby('city')['Value'].ffill()
print(df)
Output:
city Age Value
0 NY 30 NaN
1 NY 35 12AA
2 NY 40 12AA
3 NY 45 12AA
4 NY 50 15AA
5 NY 55 15AA
6 LA 25 NaN
7 LA 30 NaN
8 LA 35 14DD
9 LA 40 14DD
10 LA 45 12AA
11 LA 50 12AA
12 LA 55 12AA
13 DC 35 NaN
I have dataframe similar to this one:
name hobby date country 5 10 15 20 ...
Toby Guitar 2020-01-19 Brazil 0.1245 0.2543 0.7763 0.2264
Linda Cooking 2020-03-05 Italy 0.5411 0.2213 Nan 0.3342
Ben Diving 2020-04-02 USA 0.8843 0.2333 0.4486 0.2122
...
I want to the the int colmns, duplicate them, and put the int as the new value of the column, something like this:
name hobby date country 5 5 10 10 15 15 20 20...
Toby Guitar 2020-01-19 Brazil 0.1245 5 0.2543 10 0.7763 15 0.2264 20
Linda Cooking 2020-03-05 Italy 0.5411 5 0.2213 10 Nan 15 0.3342 20
Ben Diving 2020-04-02 USA 0.8843 5 0.2333 10 0.4486 15 0.2122 20
...
I'm not sure how to tackle this and looking for ideas
Here is a solution you can try out,
digits_ = pd.DataFrame(
{col: [int(col)] * len(df) for col in df.columns if col.isdigit()}
)
pd.concat([df, digits_], axis=1)
name hobby date country 5 ... 20 5 10 15 20
0 Toby Guitar 2020-01-19 Brazil 0.1245 ... 0.2264 5 10 15 20
1 Linda Cooking 2020-03-05 Italy 0.5411 ... 0.3342 5 10 15 20
2 Ben Diving 2020-04-02 USA 0.8843 ... 0.2122 5 10 15 20
I'm not sure if it is the best way to organise data with duplicated column names. I would recommend stacking (melting) it into long format.
df.melt(id_vars=["name", "hobby", "date", "country"])
Result
name hobby date country variable value
0 Toby Guitar 2020-01-19 Brazil 5 0.1245
1 Linda Cooking 2020-03-05 Italy 5 0.5411
2 Ben Diving 2020-04-02 USA 5 0.8843
3 Toby Guitar 2020-01-19 Brazil 10 0.2543
4 Linda Cooking 2020-03-05 Italy 10 0.2213
5 Ben Diving 2020-04-02 USA 10 0.2333
6 Toby Guitar 2020-01-19 Brazil 15 0.7763
7 Linda Cooking 2020-03-05 Italy 15 Nan
8 Ben Diving 2020-04-02 USA 15 0.4486
9 Toby Guitar 2020-01-19 Brazil 20 0.2264
10 Linda Cooking 2020-03-05 Italy 20 0.3342
11 Ben Diving 2020-04-02 USA 20 0.2122
You could use the pandas insert(...) function combined with a for loop
import numpy as np
import pandas as pd
df = pd.DataFrame([['Toby', 'Guitar', '2020-01-19', 'Brazil', 0.1245, 0.2543, 0.7763, 0.2264],
['Linda', 'Cooking', '2020-03-05', 'Italy', 0.5411, 0.2213, np.nan, 0.3342],
['Ben', 'Diving', '2020-04-02', 'USA', 0.8843, 0.2333, 0.4486, 0.2122]],
columns=['name', 'hobby', 'date', 'country', 5, 10, 5, 20])
start_col=4
for i in range(0, len(df.columns)-start_col):
dcol = df.columns[start_col+i*2] # digit col name to duplicate
df.insert(start_col+i*2+1, dcol, [dcol]*len(df.index), True)
results:
name hobby date country 5 ... 10 5 5 20 20
0 Toby Guitar 2020-01-19 Brazil 0.1245 ... 10 0.7763 5 0.2264 20
1 Linda Cooking 2020-03-05 Italy 0.5411 ... 10 NaN 5 0.3342 20
2 Ben Diving 2020-04-02 USA 0.8843 ... 10 0.4486 5 0.2122 20
[3 rows x 12 columns]
I assumed that all your columns are digits from the 5th, but if not you could add in the for loop an if condition to prevent this :
start_col=4
for i in range(0, len(df.columns)-start_col):
dcol = df.columns[start_col+i*2] # digit col name to duplicate
if type(dcol) is int:
df.insert(start_col+i*2+1, dcol, [dcol]*len(df.index), True)
hope you can help me this.
The df looks like this.
region AMER
country Brazil Canada Columbia Mexico United States
metro Rio de Janeiro Sao Paulo Toronto Bogota Mexico City Monterrey Atlanta Boston Chicago Culpeper Dallas Denver Houston Los Angeles Miami New York Philadelphia Seattle Silicon Valley Washington D.C.
ID
321321 2 1 1 13 15 29 1 2 1 11 6 15 3 2 14 3
23213 3
231 2 2 3 1 5 6 3 3 4 3 3 4
23213 4 1 1 1 4 1 2 27 1
21321 4 2 2 1 14 3 2 4 2
12321 1 2 1 1 1 1 10
123213 2 45 5 1
12321 1
123 1 3 2
I want to get the count of columns that have data per of metro and country per region of all the rows(id/index) and store that count into a new column.
Regards,
RJ
You may want to try
df['new']df.sum(level=0, axis=1)
As I am newbie to deeper DataFrame operations, I would like to ask, how to find eg. the lowest campaign ID in this DataFrame per every customerid which is in this kind of DataFrame? As I learned, iteration should not be done in DataFrame.
orderid customerid campaignid orderdate city state zipcode paymenttype totalprice numorderlines numunits
0 1002854 45978 2141 2009-10-13 NEWTON MA 02459 VI 190.00 3 3
1 1002855 125381 2173 2009-10-13 NEW ROCHELLE NY 10804 VI 10.00 1 1
2 1002856 103122 2141 2011-06-02 MIAMI FL 33137 AE 35.22 2 2
3 1002857 130980 2173 2009-10-14 E RUTHERFORD NJ 07073 AE 10.00 1 1
4 1002886 48553 2141 2010-11-19 BALTIMORE MD 21218 VI 10.00 1 1
5 1002887 106150 2173 2009-10-15 ROWAYTON CT 06853 AE 10.00 1 1
6 1002888 27805 2173 2009-10-15 INDIANAPOLIS IN 46240 VI 10.00 1 1
7 1002889 24546 2173 2009-10-15 PLEASANTVILLE NY 10570 MC 10.00 1 1
8 1002890 43783 2173 2009-10-15 EAST STROUDSBURG PA 18301 DB 29.68 2 2
9 1003004 15688 2173 2009-10-15 ROUND LAKE PARK IL 60073 DB 19.68 1 1
10 1003044 130970 2141 2010-11-22 BLOOMFIELD NJ 07003 AE 10.00 1 1
11 1003045 40048 2173 2010-11-22 SPRINGFIELD IL 62704 MC 10.00 1 1
12 1003046 21927 2141 2010-11-22 WACO TX 76710 MC 17.50 1 1
13 1003075 130971 2141 2010-11-22 FAIRFIELD NJ 07004 MC 59.80 1 4
14 1003076 7117 2141 2010-11-22 BROOKLYN NY 11228 AE 22.50 1 1
Try the following
df.groupby('customerid')['campaignid'].min()
You can group unique values of customerid and subsequently find the minimum value per group for a given column using ['column_name'].min()
Having grouped data, I want to drop from the results groups that contain only a single observation with the value below a certain threshold.
Initial data:
df = pd.DataFrame(data={'Province' : ['ON','QC','BC','AL','AL','MN','ON'],
'City' :['Toronto','Montreal','Vancouver','Calgary','Edmonton','Winnipeg','Windsor'],
'Sales' : [13,6,16,8,4,3,1]})
City Province Sales
0 Toronto ON 13
1 Montreal QC 6
2 Vancouver BC 16
3 Calgary AL 8
4 Edmonton AL 4
5 Winnipeg MN 3
6 Windsor ON 1
Now grouping the data:
df.groupby(['Province', 'City']).sum()
Sales
Province City
AL Calgary 8
Edmonton 4
BC Vancouver 16
MN Winnipeg 3
ON Toronto 13
Windsor 1
QC Montreal 6
Now the part I can't figure out is how to drop provinces with only one city (or generally N observations) with the total sales less then 10. The expected output should be:
Sales
Province City
AL Calgary 8
Edmonton 4
BC Vancouver 16
ON Toronto 13
Windsor 1
I.e. MN/Winnipeg and QC/Montreal are gone from the results. Ideally, they won't be completely gone but combined into a new group called 'Other', but this may be material for another question.
you can do it this way:
In [188]: df
Out[188]:
City Province Sales
0 Toronto ON 13
1 Montreal QC 6
2 Vancouver BC 16
3 Calgary AL 8
4 Edmonton AL 4
5 Winnipeg MN 3
6 Windsor ON 1
In [189]: g = df.groupby(['Province', 'City']).sum().reset_index()
In [190]: g
Out[190]:
Province City Sales
0 AL Calgary 8
1 AL Edmonton 4
2 BC Vancouver 16
3 MN Winnipeg 3
4 ON Toronto 13
5 ON Windsor 1
6 QC Montreal 6
Now we will create a mask for those 'provinces with more than one city':
In [191]: mask = g.groupby('Province').City.transform('count') > 1
In [192]: mask
Out[192]:
0 True
1 True
2 False
3 False
4 True
5 True
6 False
dtype: bool
And cities with the total sales greater or equal to 10 win:
In [193]: g[(mask) | (g.Sales >= 10)]
Out[193]:
Province City Sales
0 AL Calgary 8
1 AL Edmonton 4
2 BC Vancouver 16
4 ON Toronto 13
5 ON Windsor 1
I wasn't satisfied with any of the answers given, so I kept chipping at this until I figured out the following solution:
In [72]: df
Out[72]:
City Province Sales
0 Toronto ON 13
1 Montreal QC 6
2 Vancouver BC 16
3 Calgary AL 8
4 Edmonton AL 4
5 Winnipeg MN 3
6 Windsor ON 1
In [73]: df.groupby(['Province', 'City']).sum().groupby(level=0).filter(lambda x: len(x)>1 or x.Sales > 10)
Out[73]:
Sales
Province City
AL Calgary 8
Edmonton 4
BC Vancouver 16
ON Toronto 13
Windsor 1