Simplifying categorical variables with python/pandas

Simplifying categorical variables with python/pandas - python

I'm working with an airbnb dataset on Kaggle:
https://www.kaggle.com/c/airbnb-recruiting-new-user-bookings
and want to simplify the values for the language column into 2 groupings - english and non-english.
For instance:
users.language.value_counts()
en 15011
zh 101
fr 99
de 53
es 53
ko 43
ru 21
it 20
ja 19
pt 14
sv 11
no 6
da 5
nl 4
el 2
pl 2
tr 2
cs 1
fi 1
is 1
hu 1
Name: language, dtype: int64
And the result I want it is:
users.language.value_counts()
english 15011
non-english 459
Name: language, dtype: int64
This is sort of the solution I want:
def language_groupings():
for i in users:
if users.language !='en':
replace(users.language.str, 'non-english')
else:
replace(users.language.str, 'english')
return users
users['language'] = users.apply(lambda row: language_groupings)
Except there's obviously something wrong with this as it returns an empty series when I run value_counts on the column.

Try this:
users.language = np.where( users.language !='en', 'non-english', 'english' )

is that what you want?
In [181]: x
Out[181]:
val
en 15011
zh 101
fr 99
de 53
es 53
ko 43
ru 21
it 20
ja 19
pt 14
sv 11
no 6
da 5
nl 4
el 2
pl 2
tr 2
cs 1
fi 1
is 1
hu 1
In [182]: x.groupby(x.index == 'en').sum()
Out[182]:
val
False 459
True 15011

Related

get the number of involved singer in a phase

I have a dataset like this
import pandas as pd
df = pd.read_csv("music.csv")
df
name
date
singer
language
phase
1
Yes or No
02.01.20
Benjamin Smith
en
1
2
Parabens
01.06.21
Rafael Galvao;Simon Murphy
pt;en
2
3
Love
12.11.20
Michaela Condell
en
1
4
Paz
11.07.19
Ana Perez; Eduarda Pinto
es;pt
3
5
Stop
12.01.21
Michael Conway;Gabriel Lee
en;en
1
6
Shalom
18.06.21
Shimon Cohen
hebr
1
7
Habibi
22.12.19
Fuad Khoury
ar
3
8
viva
01.08.21
Veronica Barnes
en
1
9
Buznanna
23.09.20
Kurt Azzopardi
mt
1
10
Frieden
21.05.21
Gabriel Meier
dt
1
11
Uruguay
11.04.21
Julio Ramirez
es
1
12
Beautiful
17.03.21
Cameron Armstrong
en
3
13
Holiday
19.06.20
Bianca Watson
en
3
14
Kiwi
21.10.20
Lachlan McNamara
en
1
15
Amore
01.12.20
Vasco Grimaldi
it
1
16
La vie
28.04.20
Victor Dubois
fr
3
17
Yom
21.02.20
Ori Azerad; Naeem al-Hindi
hebr;ar
2
18
Elefthería
15.06.19
Nikolaos Gekas
gr
1
I convert it to 1NF.
import pandas as pd
import numpy as np
df = pd.read_csv("music.csv")
df['language']=df['language'].str.split(';')
df['singer']=df['singer'].str.split(";")
df.explode(['language','singer'])
d= pd.DataFrame(df)
d
And I create a dataframe. Now I would like to find out which phase has the most singers involved.
I used this
df= df.group.by('singer')
df['phase']. value_counts(). idxmax()
But I could not get a solution
The dataframe has 42 observations, so some singers occur again
Source: convert data to 1NF

You do not need to split/explode, you can directly count the number of ; per row and add 1:
df['singer'].str.count(';').add(1).groupby(df['phase']).sum()
If you want the classical split/explode:
(df.assign(singer=df['singer'].str.split(';'))
.explode('singer')
.groupby('phase')['singer'].count()
)
output:
phase
1 12
2 4
3 6
Name: singer, dtype: int64

I want to create a new column territory based on the city column

Data Frame :
city Temperature
0 Chandigarh 15
1 Delhi 22
2 Kanpur 20
3 Chennai 26
4 Manali -2
0 Bengalaru 24
1 Coimbatore 35
2 Srirangam 36
3 Pondicherry 39
I need to create another column in data frame, which contains a boolean value for each city to indicate whether it's a union territory or not. Chandigarh, Pondicherry and Delhi are only 3 union territories here.
I have written below code
import numpy as np
conditions = [df3['city'] == 'Chandigarh',df3['city'] == 'Pondicherry',df3['city'] == 'Delhi']
values =[1,1,1]
df3['territory'] = np.select(conditions, values)
Is there any easier or efficient way that I can write?

You can use isin:
union_terrs = ["Chandigarh", "Pondicherry", "Delhi"]
df3["territory"] = df3["city"].isin(union_terrs).astype(int)
which checks each entry in city column and if it is in union_terrs, gives True and otherwise False. The astype makes True/False to 1/0 conversion,
to get
city Temperature territory
0 Chandigarh 15 1
1 Delhi 22 1
2 Kanpur 20 0
3 Chennai 26 0
4 Manali -2 0
0 Bengalaru 24 0
1 Coimbatore 35 0
2 Srirangam 36 0
3 Pondicherry 39 1

Using re and matching, how can I search and get certain data from a text file?

I have used re and matching to extract certain data from a text file.
But I have issues trying to get specific data using similar technique. Keep getting stuck. So posting the code I used to get the lines I require.
Details are at the end of the code below.
Thank you in advance!
Data from text file:
-------------------------------------------------------------------------------------------------------------------------------------
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 1 2 3 4 5 6 7
SU MO TU WE TH FR SA SU MO TU WE TH FR SA SU MO TU WE TH FR SA SU MO TU WE TH FR SA SU MO TU WE TH FR SA SU MO
121 192 175 158 168 BLK NO. 101 DYS OFF 17
CVGORD X X X X AVPDSMORD X X X X GRBDSMORD X X X PIALEXORD X X X X CHALEXORD X X CRD. 72.00 BLK. 58.31
121= 0910/1255/0901; 192= 0810/1915/1536; 175= 0750/1218/0931; 158= 0730/1240/1359; 168= 0758/1239/1638; TAFB 245.09 C/O 0.0
Code: sorry forgot to add myDict[key] from my code EDITED
with open(filename, 'r') as f:
count = 0
for line in f:
matchObj = re.match(dashes1, line)
if matchObj:
count += 1
strcount =str(count)
data = ['','','','']
f.readline()
f.readline()
data[0] = f.readline()
data[1] = f.readline()
key = "myData"+strcount
myDict[key] = data
f.close()
for key in myDict:
print(key, '->', myDict[key])
My output is:
myData1 -> [' 121 192 175 158 168 BLK NO. 101 DYS OFF 17\n', ' CVGORD X X X X AVPDSMORD X X X X GRBDSMORD X X X PIALEXORD X X X X CHALEXORD X X CRD. 72.00 BLK. 58.31\n', '', '']
I want to get the data after BLK NO. that is 101, data after DYS OFF which is 17, and so on for CRD. value of 72.00 and BLK. value of 58.31.
I don't want to print BLK NO., DYS OFF, CRD. nor BLK. just the values after them.
I have tried the same method using re and matching but I get stuck.
Thank you for the help in advance!

I would keep things sane and simple, and just use re.findall here, after reading the entire content into a string:
inp = """-------------------------------------------------------------------------------------------------------------------------------------
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 1 2 3 4 5 6 7
SU MO TU WE TH FR SA SU MO TU WE TH FR SA SU MO TU WE TH FR SA SU MO TU WE TH FR SA SU MO TU WE TH FR SA SU MO
121 192 175 158 168 BLK NO. 101 DYS OFF 17
CVGORD X X X X AVPDSMORD X X X X GRBDSMORD X X X PIALEXORD X X X X CHALEXORD X X CRD. 72.00 BLK. 58.31
121= 0910/1255/0901; 192= 0810/1915/1536; 175= 0750/1218/0931; 158= 0730/1240/1359; 168= 0758/1239/1638; TAFB 245.09 C/O 0.0"""
keys = ["BLK NO\.", "DYS OFF", "CRD\.", "BLK\.", "TAFB", "C/O"]
regex = "(" + "|".join(keys) + ")"
matches = re.findall(regex + r'\s+(\d+(?:\.\d+)?)', inp)
print(matches)
This prints:
[('BLK NO.', '101'), ('DYS OFF', '17'), ('CRD.', '72.00'), ('BLK.', '58.31'),
('TAFB', '245.09'), ('C/O', '0.0')]

Populate a dataframe column based on a column of other dataframe

I have a dataframe with the population of a region and i want to populate a column of other dataframe with the same distribution.
The first dataframe looks like this:
Municipio Population Population5000
0 Lisboa 3184984 1291
1 Porto 2597191 1053
2 Braga 924351 375
3 Setúbal 880765 357
4 Aveiro 814456 330
5 Faro 569714 231
6 Leiria 560484 227
7 Coimbra 541166 219
8 Santarém 454947 184
9 Viseu 378784 154
10 Viana do Castelo 252952 103
11 Vila Real 214490 87
12 Castelo Branco 196989 80
13 Évora 174490 71
14 Guarda 167359 68
15 Beja 158702 64
16 Bragança 140385 57
17 Portalegre 120585 49
18 Total 12332794 5000
Basically, the second dataframe has 5000 rows and i want to create a column with a name corresponding to the Municipios from the first df.
My problem is that i dont know how to populate the column with same occurence distribution from the first dataframe.
The final result would be something like this:
Municipio
0 Porto
1 Porto
2 Lisboa
3 Évora
4 Lisboa
5 Aveiro
...
4996 Viseu
4997 Lisboa
4998 Porto
4999 Guarda
5000 Beja
Can someone help me?

I would use a simple comprehension to build a list of size 5000 with as many elements with a town name as the value of Population5000, and optionally shuffle it if you want a random order:
lst = [m for m,n in df.loc[:len(df)-2,
['Municipio', 'Population5000']].to_numpy()
for i in range(n)]
random.shuffle(lst)
result = pd.Series(1, index=lst, name='Municipio')
Initialized with random.seed(0), it gives:
Setúbal 1
Santarém 1
Lisboa 1
Setúbal 1
Aveiro 1
..
Santarém 1
Porto 1
Lisboa 1
Faro 1
Aveiro 1
Name: Municipio, Length: 5000, dtype: int64

You could just do a simple map if you do;
map = dict(zip(DF1['Population5000'], DF1['Municipio']))
DF2['Municipo'] = DF2['Population5000'].map(map)
or just change the population 5000 column name in the map (DF2) to whatever the column containing your population values is called.

map = dict(zip(municipios['Population5000'], municipios['Municipio']))
df['Municipio'] = municipios['Population5000'].map(map)
I tried this as suggested by Amen_90 and the column Municipio from the second dataframe it only gets populated with 1 instance of every Municipio, when i wanted to have the same value_counts as in the column "Population5000" in my first dataframe.
df["Municipio"].value_counts()
Beja 1
Aveiro 1
Bragança 1
Vila Real 1
Porto 1
Santarém 1
Coimbra 1
Guarda 1
Leiria 1
Castelo Branco 1
Viseu 1
Total 1
Faro 1
Portalegre 1
Braga 1
Évora 1
Setúbal 1
Viana do Castelo 1
Lisboa 1
Name: Municipio, dtype: int64

query pandas cols with loop

i have the following df:
country sport score
0 ita swim 15
1 fr run 25
2 ger golf 37
3 ita run 17
4 fr golf 58
5 fr run 35
i am interested in some elements of categories only:
ctr = ['ita','fr']
sprt= ['run','golf']
i was hoping in something like this to extract them:
df[(df['country']== x for x in ctr)&(df['sport']== x for x in sprt)]
but while it doesn't throw any error it returns empty..
any suggestion?
i also tried:
df[(df['country']== {x for x in ctr})&(df['sport']== {x for x in sprt})]
EDIT:
the reason why want to use a loop, is cos i am actually interested in the 3 top scores of each combination, which i hoped to concat:
df1 = pd.concat(df[(df['country']== x for x in ctr)&(df['sport']== x for x in sprt)].sort_values(by=['score'],ascending=False).head(3))

Use double Series.isin for check membership:
df1 = df[(df['country'].isin(ctr))&(df['sport'].isin(sprt))]
print (df1)
country sport score
1 fr run 25
3 ita run 17
4 fr golf 58
5 fr run 35
df2 = df1.sort_values('score', ascending=False).groupby(['country','sport']).head(3)
print (df2)
country sport score
4 fr golf 58
5 fr run 35
1 fr run 25
3 ita run 17

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Simplifying categorical variables with python/pandas - python

Try this: users.language = np.where( users.language !='en', 'non-english', 'english' )

is that what you want? In [181]: x Out[181]: val en 15011 zh 101 fr 99 de 53 es 53 ko 43 ru 21 it 20 ja 19 pt 14 sv 11 no 6 da 5 nl 4 el 2 pl 2 tr 2 cs 1 fi 1 is 1 hu 1 In [182]: x.groupby(x.index == 'en').sum() Out[182]: val False 459 True 15011

Related

get the number of involved singer in a phase

I want to create a new column territory based on the city column

Using re and matching, how can I search and get certain data from a text file?

Populate a dataframe column based on a column of other dataframe

query pandas cols with loop

Categories

Resources