I have the following 3 data frames:
dfSpa = pd.read_csv(
"sentences and translations/SpanishSentences.csv", sep=',')
print(dfSpa.head())
dfEng = pd.read_csv(
'sentences and translations/EngTranslations.csv', sep=',')
print(dfEng.head())
dfIndex = pd.read_csv(
'sentences and translations/SpaSentencesThatHaveEngTranslations.csv', sep=',')
print(dfIndex.head())
That output the following:
0 1 2
0 2482 spa Tengo que irme a dormir.
1 2487 spa Ahora, Muiriel tiene 20 años.
2 2493 spa Simplemente no sé qué decir...
3 2495 spa Yo estaba en las montañas.
4 2497 spa No sé si tengo tiempo.
0 1 2
0 1277 eng I have to go to sleep.
1 1282 eng Muiriel is 20 now.
2 1287 eng This is never going to end.
3 1288 eng I just don't know what to say.
4 1290 eng I was in the mountains.
0 1
0 2482 1277
1 2487 1282
2 2493 1288
3 2493 693485
4 2495 1290
Colum 0 in dfIndex represents a Spanish sentence in dfSpa and column 1 represents the English translation in dfEng that goes with it. dfSpa has more rows than the other 2 df's so, some sentences do not have english translations. Also, dfIndex is longer than dfEng because there are some duplicate translations with different values such as with 2493, in dfIndex.head(), as shown above.
I am trying to create another data frame that simply has the Spanish sentence in one column and the corresponding English translation in the other column. How could I get this done?
dfIndex.merge(
dfSpa[[0,2]], on=0)[[1,2]].rename(columns={2: "Spa"}).merge(
dfEng, left_on=1, right_on=0).rename(columns={2: "Eng"})[['Spa', 'Eng']]
You could try:
df_n=pd.DataFrame()
df_n['A'] = [df.iloc[x].values for x in dfSpa.loc[:,0]]
df_n['B'] = [df.iloc[x].values for x in dfEng.loc[:,0]]
and then remove duplicated rows using:
df_n = df_n.drop_duplicates(subset = ['A'])
It would be easier to check if you had sample dfs.
Related
I have a dataset like this
import pandas as pd
df = pd.read_csv("music.csv")
df
name
date
singer
language
phase
1
Yes or No
02.01.20
Benjamin Smith
en
1
2
Parabens
01.06.21
Rafael Galvao;Simon Murphy
pt;en
2
3
Love
12.11.20
Michaela Condell
en
1
4
Paz
11.07.19
Ana Perez; Eduarda Pinto
es;pt
3
5
Stop
12.01.21
Michael Conway;Gabriel Lee
en;en
1
6
Shalom
18.06.21
Shimon Cohen
hebr
1
7
Habibi
22.12.19
Fuad Khoury
ar
3
8
viva
01.08.21
Veronica Barnes
en
1
9
Buznanna
23.09.20
Kurt Azzopardi
mt
1
10
Frieden
21.05.21
Gabriel Meier
dt
1
11
Uruguay
11.04.21
Julio Ramirez
es
1
12
Beautiful
17.03.21
Cameron Armstrong
en
3
13
Holiday
19.06.20
Bianca Watson
en
3
14
Kiwi
21.10.20
Lachlan McNamara
en
1
15
Amore
01.12.20
Vasco Grimaldi
it
1
16
La vie
28.04.20
Victor Dubois
fr
3
17
Yom
21.02.20
Ori Azerad; Naeem al-Hindi
hebr;ar
2
18
Elefthería
15.06.19
Nikolaos Gekas
gr
1
I convert it to 1NF.
import pandas as pd
import numpy as np
df = pd.read_csv("music.csv")
df['language']=df['language'].str.split(';')
df['singer']=df['singer'].str.split(";")
df.explode(['language','singer'])
d= pd.DataFrame(df)
d
And I create a dataframe. Now I would like to find out which phase has the most singers involved.
I used this
df= df.group.by('singer')
df['phase']. value_counts(). idxmax()
But I could not get a solution
The dataframe has 42 observations, so some singers occur again
Source: convert data to 1NF
You do not need to split/explode, you can directly count the number of ; per row and add 1:
df['singer'].str.count(';').add(1).groupby(df['phase']).sum()
If you want the classical split/explode:
(df.assign(singer=df['singer'].str.split(';'))
.explode('singer')
.groupby('phase')['singer'].count()
)
output:
phase
1 12
2 4
3 6
Name: singer, dtype: int64
So I have this df or table coming from a pdf tranformation on this way example:
ElementRow
ElementColumn
ElementPage
ElementText
X1
Y1
X2
Y2
1
50
0
1
Emergency Contacts
917
8793
2191
8878
2
51
0
1
Contact
1093
1320
1451
1388
3
51
2
1
Relationship
2444
1320
3026
1388
4
51
7
1
Work Phone
3329
1320
3898
1388
5
51
9
1
Home Phone
4260
1320
4857
1388
6
51
10
1
Cell Phone
5176
1320
5684
1388
7
51
12
1
Priority Phone
6143
1320
6495
1388
8
51
14
1
Contact Address
6542
1320
7300
1388
9
51
17
1
City
7939
1320
7300
1388
10
51
18
1
State
8808
1320
8137
1388
11
51
21
1
Zip
9134
1320
9294
1388
12
52
0
1
Silvia Smith
1093
1458
1973
1526
13
52
2
1
Mother
2444
1458
2783
1526
13
52
7
1
(123) 456-78910
5176
1458
4979
1526
14
52
10
1
Austin
7939
1458
8406
1526
15
52
15
1
Texas
8808
1458
8961
1526
16
52
20
1
76063
9134
1458
9421
1526
17
52
2
1
1234 Parkside Ct
6542
1458
9421
1526
18
53
0
1
Naomi Smith
1093
2350
1973
1526
19
53
2
1
Aunt
2444
2350
2783
1526
20
53
7
1
(123) 456-78910
5176
2350
4979
1526
21
53
10
1
Austin
7939
2350
8406
1526
22
53
15
1
Texas
8808
2350
8961
1526
23
53
20
1
76063
9134
2350
9421
1526
24
53
2
1
3456 Parkside Ct
6542
2350
9421
1526
25
54
40
1
End Employee Line
6542
2350
9421
1526
25
55
0
1
Emergency Contacts
917
8793
2350
8878
I'm trying to separate each register by rows taking as a reference ElementRow column and keep the headers from the first rows and then iterate through the other rows after. The column X1 has a reference on which header should be the values. I would like to have the data like this way.
Contact
Relationship
Work Phone
Cell Phone
Priority
ContactAddress
City
State
Zip
1
Silvia Smith
Mother
(123) 456-78910
1234 Parkside Ct
Austin
Texas
76063
2
Naomi Smith
Aunt
(123) 456-78910
3456 Parkside Ct
Austin
Texas
76063
Things I tried:
To take rows between iterating through the columns. tried to slice taking the first index and the last index but showed this error:
emergStartIndex = df.index[df['ElementText'] == 'Emergency Contacts']
emergLastIndex = df.index[df['ElementText'] == 'End Employee Line']
emerRows_between = df.iloc[emergStartIndex:emergLastIndex]
TypeError: cannot do positional indexing on RangeIndex with these indexers [Int64Index([...
That way is working with this numpy trick.
emerRows_between = df.iloc[np.r_[1:54,55:107]]
emerRows_between
but when trying to replace the index showed this:
emerRows_between = df.iloc[np.r_[emergStartIndex:emergLastIndex]]
emerRows_between
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
I tried iterating row by row like this but in some point the df reach the end and I'm receiving index out of bound.
emergencyContactRow1 = df['ElementText','X1'].iloc[emergStartIndex+1].reset_index(drop=True)
emergencyContactRow2 = df['ElementText','X1'].iloc[emergStartIndex+2].reset_index(drop=True)
emergencyContactRow3 = df['ElementText','X1'].iloc[emergStartIndex+3].reset_index(drop=True)
emergencyContactRow4 = df['ElementText','X1'].iloc[emergStartIndex+4].reset_index(drop=True)
emergencyContactRow5 = df['ElementText','X1'].iloc[emergStartIndex+5].reset_index(drop=True)
emergencyContactRow6 = df['ElementText','X1'].iloc[emergStartIndex+6].reset_index(drop=True)
emergencyContactRow7 = df['ElementText','X1'].iloc[emergStartIndex+7].reset_index(drop=True)
emergencyContactRow8 = df['ElementText','X1'].iloc[emergStartIndex+8].reset_index(drop=True)
emergencyContactRow9 = df['ElementText','X1'].iloc[emergStartIndex+9].reset_index(drop=True)
emergencyContactRow10 = df['ElementText','X1'].iloc[emergStartIndex+10].reset_index(drop=True)
frameEmergContact1 = [emergencyContactRow1 , emergencyContactRow2 , emergencyContactRow3, emergencyContactRow4, emergencyContactRow5, emergencyContactRow6, emergencyContactRow7, , emergencyContactRow8,, emergencyContactRow9, , emergencyContactRow10]
df_emergContact1= pd.concat(frameEmergContact1 , axis=1)
df_emergContact1.columns = range(df_emergContact1.shape[1])
So how to make this code dynamic or how to avoid the index out of bound errors and keep my headers taking as a reference only the first row after the Emergency Contact row?. I know I didn't try to use the X1 column yet, but I have to resolve first how to iterate through those multiple indexes.
Each iteration from Emergency Contact index to End Employee line belongs to one person or one employee from the whole dataframe, so the idea after capture all those values is to keep also a counter variable to see how many times the data is captured between those two indexes.
It's a bit ugly, but this should do it. Basically you don't need the first or last two rows, so if you get rid of those, then pivot the X1 and ElemenTex columns you will be pretty close. Then it's a matter of getting rid of null values and promoting the first row to header.
df = df.iloc[1:-2][['ElementTex','X1','ElementRow']].pivot(columns='X1',values='ElementTex')
df = pd.DataFrame([x[~pd.isnull(x)] for x in df.values.T]).T
df.columns = df.iloc[0]
df = df[1:]
Split the dataframe into chunks whenever "Emergency Contacts" appears in column "ElementText"
Parse each chunk into the required format
Append to the output
import numpy as np
list_of_df = np.array_split(data, data[data["ElementText"]=="Emergency Contacts"].index)
output = pd.DataFrame()
for frame in list_of_df:
df = frame[~frame["ElementText"].isin(["Emergency Contacts", "End Employee Line"])].dropna()
if df.shape[0]>0:
temp = pd.DataFrame(df.groupby("X1")["ElementText"].apply(list).tolist()).T
temp.columns = temp.iloc[0]
temp = temp.drop(0)
output = output.append(temp, ignore_index=True)
>>> output
0 Contact Relationship Work Phone ... City State Zip
0 Silvia Smith Mother None ... Austin Texas 76063
1 Naomi Smith Aunt None ... Austin Texas 76063
I have two data frames. (These are easy examples, my real data has nearly 3,000 rows)
>df
player position nation Mins
Messi FW ARG 3302
Ronaldo FW POR 3029
Van Dijk DF NED 500
Mane FW SEN 3088
Alena MF SPA 1592
>df2
player position
Alena CM
Ronaldo ST
Mane LW
Van Dijk CB
Messi ST
What I'm trying to do is replace the position data in df with the position data from df2.matching player column.
I've tried sorting values of player columns on both and then just creating a new position column with df['pos2']=df2['position']
but It ends up slightly wrong in some areas of the resulting column.
Which is why I'm looking to do it based on matching a column.
Merge your dataframe based on player column:
>>> df1.drop(columns='position').merge(df2, on='player')
player nation Mins position
0 Messi ARG 3302 ST
1 Ronaldo POR 3029 ST
2 Van Dijk NED 500 CB
3 Mane SEN 3088 LW
4 Alena SPA 1592 CM
Maybe you want to keep history:
>>> df1.merge(df2, on='player', suffixes=('_old', '_new'))
player position_old nation Mins position_new
0 Messi FW ARG 3302 ST
1 Ronaldo FW POR 3029 ST
2 Van Dijk DF NED 500 CB
3 Mane FW SEN 3088 LW
4 Alena MF SPA 1592 CM
I have a dataframe with the population of a region and i want to populate a column of other dataframe with the same distribution.
The first dataframe looks like this:
Municipio Population Population5000
0 Lisboa 3184984 1291
1 Porto 2597191 1053
2 Braga 924351 375
3 Setúbal 880765 357
4 Aveiro 814456 330
5 Faro 569714 231
6 Leiria 560484 227
7 Coimbra 541166 219
8 Santarém 454947 184
9 Viseu 378784 154
10 Viana do Castelo 252952 103
11 Vila Real 214490 87
12 Castelo Branco 196989 80
13 Évora 174490 71
14 Guarda 167359 68
15 Beja 158702 64
16 Bragança 140385 57
17 Portalegre 120585 49
18 Total 12332794 5000
Basically, the second dataframe has 5000 rows and i want to create a column with a name corresponding to the Municipios from the first df.
My problem is that i dont know how to populate the column with same occurence distribution from the first dataframe.
The final result would be something like this:
Municipio
0 Porto
1 Porto
2 Lisboa
3 Évora
4 Lisboa
5 Aveiro
...
4996 Viseu
4997 Lisboa
4998 Porto
4999 Guarda
5000 Beja
Can someone help me?
I would use a simple comprehension to build a list of size 5000 with as many elements with a town name as the value of Population5000, and optionally shuffle it if you want a random order:
lst = [m for m,n in df.loc[:len(df)-2,
['Municipio', 'Population5000']].to_numpy()
for i in range(n)]
random.shuffle(lst)
result = pd.Series(1, index=lst, name='Municipio')
Initialized with random.seed(0), it gives:
Setúbal 1
Santarém 1
Lisboa 1
Setúbal 1
Aveiro 1
..
Santarém 1
Porto 1
Lisboa 1
Faro 1
Aveiro 1
Name: Municipio, Length: 5000, dtype: int64
You could just do a simple map if you do;
map = dict(zip(DF1['Population5000'], DF1['Municipio']))
DF2['Municipo'] = DF2['Population5000'].map(map)
or just change the population 5000 column name in the map (DF2) to whatever the column containing your population values is called.
map = dict(zip(municipios['Population5000'], municipios['Municipio']))
df['Municipio'] = municipios['Population5000'].map(map)
I tried this as suggested by Amen_90 and the column Municipio from the second dataframe it only gets populated with 1 instance of every Municipio, when i wanted to have the same value_counts as in the column "Population5000" in my first dataframe.
df["Municipio"].value_counts()
Beja 1
Aveiro 1
Bragança 1
Vila Real 1
Porto 1
Santarém 1
Coimbra 1
Guarda 1
Leiria 1
Castelo Branco 1
Viseu 1
Total 1
Faro 1
Portalegre 1
Braga 1
Évora 1
Setúbal 1
Viana do Castelo 1
Lisboa 1
Name: Municipio, dtype: int64
I'm working with an airbnb dataset on Kaggle:
https://www.kaggle.com/c/airbnb-recruiting-new-user-bookings
and want to simplify the values for the language column into 2 groupings - english and non-english.
For instance:
users.language.value_counts()
en 15011
zh 101
fr 99
de 53
es 53
ko 43
ru 21
it 20
ja 19
pt 14
sv 11
no 6
da 5
nl 4
el 2
pl 2
tr 2
cs 1
fi 1
is 1
hu 1
Name: language, dtype: int64
And the result I want it is:
users.language.value_counts()
english 15011
non-english 459
Name: language, dtype: int64
This is sort of the solution I want:
def language_groupings():
for i in users:
if users.language !='en':
replace(users.language.str, 'non-english')
else:
replace(users.language.str, 'english')
return users
users['language'] = users.apply(lambda row: language_groupings)
Except there's obviously something wrong with this as it returns an empty series when I run value_counts on the column.
Try this:
users.language = np.where( users.language !='en', 'non-english', 'english' )
is that what you want?
In [181]: x
Out[181]:
val
en 15011
zh 101
fr 99
de 53
es 53
ko 43
ru 21
it 20
ja 19
pt 14
sv 11
no 6
da 5
nl 4
el 2
pl 2
tr 2
cs 1
fi 1
is 1
hu 1
In [182]: x.groupby(x.index == 'en').sum()
Out[182]:
val
False 459
True 15011