I have a dataframe with the population of a region and i want to populate a column of other dataframe with the same distribution.
The first dataframe looks like this:
Municipio Population Population5000
0 Lisboa 3184984 1291
1 Porto 2597191 1053
2 Braga 924351 375
3 Setúbal 880765 357
4 Aveiro 814456 330
5 Faro 569714 231
6 Leiria 560484 227
7 Coimbra 541166 219
8 Santarém 454947 184
9 Viseu 378784 154
10 Viana do Castelo 252952 103
11 Vila Real 214490 87
12 Castelo Branco 196989 80
13 Évora 174490 71
14 Guarda 167359 68
15 Beja 158702 64
16 Bragança 140385 57
17 Portalegre 120585 49
18 Total 12332794 5000
Basically, the second dataframe has 5000 rows and i want to create a column with a name corresponding to the Municipios from the first df.
My problem is that i dont know how to populate the column with same occurence distribution from the first dataframe.
The final result would be something like this:
Municipio
0 Porto
1 Porto
2 Lisboa
3 Évora
4 Lisboa
5 Aveiro
...
4996 Viseu
4997 Lisboa
4998 Porto
4999 Guarda
5000 Beja
Can someone help me?
I would use a simple comprehension to build a list of size 5000 with as many elements with a town name as the value of Population5000, and optionally shuffle it if you want a random order:
lst = [m for m,n in df.loc[:len(df)-2,
['Municipio', 'Population5000']].to_numpy()
for i in range(n)]
random.shuffle(lst)
result = pd.Series(1, index=lst, name='Municipio')
Initialized with random.seed(0), it gives:
Setúbal 1
Santarém 1
Lisboa 1
Setúbal 1
Aveiro 1
..
Santarém 1
Porto 1
Lisboa 1
Faro 1
Aveiro 1
Name: Municipio, Length: 5000, dtype: int64
You could just do a simple map if you do;
map = dict(zip(DF1['Population5000'], DF1['Municipio']))
DF2['Municipo'] = DF2['Population5000'].map(map)
or just change the population 5000 column name in the map (DF2) to whatever the column containing your population values is called.
map = dict(zip(municipios['Population5000'], municipios['Municipio']))
df['Municipio'] = municipios['Population5000'].map(map)
I tried this as suggested by Amen_90 and the column Municipio from the second dataframe it only gets populated with 1 instance of every Municipio, when i wanted to have the same value_counts as in the column "Population5000" in my first dataframe.
df["Municipio"].value_counts()
Beja 1
Aveiro 1
Bragança 1
Vila Real 1
Porto 1
Santarém 1
Coimbra 1
Guarda 1
Leiria 1
Castelo Branco 1
Viseu 1
Total 1
Faro 1
Portalegre 1
Braga 1
Évora 1
Setúbal 1
Viana do Castelo 1
Lisboa 1
Name: Municipio, dtype: int64
Related
So I have this df or table coming from a pdf tranformation on this way example:
ElementRow
ElementColumn
ElementPage
ElementText
X1
Y1
X2
Y2
1
50
0
1
Emergency Contacts
917
8793
2191
8878
2
51
0
1
Contact
1093
1320
1451
1388
3
51
2
1
Relationship
2444
1320
3026
1388
4
51
7
1
Work Phone
3329
1320
3898
1388
5
51
9
1
Home Phone
4260
1320
4857
1388
6
51
10
1
Cell Phone
5176
1320
5684
1388
7
51
12
1
Priority Phone
6143
1320
6495
1388
8
51
14
1
Contact Address
6542
1320
7300
1388
9
51
17
1
City
7939
1320
7300
1388
10
51
18
1
State
8808
1320
8137
1388
11
51
21
1
Zip
9134
1320
9294
1388
12
52
0
1
Silvia Smith
1093
1458
1973
1526
13
52
2
1
Mother
2444
1458
2783
1526
13
52
7
1
(123) 456-78910
5176
1458
4979
1526
14
52
10
1
Austin
7939
1458
8406
1526
15
52
15
1
Texas
8808
1458
8961
1526
16
52
20
1
76063
9134
1458
9421
1526
17
52
2
1
1234 Parkside Ct
6542
1458
9421
1526
18
53
0
1
Naomi Smith
1093
2350
1973
1526
19
53
2
1
Aunt
2444
2350
2783
1526
20
53
7
1
(123) 456-78910
5176
2350
4979
1526
21
53
10
1
Austin
7939
2350
8406
1526
22
53
15
1
Texas
8808
2350
8961
1526
23
53
20
1
76063
9134
2350
9421
1526
24
53
2
1
3456 Parkside Ct
6542
2350
9421
1526
25
54
40
1
End Employee Line
6542
2350
9421
1526
25
55
0
1
Emergency Contacts
917
8793
2350
8878
I'm trying to separate each register by rows taking as a reference ElementRow column and keep the headers from the first rows and then iterate through the other rows after. The column X1 has a reference on which header should be the values. I would like to have the data like this way.
Contact
Relationship
Work Phone
Cell Phone
Priority
ContactAddress
City
State
Zip
1
Silvia Smith
Mother
(123) 456-78910
1234 Parkside Ct
Austin
Texas
76063
2
Naomi Smith
Aunt
(123) 456-78910
3456 Parkside Ct
Austin
Texas
76063
Things I tried:
To take rows between iterating through the columns. tried to slice taking the first index and the last index but showed this error:
emergStartIndex = df.index[df['ElementText'] == 'Emergency Contacts']
emergLastIndex = df.index[df['ElementText'] == 'End Employee Line']
emerRows_between = df.iloc[emergStartIndex:emergLastIndex]
TypeError: cannot do positional indexing on RangeIndex with these indexers [Int64Index([...
That way is working with this numpy trick.
emerRows_between = df.iloc[np.r_[1:54,55:107]]
emerRows_between
but when trying to replace the index showed this:
emerRows_between = df.iloc[np.r_[emergStartIndex:emergLastIndex]]
emerRows_between
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
I tried iterating row by row like this but in some point the df reach the end and I'm receiving index out of bound.
emergencyContactRow1 = df['ElementText','X1'].iloc[emergStartIndex+1].reset_index(drop=True)
emergencyContactRow2 = df['ElementText','X1'].iloc[emergStartIndex+2].reset_index(drop=True)
emergencyContactRow3 = df['ElementText','X1'].iloc[emergStartIndex+3].reset_index(drop=True)
emergencyContactRow4 = df['ElementText','X1'].iloc[emergStartIndex+4].reset_index(drop=True)
emergencyContactRow5 = df['ElementText','X1'].iloc[emergStartIndex+5].reset_index(drop=True)
emergencyContactRow6 = df['ElementText','X1'].iloc[emergStartIndex+6].reset_index(drop=True)
emergencyContactRow7 = df['ElementText','X1'].iloc[emergStartIndex+7].reset_index(drop=True)
emergencyContactRow8 = df['ElementText','X1'].iloc[emergStartIndex+8].reset_index(drop=True)
emergencyContactRow9 = df['ElementText','X1'].iloc[emergStartIndex+9].reset_index(drop=True)
emergencyContactRow10 = df['ElementText','X1'].iloc[emergStartIndex+10].reset_index(drop=True)
frameEmergContact1 = [emergencyContactRow1 , emergencyContactRow2 , emergencyContactRow3, emergencyContactRow4, emergencyContactRow5, emergencyContactRow6, emergencyContactRow7, , emergencyContactRow8,, emergencyContactRow9, , emergencyContactRow10]
df_emergContact1= pd.concat(frameEmergContact1 , axis=1)
df_emergContact1.columns = range(df_emergContact1.shape[1])
So how to make this code dynamic or how to avoid the index out of bound errors and keep my headers taking as a reference only the first row after the Emergency Contact row?. I know I didn't try to use the X1 column yet, but I have to resolve first how to iterate through those multiple indexes.
Each iteration from Emergency Contact index to End Employee line belongs to one person or one employee from the whole dataframe, so the idea after capture all those values is to keep also a counter variable to see how many times the data is captured between those two indexes.
It's a bit ugly, but this should do it. Basically you don't need the first or last two rows, so if you get rid of those, then pivot the X1 and ElemenTex columns you will be pretty close. Then it's a matter of getting rid of null values and promoting the first row to header.
df = df.iloc[1:-2][['ElementTex','X1','ElementRow']].pivot(columns='X1',values='ElementTex')
df = pd.DataFrame([x[~pd.isnull(x)] for x in df.values.T]).T
df.columns = df.iloc[0]
df = df[1:]
Split the dataframe into chunks whenever "Emergency Contacts" appears in column "ElementText"
Parse each chunk into the required format
Append to the output
import numpy as np
list_of_df = np.array_split(data, data[data["ElementText"]=="Emergency Contacts"].index)
output = pd.DataFrame()
for frame in list_of_df:
df = frame[~frame["ElementText"].isin(["Emergency Contacts", "End Employee Line"])].dropna()
if df.shape[0]>0:
temp = pd.DataFrame(df.groupby("X1")["ElementText"].apply(list).tolist()).T
temp.columns = temp.iloc[0]
temp = temp.drop(0)
output = output.append(temp, ignore_index=True)
>>> output
0 Contact Relationship Work Phone ... City State Zip
0 Silvia Smith Mother None ... Austin Texas 76063
1 Naomi Smith Aunt None ... Austin Texas 76063
Data Frame :
city Temperature
0 Chandigarh 15
1 Delhi 22
2 Kanpur 20
3 Chennai 26
4 Manali -2
0 Bengalaru 24
1 Coimbatore 35
2 Srirangam 36
3 Pondicherry 39
I need to create another column in data frame, which contains a boolean value for each city to indicate whether it's a union territory or not. Chandigarh, Pondicherry and Delhi are only 3 union territories here.
I have written below code
import numpy as np
conditions = [df3['city'] == 'Chandigarh',df3['city'] == 'Pondicherry',df3['city'] == 'Delhi']
values =[1,1,1]
df3['territory'] = np.select(conditions, values)
Is there any easier or efficient way that I can write?
You can use isin:
union_terrs = ["Chandigarh", "Pondicherry", "Delhi"]
df3["territory"] = df3["city"].isin(union_terrs).astype(int)
which checks each entry in city column and if it is in union_terrs, gives True and otherwise False. The astype makes True/False to 1/0 conversion,
to get
city Temperature territory
0 Chandigarh 15 1
1 Delhi 22 1
2 Kanpur 20 0
3 Chennai 26 0
4 Manali -2 0
0 Bengalaru 24 0
1 Coimbatore 35 0
2 Srirangam 36 0
3 Pondicherry 39 1
I have two dataframes of differing index lengths that look like this:
df_1:
State Month Total Time ... N columns
AL 4 1000
AL 5 500
.
.
.
VA 11 750
VA 12 1500
df_2:
State Month ... N columns
AL 4
AL 5
.
.
.
VA 11
VA 12
I would like to add a Total Time column to df_2 that uses the values from the Total Time column of df_1 if the State and Month value are the same between dataframes. Ultimately, I would end up with:
df_2:
State Month Total Time ... N columns
AL 4 1000
AL 5 500
.
.
.
VA 11 750
VA 12 1500
I want to do this conditionally since the index lengths are not the same. I have tried this so far:
def f(row):
if (row['State'] == row['State']) & (row['Month'] == row['Month']):
val = x for x in df_1["Total Time"]
return val
df_2['Total Time'] = df_2.apply(f, axis=1)
This did not work. What method would you use to accomplish this?
Any help is appreciated!
You can do this:
Consider my sample dataframes:
In [2327]: df_1
Out[2327]:
State Month Total Time
0 AL 2 1000
1 AB 4 500
2 BC 1 600
In [2328]: df_2
Out[2328]:
State Month
0 AL 2
1 AB 5
In [2329]: df_2 = pd.merge(df_2, df_1, on=['State', 'Month'], how='left')
In [2330]: df_2
Out[2330]:
State Month Total Time
0 AL 2 1000.0
1 AB 5 NaN
As mentioned in other comment, pd.merge() is how you would join two dataframes and extract a column. The issue is that merging solely on 'State' and 'Month' would result in every permutation becoming a new column (all Al-4 in df_1 would join with all other AL-4 in df_2).
With your example, there'd be
df_1
State Month Total Time df_1 col...
0 AL 4 1000 6
1 AL 4 500 7
2 VA 12 750 8
3 VA 12 1500 9
df_2
State Month df_2 col...
0 AL 4 1
1 AL 4 2
2 VA 12 3
3 VA 12 4
df_1_cols_to_use = ['State', 'Month', 'Total Time']
# note the selection of the column to use from df_1. We only want the column
# we're merging on, plus the column(s) we want to bring in, in this case 'Total Time'
new_df = pd.merge(df_2, df_1[df_1_cols_to_use], on=['State', 'Month'], how='left')
new_df:
State Month df_2 col... Total Time
0 AL 4 1 1000
1 AL 4 1 500
2 AL 4 2 1000
3 AL 4 2 500
4 VA 12 3 750
5 VA 12 3 1500
6 VA 12 4 750
7 VA 12 4 1500
You mention these have differing index lengths. Based on the parameters of the question, it's not possible to determine what value of Total Time would match up with a row in df_2. If there's three AL-4 entries in df_2, do they each get 1000, 500, or some combination? That info would be needed. Without this, this would be the best guess at getting all possibilities.
Hi I am trying to assign certain values in columns of a dataframe.
# Count the number of title counts
full.groupby(['Sex', 'Title']).Title.count()
Sex Title
female Dona 1
Dr 1
Lady 1
Miss 260
Mlle 2
Mme 1
Mrs 197
Ms 2
the Countess 1
male Capt 1
Col 4
Don 1
Dr 7
Jonkheer 1
Major 2
Master 61
Mr 757
Rev 8
Sir 1
Name: Title, dtype: int64
My tail of dataframe looks like follows:
Age Cabin Embarked Fare Name Parch PassengerId Pclass Sex SibSp Survived Ticket Title
413 NaN NaN S 8.0500 Spector, Mr. Woolf 0 1305 3 male 0 NaN A.5. 3236 Mr
414 39.0 C105 C 108.9000 Oliva y Ocana, Dona. Fermina 0 1306 1 female 0 NaN PC 17758 Dona
415 38.5 NaN S 7.2500 Saether, Mr. Simon Sivertsen 0 1307 3 male 0 NaN SOTON/O.Q. 3101262 Mr
416 NaN NaN S 8.0500 Ware, Mr. Frederick 0 1308 3 male 0 NaN 359309 Mr
417 NaN NaN C 22.3583 Peter, Master. Michael J 1 1309 3 male 1 NaN 2668 Master
The name of my dataframe is full and I want to change names of Title.
Here is the following code I wrote :
# Create a variable rate_title to modify the names of Title
rare_title = ['Dona', "Lady", "the Countess", "Capt", "Col", "Don", "Dr", "Major", "Rev", "Sir", "Jonkheer"]
# Also reassign mlle, ms, and mme accordingly
full[full.Title == "Mlle"].Title = "Miss"
full[full.Title == "Ms"].Title = "Miss"
full[full.Title == "Mme"].Title = "Mrs"
full[full.Title.isin(rare_title)].Title = "Rare Title"
I also tried the following code in pandas:
full.loc[full['Title'] == "Mlle", ['Sex', 'Title']] = "Miss"
Still the dataframe is not changed. Any help is appreciated.
Use loc based indexing and set matching row values -
miss = ['Mlle', 'Ms', 'Mme']
rare_title = ['Dona', "Lady", ...]
df.loc[df.Title.isin(miss), 'Title'] = 'Miss'
df.loc[df.Title.isin(rare_title), 'Title'] = 'Rare Title'
I'm working with an airbnb dataset on Kaggle:
https://www.kaggle.com/c/airbnb-recruiting-new-user-bookings
and want to simplify the values for the language column into 2 groupings - english and non-english.
For instance:
users.language.value_counts()
en 15011
zh 101
fr 99
de 53
es 53
ko 43
ru 21
it 20
ja 19
pt 14
sv 11
no 6
da 5
nl 4
el 2
pl 2
tr 2
cs 1
fi 1
is 1
hu 1
Name: language, dtype: int64
And the result I want it is:
users.language.value_counts()
english 15011
non-english 459
Name: language, dtype: int64
This is sort of the solution I want:
def language_groupings():
for i in users:
if users.language !='en':
replace(users.language.str, 'non-english')
else:
replace(users.language.str, 'english')
return users
users['language'] = users.apply(lambda row: language_groupings)
Except there's obviously something wrong with this as it returns an empty series when I run value_counts on the column.
Try this:
users.language = np.where( users.language !='en', 'non-english', 'english' )
is that what you want?
In [181]: x
Out[181]:
val
en 15011
zh 101
fr 99
de 53
es 53
ko 43
ru 21
it 20
ja 19
pt 14
sv 11
no 6
da 5
nl 4
el 2
pl 2
tr 2
cs 1
fi 1
is 1
hu 1
In [182]: x.groupby(x.index == 'en').sum()
Out[182]:
val
False 459
True 15011