pandas replace NaNs with modus of another column based on second column - python

I have a pandas dataframe with two columns, city and country. Both city and country contain missing values. consider this data frame:
temp = pd.DataFrame({"country": ["country A", "country A", "country A", "country A", "country B","country B","country B","country B", "country C", "country C", "country C", "country C"],
"city": ["city 1", "city 2", np.nan, "city 2", "city 3", "city 3", np.nan, "city 4", "city 5", np.nan, np.nan, "city 6"]})
I now want to fill in the NaNs in the city column with the mode of the country's city in the remaining data frame, e.g. for country A: city 1 is mentioned once; city 2 is mentioned twice; thus, fill the column city at index 2 with city 2 etc.
I have done
cities = [city for city in temp["country"].value_counts().index]
modes = temp.groupby(["country"]).agg(pd.Series.mode)
dict_locations = modes.to_dict(orient="index")
for k in dict_locations.keys():
new_dict_locations[k] = dict_locations[k]["city"]
Now having the value of the country and the corresponding city mode, I face two issues:
First: the case country C is bimodal - the key contains two entries. I want this key to refer to each of the entries with equal probability. The real data set has multiple modes, so it would be a list of len > 2.
Second: I'm stuck replacing the NaNs in city with the value corresponding to the value in the same line's country cell in new_dict_locations. In pseudo-code, this would be: `go through the column 'city'; if you find a missing value at position 'temp[i, city]', take the value of 'country' in that row (-> 'country_tmp'); take 'country_tmp' as key to the dictionary 'new_dict_locations'; if the dictionary at key 'country_temp' is a list, randomly select one item from that list; take the return value (-> 'city_tmp') and fill the cell with the missing value (temp[i, city]) with the value 'city_temp').
I've tried using different combinations of .fillna() and .replace() (and read this and other questions to no avail.* Can someone give me a pointer?
Many thanks in advance.
(Note: the referenced question replaces values in one cell according to a dict; my reference values are, however, in a different column.)
** EDIT **
executing temp["city"].fillna(temp['country'], inplace=True) and temp.replace({'city': dict_locations}) gives me an error: TypeError: unhashable type: 'dict' [This error is TypeError: unhashable type: 'numpy.ndarray' for the original data set but I cannot reproduce it with an example - if someone knows the whereabouts of the difference, I'd be super happy to hear their thoughts.]

Try map with dict new_dict_locations to create a new series s, and map again on s with np.random.choice to pick value from array. Finally, use s to fillna
s = (temp.country.map(new_dict_locations)
.map(lambda x: np.random.choice(x) if isinstance(x, np.ndarray) else x))
temp['city'] = temp.city.fillna(s)
Out[247]:
country city
0 country A city 1
1 country A city 2
2 country A city 2
3 country A city 2
4 country B city 3
5 country B city 3
6 country B city 3
7 country B city 4
8 country C city 5
9 country C city 6
10 country C city 5
11 country C city 6
Note: I thought 2 map may be joined to one by using dict comprehension. However, doing it will cause loosing of the randomness.

def get_mode(d):
for k,v in d.items():
if len(v)>1 and isinstance(v, np.ndarray):
d[k]=np.random.choice(list(v), 1, p=[0.5 for i in range(len(v))])[0]
return d
Below dictionary is the one which will be used for filling.
new_dict_locations=get_mode(new_dict_locations)
keys=list(new_dict_locations.keys())
values=list(new_dict_locations.values())
# Filling happens here
temp.city=temp.city.fillna(temp.country).replace(keys, values)
This will give desired output:
country city
0 country A city 1
1 country A city 2
2 country A city 2
3 country A city 2
4 country B city 3
5 country B city 3
6 country B city 3
7 country B city 4
8 country C city 5
9 country C city 5
10 country C city 5
11 country C city 6

Related

LEFT ON Case When in Pandas

i wanted to ask that if in SQL I can do like JOIN ON CASE WHEN, is there a way to do this in Pandas?
disease = [
{"City":"CH","Case_Recorded":5300,"Recovered":2839,"Deaths":2461},
{"City":"NY","Case_Recorded":1311,"Recovered":521,"Deaths":790},
{"City":"TX","Case_Recorded":1991,"Recovered":1413,"Deaths":578},
{"City":"AT","Case_Recorded":3381,"Recovered":3112,"Deaths":269},
{"City":"TX","Case_Recorded":3991,"Recovered":2810,"Deaths":1311},
{"City":"LA","Case_Recorded":2647,"Recovered":2344,"Deaths":303},
{"City":"LA","Case_Recorded":4410,"Recovered":3344,"Deaths":1066}
]
region = {"North": ["AT"], "West":["TX","LA"]}
So what i have is 2 dummy dict and i have already converted it to become dataframe, first is the name of the cities with the case,and I'm trying to figure out which region the cities belongs to.
Region|City
North|AT
West|TX
West|LA
None|NY
None|CH
So what i thought in SQL was using left on case when, and if the result is null when join with North region then join with West region.
But if there are 15 or 30 region in some country, it'd be problems i think
Use:
#get City without duplicates
df1 = pd.DataFrame(disease)[['City']].drop_duplicates()
#create DataFrame from region dictionary
region = {"North": ["AT"], "West":["TX","LA"]}
df2 = pd.DataFrame([(k, x) for k, v in region.items() for x in v],
columns=['Region','City'])
#append not matched cities to df2
out = pd.concat([df2, df1[~df1['City'].isin(df2['City'])]])
print (out)
Region City
0 North AT
1 West TX
2 West LA
0 NaN CH
1 NaN NY
If order is not important:
out = df2.merge(df1, how = 'right')
print (out)
Region City
0 NaN CH
1 NaN NY
2 West TX
3 North AT
4 West LA
I'm sorry, I'm not exactly sure what's your expected result, can you express more? if your expected result is just getting the city's region there is no need for conditional joining? for ex: you can transform the city-region table into per city per region per row and direct join with the main df
disease = [
{"City":"CH","Case_Recorded":5300,"Recovered":2839,"Deaths":2461},
{"City":"NY","Case_Recorded":1311,"Recovered":521,"Deaths":790},
{"City":"TX","Case_Recorded":1991,"Recovered":1413,"Deaths":578},
{"City":"AT","Case_Recorded":3381,"Recovered":3112,"Deaths":269},
{"City":"TX","Case_Recorded":3991,"Recovered":2810,"Deaths":1311},
{"City":"LA","Case_Recorded":2647,"Recovered":2344,"Deaths":303},
{"City":"LA","Case_Recorded":4410,"Recovered":3344,"Deaths":1066}
]
region = [
{'City':'AT','Region':"North"},
{'City':'TX','Region':"West"},
{'City':'LA','Region':"West"}
]
df = pd.DataFrame(disease)
df_reg = pd.DataFrame(region)
df.merge( df_reg , on = 'City' , how = 'left' )

To get the Index of a particular element, which is a sentence string present in a column of a DataFrame with conditions

I have a table. There are numbers in the column 'Para'. I have to find the index for a particular sentence from the column 'Country_Title in such a way that the value at Column 'Para' is 2.
Main DataFrame 'df_countries' is shown below:
Index
Sequence
Para
Country_Title
0
5
4
India is seventh largest country
1
6
6
Australia is a continent country
2
7
2
Canada is the 2nd largest country
3
9
3
UAE is a country in Western Asia
4
10
2
China is a country in East Asia
5
11
1
Germany is in Central Europe
6
13
2
Russia is the largest country
7
14
3
Capital city of China is Beijing
Suppose my keyword is China. And I want to get the index for the sentence with 'China', but only the one at 'Para' = 2
Consider the rows at index :- 4 and 7 ; both have same Country_Title. But I want to obtain the index for the one with 'Para' = 2, i.e., the result must be index = 4
My Approach:
I derived another DataFrame 'df_para2_countries' from above table as shown below:
Index
Para
Country_Title
2
2
Canada is the 2nd largest country
4
2
China is a country in East Asia
6
2
Russia is the largest country
Now I store the country title as:
c = list(df_level2_countries['Country_Title'])
I used a for loop to parse through elements in 'c' and find the index of a particular country in the table 'df_countries'
for i in c:
if 'China' in i:
print(i)
ind= df_para2_countries.loc[df_para2_countries['Country_Title'] = i]
print(ind)
the identifier 'ind' gives error.
I want to get the index, but this doesn't work.
Please post your suggestion on how can I approach to this.
You need two equals in your condition?
If you need only the 'index', that is, the value from your first column called index, then you can change the series returned from .loc() to a list, and then get first value, for instance:
ind = df_para2_countries.loc[df_para2_countries['Country_Title'] == i].to_list()[0]
Hope it works :)

Get Max Sum value of a column in Pandas

I have a csv like this:
Country Values Address
USA 1 AnyAddress
USA 2 AnyAddress
Brazil 1 AnyAddress
UK 3 AnyAddress
Australia 0 AnyAddress
Australia 0 AnyAddress
I need to group data by Country and sum Values, then return a string with the country and max value summed, in this case considering USA that is lexicographically greater then UK, the output is like this:
"Country: USA, Value: 3"
When I use groupby in pandas I am not able to get the strings with country name and value, how can I do that?
try:
max_values = df.groupby('Country').sum().reset_index().max().values
your_string = f"Country: {max_values[0]}, Value: {max_values[1]}"
Output:
>>> print(your_string)
Country: USA, Value: 3
You can do:
df.groupby("Country", as_index=False)["Values"].sum()\
.sort_values(["Values", "Country"], ascending=False).iloc[0]
Outputs:
Country USA
Values 3
Name: 3, dtype: object

Speeding up Pandas .apply function. Counting rows and doing operation with them

I have a huge database of migratory movements and I wrote some scripts to get useful info from it but it's really really slow. I'm not a professional coder as you will see, and I was wondering how to make this data gathering more efficient.
To start, the initial CSV database is structured as follows:
1 row = 1 person
Age Sex City_start City_destination ...
Person 1
Person 2
.....
The final database structure:
Balance_2004 Balance_2005 ....
City1
City2
....
For calculating this Balance per city and year I created a function that filters the initial database to count how many rows have a specific city in city_destination (INs), how many rows in city_start (OUTs) and then a simple sum to calculate the balance as INs - OUTs:
# idb = initial database
# City1 = pre-existing in final database
def get_balance(city, df):
ins = idb.City_start[idb.City_start == City1].count()
outs = idb.City_destination[idb.City_destination == City1].count()
balance = ins - outs
return balance
Then with this function I used pandas apply to populate the final database as:
# fdb = final database
fdb['Balance_2004'] = idb['City_start'].apply(get_balance, df=idb)
This works good, the ending result is what I need and I'm using in total 42 apply functions to get more specific data like balance per sex, per ages groups... but to give an idea of how slow this is, I started the script (with 42 functions) 45min ago and is still running.
Is there any way to do this in a less time-consuming way?
Thanks in advance
It might make sense to do this calculation only once, by grouping by the cities:
def get_balance_all_cities(df):
df_diff = pd.DataFrame([df.groupby(["City_start"])["Name"].count(),
df.groupby(["City_destination"])["Name"].count()]).T
df_diff.columns = "start", "end"
df_diff[df_diff.isna()] = 0
return df_diff.start - df_diff.end
Here are some examples for how it works:
>>> df = pd.DataFrame([("Person 1", "Chicago", "Chicago"), ("Person 2", "New York", "Chicago"), ("Person 3", "Houston", "New York")], columns=["Name", "City_start", "City_destination"])
>>> df
Name City_start City_destination
0 Person 1 Chicago Chicago
1 Person 2 New York Chicago
2 Person 3 Houston New York
>>> ins = df.groupby(["City_start"])["Name"].count()
City_start
Chicago 1
Houston 1
New York 1
Name: Name, dtype: int64
>>> outs = df.groupby(["City_end"])["Name"].count()
City_destination
Chicago 2
New York 1
Name: Name, dtype: int64
>>> df_diff = pd.DataFrame([ins, outs]).T
>>> df_diff.columns = "start", "end"
>>> df_diff[df_diff.isna()] = 0
>>> balance = df_diff.start - df_diff.end
Chicago -1.0
Houston 1.0
New York 0.0
dtype: float64
The work-around at the end is to deal with cities in where no-one lives during end or start but does live during the other time.
I believe need aggregate by cities with years with DataFrameGroupBy.size and reshape by unstack, then subtract by sub and if necessary convert to integers:
idb = pd.DataFrame([("a", "Chicago", "Chicago", 2018),
("b", "New York", "Chicago", 2018),
("c", "New York", "Chicago", 2017),
("d", "Houston", "LA", 2018)],
columns=["Name", "City_start", "City_destination", 'year'])
print (idb)
Name City_start City_destination year
0 a Chicago Chicago 2018
1 b New York Chicago 2018
2 c New York Chicago 2017
3 d Houston LA 2018
a1 = idb.groupby(["City_start", 'year']).size().unstack(fill_value=0)
a2 = idb.groupby(["City_destination", 'year']).size().unstack(fill_value=0)
idb = a1.sub(a2, fill_value=0).astype(int).add_prefix('Balance_')
print (idb)
year Balance_2017 Balance_2018
Chicago -1 -1
Houston 0 1
LA 0 -1
New York 1 1

Update missing values in a column using pandas

I have a dataframe df with two of the columns being 'city' and 'zip_code':
df = pd.DataFrame({'city': ['Cambridge','Washington','Miami','Cambridge','Miami',
'Washington'], 'zip_code': ['12345','67891','23457','','','']})
As shown above, a particular city contains zip code in one of the rows, but the zip_code is missing for the same city in some other row. I want to fill those missing values based on the zip_code values of that city in some other row. Basically, wherever there is a missing zip_code, it checks zip_code for that city in other rows, and if found, fills the value for zip_code.If not found, fills 'NA'.
How do I accomplish this task using pandas?
You can go for:
import numpy as np
df['zip_code'] = df.replace(r'', np.nan).groupby('city')['zip_code'].fillna(method='ffill').fillna(method='bfill')
>>> df
city zip_code
0 Cambridge 12345
1 Washington 67891
2 Miami 23457
3 Cambridge 12345
4 Miami 23457
5 Washington 67891
You can check the string length using str.len and for those rows, filter the main df to those with valid zip_codes, set the index to those and call map on the 'city' column which will perform the lookup and fill those values:
In [255]:
df.loc[df['zip_code'].str.len() == 0, 'zip_code'] = df['city'].map(df[df['zip_code'].str.len() == 5].set_index('city')['zip_code'])
df
Out[255]:
city zip_code
0 Cambridge 12345
1 Washington 67891
2 Miami 23457
3 Cambridge 12345
4 Miami 23457
5 Washington 67891
If your real data has lots of repeating values then you'll need to additionally call drop_duplicates first:
df.loc[df['zip_code'].str.len() == 0, 'zip_code'] = df['city'].map(df[df['zip_code'].str.len() == 5].drop_duplicates(subset='city').set_index('city')['zip_code'])
The reason you need to do this is because it'll raise an error if there are duplicate index entries
My suggestion would be to first create a dictonary that maps from the city to the zip code. You can create this dictionary from the one DataFrame.
And then you use that dictionary to fill in all missing zip code values.

Categories

Resources