Update missing values in a column using pandas - python

I have a dataframe df with two of the columns being 'city' and 'zip_code':
df = pd.DataFrame({'city': ['Cambridge','Washington','Miami','Cambridge','Miami',
'Washington'], 'zip_code': ['12345','67891','23457','','','']})
As shown above, a particular city contains zip code in one of the rows, but the zip_code is missing for the same city in some other row. I want to fill those missing values based on the zip_code values of that city in some other row. Basically, wherever there is a missing zip_code, it checks zip_code for that city in other rows, and if found, fills the value for zip_code.If not found, fills 'NA'.
How do I accomplish this task using pandas?

You can go for:
import numpy as np
df['zip_code'] = df.replace(r'', np.nan).groupby('city')['zip_code'].fillna(method='ffill').fillna(method='bfill')
>>> df
city zip_code
0 Cambridge 12345
1 Washington 67891
2 Miami 23457
3 Cambridge 12345
4 Miami 23457
5 Washington 67891

You can check the string length using str.len and for those rows, filter the main df to those with valid zip_codes, set the index to those and call map on the 'city' column which will perform the lookup and fill those values:
In [255]:
df.loc[df['zip_code'].str.len() == 0, 'zip_code'] = df['city'].map(df[df['zip_code'].str.len() == 5].set_index('city')['zip_code'])
df
Out[255]:
city zip_code
0 Cambridge 12345
1 Washington 67891
2 Miami 23457
3 Cambridge 12345
4 Miami 23457
5 Washington 67891
If your real data has lots of repeating values then you'll need to additionally call drop_duplicates first:
df.loc[df['zip_code'].str.len() == 0, 'zip_code'] = df['city'].map(df[df['zip_code'].str.len() == 5].drop_duplicates(subset='city').set_index('city')['zip_code'])
The reason you need to do this is because it'll raise an error if there are duplicate index entries

My suggestion would be to first create a dictonary that maps from the city to the zip code. You can create this dictionary from the one DataFrame.
And then you use that dictionary to fill in all missing zip code values.

Related

Split a column in Python pandas

I'm sorry if I can't explain properly the issue I'm facing since I don't really understand it that much. I'm starting to learn Python and to practice I try to do projects that I face in my day to day job, but using Python. Right now I'm stuck with a project and would like some help or guidance, I have a dataframe that looks like this
Index Country Name IDs
0 USA John PERSID|12345
SSO|John123
STARTDATE|20210101
WAVE|WAVE39
--------------------------------------------
1 UK Jane PERSID|25478
SSO|Jane123
STARTDATE|20210101
WAVE|WAVE40
(I apologize since I can't create a table on this post since the separator of the ids is a | ) but you get the idea, every person has 4 IDs and they are all on the same "cell" of the dataframe, each ID separated from its value by pipes, I need to split those ID's from their values, and put them on separate columns so I get something like this
index
Country
Name
PERSID
SSO
STARTDATE
WAVE
0
USA
John
12345
John123
20210101
WAVE39
1
UK
Jane
25478
Jane123
20210101
WAVE40
Now, adding to the complexity of the table itself, I have another issues, for example, the order of the ID's won't be the same for everyone and some of them will be missing some of the ID's.
I honestly have no idea where to begin, the first thing I thought about trying was to split the IDs column by spaces and then split the result of that by pipes, to create a dictionary, convert it to a dataframe and then join it to my original dataframe using the index.
But as I said, my knowledge in python is quite pathetic, so that failed catastrophically, I only got to the first step of that plan with a Client_ids = df.IDs.str.split(), that returns a series with the IDs separated one from each other like ['PERSID|12345', 'SSO|John123', 'STARTDATE|20210101', 'WAVE|Wave39'] but I can't find a way to split it again because I keep getting an error saying the the list object doesn't have attribute 'split'
How should I approach this? what alternatives do I have to do it?
Thank you in advance for any help or recommendation
You have a few options to consider to do this. Here's how I would do it.
I will split the values in IDs by \n and |. Then create a dictionary with key:value for each split of values of |. Then join it back to the dataframe and drop the IDs and temp columns.
import pandas as pd
df = pd.DataFrame([
["USA", "John","""PERSID|12345
SSO|John123
STARTDATE|20210101
WAVE|WAVE39"""],
["UK", "Jane", """PERSID|25478
SSO|Jane123
STARTDATE|20210101
WAVE|WAVE40"""],
["CA", "Jill", """PERSID|12345
STARTDATE|20210201
WAVE|WAVE41"""]], columns=['Country', 'Name', 'IDs'])
df['temp'] = df['IDs'].str.split('\n|\|').apply(lambda x: {k:v for k,v in zip(x[::2],x[1::2])})
df = df.join(pd.DataFrame(df['temp'].values.tolist(), df.index))
df = df.drop(columns=['IDs','temp'],axis=1)
print (df)
With this approach, it does not matter if a row of data is missing. It will sort itself out.
The output of this will be:
Original DataFrame:
Country Name IDs
0 USA John PERSID|12345
SSO|John123
STARTDATE|20210101
WAVE|WAVE39
1 UK Jane PERSID|25478
SSO|Jane123
STARTDATE|20210101
WAVE|WAVE40
2 CA Jill PERSID|12345
STARTDATE|20210201
WAVE|WAVE41
Updated DataFrame:
Country Name PERSID SSO STARTDATE WAVE
0 USA John 12345 John123 20210101 WAVE39
1 UK Jane 25478 Jane123 20210101 WAVE40
2 CA Jill 12345 NaN 20210201 WAVE41
Note that Jill did not have a SSO value. It set the value to NaN by default.
First generate your dataframe
df1 = pd.DataFrame([["USA", "John","""PERSID|12345
SSO|John123
STARTDATE|20210101
WAVE|WAVE39"""],
["UK", "Jane", """
PERSID|25478
SSO|Jane123
STARTDATE|20210101
WAVE|WAVE40"""]], columns=['Country', 'Name', 'IDs'])
Then split the last cell using lambda
df2 = pd.DataFrame(list(df.apply(lambda r: {p:q for p,q in [x.split("|") for x in r.IDs.split()]}, axis=1).values))
Lastly concat the dataframes together.
df = pd.concat([df1, df2], axis=1)
Quick solution
remove_word = ["PERSID", "SSO" ,"STARTDATE" ,"WAVE"]
for i ,col in enumerate(remove_word):
df[col] = df.IDs.str.replace('|'.join(remove_word), '', regex=True).str.split("|").str[i+1]
Use regex named capture groups with pd.String.str.extract
def ng(x):
return f'(?:{x}\|(?P<{x}>[^\n]+))?\n?'
fields = ['PERSID', 'SSO', 'STARTDATE', 'WAVE']
pat = ''.join(map(ng, fields))
df.drop('IDs', axis=1).join(df['IDs'].str.extract(pat))
Country Name PERSID SSO STARTDATE WAVE
0 USA John 12345 John123 20210101 WAVE39
1 UK Jane 25478 Jane123 20210101 WAVE40
2 CA Jill 12345 NaN 20210201 WAVE41
Setup
Credit to #JoeFerndz for sample df.
NOTE: this sample has missing values in some 'IDs'.
df = pd.DataFrame([
["USA", "John","""PERSID|12345
SSO|John123
STARTDATE|20210101
WAVE|WAVE39"""],
["UK", "Jane", """PERSID|25478
SSO|Jane123
STARTDATE|20210101
WAVE|WAVE40"""],
["CA", "Jill", """PERSID|12345
STARTDATE|20210201
WAVE|WAVE41"""]], columns=['Country', 'Name', 'IDs'])

Pandas duplicated rows with missing values

Hello I have a dataframe that contains duplicates.
df = pd.DataFrame({'id':[1,1,1],
'name':['Hamburg','Hamburg','Hamburg'],
'country':['Germany','Germany',None],
'state':[None,None,'Hamburg']})
removing the duplicates with df.drop_duplicates() returns:
How can I configure drop_duplicates such that only one row is left, that contains all the information?
In the case you have no row with all the information at once, you can use groupby and first but first fillna None with np.nan to work with missing values:
print (df.fillna(value=np.nan).groupby('id').first())
name country state
id
1 Hamburg Germany Hamburg
In your very special case, here's my proposal :
import pandas
df = pandas.DataFrame({'id':[1,1,1,2,2],
'name':['Hamburg','Hamburg','Hamburg','Paris','Paris'],
'country':['Germany','Germany',None, None, 'France'],
'state':[None,None,'Hamburg', 'Paris', None]})
df_result=pandas.DataFrame()
for id in df['id'].unique().tolist() :
df_subset=df[df['id']==id].copy(deep=True)
df_subset.sort_values(by=['id','name','country','state'],inplace=True)
df_subset.bfill(inplace=True)
df_subset.ffill(inplace=True)
df_subset.drop_duplicates(inplace=True)
df_result=df_result.append(df_subset)
df=df_result
Out[18]:
id name country state
0 1 Hamburg Germany Hamburg
4 2 Paris France Paris
Subsetting the records will avoid ffill or bfill to fill adjacent but different id records.
Regards

Drop duplicate rows in a dataframe of particular column

I have a dataframe like the following:
Districtname pincode
0 central delhi 110001
1 central delhi 110002
2 central delhi 110003
3 central delhi 110004
4 central delhi 110005
How can I drop rows based on column DistrictName and select the first unique value
The output I want:
Districtname pincode
0 central delhi 110001
Data Frames can be dropped using pandas.DataFrame.drop_duplicates() and defaults to keeping the first occurrence. In your case DataFrame.drop_duplicates(subset = "Districtname") should work. If you would like to update the same DataFrame DataFrame.drop_duplicates(subset = "Districtname", inplace = True) will do the job. Docs: https://pandas.pydata.org/pandas-docs/version/0.17/generated/pandas.DataFrame.drop_duplicates.html
Use drop_duplicates with inplace=true:
df.drop_duplicates('Districtname',inplace=True)

Create a new dataframe column by comparing two other columns in different dataframes [duplicate]

This question already has answers here:
Mapping columns from one dataframe to another to create a new column [duplicate]
(2 answers)
Closed 4 years ago.
I have a DataFrame which contains Alpha 2 country codes (UK, ES, SL etc) and I need these to be the country names. I created a second data frame that has all the Alpha 2 country codes in one column and the corresponding names in another.
I'm trying to compare these two columns then using the index to create the new column. However I am struggling to do this without using a loop. I feel like there is a more efficient way to do this without looping?
I have tried using a for loop, iterating over:
cube_data = pd.DataFrame({'Country Code':['UK','ES','SL']})
alpha2 = pd.DataFrame({'Code':['ES','GH','UK','SL'],
'Name':['Spain','Ghana','United Kingdom','Sierra Leone']})
cube_data
Country Code
0 UK
1 ES
2 SL
alpha2
Code Name
0 ES Spain
1 GH Ghana
2 UK United Kingdom
3 SL Sierra Leone
I have used a for loop to iterate through the columns and when the code from cube_data is found in alpha2['Code'] the index is used to create a new series which has alpha['Name'] at the correct position corresponding to the cube_data.
end result is:
cube_data
Country Code Name
0 UK United Kingdom
1 ES Spain
2 SL Sierra Leone
Surely there is a better way to do this without looping? I have had a look at series.isin() and series.map() but these do not seem to provide the result I need.
Can this be done without a loop?
You can use pandas merge:
df = alpha2.merge(cube_data, left_on='Code', right_on='Country Code', how='inner').drop('Code', axis=1)
merge works like an SQL join: here we merge alpha2 with cube_data. We use the columns 'Code' from alpha2 and 'Country Code' from cube_data to merge the two datframes together and use an 'inner' join logic meaning that only values present in both dataframes will be kept. Finally we drop the column 'Code' from alpha2 which contains the same values as the column 'Country Code'
Use map after converting alpha2 to a mappable object.
First we make our map:
>> country_map = alpha2.set_index('Code')['Name'].to_dict()
>> # country_map = dict(alpha2[['Code', 'Name']].values)
>> # country_map = alpha2.set_index('Code')['Name']
>> print(country_map)
{'ES': 'Spain', 'UK': 'United Kingdom', 'GH': 'Ghana', 'SL': 'Sierra Leone'}
Then we map it on the Country Code column:
>> cube_data['Country'] = cube_data['Country Code'].map(country_map)
>> print(cube_data)
Country Code Country
0 UK United Kingdom
1 ES Spain
2 SL Sierra Leone
Have you looked into the pycountry module?
I've changed your 'UK' alpha_2 to 'GB'.
import pandas as pd
import pycountry
cube_data = pd.DataFrame({'Country Code':['GB','ES','SL']})
for alpha2_code in cube_data['Country Code']:
c = pycountry.countries.get(alpha_2=alpha2_code)
print(c.name)
output:
United Kingdom
Spain
Sierra Leone
Using a lambda to create new column
df = cube_data
df['Name'] = df['Country Code'].apply(lambda x: pycountry.countries.get(alpha_2=x).name)
print(df)
output:
Country Code name
0 GB United Kingdom
1 ES Spain
2 SL Sierra Leone

Fill pandas dataframe rows from values in another dataframe rows

I have two pandas dataframes as given below:
df1
Name City Postal_Code State
James Phoenix 85003 AZ
John Scottsdale 85259 AZ
Jeff Phoenix 85003 AZ
Jane Scottsdale 85259 AZ
df2
Postal_Code Income Category
85003 41038 Two
85259 104631 Four
I would like to insert two columns, Income and Category, to df1 by capturing the values for Income and Category from df2 corresponding to the postal_code for each row in df1.
The closest question that I could find in SO was this - Fill DataFrame row values based on another dataframe row's values pandas. But, the pd.merge solution does not solve the problem for me. Specifically, I used
pd.merge(df1,df2,on='postal_code',how='outer')
All I got was nan values in the two new columns. Not sure whether this is because the no of rows for df1 and df2 are different. Any suggestions to solve this problem?
you just have the wrong how, use 'inner' instead. This matches where keys exist in both dataframes
df1.Postal_Code = df1.Postal_Code.astype(int)
df2.Postal_Code = df2.Postal_Code.astype(int)
df1.merge(df2,on='Postal_Code',how='inner')
Name City Postal_Code State Income Category
0 James Phoenix 85003 AZ 41038 Two
1 Jeff Phoenix 85003 AZ 41038 Two
2 John Scottsdale 85259 AZ 104631 Four
3 Jane Scottsdale 85259 AZ 104631 Four

Categories

Resources