Drop duplicate rows in a dataframe of particular column - python

I have a dataframe like the following:
Districtname pincode
0 central delhi 110001
1 central delhi 110002
2 central delhi 110003
3 central delhi 110004
4 central delhi 110005
How can I drop rows based on column DistrictName and select the first unique value
The output I want:
Districtname pincode
0 central delhi 110001

Data Frames can be dropped using pandas.DataFrame.drop_duplicates() and defaults to keeping the first occurrence. In your case DataFrame.drop_duplicates(subset = "Districtname") should work. If you would like to update the same DataFrame DataFrame.drop_duplicates(subset = "Districtname", inplace = True) will do the job. Docs: https://pandas.pydata.org/pandas-docs/version/0.17/generated/pandas.DataFrame.drop_duplicates.html

Use drop_duplicates with inplace=true:
df.drop_duplicates('Districtname',inplace=True)

Related

Compare different df's row by row and return changes

Every month I collect data that contains details of employees to be stored in our database.
I need to find a solution to compare the data stored in the previous month to the data received and, for each row that any of the columns had a change, it would return into a new dataframe.
I would also need to know somehow which columns in each row of this new returned dataframe had a change when this comparison happened.
There are also some important details to mention:
Each column can also contain blank values in any of the dataframes;
The dataframes have the same column names but not necessarily the same data type;
The dataframes do not have the same number of rows necessarily;
If a row do not find its Index match, do not return to the new dataframe;
The rows of the dataframes can be matched by a column named "Index"
So, for example, we would have this dataframe (which is just a slice of the real one as it has 63 columns):
df1:
Index Department Salary Manager Email Start_Date
1 IT 6000.00 Jack ax#i.com 01-01-2021
2 HR 7000 O'Donnel ay#i.com
3 MKT $7600 Maria d 30-06-2021
4 I'T 8000 Peter az#i.com 14-07-2021
df2:
Index Department Salary Manager Email Start_Date
1 IT 6000.00 Jack ax#i.com 01-01-2021
2 HR 7000 O'Donnel ay#i.com 01-01-2021
3 MKT 7600 Maria dy#i.com 30-06-2021
4 IT 8000 Peter az#i.com 14-07-2021
5 IT 9000 John NOT PROVIDED
6 IT 9900 John NOT PROVIDED
df3:
Index Department Salary Manager Email Start_Date
2 HR 7000 O'Donnel ay#i.com 01-01-2021
3 MKT 7600 Maria dy#i.com 30-06-2021
4 IT 8000 Peter az#i.com 14-07-2021
**The differences in this example are:
Start date added in row of Index 2
Salary format corrected and email corrected for row Index 3
Department format corrected for row Index 4
What would be the best way to to this comparison?
I am not sure if there is an easy solution to understand what changed in each field but returning the dataframe with rows that had at least 1 change would be helpful.
Thank you for the support!
I think compare could do the trick: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.compare.html
But first you would need to align the rows between old and new dataframe via the index:
new_df_to_compare=new_df.loc[old_df.index]
When datatypes don't match. You would also need to align them:
new_df_to_compare = new_df_to_compare.astype(old_df.dtypes.to_dict())
Then compare should work just like this:
difference_df = old_df.compare(new_df_to_compare)

How to aggregate rows in a pandas dataframe

I have a dataframe shown in the image 1. It is a sample of pubs in London,UK (3337 pubs/rows). And the geometry is at an LSOA level. In some LSOAs, there is more than 1 pub. I want my dataframe to summarise the number of pubs in every LSOA. I already have the information by using
psdf['lsoa11nm'].value_counts()
prints out:
City of London 001F 103
City of London 001G 40
Westminster 013B 36
Westminster 018A 36
Westminster 013E 30
...
Lambeth 005A 1
Croydon 043C 1
Hackney 002E 1
Merton 022D 1
Bexley 008B 1
Name: lsoa11nm, Length: 1630, dtype: int64
I cant use this as a new dataframe because it is a key and one column as opposed two columns where one would be lsoa11nm and the other pub count.
Does anyone know how to groupby the dataframe so that there will be only one row for every lsoa, that says how many pubs are in it?

Pandas duplicated rows with missing values

Hello I have a dataframe that contains duplicates.
df = pd.DataFrame({'id':[1,1,1],
'name':['Hamburg','Hamburg','Hamburg'],
'country':['Germany','Germany',None],
'state':[None,None,'Hamburg']})
removing the duplicates with df.drop_duplicates() returns:
How can I configure drop_duplicates such that only one row is left, that contains all the information?
In the case you have no row with all the information at once, you can use groupby and first but first fillna None with np.nan to work with missing values:
print (df.fillna(value=np.nan).groupby('id').first())
name country state
id
1 Hamburg Germany Hamburg
In your very special case, here's my proposal :
import pandas
df = pandas.DataFrame({'id':[1,1,1,2,2],
'name':['Hamburg','Hamburg','Hamburg','Paris','Paris'],
'country':['Germany','Germany',None, None, 'France'],
'state':[None,None,'Hamburg', 'Paris', None]})
df_result=pandas.DataFrame()
for id in df['id'].unique().tolist() :
df_subset=df[df['id']==id].copy(deep=True)
df_subset.sort_values(by=['id','name','country','state'],inplace=True)
df_subset.bfill(inplace=True)
df_subset.ffill(inplace=True)
df_subset.drop_duplicates(inplace=True)
df_result=df_result.append(df_subset)
df=df_result
Out[18]:
id name country state
0 1 Hamburg Germany Hamburg
4 2 Paris France Paris
Subsetting the records will avoid ffill or bfill to fill adjacent but different id records.
Regards

How to use Pivot table in non-numeric values?

I am working with Pivot function in Pandas:
My input table is:
POI_Entity_ID State
ADD_Q319_143936 Rajasthan
Polyline-Kot-2089 New Delhi
Q111267412 Rajasthan
EL_Q113_32573 Rajasthan
RCE_UDZ_10979 New Delhi
I want my output as:
Sate counts of POI_Entity_ID
Rajasthan 3
New Delhi 2
You can count the number of rows in each group using the pandas.DataFrame.value_counts().
In your case it would look something like:
d = {'State': ['Rajasthan','New Delhi','Rajasthan','Rajasthan','New Delhi'], 'POI_Entity_ID': ['ADD_Q319_143936','Polyline-Kot-2089','Q111267412','EL_Q113_32573','RCE_UDZ_10979']}
df = pd.DataFrame(data=d)
df['State'].value_counts()
The last line produces the following table:
Rajasthan 3
New Delhi 2
You can also use pandas.DataFrame.groupby() combined with the count() method:
df.groupby('State').count()
yields the following table:
POI_Entity_ID
State
New Delhi 2
Rajasthan 3
You can use a pivot table and aggregator function as count, keeping your index as 'State'.
d ={'POI_Entity_ID': ['ADD_Q319_143936','Polyline-Kot-2089','Q111267412','EL_Q113_32573',
'RCE_UDZ_10979'], 'State':['Rajasthan', 'New Delhi' ,'Rajasthan',
'Rajasthan' ,'New Delhi']}
df=pd.DataFrame(data=d)
pivotdf=pd.pivot_table(data=df,index='State',values='POI_Entity_ID',aggfunc='count')
gives you a table like :
POI_Entity_ID
State
New Delhi 2
Rajasthan 3

Update missing values in a column using pandas

I have a dataframe df with two of the columns being 'city' and 'zip_code':
df = pd.DataFrame({'city': ['Cambridge','Washington','Miami','Cambridge','Miami',
'Washington'], 'zip_code': ['12345','67891','23457','','','']})
As shown above, a particular city contains zip code in one of the rows, but the zip_code is missing for the same city in some other row. I want to fill those missing values based on the zip_code values of that city in some other row. Basically, wherever there is a missing zip_code, it checks zip_code for that city in other rows, and if found, fills the value for zip_code.If not found, fills 'NA'.
How do I accomplish this task using pandas?
You can go for:
import numpy as np
df['zip_code'] = df.replace(r'', np.nan).groupby('city')['zip_code'].fillna(method='ffill').fillna(method='bfill')
>>> df
city zip_code
0 Cambridge 12345
1 Washington 67891
2 Miami 23457
3 Cambridge 12345
4 Miami 23457
5 Washington 67891
You can check the string length using str.len and for those rows, filter the main df to those with valid zip_codes, set the index to those and call map on the 'city' column which will perform the lookup and fill those values:
In [255]:
df.loc[df['zip_code'].str.len() == 0, 'zip_code'] = df['city'].map(df[df['zip_code'].str.len() == 5].set_index('city')['zip_code'])
df
Out[255]:
city zip_code
0 Cambridge 12345
1 Washington 67891
2 Miami 23457
3 Cambridge 12345
4 Miami 23457
5 Washington 67891
If your real data has lots of repeating values then you'll need to additionally call drop_duplicates first:
df.loc[df['zip_code'].str.len() == 0, 'zip_code'] = df['city'].map(df[df['zip_code'].str.len() == 5].drop_duplicates(subset='city').set_index('city')['zip_code'])
The reason you need to do this is because it'll raise an error if there are duplicate index entries
My suggestion would be to first create a dictonary that maps from the city to the zip code. You can create this dictionary from the one DataFrame.
And then you use that dictionary to fill in all missing zip code values.

Categories

Resources