How to remove duplicates rows based on partial strings in Python - python

If i have a dataframe as follows in which 01 and 02, 03 and 04, 05 and 06 are same cites:
id city
01 New York City
02 New York
03 Tokyo City
04 Tokyo
05 Shanghai City
06 Shanghai
07 Beijing City
08 Paris
09 Berlin
How can I drop duplicates cites and get following dataframe? Thanks.
id city
01 New York
02 Tokyo
03 Shanghai
04 Beijing City
05 Paris
06 Berlin

Replace City part with null string and apply group by keeping the first row
df=pd.DataFrame({'id':[1,2,3,4],'city':['New York City','New York','Tokyo City','Tokyo']})
df looks like this
city id
0 New York City 1
1 New York 2
2 Tokyo City 3
3 Tokyo 4
Apply replace and group by to get first row in each group
df.city=df.city.str.replace('City','').str.strip()
df.groupby('city').first().sort_values('id')
Output:
city id
New York 1
Tokyo 3
Or use drop_duplicates on subset of columns. Thanks #JR ibkr
df.drop_duplicates(subset='city')

This is much easier in pandas now with drop_duplicates and the keep parameter.
# dataset
df = pd.DataFrame({'id':[1,2,3,4],'city':['New York City','New York','Tokyo City','Tokyo']})
# replace values
df.city = df.city.str.replace('City','').str.strip()
# drop duplicate (answer of original question)
df.drop_duplicates(subset=['city'])
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html

Related

Why am I getting /n/n in my excel output on python?

In my excel file, the only thing I see in the mileage columns is the actual value, so for example the first line I see “35 kmpl”. Why is it putting /n/n before every value when I get python to print my file?
I was expecting just the kmpl values in that column but instead I’m getting /n/n infront of them on python.
code
here is my code pasted, I didn't get an error message so I can't put that:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
bikes = pd.read_csv('/content/bikes sorted final.csv')
print(bikes)
which returns:
model_name model_year kms_driven owner
0 Bajaj Avenger Cruise 220 2017 2017 17000 Km first owner
1 Royal Enfield Classic 350cc 2016 2016 50000 Km first owner
2 Hyosung GT250R 2012 2012 14795 Km first owner
3 KTM Duke 200cc 2012 2012 24561 Km third owner
4 Bajaj Pulsar 180cc 2016 2016 19718 Km first owner
...
location mileage power price
0 hyderabad \n\n 35 kmpl 19 bhp 63500
1 hyderabad \n\n 35 kmpl 19.80 bhp 115000
2 hyderabad \n\n 30 kmpl 28 bhp 300000
3 bangalore \n\n 35 kmpl 25 bhp 63400
4 bangalore \n\n 65 kmpl 17 bhp 55000

Replace values in pandas dataframe with blank space that start with a string value

I have a large pandas dataframe (15million lines) and I want to replace any value that starts with 'College' and replace it with a blank. I know I could do this with a for loop or 'np.where', but this takes way too long on my large dataframe. I also want to create a 'combined_id' column where I take the student name and the college. I want to skip the ones that don't have a proper college name. What is the fastest way to do this?
original:
id1 id2 college_name student combined id
0 01 01 Stanford haley id/haley_Stanford
1 01 02 College12 josh id/josh_College12
2 01 03 Harvard jake id/jake_Harvard
2 01 05 UPenn emily id/emily_UPenn
2 01 00 College10 sarah id/sarah_College10
desired:
id1 id2 college_name student combined id
0 01 01 Stanford haley id/haley_Stanford
1 01 02 josh
2 01 03 Harvard jake id/jake_Harvard
2 01 05 UPenn emily id/emily_UPenn
2 01 00 sarah
Use boolean indexing:
m = df['college_name'].str.startswith('College')
df.loc[m, 'college_name'] = ''
df.loc[m, 'combined id'] = ''
Or if "combined id" does not exist, you have to use numpy.where:
df['combined id'] = np.where(m, '', 'id/'+df['student']+'_'+df['college_name'])
Here's a way to get from original to desired in your question:
df.loc[df.college_name.str.startswith("College"), ['college_name', 'combined_id']] = ''
Output:
id1 id2 college_name student combined_id
0 1 1 Stanford haley id/haley_Stanford
1 1 2 josh
2 1 3 Harvard jake id/jake_Harvard
2 1 5 UPenn emily id/emily_UPenn
2 1 0 sarah

Assign the value of another column to the empty cells of a specific column

I have the following DataFrame in pandas:
code
city
district
01
London
Westminster
03
Madrid
NaN
04
Rome
Trevi
07
Berlin
NaN
08
Barcelona
Badalona
For the district column if the row value is nan, I want to assign it the same value it has in its city attribute. Example:
code
city
district
01
London
Westminster
03
Madrid
Madrid
04
Rome
Trevi
07
Berlin
Berlin
08
Barcelona
Badalona
You should be able to use fillna() and pass the other column name:
df['district'] = df['district'].fillna(df['city'])
If the value is not Null (which I suggest to have it Null for best practices) but an empty string, you can evaluate a condition, so that:
df['district'] = np.where(df['district'] == '',df['city'],df['district'])

using library to replace column values in python

I am trying to replace the FIPS code with state abbreviations using us library. This is how I can get the value for each individual state
fips_name = us.states.mapping('fips', 'name')
fips_name['20']
Out[31]: 'Kansas'
Assming that fips_name is a dictionary of fips -> state names, you can use the .map method of a pandas.Series (column):
df["state_names"] = df["fips"].map(fips_name)
Update w/ working example:
import pandas as pd
import us
df = pd.DataFrame({"fips": ["01", "01", "08", "09", "10", "06"]})
fips_to_name = us.states.mapping("fips", "name")
df["states"] = df["fips"].map(fips_to_name)
print(df)
fips states
0 01 Alabama
1 01 Alabama
2 08 Colorado
3 09 Connecticut
4 10 Delaware
5 06 California

Pandas Groupby Percentage of total

df["% Sales"] = df["Jan"]/df["Q1"]
q1_sales = df.groupby(["City"])["Jan","Feb","Mar", "Q1"].sum()
ql_sales.head()
Jan Feb Mar Q1
City
Los Angeles 44 40 54 138
Want the code to get the percentage of sales for the quarter. Want it to look like this below each month is divided by the total sales of the quarter.
Jan Feb Mar
City
Los Angeles 31.9% 29% 39.1%
Try div:
q1_sales[['Jan','Feb','Mar']].div(q1_sales['Q1']*0.01, axis='rows')
Output:
Jan Feb Mar
City
Los Angeles 31.884058 28.985507 39.130435
Use:
new_df=q1_sales[q1_sales.columns.difference(['Q1'])]
new_df=(new_df.T/new_df.sum(axis=1)*100).T
print(new_df)
Feb Jan Mar
Los Angeles 28.985507 31.884058 39.130435

Categories

Resources