populate a dataframe column based on a list - python

I have a dataframe
vehicle_make vehicle_model vehicle_year
Toyota Corolla 2016
Hyundai Sonata 2016
Cadillac DTS 2006
Toyota Prius 2014
Kia Optima 2015
I want to add a new column 'vehicle_make_category' which populates based on a list i have
luxury=['Bentley',
'Maserati',
'Hummer',
'Porsche',
'Lexus']
non_luxury=['Saab',
'Mazda',
'Dodge',
'Volkswagen',
'Kia',
'Chevrolet',
'Hyundai',
'Ford',
'Nissan',
'Honda',
'Toyota'
]
How can accomplish this? I have tried using
df['vehicle_make_category']=np.where(df['vehicle_make']=i for i in luxury, 'luxury')
but it doesnt work...

Simply
df["vehicle_make_category"] = None
df.loc[df["vehicle_make"].isin(luxury), "vehicle_make_category"] = "luxury"
df.loc[df["vehicle_make"].isin(non_luxury), "vehicle_make_category"] = "non_luxury"

Use isin and also add a condition to np.where that fills the gaps for a condition not evaluated as true
df['vehicle_make_category'] = np.where(df.vehicle_make.isin(luxury),'luxury','non-luxury')
vehicle_make vehicle_model vehicle_year vehicle_make_category
0 Toyota Corolla 2016 non-luxury
1 Hyundai Sonata 2016 non-luxury
2 Cadillac DTS 2006 non-luxury
3 Toyota Prius 2014 non-luxury
4 Kia Optima 2015 non-luxury
Using np.select we can create a conditions list and assign values based on a condition being true
conditions = [df.vehicle_make.isin(luxury),df.vehicle_make.isin(non_luxury)]
df['vehicle_make_category'] = np.select(conditions,['luxury','non-luxury'],default='no-category')
vehicle_make vehicle_model vehicle_year vehicle_make_category
0 Toyota Corolla 2016 non-luxury
1 Hyundai Sonata 2016 non-luxury
2 Cadillac DTS 2006 no-category
3 Toyota Prius 2014 non-luxury
4 Kia Optima 2015 non-luxury

You can us df.join
You'll have to make a new dataframe identifying luxury/nonluxury.
veh = ['toyota','hyundai','cadillac']
yr = [2016,2016,2016]
lux = ['non','non','lux']
#recreating your lux/non layout
n_lux = [veh[0],veh[1]]
lux = [veh[2]]
#then making a new column
b = ['non' if v in n_lux else 'lux' for v in veh]
A = pd.DataFrame(np.array([veh,yr]).T)
B =pd.DataFrame(np.array([veh,b]).T)
pd.concat([A,B],axis = 1, keys = [0])

You can create the column via list comprehension:
df['vehicle_make_category'] = [
'luxury' if row.vehicle_make in luxury
else 'non_luxury'
for _, row in df.iterrows()
]

You can create a lookup_df from the lists for non_luxury and luxury.
lookup_df = pd.DataFrame({
'vehicle_make': luxury + non_luxury,
'vehicl_make_category': (["luxury"] * len(luxury))+(["non_luxury"] * len(non_luxury))
})
Then left join on the original df that you have.
df.merge(lookup_df, how='left',left_on='vehicle_make', right_on='vehicle_make')
Output:
vehicle_make vehicle_model vehicle_year vehicle_make_category
0 Toyota Corolla 2016 non_luxury
1 Hyundai Sonata 2016 non_luxury
2 Cadillac DTS 2006 NaN
3 Toyota Prius 2014 non_luxury
4 Kia Optima 2015 non_luxury

Related

Pandas where function

I'm using Pandas where function trying to find the percentage in each state
filter1 = df['state']=='California'
filter2 = df['state']=='Texas'
filter3 = df['state']=='Florida'
df['percentage']= df['total'].where(filter1)/df['total'].where(filter1).sum()
The output is
Year state total percentage
2014 California 914198.0 0.134925
2014 Florida 766441.0 NaN
2014 Texas 1045274.0 NaN
2015 California 874642.0 0.129087
2015 Florida 878760.0 NaN
how do I apply the rest of 2 filters into there too?
Don't use where but groupby.transform:
df['percentage'] = df['total'].div(df.groupby('state')['total'].transform('sum'))
Output:
Year state total percentage
0 2014 California 914198.0 0.511056
1 2014 Florida 766441.0 0.465865
2 2014 Texas 1045274.0 1.000000
3 2015 California 874642.0 0.488944
4 2015 Florida 878760.0 0.534135
You can try out df.loc[(filter1) & (filter2) & (filter3)] in pandas to apply multiple filter together !

How can I use groupby with multiple values in a column in pandas?

I've a dataframe like as follows,
import pandas as pd
data = {
'brand': ['Mercedes', 'Renault', 'Ford', 'Mercedes', 'Mercedes', 'Mercedes', 'Renault'],
'model': ['X', 'Y', 'Z', 'X', 'X', 'X', 'Q'],
'year': [2011, 2010, 2009, 2010, 2012, 2020, 2011],
'price': [None, 1000.4, 2000.3, 1000.0, 1100.3, 3000.5, None]
}
df = pd.DataFrame(data)
print(df)
brand model year price
0 Mercedes X 2011 NaN
1 Renault Y 2010 1000.4
2 Ford Z 2009 2000.3
3 Mercedes X 2010 1000.0
4 Mercedes X 2012 1100.3
5 Mercedes X 2020 3000.5
6 Renault Q 2011 NaN
And here is the another case to test your solution,
data = {
'brand': ['Mercedes', 'Mercedes', 'Mercedes', 'Mercedes', 'Mercedes'],
'model': ['X', 'X', 'X', 'X', 'X'], 'year': [2017, 2018, 2018, 2019, 2019],
'price': [None, None, None, 1000.0, 1200.50]
}
Expected output,
brand model year price
0 Mercedes X 2017 NaN
1 Mercedes X 2018 1100.25
2 Mercedes X 2018 1100.25
3 Mercedes X 2019 1000.00
4 Mercedes X 2019 1200.50
I want to fill the missing values with the average of the observations containing year-1, year and year+1 and also same brand and model. For instance, Mercedes X model has a null price in 2011. When I look at the data,
2011 - 1 = 2010
2011 + 1 = 2012
The 4th observation -> Mercedes,X,2010,1000.0
The 5th observation -> Mercedes,X,2012,1100.3
The mean -> (1000.0 + 1100.3) / 2 = 1050.15
I've tried something as follows,
for c_key, _ in df.groupby(['brand', 'model', 'year']):
fc = (
(df['brand'] == c_key[0])
& (df['model'] == c_key[1])
& (df['year'].isin([c_key[2] + 1, c_key[2], c_key[2] - 1]))
)
sc = (
(df['brand'] == c_key[0])
& (df['model'] == c_key[1])
& (df['year'] == c_key[2])
& (df['price'].isnull())
)
mean_val = df[fc]['price'].mean()
df.loc[sc, 'price'] = mean_val
print(df)
brand model year price
0 Mercedes X 2011 1050.15
1 Renault Y 2010 1000.40
2 Ford Z 2009 2000.30
3 Mercedes X 2010 1000.00
4 Mercedes X 2012 1100.30
5 Mercedes X 2020 3000.50
6 Renault Q 2011 NaN
But this solution takes a long time for 90,000 rows and 27 columns so, is there a more effective solution? For instance, can I use groupby for the values year-1, year, year+1, brand and model?
Thanks in advance.
I think actually a more efficient way would be to sort by Brand and then Year, and then use interpolate:
df = df.sort_values(['brand', 'year']).groupby('brand').apply(lambda g: g.interpolate(limit_area='inside'))
Output:
>>> df
brand model year price
0 Mercedes X 2011 1050.15
1 Renault Y 2010 1000.40
2 Ford Z 2009 2000.30
3 Mercedes X 2010 1000.00
4 Mercedes X 2012 1100.30
5 Mercedes X 2020 3000.50
6 Renault Q 2011 1000.40
That also handles all the columns.
Based on the solution of #richardec, but with some addition to correct the price if the next year's price is known. Not sure if it faster than your original solution though
# Make an interpolated average
df_out = df.sort_values(['brand', 'year']).groupby('brand').apply(lambda g: g.interpolate(limit_area='inside'))
# Make an average per brand/year/model
df1 = df.sort_values(['brand', 'year']).groupby(['brand','year','model']).mean().reset_index()
# Check if the next line has the same brand and model. If so, take the next average price when the price isNa
mask1 = df1["model"] == df1["model"].shift(-1)
mask2 = df1["brand"] == df1["brand"].shift(-1)
mask3 = df1["price"].isna()
df1["priceCorr"] = np.where(mask1 & mask2 & mask3 ,df1["price"].shift(-1),df1["price"] )
# Merge everything together
df_out = df_out.merge(df1[["brand", "year", "model","priceCorr"]], on=["brand", "year", "model"])
df_out["price"] = np.where(df_out["price"].isna(),df_out["priceCorr"], df_out["price"])
Here goes a solution that looks simpler:
Sort values in the original dataframe:
df = df.sort_values(["brand", "model", "year"])
Group by "brand" and "model", and store the groups in a variable (to calculate only once):
groups = df.groupby(["brand", "model"])
Fill nan values using the average of the previous and next rows (Important: this assumes that you have data of consecutive years, meaning that if you're missing data for 2015 you know the values of 2014 and 2016. If you have no data for consecutive years, null values will remain null).
df["price"] = df["price"].fillna((groups["price"].ffill(limit=1) + groups["price"].bfill(limit=1)) / 2)
Resulting code:
df = df.sort_values(["brand", "model", "year"])
groups = df.groupby(["brand", "model"])
df["price"] = df["price"].fillna((groups["price"].ffill(limit=1) + groups["price"].bfill(limit=1)) / 2)
print(df)
Output:
brand model year price
2 Ford Z 2009 2000.30
3 Mercedes X 2010 1000.00
0 Mercedes X 2011 1050.15
4 Mercedes X 2012 1100.30
5 Mercedes X 2020 3000.50
6 Renault Q 2011 NaN
1 Renault Y 2010 1000.40
def fill_it(x):
return df[(df.brand==df.iat[x,0])&(df.model==df.iat[x,1])&((df.year==df.iat[x,2]-1)|(df.year==df.iat[x,2]+1))].price.mean()
df = df.apply(lambda x: x.fillna(fill_it(x.name)), axis=1)
df
Output 1:
brand model year price
0 Mercedes X 2011 1050.15
1 Renault Y 2010 1000.40
2 Ford Z 2009 2000.30
3 Mercedes X 2010 1000.00
4 Mercedes X 2012 1100.30
5 Mercedes X 2020 3000.50
6 Renault Q 2011 NaN
Output 2:
brand model year price
0 Mercedes X 2017 NaN
1 Mercedes X 2018 1100.25
2 Mercedes X 2018 1100.25
3 Mercedes X 2019 1000.00
4 Mercedes X 2019 1200.50
This is 3x Faster
df.loc[df.price.isna(), 'price'] = df[df.price.isna()].apply(lambda x: x.fillna(fill_it(x.name)), axis=1)
I tried with another approach, using pd.rolling and it is way faster (on the dataframe with 70k rows runs in 200ms). The outputs are still as you wanted them.
df.year = pd.to_datetime(df.year, format='%Y')
df.sort_values('year', inplace=True)
df.groupby(['brand', 'model']).apply(lambda x: x.fillna(x.rolling('1095D',on='year', center=True).mean())).sort_index()
This is not a pretty solution, but from your description, I believe it would work and be really fast. It's just a lot of ifs inside a np.where on a sorted data frame.
import pandas as pd
import numpy as np
data = pd.DataFrame({
'brand': ['Mercedes', 'Renault', 'Ford', 'Mercedes', 'Mercedes', 'Mercedes', 'Renault'],
'model': ['X', 'Y', 'Z', 'X', 'X', 'X', 'Q'],
'year': [2011, 2010, 2009, 2010, 2012, 2020, 2011],
'price': [None, 1000.4, 2000.3, 1000.0, 1100.3, 3000.5, None]
})
data = data.sort_values(by=['brand', 'model', 'year'])
data['adjusted_price'] = np.where(data['price'].isnull() &
(data['brand']==data['brand'].shift(1)) & (data['brand']==data['brand'].shift(-1)) &
(data['model']==data['model'].shift(1)) & (data['model']==data['model'].shift(-1)) &
(data['year']==(data['year'].shift(1)+1))&(data['year']==(data['year'].shift(-1)-1)),
(data['price'].shift(1)+data['price'].shift(-1))/2,
data['price'])
data['price'] = data['adjusted_price']
data = data.drop(['adjusted_price'], axis=1)

The problem that a value whose index I know with the loc function cannot update another column in the same index?

Datatable:
ARAÇ VEHICLE_YEAR NUM_PASSENGERS
0 CHEVROLET 2017 NaN
1 NISSAN 2017 NaN
2 HYUNDAI 2017 1.0
3 DODGE 2017 NaN
I want to update more than one index and column data on that index with the loc function.
but when I use the loc function, it changes the new values ​​by twos
listcolumns = ['VEHICLE_YEAR', 'NUM_PASSENGERS']
listnewvalue = [16000, 28000]
indexlister = [0, 1]
data.loc[indexlister , listcolumns] = listnewvalue
As you can see in the output below. just zero and the first index 'VEHICLE_YEAR' should be 16000, 'NUM_PASSENGERS' should be 28000. BUT, BOTH ZERO AND THE FIRST ROW HAS CHANGED IN BOTH COLUMNS.
How can i check this and change only the columns and index i want.or do you have a different method? thank you very much.
output:
ARAÇ VEHICLE_YEAR NUM_PASSENGERS
0 CHEVROLET 16000 28000.0
1 NISSAN 16000 28000.0
In the printout, I set fields to be empty so that new entries appear. for example; I want to assign the value 2005 to the 0 index of the column 'VEHICLE_YEAR' and to the 1st index 2005 of the column 'NUM_PASSENGERS'
The output I want is as follows:
ARAÇ VEHICLE_YEAR NUM_PASSENGERS
0 CHEVROLET 2005 Nan
1 NISSAN Nan 2005
2 HYUNDAI Nan Nan
The list you're setting the values with needs to correspond to the number of rows and number of columns you've selected with loc. If it receives a single list, it will assign all selected rows at those columns to that value.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'ARAC' : ['CHEVROLET', 'NISSAN', 'HYUNDAI', 'DODGE'],
'VEHICLE_YEAR' : [2017, 2017, 2017, 2017],
'NUM_PASSENGERS' : [np.nan, np.nan, 1.0, np.nan]
})
ARAC NUM_PASSENGERS VEHICLE_YEAR
0 CHEVROLET NaN 2017
1 NISSAN NaN 2017
2 HYUNDAI 1.0 2017
3 DODGE NaN 2017
df.loc[[0, 2], ['NUM_PASSENGERS', 'VEHICLE_YEAR']] = [[1000, 2014], [3000, 2015]]
ARAC NUM_PASSENGERS VEHICLE_YEAR
0 CHEVROLET 1000.0 2014
1 NISSAN NaN 2017
2 HYUNDAI 3000.0 2015
3 DODGE NaN 2017
If you only want to change the values in the NUM_PASSENGERS column, select only that and give it a single list/array, the same length as your row indices.
df.loc[[0,1,3], ['NUM_PASSENGERS']] = [10, 20, 30]
ARAC NUM_PASSENGERS VEHICLE_YEAR
0 CHEVROLET 10.0 2014
1 NISSAN 20.0 2017
2 HYUNDAI 3000.0 2015
3 DODGE 30.0 2017
The docs might be helpful too. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html#pandas.DataFrame.loc
If this didn't answer your question, please provide your expected output.
I solved the problem as follows.
I could not describe the problem exactly, I am working on it, but when I changed it that way, it worked. And now I can change the row and column value I want to the value I want.
listcolumns = ['VEHICLE_YEAR', 'NUM_PASSENGERS']
listnewvalue = [16000, 28000]
indexlister = [0, 1]
for i in len(indexlister):
df.loc[lister[count], listcolumn[count]] = listnewvalue[count]

Why fillna not working as expected for mode

I am working on a carsale data set having columns : 'car', 'price', 'body', 'mileage', 'engV', 'engType', 'registration','year', 'model', 'drive'
column 'drive' and 'engType' have NaN missing values, I want to calculate mode for let say for 'drive' based on group by of ['car', 'model'] and then where this group falls, I want to replace NaN value there based on this groupby
I have tried these methods:
for numeric data
carsale['engV2'] = (carsale.groupby(['car','body','model']))['engV'].transform(lambda x: x.fillna(x.median()))
this is working fine, filling/replacing data accurately
for categorical data
carsale['driveT'] = (carsale.groupby(['car','model']))['drive'].transform(lambda x: x.fillna(x.mode()))
carsale['driveT'] = (carsale.groupby(['car','model']))['drive'].transform(lambda x: x.fillna(pd.Series.mode(x)))
both are giving same results
Here is the full code:
# carsale['price2'] = (carsale.groupby(['car','model','year']))['price'].transform(lambda x: x.fillna(x.median()))
# carsale['engV2'] = (carsale.groupby(['car','body','model']))['engV'].transform(lambda x: x.fillna(x.median()))
# carsale['mileage2'] = (carsale.groupby(['car','model','year']))['mileage'].transform(lambda x: x.fillna(x.median()))
# mode = carsale.filter(['car','drive']).mode()
# carsale[['test1','test2']] = carsale[['car','engType']].fillna(carsale.mode().iloc[0])
**carsale.groupby(['car', 'model'])['engType'].apply(pd.Series.mode)**
# carsale.apply()
# carsale
# carsale['engType2'] = carsale.groupby('car').engType.transform(lambda x: x.fillna(x.mode()))
**carsale['driveT'] = carsale.groupby(['car', 'model'])['drive'
].transform(lambda x: x.fillna(x.mode()))
carsale['driveT'] = carsale.groupby(['car', 'model'])['drive'
].transform(lambda x: x.fillna(pd.Series.mode(x)))**
# carsale[carsale.car == 'Mercedes-Benz'].sort_values(['body','engType','model','mileage']).tail(50)
# carsale[carsale.engV.isnull()]
# carsale.sort_values(['car','body','model'])
**carsale**
from above both methods giving the same results, it is just replacing/adding values in new column driveT same as we have in origional column 'drive'. like if we have NaN in some indexes then it is showing same NaN in driveT as well and same for other values.
But for numerical data, if i applied median it is adding/replacing correct value.
So the thing is it actually not calculating mode based on ['car', 'model'] group instead it is doing mode for single values in 'drive', but if you run this command
**carsale.groupby(['car','model'])['engType'].apply(pd.Series.mode)**
this is correctly calculating mode based on groupby (car, model)
Can anyone help in this matter?
My approach was to:
Use .groupby() to create a look-up dataframe that contains the mode of the drive feature for each car/model combo.
Write a method that looks up the mode in this dataframe and returns it for a given car/model, when that car/model's value in drive is null.
However, turned out there were two key corner cases specific to OP's dataset that needed to be handled:
When a particular car/model combo has no mode (because all entries in the drive column for this combo were NaN).
When a particular car brand has no mode.
Below are the steps I followed. If I begin with an example extended from first several rows of the sample dataframe in the question:
carsale = pd.DataFrame({'car': ['Ford', 'Mercedes-Benz', 'Mercedes-Benz', 'Mercedes-Benz', 'Mercedes-Benz', 'Nissan', 'Honda','Renault', 'Mercedes-Benz', 'Mercedes-Benz', 'Toyota', 'Toyota', 'Ferrari'],
'price': [15500.000, 20500.000, 35000.000, 17800.000, 33000.000, 16600.000, 6500.000, 10500.000, 21500.000, 21500.000, 1280.000, 2005.00, 300000.000],
'body': ['crossover', 'sedan', 'other', 'van', 'vagon', 'crossover', 'sedan', 'vagon', 'sedan', 'sedan', 'compact', 'compact', 'sport'],
'mileage': [68.0, 173.0, 135.0, 162.0, 91.0, 83.0, 199.0, 185.0, 146.0, 146.0, 200.0, 134, 123.0],
'engType': ['Gas', 'Gas', 'Petrol', 'Diesel', np.nan, 'Petrol', 'Petrol', 'Diesel', 'Gas', 'Gas', 'Hybrid', 'Gas', 'Gas'],
'registration':['yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes'],
'year': [2010, 2011, 2008, 2012, 2013, 2013, 2003, 2011, 2012, 2012, 2009, 2003, 1988],
'model': ['Kuga', 'E-Class', 'CL 550', 'B 180', 'E-Class', 'X-Trail', 'Accord', 'Megane', 'E-Class', 'E-Class', 'Prius', 'Corolla', 'Testarossa'],
'drive': ['full', 'rear', 'rear', 'front', np.nan, 'full', 'front', 'front', 'rear', np.nan, np.nan, 'front', np.nan],
})
carsale
car price body mileage engType registration year model drive
0 Ford 15500.0 crossover 68.0 Gas yes 2010 Kuga full
1 Mercedes-Benz 20500.0 sedan 173.0 Gas yes 2011 E-Class rear
2 Mercedes-Benz 35000.0 other 135.0 Petrol yes 2008 CL 550 rear
3 Mercedes-Benz 17800.0 van 162.0 Diesel yes 2012 B 180 front
4 Mercedes-Benz 33000.0 vagon 91.0 NaN yes 2013 E-Class NaN
5 Nissan 16600.0 crossover 83.0 Petrol yes 2013 X-Trail full
6 Honda 6500.0 sedan 199.0 Petrol yes 2003 Accord front
7 Renault 10500.0 vagon 185.0 Diesel yes 2011 Megane front
8 Mercedes-Benz 21500.0 sedan 146.0 Gas yes 2012 E-Class rear
9 Mercedes-Benz 21500.0 sedan 146.0 Gas yes 2012 E-Class NaN
10 Toyota 1280.0 compact 200.0 Hybrid yes 2009 Prius NaN
11 Toyota 2005.0 compact 134.0 Gas yes 2003 Corolla front
12 Ferrari 300000.0 sport 123.0 Gas yes 1988 Testarossa NaN
Create a dataframe to that shows the mode of the drive feature for each car/model combination.
If a car/model combo has no mode (such as the row with Toyota Prius), I fill with the mode of that particular car brand (Toyota).
However, if the car brand, itself, (such as Ferrari here in my example) has no mode, I fill with the dataset's mode for the drive feature.
def get_drive_mode(x):
brand = x.name[0]
if x.count() > 0:
return x.mode() # Return mode for a brand/model if the mode exists.
elif carsale.groupby(['car'])['drive'].count()[brand] > 0:
brand_mode = carsale.groupby(['car'])['drive'].apply(lambda x: x.mode())[brand]
return brand_mode # Return mode of brand if particular brand/model combo has no mode,
else: # but brand itself has a mode for the 'drive' feature.
return carsale['drive'].mode() # Otherwise return dataset's mode for the 'drive' feature.
drive_modes = carsale.groupby(['car','model'])['drive'].apply(get_drive_mode).reset_index().drop('level_2', axis=1)
drive_modes.rename(columns={'drive': 'drive_mode'}, inplace=True)
drive_modes
car model drive_mode
0 Ferrari Testarossa front
1 Ford Kuga full
2 Honda Accord front
3 Mercedes-Benz B 180 front
4 Mercedes-Benz CL 550 rear
5 Mercedes-Benz E-Class rear
6 Nissan X-Trail full
7 Renault Megane front
8 Toyota Corolla front
9 Toyota Prius front
Write a method that looks up the drive mode value for a given car/model in a given row if that row's value for drive is NaN:
def fill_with_mode(x):
if pd.isnull(x['drive']):
return drive_modes[(drive_modes['car'] == x['car']) & (drive_modes['model'] == x['model'])]['drive_mode'].values[0]
else:
return x['drive']
Apply the above method to the rows in the carsale dataframe in order to create the driveT feature:
carsale['driveT'] = carsale.apply(fill_with_mode, axis=1)
del(drive_modes)
Which results in the following dataframe:
carsale
car price body mileage engType registration year model drive driveT
0 Ford 15500.0 crossover 68.0 Gas yes 2010 Kuga full full
1 Mercedes-Benz 20500.0 sedan 173.0 Gas yes 2011 E-Class rear rear
2 Mercedes-Benz 35000.0 other 135.0 Petrol yes 2008 CL 550 rear rear
3 Mercedes-Benz 17800.0 van 162.0 Diesel yes 2012 B 180 front front
4 Mercedes-Benz 33000.0 vagon 91.0 NaN yes 2013 E-Class NaN rear
5 Nissan 16600.0 crossover 83.0 Petrol yes 2013 X-Trail full full
6 Honda 6500.0 sedan 199.0 Petrol yes 2003 Accord front front
7 Renault 10500.0 vagon 185.0 Diesel yes 2011 Megane front front
8 Mercedes-Benz 21500.0 sedan 146.0 Gas yes 2012 E-Class rear rear
9 Mercedes-Benz 21500.0 sedan 146.0 Gas yes 2012 E-Class NaN rear
10 Toyota 1280.0 compact 200.0 Hybrid yes 2009 Prius NaN front
11 Toyota 2005.0 compact 134.0 Gas yes 2003 Corolla front front
12 Ferrari 300000.0 sport 123.0 Gas yes 1988 Testarossa NaN front
Notice that in rows 4 and 9 of the driveT column, the NaN value that was in the drive column has been replaced by the string rear, which as we would expect, is the mode of drive for a Mercedes E-Class.
Also, in row 11, since there is no mode for the Toyota Prius car/model combo, we fill with the mode for the Toyota brand, which is front.
Finally, in row 12, since there is no mode for the Ferrari car brand, we fill with the mode of the entire dataset's drive column, which is also front.

How to Pandas renaming column names and prepend 0's to record?

this is just example DataFrame I made up. I have output how I would like to see. There is two thing I am trying to achieve here.
Replace periods . in the column names with _ underscore. I can
to this individually but I want do this in loop, like if we assume
there is 40-50 column names.
check if Car.Mile is 5 digits on the record. If not preprend 0’s
car.Model car.Color car.Year car.Mile
0 AUDI RED 2015 14000
1 BUIC WHITE 2015 9000
2 PORS BLUE 2016 7000
3 HONDA BLACK 2015 100000
OUTPUT
car_Model car_Color car_Year car_Mile
0 AUDI RED 2015 014000
1 BUIC WHITE 2015 009000
2 PORS BLUE 2016 007000
3 HONDA BLACK 2015 100000
You can use str.replace for replacing . Then convert column car_Mile to string by astype and last apply zfill:
df.columns = df.columns.str.replace('.', '_')
df['car_Mile'] = df['car_Mile'].astype(str).apply(lambda x: x.zfill(6))
print df
car_Model car_Color car_Year car_Mile
0 AUDI RED 2015 014000
1 BUIC WHITE 2015 009000
2 PORS BLUE 2016 007000
3 HONDA BLACK 2015 100000
Or:
df.columns = df.columns.str.replace('.', '_')
df['car_Mile'] = df['car_Mile'].astype(str).apply(lambda x: '{0:0>6}'.format(x))
print df
car_Model car_Color car_Year car_Mile
0 AUDI RED 2015 014000
1 BUIC WHITE 2015 009000
2 PORS BLUE 2016 007000
3 HONDA BLACK 2015 100000
EDIT:
Thank you Edchum for improvement - apply is not necessary, better is use str.zfill:
df.columns = df.columns.str.replace('.', '_')
df['car_Mile'] = df['car_Mile'].astype(str).str.zfill(6)
print df
car_Model car_Color car_Year car_Mile
0 AUDI RED 2015 014000
1 BUIC WHITE 2015 009000
2 PORS BLUE 2016 007000
3 HONDA BLACK 2015 100000

Categories

Resources