In my excel file, the only thing I see in the mileage columns is the actual value, so for example the first line I see “35 kmpl”. Why is it putting /n/n before every value when I get python to print my file?
I was expecting just the kmpl values in that column but instead I’m getting /n/n infront of them on python.
code
here is my code pasted, I didn't get an error message so I can't put that:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
bikes = pd.read_csv('/content/bikes sorted final.csv')
print(bikes)
which returns:
model_name model_year kms_driven owner
0 Bajaj Avenger Cruise 220 2017 2017 17000 Km first owner
1 Royal Enfield Classic 350cc 2016 2016 50000 Km first owner
2 Hyosung GT250R 2012 2012 14795 Km first owner
3 KTM Duke 200cc 2012 2012 24561 Km third owner
4 Bajaj Pulsar 180cc 2016 2016 19718 Km first owner
...
location mileage power price
0 hyderabad \n\n 35 kmpl 19 bhp 63500
1 hyderabad \n\n 35 kmpl 19.80 bhp 115000
2 hyderabad \n\n 30 kmpl 28 bhp 300000
3 bangalore \n\n 35 kmpl 25 bhp 63400
4 bangalore \n\n 65 kmpl 17 bhp 55000
Related
My dataset looks as follows:
Country
Year
Value
Ireland
2010
9
Ireland
2011
11
Ireland
2012
14
Ireland
2013
17
Ireland
2014
20
France
2011
15
France
2012
19
France
2013
21
France
2014
28
Germany
2008
17
Germany
2009
20
Germany
2010
19
Germany
2011
24
Germany
2012
27
Germany
2013
32
My goal is to create a new dataset which tells me the % increase from the first year of available data for a given country, compared to the most recent, which would look roughly as follows:
Country
% increase
Ireland
122
France
87
Germany
88
In essence, I need my code for each country in my dataset, to locate the smallest and largest value for year, then take the corresponding values within the value column and calculate the % increase.
I can do this manually, however I have a lot of countries in my dataset and am looking for a more elegant way to do it. I am trying to troubleshoot my code for this however I am not having much luck as of yet.
My code looks as follows at present:
df_1["Min_value"] = df.loc[df["Year"].min(),"Value"].iloc[0]
df_1["Max_value"] = df.loc[df["Year"].max(),"Value"].iloc[0]
df_1["% increase"] = ((df_1["Max_value"]-df_1["Min_value"])/df_1["Min_value"])*100
This returns an error:
AttributeError: 'numpy.float64' object has no attribute 'iloc'
In addition to this it also has the issue that I cannot figure out a way to have the code to run individually for each country within my dataset, so this is another challenge which I am not entirely sure how to address.
Could I potentially go down the route of defining a particular function which could then be applied to each country?
You can group by Country and aggregate min/max for both Year and Value, then calculate percentage change between min and max of the Value.
pct_df = df.groupby(['Country']).agg(['min', 'max'])['Value']\
.apply(lambda x: x.pct_change().round(2) * 100, axis=1)\
.drop('min', axis=1).rename(columns={'max':'% increase'}).reset_index()
print(pct_df)
The output:
Country % increase
0 France 87.0
1 Germany 88.0
2 Ireland 122.0
I am working on extraction of raw data from various sources. After a process, I could form a dataframe that looked like this.
data
0 ₹ 16,50,000\n2014 - 49,000 km\nJaguar XF 2.2\nJAN 16
1 ₹ 23,60,000\n2017 - 28,000 km\nMercedes-Benz CLA 200 CDI Style, 2017, Diesel\nNOV 26
2 ₹ 26,00,000\n2016 - 44,000 km\nMercedes Benz C-Class Progressive C 220d, 2016, Diesel\nJAN 03
I want to split this raw dataframe into relevant columns in order of the raw data occurence: Price, Year, Mileage, Name, Date
I have tried to use df.data.split('-', expand=True) with other delimiter options sequentially along with some lambda functions to achieve this, but haven't gotten much success.
Need assistance in splitting this data into relevant columns.
Expected output:
price year mileage name date
16,50,000 2014 49000 Jaguar 2.2 XF Luxury Jan-17
23,60,000 2017 28000 CLA CDI Style Nov-26
26,00,000 2016 44000 Mercedes C-Class C220d Jan-03
Try split on '\n' then on '-'
df[["Price","Year-Mileage","Name","Date"]] =df.data.str.split('\n', expand=True)
df[["Year","Mileage"]] =df ["Year-Mileage"].str.split('-', expand=True)
df.drop(columns=["data","Year-Mileage"],inplace=True)
print(df)
Price Name Date Year Mileage
0 ₹ 16,50,000 Jaguar XF 2.2 JAN 16 2014 49,000 km
2 ₹ 26,00,000 Mercedes Benz C-Class Progressive C 220d, 2016, Diesel JAN 03 2016 44,000 km
1 ₹ 23,60,000 Mercedes-Benz CLA 200 CDI Style, 2017, Diesel NOV 26 2017 28,000 km
I want to replace NA values with the mean of other column with the same year.
Note: To replace NA values for Canada data, I want to use only the mean of Canada, not the mean from the whole dataset of course.
Here's a sample dataframe filled with random numbers. And some NA how i find them in my dataframe:
Country
Inhabitants
Year
Area
Cats
Dogs
Canada
38 000 000
2021
4
32
21
Canada
37 000 000
2020
4
NA
21
Canada
36 000 000
2019
3
32
21
Canada
NA
2018
2
32
21
Canada
34 000 000
2017
NA
32
21
Canada
35 000 000
2016
3
32
NA
Brazil
212 000 000
2021
5
32
21
Brazil
211 000 000
2020
4
NA
21
Brazil
210 000 000
2019
NA
32
21
Brazil
209 000 000
2018
4
32
21
Brazil
NA
2017
2
32
21
Brazil
207 000 000
2016
4
32
NA
What's the easiest way with pandas to replace those NA with the mean values of the other years? And is it possible to write a code for which it is possible to go through every NA and replace them (Inhabitants, Area, Cats, Dogs at once)?
Note Example is based on your additional data source from the comments
Replacing the NA-Values for multiple columns with mean() you can combine the following three methods:
fillna() (Iterating per column axis should be 0, which is default value of fillna())
groupby()
transform()
Create data frame from your example:
df = pd.read_excel('https://happiness-report.s3.amazonaws.com/2021/DataPanelWHR2021C2.xls')
Country name
year
Life Ladder
Log GDP per capita
Social support
Healthy life expectancy at birth
Freedom to make life choices
Generosity
Perceptions of corruption
Positive affect
Negative affect
Canada
2005
7.41805
10.6518
0.961552
71.3
0.957306
0.25623
0.502681
0.838544
0.233278
Canada
2007
7.48175
10.7392
nan
71.66
0.930341
0.249479
0.405608
0.871604
0.25681
Canada
2008
7.4856
10.7384
0.938707
71.84
0.926315
0.261585
0.369588
0.89022
0.202175
Canada
2009
7.48782
10.6972
0.942845
72.02
0.915058
0.246217
0.412622
0.867433
0.247633
Canada
2010
7.65035
10.7165
0.953765
72.2
0.933949
0.230451
0.41266
0.878868
0.233113
Call fillna() and iterate over all columns grouped by name of country:
df = df.fillna(df.groupby('Country name').transform('mean'))
Check your result for Canada:
df[df['Country name'] == 'Canada']
Country name
year
Life Ladder
Log GDP per capita
Social support
Healthy life expectancy at birth
Freedom to make life choices
Generosity
Perceptions of corruption
Positive affect
Negative affect
Canada
2005
7.41805
10.6518
0.961552
71.3
0.957306
0.25623
0.502681
0.838544
0.233278
Canada
2007
7.48175
10.7392
0.93547
71.66
0.930341
0.249479
0.405608
0.871604
0.25681
Canada
2008
7.4856
10.7384
0.938707
71.84
0.926315
0.261585
0.369588
0.89022
0.202175
Canada
2009
7.48782
10.6972
0.942845
72.02
0.915058
0.246217
0.412622
0.867433
0.247633
Canada
2010
7.65035
10.7165
0.953765
72.2
0.933949
0.230451
0.41266
0.878868
0.233113
This also works:
In [2]:
df = pd.read_excel('DataPanelWHR2021C2.xls')
In [3]:
# Check for number of null values in df
df.isnull().sum()
Out [3]:
Country name 0
year 0
Life Ladder 0
Log GDP per capita 36
Social support 13
Healthy life expectancy at birth 55
Freedom to make life choices 32
Generosity 89
Perceptions of corruption 110
Positive affect 22
Negative affect 16
dtype: int64
SOLUTION
In [4]:
# Adds mean of column to any NULL values
df.fillna(df.mean(), inplace=True)
In [5]:
# 2nd check for number of null values
df.isnull().sum()
Out [5]: No more NULL values
Country name 0
year 0
Life Ladder 0
Log GDP per capita 0
Social support 0
Healthy life expectancy at birth 0
Freedom to make life choices 0
Generosity 0
Perceptions of corruption 0
Positive affect 0
Negative affect 0
dtype: int64
Data Frame:
Unnamed: 0 date target insult tweet year
0 1 2014-10-09 thomas-frieden fool Can you believe this fool, Dr. Thomas Frieden ... 2014
1 2 2014-10-09 thomas-frieden DOPE Can you believe this fool, Dr. Thomas Frieden ... 2014
2 3 2015-06-16 politicians all talk and no action Big time in U.S. today - MAKE AMERICA GREAT AG... 2015
3 4 2015-06-24 ben-cardin It's politicians like Cardin that have destroy... Politician #SenatorCardin didn't like that I s... 2015
4 5 2015-06-24 neil-young total hypocrite For the nonbeliever, here is a photo of #Neily... 2015
I want the data frame which consists for only year with 2020 and 2021 using search and match methods.
df_filtered = df.loc[df.year.str.contains('2014|2015', regex=True) == True]
I have a data frame which headers look like that:
Time Peter_Price, Peter_variable 1, Peter_variable 2, Maria_Price, Maria_variable 1, Maria_variable 3,John_price,...
2017 12 985685466 Street 1 12 4984984984 Street 2
2018 10 985785466 Street 3 78 4984974184 Street 8
2019 12 985685466 Street 1 12 4984984984 Street 2
2020 12 985685466 Street 1 12 4984984984 Street 2
2021 12 985685466 Street 1 12 4984984984 Street 2
What would be the best multi-index to compare variables by group later such as what person has the highest variable 3 or the trend of all variable 3 by people
I think that what I need is something like this but I accept other suggestions (this is my first approach with multi-index).
Peter Maria John
Price, variable 1, variable 2, Price, variable 1, variable 3, Price,...
Time
You can try this:
Create data
import pandas as pd
import numpy as np
import itertools
people = ["Peter", "Maria"]
vars = ["Price", "variable 1", "variable 2"]
columns = ["_".join(x) for x in itertools.product(people, vars)]
df = (pd.DataFrame(np.random.rand(10, 6), columns=columns)
.assign(time=np.arange(2012, 2022))
print(df.head())
Peter_Price Peter_variable 1 Peter_variable 2 Maria_Price Maria_variable 1 Maria_variable 2 time
0 0.542336 0.201243 0.616050 0.313119 0.652847 0.928497 2012
1 0.587392 0.143169 0.594997 0.553803 0.249188 0.076633 2013
2 0.447318 0.410310 0.443391 0.947064 0.476262 0.230092 2014
3 0.285560 0.018005 0.869387 0.165836 0.399670 0.307120 2015
4 0.422084 0.414453 0.626180 0.658528 0.286265 0.404369 2016
Snippet to try
new_df = df.set_index("time")
new_df.columns = new_df.columns.str.split("_", expand=True)
print(new_df.head())
Peter Maria
Price variable 1 variable 2 Price variable 1 variable 2
time
2012 0.542336 0.201243 0.616050 0.313119 0.652847 0.928497
2013 0.587392 0.143169 0.594997 0.553803 0.249188 0.076633
2014 0.447318 0.410310 0.443391 0.947064 0.476262 0.230092
2015 0.285560 0.018005 0.869387 0.165836 0.399670 0.307120
2016 0.422084 0.414453 0.626180 0.658528 0.286265 0.404369
Then you can use the xs method to subselect specific variables for an individual level analysis. Subsetting to only "variable 2"
>>> new_df.xs("variable 2", level=1, axis=1)
Peter Maria
time
2012 0.616050 0.928497
2013 0.594997 0.076633
2014 0.443391 0.230092
2015 0.869387 0.307120
2016 0.626180 0.404369
2017 0.443827 0.544415
2018 0.425426 0.176707
2019 0.454269 0.414625
2020 0.863477 0.322609
2021 0.902759 0.821789
Example analysis: For each year, who has the higher "Price"
>>> new_df.xs("Price", level=1, axis=1).idxmax(axis=1)
time
2012 Peter
2013 Peter
2014 Maria
2015 Peter
2016 Maria
2017 Peter
2018 Maria
2019 Peter
2020 Maria
2021 Peter
dtype: object
Try:
df=df.set_index('Time')
df.columns = pd.MultiIndex.from_tuples([x.split('_') for x in df.columns])
Output:
Peter Maria
Price variable1 variable2 Price variable1 variable3
Time
2017 12 985685466 Street 1 12 4984984984 Street 2
2018 10 985785466 Street 3 78 4984974184 Street 8
2019 12 985685466 Street 1 12 4984984984 Street 2
2020 12 985685466 Street 1 12 4984984984 Street 2
2021 12 985685466 Street 1 12 4984984984 Street 2