Splitting single text column into multiple columns Pandas - python

I am working on extraction of raw data from various sources. After a process, I could form a dataframe that looked like this.
data
0 ₹ 16,50,000\n2014 - 49,000 km\nJaguar XF 2.2\nJAN 16
1 ₹ 23,60,000\n2017 - 28,000 km\nMercedes-Benz CLA 200 CDI Style, 2017, Diesel\nNOV 26
2 ₹ 26,00,000\n2016 - 44,000 km\nMercedes Benz C-Class Progressive C 220d, 2016, Diesel\nJAN 03
I want to split this raw dataframe into relevant columns in order of the raw data occurence: Price, Year, Mileage, Name, Date
I have tried to use df.data.split('-', expand=True) with other delimiter options sequentially along with some lambda functions to achieve this, but haven't gotten much success.
Need assistance in splitting this data into relevant columns.
Expected output:
price year mileage name date
16,50,000 2014 49000 Jaguar 2.2 XF Luxury Jan-17
23,60,000 2017 28000 CLA CDI Style Nov-26
26,00,000 2016 44000 Mercedes C-Class C220d Jan-03

Try split on '\n' then on '-'
df[["Price","Year-Mileage","Name","Date"]] =df.data.str.split('\n', expand=True)
df[["Year","Mileage"]] =df ["Year-Mileage"].str.split('-', expand=True)
df.drop(columns=["data","Year-Mileage"],inplace=True)
print(df)
Price Name Date Year Mileage
0 ₹ 16,50,000 Jaguar XF 2.2 JAN 16 2014 49,000 km
2 ₹ 26,00,000 Mercedes Benz C-Class Progressive C 220d, 2016, Diesel JAN 03 2016 44,000 km
1 ₹ 23,60,000 Mercedes-Benz CLA 200 CDI Style, 2017, Diesel NOV 26 2017 28,000 km

Related

Extract values from dataset to perform functions- multiple countries within dataset

My dataset looks as follows:
Country
Year
Value
Ireland
2010
9
Ireland
2011
11
Ireland
2012
14
Ireland
2013
17
Ireland
2014
20
France
2011
15
France
2012
19
France
2013
21
France
2014
28
Germany
2008
17
Germany
2009
20
Germany
2010
19
Germany
2011
24
Germany
2012
27
Germany
2013
32
My goal is to create a new dataset which tells me the % increase from the first year of available data for a given country, compared to the most recent, which would look roughly as follows:
Country
% increase
Ireland
122
France
87
Germany
88
In essence, I need my code for each country in my dataset, to locate the smallest and largest value for year, then take the corresponding values within the value column and calculate the % increase.
I can do this manually, however I have a lot of countries in my dataset and am looking for a more elegant way to do it. I am trying to troubleshoot my code for this however I am not having much luck as of yet.
My code looks as follows at present:
df_1["Min_value"] = df.loc[df["Year"].min(),"Value"].iloc[0]
df_1["Max_value"] = df.loc[df["Year"].max(),"Value"].iloc[0]
df_1["% increase"] = ((df_1["Max_value"]-df_1["Min_value"])/df_1["Min_value"])*100
This returns an error:
AttributeError: 'numpy.float64' object has no attribute 'iloc'
In addition to this it also has the issue that I cannot figure out a way to have the code to run individually for each country within my dataset, so this is another challenge which I am not entirely sure how to address.
Could I potentially go down the route of defining a particular function which could then be applied to each country?
You can group by Country and aggregate min/max for both Year and Value, then calculate percentage change between min and max of the Value.
pct_df = df.groupby(['Country']).agg(['min', 'max'])['Value']\
.apply(lambda x: x.pct_change().round(2) * 100, axis=1)\
.drop('min', axis=1).rename(columns={'max':'% increase'}).reset_index()
print(pct_df)
The output:
Country % increase
0 France 87.0
1 Germany 88.0
2 Ireland 122.0

Why am I getting /n/n in my excel output on python?

In my excel file, the only thing I see in the mileage columns is the actual value, so for example the first line I see “35 kmpl”. Why is it putting /n/n before every value when I get python to print my file?
I was expecting just the kmpl values in that column but instead I’m getting /n/n infront of them on python.
code
here is my code pasted, I didn't get an error message so I can't put that:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
bikes = pd.read_csv('/content/bikes sorted final.csv')
print(bikes)
which returns:
model_name model_year kms_driven owner
0 Bajaj Avenger Cruise 220 2017 2017 17000 Km first owner
1 Royal Enfield Classic 350cc 2016 2016 50000 Km first owner
2 Hyosung GT250R 2012 2012 14795 Km first owner
3 KTM Duke 200cc 2012 2012 24561 Km third owner
4 Bajaj Pulsar 180cc 2016 2016 19718 Km first owner
...
location mileage power price
0 hyderabad \n\n 35 kmpl 19 bhp 63500
1 hyderabad \n\n 35 kmpl 19.80 bhp 115000
2 hyderabad \n\n 30 kmpl 28 bhp 300000
3 bangalore \n\n 35 kmpl 25 bhp 63400
4 bangalore \n\n 65 kmpl 17 bhp 55000

sum every nth row create new column and apply the same value every 3 rows on that column

I have a dataframe as such
Date Search Volume
Jan 2004 80,000
Feb 2004 90,000
Mar 2004 100,000
Apr 2004 40,000
May 2004 60,000
Jun 2004 50,000
I wish to have an output like this:
Date Search Volume Total Quarter
Jan 2004 80,000 270,000 2004Q1
Feb 2004 90,000 270,000 2004Q1
Mar 2004 100,000 270,000 2004Q1
Apr 2004 40,000 150,000 2004Q2
May 2004 60,000 150,000 2004Q2
Jun 2004 50,000 150,000 2004Q2
...
...
Aug 2022 50,000 100,000 2022Q3
Sep 2022 10,000 100,000 2022Q3
Oct 2022 40,000 100,000 2022Q3
So what I'm trying to do is sum every 3 rows (quarter) and create a new column called total, and apply the sum to every row that belongs to that quarter. The other column should be Quarter, which represents the quarter that the month belongs to.
I have tried this:
N = 3
keyvolume=keyvol.groupby(keyvol.index // 3).sum()
but this just results in a sum, not sure how to apply the values every 3 rows that quarter, and I don't know how to generate the quarter column.
Appreciate your help.
First convert column Search Volume to numeric by Series.str.replace and casting to integers or floats, then convert dates to quarters by to_datetime and Series.dt.to_period and for new column use GroupBy.transform with sum per quarters:
def func(df):
df['Search Volume'] = df['Search Volume'].str.replace(',','', regex=True).astype(int)
q = pd.to_datetime(df['Date']).dt.to_period('q')
df['Total'] = df['Search Volume'].groupby(q).transform('sum')
df['Quarter'] = q
return df
out = func(df)
print (out)
Date Search Volume Total Quarter
0 Jan 2004 80000 270000 2004Q1
1 Feb 2004 90000 270000 2004Q1
2 Mar 2004 100000 270000 2004Q1
3 Apr 2004 40000 150000 2004Q2
4 May 2004 60000 150000 2004Q2
5 Jun 2004 50000 150000 2004Q2

Create Average of Monthly Data to Quaterly Data in Panel Data using Pandas

I've got the following dataset:
Date Country Specie Monthly Average \
Apr 2015 BR co 5.840000
Apr 2015 BR no2 7.553704
Apr 2015 BR o3 15.561667
Apr 2015 BR pm10 16.283333
Apr 2015 BR pm25 51.633333
... ... ... ... ...
For 10 countries, for certain emissions (specie) with months of 2015 to 2021. I want to convert them into quarterly average data (using the average of the corresponding months in a quarter) of the following form as an example:
Date Country Specie Quarterly Average \
2015 Q1 BR co 6.840000
2015 Q1 BR no2 9.553704
2015 Q1 BR o3 17.561667
2015 Q1 BR pm10 18.283333
2015 Q1 BR pm25 55.633333
... ... ... ... ...
How it would be possible to do this in python pandas?
Also I've got another question, If I want to make a separation of Specie in columns and take the corresponding values, how it would be possible, in the way that I can obtain the following structure:
Date Country co Average no2 Average o3 Average ... \
2015 Q1 BR 6.840000 9.553704 17.561667
2015 Q2 BR 8.840000 10.553704 18.561667
First create a Quarter column from original Date column
df['Quarter'] = pd.to_datetime(df['Date']).dt.to_period('Q')
Then Groupby Country, Specie and Quarter columns, calculate mean of Monthly Average column in each group. Rename the result column as Quarterly Average.
df_ = df.groupby(['Country', 'Specie', 'Quarter'], as_index=False)['Monthly Average'].mean().rename(columns={'Monthly Average': 'Quarterly Average'})
FYI, pandas.pivot_table() is what you want to obtain the structure.

Sort Plots by year

I have this data frame where I want to graph 3 plots based on year with x and y being Unspcs Desc and Total_Price. For example plot one will be specific to the year 2018 and only contain contents of Unspsc Desc and Total_Price for 2018
Material Total_Price Year_Purchase
Gasket 50,000 2018
Washer 6,000 2019
Bolts 7,000 2019
Nut 3,000 2020
Gasket 25,000 2019
Gasket 2500 2020
Washer 33500 2018
Nuts 7000 2019
The code I was using
dw.groupby(['Unspsc Desc', 'Total_Price']).Year_Purchase.sort_values().plot.bar()

Categories

Resources