Replace multiple columns with years for one column - python

I'm working with worldbank data and I'm trying to create some graphs representing time, but the data I have now looks like this:
As I don't think there's a way to change it to a datetime I think the only way is to replace all these years columns with 1 column called 'Year' with column names I have right now as values and current values in a separate column.
Is there any nice function in Python that allows that or would I have to iterate through the entire dataframe?
Edit to include some code:
df2 = pd.DataFrame({'Country Name': ['Aruba', 'Afghanistan', 'Angola'],
'Country Code': ['ABW', 'AFG', 'AGO'],
'1960':[65.66, 32.29, 33.25],
'1961': [66.07, 32.74, 33.57],
'1962': [66.44, 33.18, 33.91],
'1963': [66.79, 33.62, 34.27],
'1964': [66.11, 34.06, 34.65],
'1965': [67.44, 34.49, 35.03]}).set_index('Country Name')

You can try taking transpose of the dataframe thus the year values will become rows and then you can rename this as year and use it in the plots.
You can try something like this :
import pandas as pd
from matplotlib import pyplot as plt
df1 = pd.DataFrame({'Country Name' : ['Aruba', 'Afghanistan', 'Angola'],
'Country Code' : ['ABW', 'AFG', 'AGO'],
'1960' : [65.66, 32.29, 33.25],
'1961' : [66.07, 32.74, 33.57],
'1962' : [66.44, 33.18, 33.91],
'1963' : [66.79, 33.62, 34.27],
'1964' : [66.11, 34.06, 34.65],
'1965' : [67.44, 34.49, 35.03]})
df2 = df1.transpose()
df2.columns = df1['Country Name']
df2 = df2[2:]
df2['Year'] = df2.index.values
plt.plot(df2['Year'], df2['Aruba'])
plt.plot(df2['Year'], df2['Afghanistan'])
plt.plot(df2['Year'], df2['Angola'])
plt.legend()
plt.show()
Output : Plot Output

Related

Different questions about pandas pivot tables

Here's my df:
df=pd.DataFrame(
{
'Color': ['red','blue','red','red','green','red','yellow'],
'Type': ['Oil', 'Aluminium', 'Oil', 'Oil', 'Cement Paint', 'Synthetic Rubber', 'Emulsion'],
'Finish' : ['Satin', 'Matte', 'Matte', 'Satin', 'Semi-gloss', 'Satin', 'Satin'],
'Use' : ['Interior', 'Exterior', 'Interior', 'Interior', 'Exterior', 'Exterior', 'Exterior'],
'Price' : [55, 75, 60, 60, 55, 75, 50]
}
)
I want to create a pivot table that will output 'Color', 'color count', the percentage or weight or each count of color, and finally a total row, outputting the total color count next to 100%. Additionally, I'd like to add a header with today's date in the following format (02 - Nov).
Here is my current pivot with the aproximating inputs
today=datetime.date.today()
today_format=today.strftime("%d-m%")
pivot_table=pd.pivot_table(
data=df,
index='Color',
aggfunc={'Color':'count'}
)
df['Color'].value_counts(
normalize=True
).mul(100).round(1).astype(str) + '%'
Is there a way to add more information to the pivot as a header, total and extra column? Or just I just try to convert the pivot back to a DF and edit it from there?
The main difficulty I'm finding is that since I'm handling string data, when I 'aggfunc='sum' it actually adds the strings. And If I try to add 'margins=True, margins_name='Total count' I get the following error:
if isinstance(aggfunc[k], str):
KeyError: 'Type'
The desired table output would look something like this:
Updated Answer
Thanks to a great suggestion by Rabinzel, we can also have today's date as a column header as well:
df = (df['Color'].value_counts().reset_index().pivot_table(index = ['index'], aggfunc = np.sum, margins=True, margins_name='Total')
.assign(perc = lambda x: x['Color']/x.iloc[:-1]['Color'].sum() * 100)
.rename(columns = {'Color' : 'Color Count',
'perc' : '%'}))
new_cols = pd.MultiIndex.from_product([[datetime.today().strftime('%#d-%b')], df.columns])
df.columns = new_cols
df
2-Nov
Color Count %
index
blue 1 14.285714
green 1 14.285714
red 4 57.142857
yellow 1 14.285714
Total 7 100.000000

Finding the Corresponding Max Value in a Data Frame

I have the following code
import pandas as pd
df = {'sport' : ['football', 'hockey', 'baseball', 'basketball', 'nan'], 'league': ['NFL', 'NHL', 'MLB', 'NBA', 'NaN'], 'number': [1,2,3,4,'']}
df = pd.DataFrame(df)
df
I'd like to print out the sport with the highest number
I've tried the following code:
highestnumberedsport = df[df['number'] == df['number'].max()]
However, this gives me the entire row. I'd like to only print out the value in the sport column.
Update: I've tried the following based on suggestion, but this still does not output the desired string of 'basketball' in this instance:
df.loc[df['number'] == df['number'].max(), 'sport']
how do I only print out the sport that has the highest value?

Extract the country zip code based on the full country code - DataFrame Python

I have in my data information about place as full post code for example CZ25145. I would like to create new column for this with value CZ. How to do this?
I have this:
import pandas as pd
df = pd.DataFrame({
'CODE_LOAD_PLACE' : ['PL43100', 'CZ25905', 'DE29333', 'DE29384', 'SK92832']
},)
I would like to get it like below:
df = pd.DataFrame({
'CODE_LOAD_PLACE' : ['PL43100', 'CZ25905', 'DE29333', 'DE29384', 'SK92832'],
'COUNTRY_LOAD_PLACE' : ['PL', 'CZ', 'DE', 'DE', 'SK']
},)
I try use .factorize and .groupby but no positive final effect.
Use .str and select the first 2 characters:
df["COUNTRY_LOAD_PLACE"] = df["CODE_LOAD_PLACE"].str[:2]

Pandas create multiple dataframe based on group from another dataframe

I have a pandas dataframe
df=pd.DataFrame({'Name':['Jhon','Andy','Jenny','Joan','Paul','Rosa'],
'Position':['Programmer','Designer','Programmer','Designer','Analyst','Analyst']})
I want to create multiple of other dataframe based on the Position, and named each dataframe as "Job_as_"
Expected output would be
Job_as_Programmer=['Jhon','Jeny']
Job_as_Designer=['Andy','Jhon']
Use pandas.DataFrame.groupby with pandas.Series.add_prefix:
df2 = df.groupby("Position")["Name"].apply(list)
df2.add_prefix("Job_as_").to_dict()
Output:
{'Job_as_Analyst': ['Paul', 'Rosa'],
'Job_as_Designer': ['Andy', 'Joan'],
'Job_as_Programmer': ['Jhon', 'Jenny']}
You could create a dictionary:
{"Job_as_"+ x : df.loc[df.Position==x, "Name"].to_list() for x in df.Position.unique()}
Output
{
'Job_as_Programmer': ['Jhon', 'Jenny'],
'Job_as_Designer': ['Andy', 'Joan'],
'Job_as_Analyst': ['Paul', 'Rosa']
}
you could just use groupby as below:
import pandas as pd
df=pd.DataFrame({'Name':['Jhon','Andy','Jenny','Joan','Paul','Rosa'],
'Position':['Programmer','Designer','Programmer','Designer','Analyst','Analyst']})
newDf = df.groupby(["Position" , "Name"]).first()
newDf #To Print Table
Output:
Position Name
Analyst Paul
Rosa
Designer Andy
Joan
Programmer Jenny
Jhon

Operations within DataFrameGroupBy

I am trying to understand how to apply function within the 'groupby' or each groups of the groups in a dataframe.
import pandas as pd
import numpy as np
df = pd.DataFrame({'Stock' : ['apple', 'ford', 'google', 'samsung','walmart', 'kroger'],
'Sector' : ['tech', 'auto', 'tech', 'tech','retail', 'retail'],
'Price': np.random.randn(6),
'Signal' : np.random.randn(6)}, columns= ['Stock','Sector','Price','Signal'])
dfg = df.groupby(['Sector'],as_index=False)
type(dfg)
pandas.core.groupby.DataFrameGroupBy
I want to get the sum ( Price * (1/Signal) ) group by 'Sector'.
i.e. The resulting output should look like
Sector | Value
auto | 0.744944
retail |-0.572164053
tech | -1.454632
I can get the results by creating separate data frames, but was looking for a way to
figure out how to operate withing each of the grouped ( sector) frames.
I can find mean or sum of Price
dfg.agg({'Price' : [np.mean, np.sum] }).head(2)
but not get sum ( Price * (1/Signal) ), which is what I need.
Thanks,
You provided random data, so there is no way we can get the exact number that you got. But based on what you just described, I think the following will do:
In [121]:
(df.Price/df.Signal).groupby(df.Sector).sum()
Out[121]:
Sector
auto -1.693373
retail -5.137694
tech -0.984826
dtype: float64

Categories

Resources