Different questions about pandas pivot tables - python

Here's my df:
df=pd.DataFrame(
{
'Color': ['red','blue','red','red','green','red','yellow'],
'Type': ['Oil', 'Aluminium', 'Oil', 'Oil', 'Cement Paint', 'Synthetic Rubber', 'Emulsion'],
'Finish' : ['Satin', 'Matte', 'Matte', 'Satin', 'Semi-gloss', 'Satin', 'Satin'],
'Use' : ['Interior', 'Exterior', 'Interior', 'Interior', 'Exterior', 'Exterior', 'Exterior'],
'Price' : [55, 75, 60, 60, 55, 75, 50]
}
)
I want to create a pivot table that will output 'Color', 'color count', the percentage or weight or each count of color, and finally a total row, outputting the total color count next to 100%. Additionally, I'd like to add a header with today's date in the following format (02 - Nov).
Here is my current pivot with the aproximating inputs
today=datetime.date.today()
today_format=today.strftime("%d-m%")
pivot_table=pd.pivot_table(
data=df,
index='Color',
aggfunc={'Color':'count'}
)
df['Color'].value_counts(
normalize=True
).mul(100).round(1).astype(str) + '%'
Is there a way to add more information to the pivot as a header, total and extra column? Or just I just try to convert the pivot back to a DF and edit it from there?
The main difficulty I'm finding is that since I'm handling string data, when I 'aggfunc='sum' it actually adds the strings. And If I try to add 'margins=True, margins_name='Total count' I get the following error:
if isinstance(aggfunc[k], str):
KeyError: 'Type'
The desired table output would look something like this:

Updated Answer
Thanks to a great suggestion by Rabinzel, we can also have today's date as a column header as well:
df = (df['Color'].value_counts().reset_index().pivot_table(index = ['index'], aggfunc = np.sum, margins=True, margins_name='Total')
.assign(perc = lambda x: x['Color']/x.iloc[:-1]['Color'].sum() * 100)
.rename(columns = {'Color' : 'Color Count',
'perc' : '%'}))
new_cols = pd.MultiIndex.from_product([[datetime.today().strftime('%#d-%b')], df.columns])
df.columns = new_cols
df
2-Nov
Color Count %
index
blue 1 14.285714
green 1 14.285714
red 4 57.142857
yellow 1 14.285714
Total 7 100.000000

Related

Group By Dinstinct in Pandas

I Have Script Like This in Pandas :
dfmi['Time'] = pd.to_datetime(dfmi['Time'], format='%H:%M:%S')
dfmi['hours'] = dfmi['Time'].dt.hour
sum_dh = dfmi.groupby(['Date','hours']).agg({'Amount': 'sum', 'Price':'sum'})
dfdhsum = pd.DataFrame(sum_dh)
dfdhsum.columns = ['Amount', 'Gas Sales']
dfdhsum
And the output :
I want Sum Distinct Group BY and the final result is like This :
How its pandas code solution ??
I don't understand what you want to exactly but this instruction will sum hours , amount ans gas sales for each date
dfmi.groupby("Date").agg({'hours': 'sum', 'Amount': 'sum','Gas Sales':'sum})

Add a new column containing the difference between EACH TWO ROWS of another column of a data frame

I would like to get the difference between each 2 rows of the column duration and then fill the values in a new column differenceor print it.
So basically I want: row(1)-row(2)=difference1, row(3)-row(4)=difference2, row(5)-row(6)=difference3 ....
Example of a code:
data = {'Profession':['Teacher', 'Banker', 'Teacher', 'Judge','lawyer','Teacher'], 'Gender':['Male','Male', 'Female', 'Male','Male','Female'],'Size':['M','M','L','S','S','M'],'Duration':['5','6','2','3','4','7']}
data2={'Profession':['Doctor', 'Scientist', 'Scientist', 'Banker','Judge','Scientist'], 'Gender':['Male','Male', 'Female','Female','Male','Male'],'Size':['L','M','L','M','L','L'],'Duration':['1','2','9','10','1','17']}
data3 = {'Profession':['Banker', 'Banker', 'Doctor', 'Doctor','lawyer','Teacher'], 'Gender':['Male','Male', 'Female', 'Female','Female','Male'],'Size':['S','M','S','M','L','S'],'Duration':['15','8','5','2','11','10']}
data4={'Profession':['Judge', 'Judge', 'Scientist', 'Banker','Judge','Scientist'], 'Gender':['Female','Female', 'Female','Female','Female','Female'],'Size':['M','S','L','S','M','S'],'Duration':['1','2','9','10','1','17']}
df= pd.DataFrame(data)
df2=pd.DataFrame(data2)
df3=pd.DataFrame(data3)
df4=pd.DataFrame(data4)
DATA=pd.concat([df,df2,df3,df4])
DATA.groupby(['Profession','Size','Gender']).agg('sum')
D=DATA.reset_index()
D['difference']=D['Duration'].diff(-1)
I tried using diff(-1) but it's not exactly what I'm looking for. any ideas?
Is that what you wanted?
D["Neighbour"]=D["Duration"].shift(-1)
# fill empty lines with 0
D["Neighbour"] = D["Neighbour"].fillna(0)
# convert columns "Neighbour" and "Duration" to numeric
D["Neighbour"] = pd.to_numeric(D["Neighbour"])
D["Duration"] = pd.to_numeric(D["Duration"])
# get difference
D["difference"]=D["Duration"] - D["Neighbour"]
# remove "Neighbour" column
D = D.drop(columns=["Neighbour"], axis=1)
# remove odd lines
D.loc[1::2,"difference"] = None
# print D
D

Adding a column to a panda dataframe with values from columns from another dataframe, depending on a key from a dictionary

I have the following two dataframes:
df1 = pd.DataFrame([["blala Amazon", '02/30/2017', 'Amazon'], ["blala Amazon", '04/28/2017', 'Amazon'], ['blabla Netflix', '06/28/2017', 'Netflix']], columns=['text', 'date', 'keyword'])
df2 = pd.DataFrame([['01/28/2017', '3.4', '10.2'], ['02/30/2017', '3.7', '10.5'], ['03/28/2017', '6.0', '10.9']], columns=['dates', 'ReturnOnAssets.1', 'ReturnOnAssets.2'])
(perhaps it's clearer in the screenshots here: https://imgur.com/a/YNrWpR2)
The df2 is much larger than shown here - it contains columns for 100 companies. So for example, for the 10th company, the column names are: ReturnOnAssets.10, etc.
I have created a dictionary which maps the company names to the column names:
stocks = {'Microsoft':'','Apple' :'1', 'Amazon':'2', 'Facebook':'3',
'Berkshire Hathaway':'4', 'Johnson & Johnson':'5',
'JPMorgan' :'6', 'Alphabet': '7'}
and so on.
Now, what I am trying to achieve is adding a column "ReturnOnAssets" from d2 to d1, but for a specific company and for a specific date. So looking at df1, the first tweet (i.e. "text") contains a keyword "Amazon" and it was posted on 04/28/2017. I now need to go to df2 to the relevant column name for Amazon (i.e. "ReturnOnAssets.2") and fetch the value for the specified date.
So what I expect looks like this:
df1 = pd.DataFrame([["blala Amazon", '02/30/2017', 'Amazon', **'10.5'**], ["blala Amazon", '04/28/2017', 'Amazon', 'x'], ["blabla Netflix', '06/28/2017', 'Netflix', 'x']], columns=['text', 'date', 'keyword', 'ReturnOnAssets'])
By x I mean values which where not included in the example df1 and df2.
I am fairly new to pandas and I can't wrap my head around it. I tried:
keyword = df1['keyword']
txt = 'ReturnOnAssets.'+ stocks[keyword]
df1['ReturnOnAssets'] = df2[txt]
But I don't know how to fetch the relevant date, and also this gives me an error: "Series' objects are mutable, thus they cannot be hashed", which probably comes from the fact that I cannot just add a whole column of keywords to the text string.
I don't know how to achieve the operation I need to do, so I would appreciate help.
It can probably be shortened and you can add if statements to deal with when there are missing values.
import pandas as pd
import numpy as np
df1 = pd.DataFrame([["blala Amazon", '05/28/2017', 'Amazon'], ["blala Facebook", '04/28/2017', 'Facebook'], ['blabla Netflix', '06/28/2017', 'Netflix']], columns=['text', 'dates', 'keyword'])
df1
df2 = pd.DataFrame([['06/28/2017', '3.4', '10.2'], ['05/28/2017', '3.7', '10.5'], ['04/28/2017', '6.0', '10.9']], columns=['dates', 'ReturnOnAsset.1', 'ReturnOnAsset.2'])
#creating myself a bigger df2 to cover all the way to netflix
for i in range (9):
df2[('ReturnOnAsset.' + str(i))]=np.random.randint(1, 1000, df1.shape[0])
stocks = {'Microsoft':'0','Apple' :'1', 'Amazon':'2', 'Facebook':'3',
'Berkshire Hathaway':'4', 'Johnson & Johnson':'5',
'JPMorgan' :'6', 'Alphabet': '7', 'Netflix': '8'}
#new col where to store values
df1['ReturnOnAsset']=np.nan
for index, row in df1.iterrows():
colname=('ReturnOnAsset.' + stocks[row['keyword']] )
df1['ReturnOnAsset'][index]=df2.loc[df2['dates'] ==row['dates'] , colname]
Next time please give us a correct test data, I modified your dates and dictionary for match the first and second column (netflix and amazon values).
This code will work if and only if all dates from df1 are in df2 (Note that in df1 the column name is date and in df2 the column name is dates)
df1 = pd.DataFrame([["blala Amazon", '02/30/2017', 'Amazon'], ["blala Amazon", '04/28/2017', 'Amazon'], ['blabla Netflix', '02/30/2017', 'Netflix']], columns=['text', 'date', 'keyword'])
df2 = pd.DataFrame([['04/28/2017', '3.4', '10.2'], ['02/30/2017', '3.7', '10.5'], ['03/28/2017', '6.0', '10.9']], columns=['dates', 'ReturnOnAssets.1', 'ReturnOnAssets.2'])
stocks = {'Microsoft':'','Apple' :'5', 'Amazon':'2', 'Facebook':'3',
'Berkshire Hathaway':'4', 'Netflix':'1',
'JPMorgan' :'6', 'Alphabet': '7'}
df1["ReturnOnAssets"]= [ df2["ReturnOnAssets." + stocks[ df1[ "keyword" ][ index ] ] ][ df2.index[ df2["dates"] == df1["date"][index] ][0] ] for index in range(len(df1)) ]
df1

Replace multiple columns with years for one column

I'm working with worldbank data and I'm trying to create some graphs representing time, but the data I have now looks like this:
As I don't think there's a way to change it to a datetime I think the only way is to replace all these years columns with 1 column called 'Year' with column names I have right now as values and current values in a separate column.
Is there any nice function in Python that allows that or would I have to iterate through the entire dataframe?
Edit to include some code:
df2 = pd.DataFrame({'Country Name': ['Aruba', 'Afghanistan', 'Angola'],
'Country Code': ['ABW', 'AFG', 'AGO'],
'1960':[65.66, 32.29, 33.25],
'1961': [66.07, 32.74, 33.57],
'1962': [66.44, 33.18, 33.91],
'1963': [66.79, 33.62, 34.27],
'1964': [66.11, 34.06, 34.65],
'1965': [67.44, 34.49, 35.03]}).set_index('Country Name')
You can try taking transpose of the dataframe thus the year values will become rows and then you can rename this as year and use it in the plots.
You can try something like this :
import pandas as pd
from matplotlib import pyplot as plt
df1 = pd.DataFrame({'Country Name' : ['Aruba', 'Afghanistan', 'Angola'],
'Country Code' : ['ABW', 'AFG', 'AGO'],
'1960' : [65.66, 32.29, 33.25],
'1961' : [66.07, 32.74, 33.57],
'1962' : [66.44, 33.18, 33.91],
'1963' : [66.79, 33.62, 34.27],
'1964' : [66.11, 34.06, 34.65],
'1965' : [67.44, 34.49, 35.03]})
df2 = df1.transpose()
df2.columns = df1['Country Name']
df2 = df2[2:]
df2['Year'] = df2.index.values
plt.plot(df2['Year'], df2['Aruba'])
plt.plot(df2['Year'], df2['Afghanistan'])
plt.plot(df2['Year'], df2['Angola'])
plt.legend()
plt.show()
Output : Plot Output

Operations within DataFrameGroupBy

I am trying to understand how to apply function within the 'groupby' or each groups of the groups in a dataframe.
import pandas as pd
import numpy as np
df = pd.DataFrame({'Stock' : ['apple', 'ford', 'google', 'samsung','walmart', 'kroger'],
'Sector' : ['tech', 'auto', 'tech', 'tech','retail', 'retail'],
'Price': np.random.randn(6),
'Signal' : np.random.randn(6)}, columns= ['Stock','Sector','Price','Signal'])
dfg = df.groupby(['Sector'],as_index=False)
type(dfg)
pandas.core.groupby.DataFrameGroupBy
I want to get the sum ( Price * (1/Signal) ) group by 'Sector'.
i.e. The resulting output should look like
Sector | Value
auto | 0.744944
retail |-0.572164053
tech | -1.454632
I can get the results by creating separate data frames, but was looking for a way to
figure out how to operate withing each of the grouped ( sector) frames.
I can find mean or sum of Price
dfg.agg({'Price' : [np.mean, np.sum] }).head(2)
but not get sum ( Price * (1/Signal) ), which is what I need.
Thanks,
You provided random data, so there is no way we can get the exact number that you got. But based on what you just described, I think the following will do:
In [121]:
(df.Price/df.Signal).groupby(df.Sector).sum()
Out[121]:
Sector
auto -1.693373
retail -5.137694
tech -0.984826
dtype: float64

Categories

Resources