Python pandas pivot_table margins keyError - python

Consider the following dataframe:
test = pd.DataFrame({'A': [datetime.datetime.now(), datetime.datetime.now()], 'B': [1, 2]})
If I use pivot_table like below then everything is fine:
test.pivot_table(index = 'A', aggfunc = {'B': 'mean'}, margins = True)
However, if I do the following, I can't set margins = True (throws the error KeyError: 'A'):
test.pivot_table(index = test['A'], aggfunc = {'B': 'mean'}, margins = True)
I am really confused. Let's say I need do something like below AND need to set margin = True. Is that impossible?
test.pivot_table(index = test['A'].dt.year, aggfunc = {'B': 'mean'}, margins = True)

Try:
test['Ax']=test['A'].dt.year
test.pivot_table(index = 'Ax' , aggfunc = 'mean', values='B', margins = True)
Outputs:
B
Ax
2020 1.5
All 1.5
Explanation: in case if you don't pass values it will default to df.columns (all columns of dataframe, that you're pivoting over). https://github.com/pandas-dev/pandas/blob/v0.25.3/pandas/core/reshape/pivot.py#L87
So technically by passing no values you were passing ALL columns into values, yet at the same time providing function for just one, so this is where this KeyError was coming from.
Source here is oddly off with the documentation:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.pivot_table.html

Related

pandas value_counts(): directly compare two instances

I have done .value_counts() on two dataFrames (similar column) and would like to compare the two.
I also tried with converting the resulting Series to dataframes (.to_frame('counts') as suggested in this thread), but it doesn't help.
first = df1['company'].value_counts()
second = df2['company'].value_counts()
I tried to merge but I think the main problem is that I dont have the company name as a column but its the index (?). Is there a way to resolve it or to use a different way to get the comparison?
GOAL: The end goal is to be able to see which companies occur more in df2 than in df1, and the value_counts() themselves (or the difference between them).
You might use collections.Counter ability to subtract as follows
import collections
import pandas as pd
df1 = pd.DataFrame({'company':['A','A','A','B','B','C','Y']})
df2 = pd.DataFrame({'company':['A','B','B','C','C','C','Z']})
c1 = collections.Counter(df1['company'])
c2 = collections.Counter(df2['company'])
c1.subtract(c2)
print(c1)
gives output
Counter({'A': 2, 'Y': 1, 'B': 0, 'Z': -1, 'C': -2})
Explanation: where value is positive means more instances are in df1, where value is zero then number is equal, where value is negative means more instances are in df2.
Use from this code
df2['x'] = '2'
df1['x'] = '1'
df = pd.concat([df1[['company', 'x']], df2[['company', 'x']]])
df = pd.pivot_table(df, index=['company'], columns=['x'], aggfunc={'values': 'sum'}).reset_index()
Now filter on df for related data

Pandas : Empty dataframe after using concat()

I created an empty datafame,drugs:
drugs = pd.DataFrame({'name':[],
'value':[]})
after that, I wanted to add data from another data frame to it using a for loop:
for i in range(227,498):
value = drug_users[drug_users[drug_users.columns[i]] == 1][drug_users.columns[i]].sum() / 10138
name = drug_users.columns[i]
d2 = pd.DataFrame({'name':[name],
'value':[value]})
print(d2)
pd.concat([drugs, d2], ignore_index = True, axis = 0)
but when I take a sample from drugs, I get the error:
ValueError: a must be greater than 0 unless no samples are taken
The concat method return a new dataframe instead of changing the current dataframe. You need to assign the return value, e.g.:
drugs = pd.concat([drugs, d2], ignore_index = True, axis = 0)
You need to assign the return value from the concat function, I'm afraid it is not an inplace operation.

Pandas fillna with method=None (default value) raises an error

I am writing a function to aid DataFrame merges between two tables. The function creates a mapping key in the first DataFrame using variables in the second DataFrame.
My issue arises when I try to include the .fillna(method=) at the end of the function.
# Import libraries
import pandas as pd
# Create data
data_1 = {"col_1": [1, 2, 3, 4, 5], "col_2": [1, , 3, , 5]}
data_2 = {"col_1": [1, 2, 3, 4, 5], "col_3": [1, , 3, , 5]}
df = pd.DataFrame(data_1)
df2 = pd.DataFrame(data_2)
def merge_on_key(df, df2, join_how="left", fill_na=None):
# Import libraries
import pandas as pd
# Code to create mapping key not required for question
# Merge the two dataframes
print(fill_na)
print(type(fill_na))
df3 = pd.merge(df, df1, how=join_how, on="col_1").fillna(method=fill_na)
return df3
df3 = merge_on_key(df, df2)
output:
>>> None
>>> <class 'NoneType'>
error message:
ValueError: Must specify a fill 'value' or 'method'
My question is why does the fill_na, which is equal to None, not allow the fillna(method=None, the default value for fillna(method))?
You have to either use a 'value' or a 'method'. In your call to fillna you are setting both of them to None. In short, you're telling Python to fill empty (None) values in the dataframe with None, which does nothing and thus it raises an exception.
Based on the docs (link), you could either assign a non-empty value:
df3 = pd.merge(df, df1, how=join_how, on="col_1").fillna(value=0, method=fill_na)
or change the method from None (which means "directly substitute the None values in the dataframe by the given value) to one of {'backfill', 'bfill', 'pad', 'ffill'} (each documented in the docs):
df3 = pd.merge(df, df1, how=join_how, on="col_1").fillna( method='backfill')

Why won't barchart in Pandas stack different values?

Using Pandas, python 3. Working in jupyter.
Ive made this graph below using the following code:
temp3 = pd.crosstab(df['Credit_History'], df['Loan_Status'])
temp3.plot(kind = 'bar', stacked = True, color = ['red', 'blue'], grid = False)
print(temp3)
And then tried to do the same, but with divisions for Gender. I wanted to make this:
So I wrote this code:
And made this monstrosity. I'm unfamiliar with pivot tables in pandas, and after reading documentation, am still confused. I'm assuming that aggfunc affects the values given, but not the indices. How can I separate the loan status so that it reads as different colors for 'Y' and 'N'?
Trying a method similar to the methods used for temp3 simply yields a key error:
temp3x = pd.crosstab(df['Credit_History'], df['Loan_Status', 'Gender'])
temp3x.plot(kind = 'bar', stacked = True, color = ['red', 'blue'], grid = False)
print(temp3)
How can I make the 'Y' and 'N' appear separately as they are in the first graph, but for all 4 bars instead of using just 2 bars?
You need to make a new column called Loan_status_word and then pivot.
loan_status_word = loan_status.map({0:'No', 1:'Yes'})
df.pivot_table(values='Loan_Status',
index=['Credit_History', 'Gender'],
columns = 'loan_status_word',
aggfunc ='size')
Try to format your data such that each item you want in your legend is in a single column.
df = pd.DataFrame(
[
[3, 1],
[4, 1],
[1, 4],
[1, 3]
],
pd.MultiIndex.from_product([(1, 0), list('MF')], names=['Credit', 'Gendeer']),
pd.Index(['Yes', 'No'], name='Loan Status')
)
df
Then you can plot
df.plot.bar(stacked=True)
Below is the code to achieve the desired result:
temp4=pd.crosstab([df['Credit_History'],df['Gender']],df['Loan_Status'])
temp4.plot(kind='bar',stacked=True,color=['red','blue'],grid=False)

Pandas DataFrame Add column to index without resetting

how do I add 'd' to the index below without having to reset it first?
from pandas import DataFrame
df = DataFrame( {'a': range(6), 'b': range(6), 'c': range(6)} )
df.set_index(['a','b'], inplace=True)
df['d'] = range(6)
# how do I set index to 'a b d' without having to reset it first?
df.reset_index(['a','b','d'], inplace=True)
df.set_index(['a','b','d'], inplace=True)
df
We added an append option to set_index. Try that.
The command is:
df.set_index(['d'], append=True)
(we don't need to specify ['a', 'b'], as they already are in the index and we're appending to them)
Your code is not valid, reset_index has no inplace argument in my version of pandas (0.8.1).
The following achieves what you want but there's probably a more elegant way, but you've not provided enough information as to why you are avoiding the reset_index.
df2.index = MultiIndex.from_tuples([(x,y,df2['d'].values[i]) for i,(x,y) in enumerate(df2.index.values)])
HTH

Categories

Resources