Using Pandas, python 3. Working in jupyter.
Ive made this graph below using the following code:
temp3 = pd.crosstab(df['Credit_History'], df['Loan_Status'])
temp3.plot(kind = 'bar', stacked = True, color = ['red', 'blue'], grid = False)
print(temp3)
And then tried to do the same, but with divisions for Gender. I wanted to make this:
So I wrote this code:
And made this monstrosity. I'm unfamiliar with pivot tables in pandas, and after reading documentation, am still confused. I'm assuming that aggfunc affects the values given, but not the indices. How can I separate the loan status so that it reads as different colors for 'Y' and 'N'?
Trying a method similar to the methods used for temp3 simply yields a key error:
temp3x = pd.crosstab(df['Credit_History'], df['Loan_Status', 'Gender'])
temp3x.plot(kind = 'bar', stacked = True, color = ['red', 'blue'], grid = False)
print(temp3)
How can I make the 'Y' and 'N' appear separately as they are in the first graph, but for all 4 bars instead of using just 2 bars?
You need to make a new column called Loan_status_word and then pivot.
loan_status_word = loan_status.map({0:'No', 1:'Yes'})
df.pivot_table(values='Loan_Status',
index=['Credit_History', 'Gender'],
columns = 'loan_status_word',
aggfunc ='size')
Try to format your data such that each item you want in your legend is in a single column.
df = pd.DataFrame(
[
[3, 1],
[4, 1],
[1, 4],
[1, 3]
],
pd.MultiIndex.from_product([(1, 0), list('MF')], names=['Credit', 'Gendeer']),
pd.Index(['Yes', 'No'], name='Loan Status')
)
df
Then you can plot
df.plot.bar(stacked=True)
Below is the code to achieve the desired result:
temp4=pd.crosstab([df['Credit_History'],df['Gender']],df['Loan_Status'])
temp4.plot(kind='bar',stacked=True,color=['red','blue'],grid=False)
Related
I am trying to compare two sets of data, each with a long list of categorical variables using Pandas and Matplotlib. I want to get and somehow store the frequency of values for each variable using the value_counts() method for each data set so that I can later compare the two for statistically significant differences in those frequencies.
As of now I just have a function to display the values and counts for each column in a data frame as pie charts, given a list of columns (cat_columns) which is defined outside of the function:
def getCat(data):
for column in cat_columns:
plt.figure()
df[column].value_counts().plot(kind='pie', autopct='%1.1f%%')
plt.title(f"Distribution of {column} Patients in {dataname}")
plt.ylabel('')
getCat(df)
Is it possible to append/store the returned values of value_counts() into a new DataFrame object corresponding to the each original data set so I can access and operate on those values later?
TIA!
I think you could first transform your value_counts Series into a DataFrame and then merge it with the original.
Below you have an example of how to do that.
test_df = pd.DataFrame(
{
'id': [1, 2, 3, 4],
'eye_colour': ['blue', 'brown', 'brown', 'green'],
'city': ['Paris', 'Paris', 'Lyon', 'Paris']
}
)
cat_counts = test_df['eye_colour'].value_counts().to_frame().reset_index()
cat_counts.columns = ['eye_colour', 'eye_color_count']
display(cat_counts)
# beware that inplace changes your original dataframe while True
test_df.merge(how='left', on='eye_colour', right=cat_counts, inplace=True)
I would like to create a graph that plots multiple lines onto one graph.
Here is an example dataframe (my actual dataframe is much larger):
df = pd.DataFrame({'first': [1, 2, 3], 'second': [1, 1, 5], 'Third' : [8,7,9], 'Person' : ['Ally', 'Bob', 'Jim']})
The lines I want plotted are rowwise i.e. a line for Ally, a line for Jim and a line for Bob
You can use the built-in plotting functions, as soon as the DataFrame has the right shape. The right shape in this case would be Person names as columns and the former columns as index. So all you have to do is to set Person as the index and transpose:
ax = df.set_index("Person").T.plot()
ax.set_xlabel("My label")
First you should set your name as the index then retrieve the values for each index:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({'first': [1, 2, 3], 'second': [1, 1, 5], 'Third' : [8,7,9], 'Person' : ['Ally', 'Bob', 'Jim']})
df = df.set_index('Person')
for person in df.index:
val = df.loc[person].values
plt.plot(val, label = person)
plt.legend()
plt.show()
As for how you want to handle the first second third I let you judge by yourself
The goal is to essentially combine the two databases and keep the alphabetical headers from the Tk1P dataframe while integrating the data from the Tk1L dataframe. Unfortunately I am getting this unintended result when trying to merge. Please see the link below the code to giphy for the output screen which shows both databases and the concat result. If anyone has ideas it would be very helpful. Thanks in advance.
Tk1D = pd.read_excel('C:\\Users\\Sam\\Desktop\\DF2.xlsx',1)
Tk1D = Tk1D.dropna()
Tk1D.drop(Tk1D.columns[[0, 1, 10]], inplace=True, axis=1)
#print("Tk1D: ", len(Tk1D), 'X', len(Tk1D.columns))
print('----------------------------------------------------------')
Tk1P = Tk1D.drop(['NT', 'PT'], axis=1)
Tk1P = Tk1P.drop(Tk1P.index[2:10035])
print(Tk1P)
print("Tk1P: ", len(Tk1P), 'X', len(Tk1P.columns))
print('----------------------------------------------------------')
Tk1L = xw.Book('C:\\Users\\Sam\\Desktop\\DF2.xlsx').sheets[1]
Tk1L = Tk1L.range('A2:N2').value
Tk1L = pd.DataFrame([Tk1L])
Tk1L.drop(Tk1L.columns[[0, 1, 10, 11, 12]], inplace=True, axis=1)
print(Tk1L)
print("Tk1L: ", len(Tk1L), 'X', len(Tk1L.columns))
print('----------------------------------------------------------')
TKP = pd.DataFrame(Tk1P.iloc[0]).transpose()
TKP.columns = Tk1P.columns
TKP = pd.concat([Tk1L, TKP], ignore_index=True)
print(TKP)
Giphy Dataframe and Concat Output
From your ouput it seems that you concat the dfalong the columns axis.
Try append instead of concat.
Tk1L.append(TKP)
EDIT
Look at the following two lines
TKP.columns = Tk1P.columns
TKP = pd.concat([Tk1L, TKP], ignore_index=True)
You set the column names of TKP equal to the column names of Tk1P in the first line. But in the second line, you append TKP to Tk1L (!). So the follwing should solve your problem
TKP.columns = Tk1L.columns
TKP = pd.concat([Tk1L, TKP], ignore_index=True)
I therefore guess that you simply mixed this up.
EDIT TWO
There is another problem in your code.
TKP = pd.DataFrame(Tk1P.iloc[0]).transpose()
The transpose likely messes things up. Tk1P is a 2 x 9 data frame. But when you transpose this, you get a 9 x 2 data frame. So, if you delete the transpose, you should be fine.
EDIT THREE
If you want alphatic column names, do
TKP.columns = Tk1P.columns
Consider the following dataframe:
test = pd.DataFrame({'A': [datetime.datetime.now(), datetime.datetime.now()], 'B': [1, 2]})
If I use pivot_table like below then everything is fine:
test.pivot_table(index = 'A', aggfunc = {'B': 'mean'}, margins = True)
However, if I do the following, I can't set margins = True (throws the error KeyError: 'A'):
test.pivot_table(index = test['A'], aggfunc = {'B': 'mean'}, margins = True)
I am really confused. Let's say I need do something like below AND need to set margin = True. Is that impossible?
test.pivot_table(index = test['A'].dt.year, aggfunc = {'B': 'mean'}, margins = True)
Try:
test['Ax']=test['A'].dt.year
test.pivot_table(index = 'Ax' , aggfunc = 'mean', values='B', margins = True)
Outputs:
B
Ax
2020 1.5
All 1.5
Explanation: in case if you don't pass values it will default to df.columns (all columns of dataframe, that you're pivoting over). https://github.com/pandas-dev/pandas/blob/v0.25.3/pandas/core/reshape/pivot.py#L87
So technically by passing no values you were passing ALL columns into values, yet at the same time providing function for just one, so this is where this KeyError was coming from.
Source here is oddly off with the documentation:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.pivot_table.html
I am getting really strange results with a pandas DataFrame grouping operation. What I want to do is group by index (my index is non-unique), and then fill null values appropriately. This works in many cases but in some instances I am getting a strange behavior where an empty DataFrame is all that is returned:
df = pd.DataFrame(columns=['sample', 'cooling_rate'],
index=['SYd', 'SYd', 'XNa', 'Xna', 'Qza_new', 'Qza_new'],
data=[['SYd', 3], ['SYd', 3], ['XNa', 3],
['XNa', 3], ['val1', 'val3'], ['val1', None]])
res = df.groupby(df.index).fillna('1')
#Empty DataFrame
#Columns: []
#Index: []
However, if I change the DataFrame ever so slightly, by renaming the index item 'QZa_new' to 'qza_new':
df = pd.DataFrame(columns=['sample', 'cooling_rate'],
index=['SYd', 'SYd', 'XNa', 'Xna', 'qza_new', 'qza_new'],
data=[['SYd', 3], ['SYd', 3], ['XNa', 3],
['XNa', 3], ['val1', 'val3'], ['val1', None]])
res = df.groupby(df.index).fillna('1')
# sample cooling_rate
#SYd SYd 3
#SYd SYd 3
#XNa XNa 3
#Xna XNa 3
#qza_new val1 val3
#qza_new val1 1
The result is a properly grouped, filled DataFrame as expected. I can't make any sense of this behavior, and I'm not getting any sort of "error".
With more experimentation, it appears that the key is definitely in my DataFrame index line:
index=['SYd', 'SYd', 'XNa', 'Xna', 'qza_new', 'qza_new'],
It appears that the second to last value has to be earlier in the alphabet than the last value. In other words,
index=['SYd', 'SYd', 'XNa', 'XNa', 'a', 'b']
works and returns a filled in DataFrame, but:
index=['SYd', 'SYd', 'XNa', 'XNa', 'c', 'b']
returns an empty DataFrame. But why?
I suspect I must be missing something obvious, but I have no idea why I'm seeing this behavior.
Update:
This issue appears to be known: https://github.com/pandas-dev/pandas/issues/14955 Hopefully will be fixed next release.