Weighted Pie Chart Pandas - python

I'd like to create a weighted pie chart using pandas. Here is a simple example to build off of.
import pandas as pd
data = [['red', 10], ['orange', 15], ['blue', 14], ['red', 8],
['orange', 11], ['blue', 20]]
df = pd.DataFrame(data, columns = ['color', 'weight'])

Simplest solution is to create a new "totals" column and create the pie chart from there.
new_df = df.groupby(['color'])[['weight']].sum()
new_df = new_df.reset_index()
new_df.columns = ['color', 'total']
From there I prefer to use plotly express
import plotly.express as px
fig = px.pie(new_df, values='total', names='color', title='...')

Related

Plotting a stacked column containing a categorical list using Pandas

I have a Pandas dataframe with categorical data stored in a list. I would like to plot a stacked bar plot with col3 on the x-axis and col1 and col2 stacked on top of each other for the y-axis.
Reproducible dataframe structure:
import pandas as pd
import matplotlib.pyplot as plt
d = {'col1': [1, 17, 40],
'col2': [10, 70, 2],
'col3': [['yellow', 'green', 'blue'],
['yellow', 'orange', 'blue'],
['blue', 'green', 'pink']]}
df = pd.DataFrame(data=d)
Use:
df.explode('col3').set_index('col3').plot.bar(stacked=True)
Or:
df1 = (df.explode('col3')
.melt('col3')
.pivot_table(index='col3', columns='variable', values='value', aggfunc='sum'))
df1.plot.bar(stacked=True)

What is wrong here in colouring the Excel sheet?

Here I need to colour 'red' for rows with Age<13 and colur 'green' for rows with Age>=13. But the final 'Report.xlsx' isn't getting coloured. What is wrong here?
import pandas as pd
data = [['tom', 10], ['nick', 12], ['juli', 14]]
df = pd.DataFrame(data, columns = ['Name', 'Age'])
df_styled = df.style.applymap(lambda x: 'background:red' if x < 13 else 'background:green', subset=['Age'])
df_styled.to_excel('Report.xlsx',engine='openpyxl',index=False)

How to make a jointplot in Seaborn with multiple groups or categories?

I am trying to make a jointplot in Seaborn. The goal is to have a scatter plot of all [x,z] values and to have these color-coded by [cat], and to have the distributions for these two categories. Then I also want a scatter and distribution plot of [x,alt_Z], ignoring the alt_Z values that are NaN.
Using Python 3.7
Here is a stand-alone dataset and my goal (made in Excel, so the distributions are not shown).
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd
import seaborn as sns
col1 = [1,1.5,3.1,3.4,2,-1]
col2 = [1,-3,2,8,2.5,-1.3]
col3 = [4,3,4,0.5,1,0.3]
col4 = [10,12,10,'NaN',13,'NaN']
col5 = ['A','A','A','B','A','B']
df = pd.DataFrame(list(zip(col1, col2, col3, col4, col5)),
columns =['x', 'y', 'z', 'alt_Z', 'cat'])
display(df)
The code below doesn't finish the plot and returns TypeError: The y variable is categorical, but one of ['numeric', 'datetime'] is required. I also don't how, in the code below, to group by [cat] A & B, so it is shown as red and only the A category is plotting.
df2 = df[['x', 'y', 'z', 'alt_Z', 'cat']]\
.melt(id_vars=['x', 'y'], value_vars=['z', 'alt_Z'])
g = sns.jointplot(data=df2, x='x', y='value', hue='variable',
palette={'z': 'black', 'alt_Z': 'red'})
One problem with the dataframe, is that col4 contains integers and 'NaN'. As there don't exist NaN values for integers, pandas makes it a column of objects. Converting it to floats will create a proper float column with NaN as numbers.
To create the scatter plot, two calls to sns.scatter() will do:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
col1 = [1, 1.5, 3.1, 3.4, 2, -1]
col2 = [1, -3, 2, 8, 2.5, -1.3]
col3 = [4, 3, 4, 0.5, 1, 0.3]
col4 = [10, 12, 10, 'NaN', 13, 'NaN']
col5 = ['A', 'A', 'A', 'B', 'A', 'B']
df = pd.DataFrame(list(zip(col1, col2, col3, col4, col5)),
columns=['x', 'y', 'z', 'alt_Z', 'cat'])
df['alt_Z'] = df['alt_Z'].astype(float)
ax = sns.scatterplot(data=df, x='x', y='alt_Z', color='black', label='alt_Z')
sns.scatterplot(data=df, x='x', y='z', hue='cat', ax=ax)
plt.show()
From here, we can create 2 dataframes: df1 containing x, z and cat.
And df2 containing x and alt_Z. Renaming alt_Z to z and filling in a cat column containing the string alt_Z will make it similar to df1.
The jointplot() can then operate on the concatenation of both datafames:
df1 = df[['x', 'z', 'cat']]
df2 = df[['x', 'alt_Z']].rename(columns={'alt_Z': 'z'}).dropna()
df2['cat'] = 'alt_Z'
g = sns.jointplot(data=df1.append(df2), x='x', y='z', hue='cat', palette={'alt_Z': 'black', 'A': 'orange', 'B': 'green'})
g.ax_joint.set_xlim(-3, 6) # the default limits are too wide for these reduced test data
plt.show()

Find rows in dataframe that contain words that are bigrams/trigrams

This example is for finding bigrams:
Given:
import pandas as pd
data = [['tom', 10], ['jobs', 15], ['phone', 14],['pop', 16], ['they_said', 11], ['this_example', 22],['lights', 14]]
test = pd.DataFrame(data, columns = ['Words', 'Freqeuncy'])
test
I'd like to write a query to only find words that are separated by a "_" such that the returning df would look like this:
data2 = [['they_said', 11], ['this_example', 22]]
test2 = pd.DataFrame(data2, columns = ['Words', 'Freqeuncy'])
test2
I'm wondering why something like this doesn't work.. data[data['Words'] == (len> 3)]
To use a function you need to use apply:
df[df.apply(lambda x: len(x['Words']), axis=1)> 3]
The pandas way of doing it is like this:
import pandas as pd
data = [['tom', 10], ['jobs', 15], ['phone', 14],['pop', 16], ['they_said', 11], ['this_example', 22],['lights', 14]]
test = pd.DataFrame(data, columns = ['Words', 'Freqeuncy'])
test = test[test.Words.str.contains('_')]
test
To do the opposite, you can do:
test = test[~test.Words.str.contains('_')]

How to keep the index when using pd.melt and merge to create a DataFrame for Seaborn and matplotlib

I am trying to draw subplots using two identical DataFrames ( predicted and observed) with exact same structure ... the first column is index
The code below makes new index when they are concatenated using pd.melt and merge
as you can see in the figure the index of orange line is changed from 1-5 to 6-10
I was wondering if some could fix the code below to keep the same index for the orange line:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
actual = pd.DataFrame({'a': [5, 8, 9, 6, 7, 2],
'b': [89, 22, 44, 6, 44, 1]})
predicted = pd.DataFrame({'a': [7, 2, 13, 18, 20, 2],
'b': [9, 20, 4, 16, 40, 11]})
# Creating a tidy-dataframe to input under seaborn
merged = pd.concat([pd.melt(actual), pd.melt(predicted)]).reset_index()
merged['category'] = ''
merged.loc[:len(actual)*2,'category'] = 'actual'
merged.loc[len(actual)*2:,'category'] = 'predicted'
g = sns.FacetGrid(merged, col="category", hue="variable")
g.map(plt.plot, "index", "value", alpha=.7)
g.add_legend();
The orange line ('variable' == 'b') doesn't have an index of 0-5 because of how you used melt. If you look at pd.melt(actual), the index doesn't match what you are expecting, IIUC.
Here is how I would rearrange the dataframe:
merged = pd.concat([actual, predicted], keys=['actual', 'predicted'])
merged.index.names = ['category', 'index']
merged = merged.reset_index()
merged = pd.melt(merged, id_vars=['category', 'index'], value_vars=['a', 'b'])
Set the ignore_index variable to false to preserve the index., e.g.
df = df.melt(var_name=‘species’, value_name=‘height’, ignore_index = False)

Categories

Resources