I have a dataframe that has an index (words) and a single column (counts) for some lyrics. I am trying to create a heatmap based on the word counts.
Cuenta
Que 179
La 145
Y 142
Me 113
No 108
I am trying to produce the heatmap like this:
df1 = pd.DataFrame.from_dict([top50]).T
df1.columns = ['Cuenta']
df1.sort_values(['Cuenta'], ascending = False, inplace=True)
result = df1.pivot(index=df1.index, columns='Cuenta', values=df1.Cuenta.count)
sns.heatmap(result, annot=True, fmt="g", cmap='viridis')
plt.show()
But, it keeps throwing 'Index' object has no attribute 'levels'
Any ideas why this isn't working? I tried using the index or words as a separate column and still doesn't work.
The data is one-dimensional. The counts are already present in the one (and only) column of the dataframe. There is no meaningless way to pivot this data.
You would hence directly plot the dataframe as a heatmap.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.DataFrame({"Cuenta": [179,145,142,113,108]},
index=["Que", "La", "Y", "Me", "No"])
sns.heatmap(df, annot=True, fmt="g", cmap='viridis')
plt.show()
If the data to be on the y-axis is a column, and not the index of the dataframe, then use .set_index
df = pd.DataFrame({"Cuenta": [179,145,142,113,108],
"words": ["Que", "La", "Y", "Me", "No"]})
# given a dataframe of two columns, set the column as the index
df.set_index("words", inplace=True)
ax = sns.heatmap(df, annot=True, fmt="g", cmap='viridis')
sns.heatmap will result in an IndexError if passing a pandas.Series.
.value_counts create a Series
df['column'] and df.column create a Series. Use df[['column']] instead.
# sample data
tips = sns.load_dataset('tips')
# value_counts creates a Series
vc = tips.time.value_counts()
# convert to a DataFrame
vc = vc.to_frame()
# plot
ax = sns.heatmap(data=vc)
Related
I have two different DataFrames in Python, one is the actual revenue values and the second one is the values of the prediction with the accumulative per day (index of the rows). Both DataFrames have the same length.
I want to compare them on the same plot, row by row. If I want to plot only one row from each DataFrame, I use this code:
df_actual.loc[71].T.plot(figsize=(14,10), kind='line')
df_preds.loc[71].T.plot(figsize=(14,10), kind='line')
The output is this:
However, the ideal output is to have all the rows for each DataFrame in a grid so I can compare all the results:
I have tried to create a for loop to itinerate each row but it is not working:
for i in range(20):
df_actual.loc[i].T.plot(figsize=(14,10), kind='line')
df_preds.loc[i].T.plot(figsize=(14,10), kind='line')
Is there any way to do this that is not manual? Thanks!
it would be helpful if you provided a sample of your dfs.
assuming both dfs have the same length & assuming you want 2 columns, try this:
fig, ax = plt.subplots(round(len(df_actual)/2),2)
ax.ravel()
for i in range(len(ax)):
sns.lineplot(df_actual.loc[i].T, ax=ax[i], color="navy")
sns.lineplot(df_preds.loc[i].T, ax=ax[i], color="orange")
edit:
this works for me (you just have to add your .T):
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
df_actual = pd.DataFrame(data=[[1,2,3,4,5], [6,7,8,9,10]], columns = ["col1","col2", "col3", "col4", "col5"])
df_pred = pd.DataFrame(data=[[3,4,5,6,7], [8,9,10,11,12]], columns = ["col1", "col2", "col3", "col4", "col5"])
fig, ax = plt.subplots(round(len(df_actual)/2),2)
ax.ravel()
for i in range(len(ax)):
ax[i].plot(df_actual.loc[i], color="navy")
ax[i].plot(df_pred.loc[i], color="orange")
Whenever I plot a dataset to a bar plot, the x axis labels overload with labels. How can I change the datatype of the x axis from the dataframe or how can I display every nth label?
Here is my code:
# Import statements for the packages to be used.
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline
# Loading the data and having a look at the first few lines
df = pd.read_csv('tmdb-movies.csv')
df.head()
# Replace 0 with NaN (Not a Number)
df['budget'].replace(0, np.NAN, inplace=True)
df['runtime'].replace(0, np.NAN, inplace=True)
# Drop all rows with null values (NaN)
df.dropna(axis=0, inplace=True)
# Drop all columns not required for investigation
df = df.drop(['id', 'imdb_id', 'revenue', 'cast', 'homepage', 'director', 'tagline',
'keywords', 'overview', 'genres', 'production_companies', 'vote_count', 'vote_average',
'release_date', 'budget_adj', 'revenue_adj'], axis=1)
budget_grp = df.groupby(['budget'])
budget_grp['popularity'].agg(['median', 'mean'])
# Setting mean popularity to variable budget_pop.
budget_pop = budget_grp['popularity'].mean()
# Bar plot with x as budget and y as average popularity.
budget_pop.plot(kind='bar' ,x='budget', y='popularity', figsize=(20,10), xlabel='Budget in
Dollars', ylabel='Average Popularity', rot=0, legend=True)
I have tried enumerate but don't know where to fit that in. I have also tried creating a function to find nth and I have tried changing my dataframe to integer but they always error.
Follow-up to answer below:
You can use custom xticks:
df = pd.DataFrame(data={'x':np.arange(1,1001,1), 'y':np.random.randint(1,1000,1000)})
ax = df.plot(kind='bar' ,x='x', y='y', figsize=(20,10))
min_value_in_x = 1
max_value_in_x = 1000
x_ticks = np.arange(min_value_in_x, max_value_in_x, 100)
ax.set(xticks=x_ticks, xticklabels=x_ticks)
plt.show()
I have a multi index dataframe, with the two indices being Sample and Lithology
Sample 20EC-P 20EC-8 20EC-10-1 ... 20EC-43 20EC-45 20EC-54
Lithology Pd Di-Grd Gb ... Hbl Plag Pd Di-Grd Gb
Rb 7.401575 39.055118 6.456693 ... 0.629921 56.535433 11.653543
Ba 24.610102 43.067678 10.716841 ... 1.073115 58.520532 56.946630
Th 3.176471 19.647059 3.647059 ... 0.823529 29.647059 5.294118
I am trying to put it into a seaborn lineplot as such.
spider = sns.lineplot(data = data, hue = data.columns.get_level_values("Lithology"),
style = data.columns.get_level_values("Sample"),
dashes = False, palette = "deep")
The lineplot comes out as
1
I have two issues. First, I want to format hues by lithology and style by sample. Outside of the lineplot function, I can successfully access sample and lithology using data.columns.get_level_values, but in the lineplot they don't seem to do anything and I haven't figured out another way to access these values. Also, the lineplot reorganizes the x-axis by alphabetical order. I want to force it to keep the same order as the dataframe, but I don't see any way to do this in the documentation.
To use hue= and style=, seaborn prefers it's dataframes in long form. pd.melt() will combine all columns and create new columns with the old column names, and a column for the values. The index too needs to be converted to a regular column (with .reset_index()).
Most seaborn functions use order= to set an order on the x-values, but with lineplot the only way is to make the column categorical applying a fixed order.
from matplotlib import pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
column_tuples = [('20EC-P', 'Pd '), ('20EC-8', 'Di-Grd'), ('20EC-10-1 ', 'Gb'),
('20EC-43', 'Hbl Plag Pd'), ('20EC-45', 'Di-Grd'), ('20EC-54', 'Gb')]
col_index = pd.MultiIndex.from_tuples(column_tuples, names=["Sample", "Lithology"])
data = pd.DataFrame(np.random.uniform(0, 50, size=(3, len(col_index))), columns=col_index, index=['Rb', 'Ba', 'Th'])
data_long = data.melt(ignore_index=False).reset_index()
data_long['index'] = pd.Categorical(data_long['index'], data.index) # make categorical, use order of the original dataframe
ax = sns.lineplot(data=data_long, x='index', y='value',
hue="Lithology", style="Sample", dashes=False, markers=True, palette="deep")
ax.set_xlabel('')
ax.legend(loc='upper left', bbox_to_anchor=(1.01, 1.02))
plt.tight_layout() # fit legend and labels into the figure
plt.show()
The long dataframe looks like:
index Sample Lithology value
0 Rb 20EC-P Pd 6.135005
1 Ba 20EC-P Pd 6.924961
2 Th 20EC-P Pd 44.270570
...
I have a data frame such as below:
Two categorical variables are impulsivity and treatment and multiple dependent variables (prot_width etc..).
I have managed to produce a boxplot that models the dependent variable by impulsivity and treatment;
sns.boxplot(x='treatment', y='prot_width', hue='impulsivity',
palette=['b','r'], data=data)
sns.despine(offset=10, trim=True)
which produces the graph below;
Now what I want to do is produce the exact same graph but for each dependent variable. I want to loop through each dependent variable column renaming the y-axis.
I have searched for for loops etc. but can't work out how to call the columns and more importantly how to change the y-axis during the loop.
Simply loop through the numeric data columns using DataFrame.columns which is an iterable object and then pass iterator variable (here being col) into y argument of boxplot.
for col in data.columns[4:len(data.columns)]:
sns.boxplot(x='treatment', y=col, hue='impulsivity',
palette=['b','r'], data=data)
sns.despine(offset=10, trim=True)
plt.show()
Alternatively use select_dtypes for all numeric columns:
for col in data.select_dtypes(['float', 'int']).columns:
...
Or even filter to leave out non-numeric columns:
for col in data.filter(regex="[^(subject|protrusion|impulsivity|treatment)]").columns:
...
To demonstrate with random data:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
np.random.seed(9192018)
demo_df = pd.DataFrame({'tool': np.random.choice(['pandas', 'r', 'julia', 'sas', 'stata', 'spss'],500),
'os': np.random.choice(['windows', 'mac', 'linux'],500),
'prot_width': np.random.randn(500)*100,
'prot_length': np.random.uniform(0,1,500),
'prot_lwr': np.random.randint(100, size=500)
}, columns=['tool', 'os', 'prot_width', 'prot_length', 'prot_lwr'])
for col in demo_df.columns[2:len(demo_df.columns)]:
sns.boxplot(x='tool', y=col, hue='os', palette=['b','r'], data=demo_df)
sns.despine(offset=10, trim=True)
plt.legend(loc='center', ncol = 3, bbox_to_anchor=(0.5, 1.10))
plt.show()
plt.clf()
plt.close()
Is there a way to iteratively plot data using seaborn's sns.boxplot() without having the boxplots overlap? (without combining datasets into a single pd.DataFrame())
Background
Sometimes when comparing different (e.g. size/shape) datasets, a mutual comparison is often useful and can be made by binning the datasets by a different shared variable (via pd.cut() and df.groupby(), as shown below).
Previously, I have iteratively plotted these "binned" data as boxplots on the same axis by looping separate DataFrames using matplotlib's ax.boxplot() (by providing y axis location values as a position argument to to ensure boxplots don't overlap).
Example
Below is an simplified example that shows the overlapping plots in when using sns.boxplot():
import seaborn as sns
import random
import pandas as pd
import matplotlib.pyplot as plt
# Get the tips dataset and select a subset as an example
tips = sns.load_dataset("tips")
variable_to_bin_by = 'tip'
binned_variable = 'total_bill'
df = tips[[binned_variable, variable_to_bin_by] ]
# Create a second dataframe with different values and shape
df2 = pd.concat( [ df.copy() ] *5 )
# Use psuedo random numbers to convey that df2 is different to df
scale = [ random.uniform(0,2) for i in range(len(df2[binned_variable])) ]
df2[ binned_variable ] = df2[binned_variable].values * scale * 5
dfs = [ df, df2 ]
# Group the data by a list of bins
bins = [0, 1, 2, 3, 4]
for n, df in enumerate( dfs ):
gdf = df.groupby( pd.cut(df[variable_to_bin_by].values, bins ) )
data = [ i[1][binned_variable].values for i in gdf]
dfs[n] = pd.DataFrame( data, index = bins[:-1])
# Create an axis for both DataFrames to be plotted on
fig, ax = plt.subplots()
# Loop the DataFrames and plot
colors = ['red', 'black']
for n in range(2):
ax = sns.boxplot( data=dfs[n].T, ax=ax, width=0.2, orient='h',
color=colors[n] )
plt.ylabel( variable_to_bin_by )
plt.xlabel( binned_variable )
plt.show()
More detail
I realise the simplified example above could resolved by combining the DataFrames and providing the hue argument to sns.boxplot().
Updating the index of the DataFrames provide also doesn't help, as y values from the last DataFrame provided is then used.
Providing the kwargs argument (e.g. kwargs={'positions': dfs[n].T.index}) won't work as this raises a TypeError.
TypeError: boxplot() got multiple values for keyword argument
'positions'
The setting sns.boxplot()'s dodge argument to True doesn't solve this.
Funnily enough, the "hack" that I proposed earlier today in this answer could be applied here.
It complicates the code a bit because seaborn expects a long-form dataframe instead of a wide-form to use hue-nesting.
# Get the tips dataset and select a subset as an example
tips = sns.load_dataset("tips")
df = tips[['total_bill', 'tip'] ]
# Group the data by
bins = [0, 1, 2, 3, 4]
gdf = df.groupby( pd.cut(df['tip'].values, bins ) )
data = [ i[1]['total_bill'].values for i in gdf]
df = pd.DataFrame( data , index = bins[:-1]).T
dfm = df.melt() # create a long-form database
dfm.loc[:,'dummy'] = 'dummy'
# Create a second, slightly different, DataFrame
dfm2 = dfm.copy()
dfm2.value = dfm.value*2
dfs = [ dfm, dfm2 ]
colors = ['red', 'black']
hue_orders = [['dummy','other'], ['other','dummy']]
# Create an axis for both DataFrames to be plotted on
fig, ax = plt.subplots()
# Loop the DataFrames and plot
for n in range(2):
ax = sns.boxplot( data=dfs[n], x='value', y='variable', hue='dummy', hue_order=hue_orders[n], ax=ax, width=0.2, orient='h',
color=colors[n] )
ax.legend_.remove()
plt.show()