Plotting DataFrame columns as Series sets unexpected arguments - python

Here is an issue when plotting a dataframe:
The dataframe looks like
i ii n b
0 1 0 3 0
1 4 1 5 0
2 4 0 1 5
3 4 1 2 6
4 6 0 3 0
5 6 1 4 3
(code to create below). I'd like to plot stacked bars for same values of i, and I want bars to belong to groups according to ii. When I select only certain rows of the dataframe, I have issues plotting, forcing me to explicitly convert the dataframe's columns (which are extracted as pandas Series) to lists. (Note that I cannot use pivot, as I have multiple rows for some (i, ii) combinations.)
Why can I not directly pass a Series to matplotlib.pyplot.bar() (code for figure 3)?
Why does using Series affect the width of bars, which cannot be overridden by an explicit argument width?
Is there a way to produce the desired plot in a better way?
import matplotlib.pyplot as plt
import pandas as pd
df = pd.DataFrame({'i':[1,4,4,4,6,6], 'n':[3,5,1,2,3,4]})
df['ii'] = df.index % 2
df2 = df.set_index(['i', 'ii'])
df2["b"] = df2.groupby(level='i')['n'].cumsum() - df2.n
df2.reset_index(inplace=True)
# This produces expected outcome
plt.figure(1)
plt.clf()
ix = df2[df2.ii==0]
plt.bar(x=list(ix.i), height=ix.n, bottom=list(ix.b))
ix = df2[df2.ii==1]
plt.bar(x=list(ix.i), height=ix.n, bottom=list(ix.b))
plt.show()
plt.figure(2)
plt.clf()
ix = df2[df2.ii==0]
plt.bar(x=ix.i, height=ix.n, bottom=list(ix.b))
ix = df2[df2.ii==1]
# The following line will draw a bar with unexpected width of bar
plt.bar(x=ix.i, height=ix.n, bottom=list(ix.b))
plt.show()
plt.figure(3)
plt.clf()
plt.show()
ix = df2[df2.ii==0]
plt.bar(x=ix.i, height=ix.n, bottom=ix.b)
ix = df2[df2.ii==1]
# The following line will fail
plt.bar(x=ix.i, height=ix.n, bottom=ix.b)
# error:
# TypeError: only size-1 arrays can be converted to Python scalars
# apparently matplotlib tries to set line width
Desired output:

Basically you get such kind of error when the function expects a single value instead of that you passed an array. In many cases, we can use np.vectorize to apply a function that accepts a single element to every element in an array. It seems this is not a case here. Why you do not want to pass the list as you did in the first plot?
plt.bar(x=list(ix.i), height=list(ix.n), bottom=list(ix.b))

Related

Pandas stacked bar plotting with different shapes

I'm currently experimenting with pandas and matplotlib.
I have created a Pandas dataframe which stores data like this:
cmc|coloridentity
1 | G
1 | R
2 | G
3 | G
3 | B
4 | B
What I now want to do is to make a stacked bar plot where I can see how many entries per cmc exist. And I want to do that for all coloridentity and stack them above.
My thoughts so far:
#get all unique values of coloridentity
unique_values = df['coloridentity'].unique()
#Create two dictionaries. One for the number of entries per cost and one
# to store the different costs for each color
color_dict_values = {}
color_dict_index = {}
for u in unique_values:
temp_df = df['cmc'].loc[df['coloridentity'] == u].value_counts()
color_dict_values[u] = np.array(temp_df)
color_dict_index[u] = temp_df.index.to_numpy()
width = 0.4
p1 = plt.bar(color_dict_index['G'], color_dict_values['G'], width, color='g')
p2 = plt.bar(color_dict_index['R'], color_dict_values['R'], width,
bottom=color_dict_values['G'], color='r')
plt.show()
So but this gives me an error because the line where I say that the bottom of the second plot shall be the values of different plot have different numpy shapes.
Does anyone know a solution? I thought of adding 0 values so that the shapes are the same , but I don't know if this is the best solution, and if yes how the best way would be to solve it.
Working with a fixed index (the range of cmc values), makes things easier. That way the color_dict_values of a color_id give a count for each of the possible cmc values (stays zero when there are none).
The color_dict_index isn't needed any more. To fill in the color_dict_values, we iterate through the temporary dataframe with the value_counts.
To plot the bars, the x-axis is now the range of possible cmc values. I added [1:] to each array to skip the zero at the beginning which would look ugly in the plot.
The bottom starts at zero, and gets incremented by the color_dict_values of the color that has just been plotted. (Thanks to numpy, the constant 0 added to an array will be that array.)
In the code I generated some random numbers similar to the format in the question.
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
N = 50
df = pd.DataFrame({'cmc': np.random.randint(1, 10, N), 'coloridentity': np.random.choice(['R', 'G'], N)})
# get all unique values of coloridentity
unique_values = df['coloridentity'].unique()
# find the range of all cmc indices
max_cmc = df['cmc'].max()
cmc_range = range(max_cmc + 1)
# dictionary for each coloridentity: array of values of each possible cmc
color_dict_values = {}
for u in unique_values:
value_counts_df = df['cmc'].loc[df['coloridentity'] == u].value_counts()
color_dict_values[u] = np.zeros(max_cmc + 1, dtype=int)
for ind, cnt in value_counts_df.iteritems():
color_dict_values[u][ind] = cnt
width = 0.4
bottom = 0
for col_id, col in zip(['G', 'R'], ['limegreen', 'crimson']):
plt.bar(cmc_range[1:], color_dict_values[col_id][1:], bottom=bottom, width=width, color=col)
bottom += color_dict_values[col_id][1:]
plt.xticks(cmc_range[1:]) # make sure every cmc gets a tick label
plt.tick_params(axis='x', length=0) # hide the tick marks
plt.xlabel('cmc')
plt.ylabel('count')
plt.show()

How to plot the data on two y-axis in pandas or matplotlib?

I have a data frame which is a combination of FER (Facial Emotion Recognition) and Mood prediction
Now, the mood prediction dataset has two columns - Question and Prediction. The question represents three values [1 - Activation; 2 - Pleasance; 5- Stress] and the prediction column also has three values [0 - Low; 1 - Medium; 2 - High]. The index consists of timestamps.
I'd like to give a brief explanation about the screenshot 3 below. Let's consider the third row where the question value is 5 and the prediction value is 1. This indicates stress (5) of medium (1) level.
How can I plot the prediction of question values over time? I tried to do it but I am getting just one line for everything.
d = one.groupby('question')
dunk1 = d.get_group(1)
fig, ax1 = plt.subplots(figsize = (20,5))
x = one.index
y = one.prediction
ax1.plot(x,y,'r-')
Plot of my attempted code
I am looking to get an output that looks something like the following:
Screenshot of the dataset
You are plotting x and y from the original dataframe, not the grouped dataframe, you should be doing
d = one.groupby('question')
dunk1 = d.get_group(1)
fig, ax1 = plt.subplots(figsize = (20,5))
x = dunk1.index
y = dunk1.prediction
ax1.plot(x,y,'r-')
Or to plot all three question groups
d = one.groupby('question')
fig, ax = plt.subplots(figsize = (20,5))
for k in d.groups.keys():
group = d.get_group(k)
ax.plot(group.index, group.prediction)
But understand that this may not get you all the way to the result you want - there may be more filtering or sorting necessary.

How can I auto-adjust my scatterplot labels without them being overlapped by other labels in python?

So I have been working on this for a bit, and just wanted to see if someone could look at why I could to auto-adjust my scatter-plot labels. As I was searching for a solution I came across the adjustText library found here https://github.com/Phlya/adjustText and it seems like it should work, but I'm just trying to find an example that plots from a dataframe. As I tried replicating the adjustText examples it throws me an error So this is my current code.
df["category"] = df["category"].astype(int)
df2 = df.sort_values(by=['count'], ascending=False).head()
ax = df.plot.scatter(x="category", y="count")
a = df2['category']
b = df2['count']
texts = []
for xy in zip(a, b):
texts.append(plt.text(xy))
adjust_text(texts, arrowprops=dict(arrowstyle="->", color='r', lw=0.5))
plt.title("Count of {column} in {table}".format(**sql_dict))
But then I got this TypeError: TypeError: text() missing 2 required positional arguments: 'y' and 's' This is what I tried to transform it from to pivot the coordinates, it works but coordinates just overlap.
df["category"] = df["category"].astype(int)
df2 = df.sort_values(by=['count'], ascending=False).head()
ax = df.plot.scatter(x="category", y="count")
a = df2['category']
b = df2['count']
for xy in zip(a, b):
ax.annotate('(%s, %s)' % xy, xy=xy)
As you can see here I'm getting my df constructed from tables in sql and I'll provide you what this specific table should look like here. In this specific table it's length of stay in days compared to how many people stayed that long.
So as a sample of the data may look like. I made a second datframe above so I would label only the highest values on the plot. This is one of my first experiences with graphing visualizations in python so any help would be appreciated.
[![picture of a graph of overlapping items][1]][1]
[los_days count]
3 350
1 4000
15 34
and so forth. Thanks so much. Let me know if you need anything else.
Here is an example of the df
category count
0 2 29603
1 4 33980
2 9 21387
3 11 17661
4 18 10618
5 20 8395
6 27 5293
7 29 4121
After some reverse engineering with an example from adjustText library and my own example, I just had to change my for loop to create the labels and it worked fantastically.
labels = ['{}'.format(i) for i in zip(a, b)]
texts = []
for x, y, text in zip(a, b, labels):
texts.append(ax.text(x, y, text))
adjust_text(texts, force_text=0.05, arrowprops=dict(arrowstyle="-|>",
color='r', alpha=0.5))

Seaborn countplot does not accept both x and y arguments in Python

I want to plot the frequencies of a variable y by a variable x and to that effect I am using the seaborn.countplot() method. However I am getting an error message.
For a reproducible example see below:
surveys_species_by_plot_sample
plot_id taxa
0 1 Bird
1 1 Rabbit
2 1 Rodent
3 2 Bird
4 2 Rabbit
5 2 Reptile
6 2 Rodent
7 3 Bird
8 3 Rabbit
9 3 Rodent
The intent now is to plot the number of taxa (Birds, Rabbits etc) by plot_id. I am using the following command:
sn.countplot(x = "plot_id", y = "taxa", data = surveys_species_by_plot_sample,
palette = sn.color_palette(palette = ["SteelBlue" , "Salmon"], n_colors = 4))
I am getting the following error message:
TypeError: Cannot pass values for both `x` and `y`
I do not understand, since the documentation states that both x and y variables can be passed to the function:
Parameters: x, y, hue : names of variables in data or vector data,
optional Inputs for plotting long-form data. See examples for
interpretation.
data : DataFrame, array, or list of arrays, optional Dataset for
plotting. If x and y are absent, this is interpreted as wide-form.
Otherwise it is expected to be long-form.
Your advice will be appreciated.
Reading examples in the Documentation, you can only pass x or y, not both. Stating X or Y changes the orientation of your chart. You might need a different chart type.
If you want to use both the x and y in your plot, you can't use countplot, instead you should use sn.barplot :)
sn.barplot(x = "plot_id", y = "taxa", data = surveys_species_by_plot_sample,
palette = sn.color_palette(palette = ["SteelBlue" , "Salmon"], n_colors = 4))

Separating out pandas series for pyplot

I currently have a set of series in pandas and each series is composed of two data sets. I need to separate out the two data sets into lists while retaining the series information, ie. the time and intensity data for 58V.
My current code looks like:
import numpy as numpy
import pandas as pd
xl = pd.ExcelFile("TEST_ATD.xlsx")
df = xl.parse("Sheet1")
series = xl.parse("Sheet1")
voltages = []
for item in df:
if "V" in item:
voltages.append(item)
data_list = []
for value in voltages:
print(df[value])
How do I select a particular data set from the series to extract them into a list? If I ask it to print(df[value]) returns my data sets, an example of which looks like:
Name: 58V, dtype: int64
0.000 0
0.180 1
0.360 1.2
0.540 1.5
0.720 1.2
..
35.277 0
35.457 0
35.637 0
NaN 0
Ultimately I plan to plot these data sets into a line graph with pyplot.
~~~ UPDATE ~~~
using
for value in voltages:
intensity=[]
for row in series[value].tolist():
intensity.append(row)
time=range(0,len(intensity))
pc_intensity = []
for item in intensity:
pc_intensity.append((100/max(intensity)*item))
plt.plot(time, pc_intensity)
axes = plt.gca()
axes.set_ylim([0,100])
plt.title(value)
plt.ylabel('Intensity')
plt.xlabel('Time')
plt.savefig(value +'.png')
plt.clf()
print(value)
I am able to get the plots of the first 8 data series (using arbitrary x axis), however, anything past the 8th series and my plots are empty? I have experimented and found this to be due to some of the series being different lengths. I'm confused as to why this would effect the plots as the x-axis is directly related to the length of the data set it is being plotted against?
I am not sure what you are trying to acheive but I'll take a guess
df = pd.DataFrame({'A': range(1, 10), 'B': range(1, 10), 'C': range(1, 10), 'D': range(1, 10), 'E': [1,1,1,2,2,2,2,3,4]})
for col in df.columns:
print(df[col].values.tolist())
this would print every columns of your dataframe as list
if you are just trying to plot something why not just use
df.plot()

Categories

Resources