Plotting dataframe where headers are 24h timestamps - python

I have a CSV which looks something like:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
<>
Red 2151.0 1966.0 1889.0 2148.0 2112.0 2351.0 1813.0 2008.0 1841.0 1901.0 2373.0 2643.0 2322.0 1901.0 1849.0 2132.0 1877.0 1963.0 1861.0 1973.0 2468.0 3434.0 3159.0 3413.0
Blue 2122.0 2059.0 2274.0 2647.0 2136.0 2271.0 2107.0 2192.0 2403.0 2148.0 2008.0 2111.0 2061.0 2196.0 2165.0 2354.0 1931.0 2195.0 1985.0 2025.0 2463.0 2943.0 3302.0 3424.0
I need to plot this data as a scatter plot, where Red/Blue is on the X-axis, and time on the Y-axis. Here 1->24 is timestamps per hour. I am confused as usually we have a timestamp per row, but in my situation, I have to plot each row, for each timestamp.
I then have multiple such CSVs, and will need to plot all of them to one graph.
So my question is, what's the best way to plot all values of Red/Blue for each given timestamp?
When I try to do:
x = sorted_df.index
y = list(sorted_df.columns)
plt.scatter(x, y)
plt.show()
I get a ValueError: x and y must be the same size - and this is my main source of confusion because x and y will never be the same. In the above example, x will always be 2 and y will always be 24!
Any help is much appreciated!

Maybe you could transpose (.T, short for .transpose) your dataframe, iterate its (now) columns and add the scatter plot of each column ("Red", "Blue", etc.) in the same plot:
# Assuming that your index is ["Red", "Blue", ...]
# Assuming that your columns are [1, 2, ...]
sorted_df = sorted_df.T
# Now your index is [1, 2, ...] and columns are ["Red", "Blue", ...]
for column in sorted_df.columns:
# x: [1, 2, ...] (always the same)
# y: Values for each column (first "Red", then "Blue", and so on)
plt.scatter(sorted_df.index, sorted_df[column])
# Display plot
plt.show()
This is the result I get:

Related

Multiple datasets and fit line

I have different datasets:
Df1
X Y
1 1
2 5
3 14
4 36
5 90
Df2
X Y
1 1
2 5
3 21
4 38
5 67
Df3
X Y
1 1
2 5
3 10
4 50
5 78
I would like to determine a line which fits this data and plot all data in one chart (like a regression).
On the x axis I have the time; on the y axis I have the frequency of an event that occurs.
Any help on the approach on how to determine the line and plot the results keeping the different legends (would be ok with seaborn or matplotlib) would be helpful.
What I have done so far is plotting the three lines as follows:
plot_df = pd.DataFrame(list(zip(dataset_list, x_lists, y_lists)),
columns =['Dataset', 'X', 'Y']).set_index('Dataset', inplace=False)
plot_df= plot_df.apply(pd.Series.explode).reset_index() # this step should transpose the resulting df and explode the values
# plot
fig, ax = plt.subplots(figsize=(10,8))
for name, group in plot_df.groupby('Dataset'):
group.plot(x = "X", y= "Y", ax=ax, label=name)
Please note that the three lists at the beginning contain information on the three different df.
I recommend using linregress from scipy.stats as this gives very readable code. Just need to add in the logic to your loop:
from scipy.stats import linregress
for name, group in plot_df.groupby('Dataset'):
group.plot(x = "X", y= "Y", ax=ax, label=name)
#fit a line to the data
fit = linregress(group.X, group.Y)
ax.plot(group.X, group.X * fit.slope + fit.intercept, label=f'{name} fit')

How to plot specific colors for a range of values in python dataframe?

I have a df that looks like below:
S.No Date A
0 12/07/03 76
1 12/07/13 1
2 12/07/23 32
3 12/08/03 12
4 12/08/04 22
5 12/08/05 11
I want to have a plot where the Y axis is A and X axis the Date, and the problem is with the color. I want all the occurences of 76 in red, 32 in blue and all other values of A in green color. Is this possible?
Yes, you can do so:
# define the color according to the values of df['A']
colors = np.select((df['A'].eq(76), df['A'].eq(32)), ('r','b'), 'g')
# pass the color to plt.scatter
plt.scatter(x=df['Date'],y=df['A'], c=colors)
Output:

pandas plotting skipping xtick labels

I am trying to display two pandas Series objects together, which works, except all the labels are not displayed.
I am trying to plot the two Series together like this:
plt.figure()
sns.set_style('ticks')
ts86['Gene'].value_counts().plot(kind='area')
l97['Gene'].value_counts().plot(kind='area')
sns.despine(offset=10)
But only one of the indexes is displayed.
Here are the two Series that I have:
one
TIIIh 25
TET2-2 24
IDH2 15
TIIIa 14
TIIIb 12
TIIIj 11
TIIIp 9
p53-1 9
SF3B1 8
TIIIe 8
KRAS-1 7
TIIIo 6
TIIId 6
TET2-1 6
GATA1 5
p53-3 5
HRAS 5
NRAS-2 4
IDH1 4
TIIIq 4
JAK2 4
TIIIc 4
TIIIf 3
TIIIg 3
TIIIm 3
KRAS-2 3
p53-2 3
TIIIk 3
TIIIn 2
DNMT3a 1
and
two
p53-1 17
p53-2 2
NRAS-2 2
p53-3 1
KRAS-2 1
Your output graph shows value_counts of 2 dataframes but obviously the index orders are no longer the same, so there is no way to show xticks at this point (e.g. highest count in df1 is TIIIh while that of df2 is p53-1 and you are trying to plot them together by preserving the order).
Let's simply merge df1 and df2 first (I named TIIIh and so on as id for merge key):
combi = pd.merge(ts86, l97, on='id', how='left')
combi = combi.set_index('id')
And then, plot each column and show all xticks:
ax = combi['Gene_x'].plot(kind='area', figsize=(10, 3))
combi['Gene_y'].plot(kind='area', figsize=(10, 3))
ax.set_xticks(range(combi.shape[0]))
ax.set_xticklabels(combi.index, rotation=90)
Now you get this:
Hope this helps.

pandas histogram/barplot on categorical index and axis

I have this series:
data:
0 17
1 25
2 10
3 60
4 0
5 20
6 300
7 50
8 10
9 80
10 100
11 65
12 125
13 50
14 100
15 150
Name: 1, dtype: int64
I wanted to plot an histogram with variable bin size, so I made this:
filter_values = [0,25,50,60,75,100,150,200,250,300,350]
out = pd.cut(data."1", bins = filter_values)
counts = pd.value_counts(out)
print(counts)
My problem is that when I use counts.plot(kind="hist"), i have not the good label for x axis. I only get them by using a bargraph instead counts.plot(kind="bar"), but I can't get the right order then.
I tried to use xticks=counts.index.values[0] but it makes an error, and xticks=filter_values give an odd figure shape as the numbers go far beyond what the plot understand the bins to be.
I also tried counts.hist(), data.hist(), and counts.plot.hist without success.
I don't know how to plot correctly the categorical data from counts (it includes as index a pandas categorical index) so, I don't know which process I should apply, if there is a way to plot variable bins directly in data.hist() or data.plot(kind="hist") or data.plot.hist(), or if I am right to build counts, but then how to represent this correctly (with good labels on the xaxis and the right order, not a descending one as in the bar graph.

Grouping boxplots in seaborn when input is a DataFrame

I intend to plot multiple columns in a pandas dataframe, all grouped by another column using groupby inside seaborn.boxplot. There is a nice answer here, for a similar problem in matplotlib matplotlib: Group boxplots but given the fact that seaborn.boxplot comes with groupby option I thought it could be much easier to do this in seaborn.
Here we go with a reproducible example that fails:
import seaborn as sns
import pandas as pd
df = pd.DataFrame([[2, 4, 5, 6, 1], [4, 5, 6, 7, 2], [5, 4, 5, 5, 1],
[10, 4, 7, 8, 2], [9, 3, 4, 6, 2], [3, 3, 4, 4, 1]],
columns=['a1', 'a2', 'a3', 'a4', 'b'])
# display(df)
a1 a2 a3 a4 b
0 2 4 5 6 1
1 4 5 6 7 2
2 5 4 5 5 1
3 10 4 7 8 2
4 9 3 4 6 2
5 3 3 4 4 1
#Plotting by seaborn
sns.boxplot(df[['a1','a2', 'a3', 'a4']], groupby=df.b)
What I get is something that completely ignores groupby option:
Whereas if I do this with one column it works thanks to another SO question Seaborn groupby pandas Series :
sns.boxplot(df.a1, groupby=df.b)
So I would like to get all my columns in one plot (all columns come in a similar scale).
EDIT:
The above SO question was edited and now includes a 'not clean' answer to this problem, but it would be nice if someone has a better idea for this problem.
As the other answers note, the boxplot function is limited to plotting a single "layer" of boxplots, and the groupby parameter only has an effect when the input is a Series and you have a second variable you want to use to bin the observations into each box..
However, you can accomplish what I think you're hoping for with the factorplot function, using kind="box". But, you'll first have to "melt" the sample dataframe into what is called long-form or "tidy" format where each column is a variable and each row is an observation:
df_long = pd.melt(df, "b", var_name="a", value_name="c")
Then it's very simple to plot:
sns.factorplot("a", hue="b", y="c", data=df_long, kind="box")
You can use directly boxplot (I imagine when the question was asked, that was not possible, but with seaborn version > 0.6 it is).
As explained by #mwaskom, you have to "melt" the sample dataframe into its "long-form" where each column is a variable and each row is an observation:
df_long = pd.melt(df, "b", var_name="a", value_name="c")
# display(df_long.head())
b a c
0 1 a1 2
1 2 a1 4
2 1 a1 5
3 2 a1 10
4 2 a1 9
Then you just plot it:
sns.boxplot(x="a", hue="b", y="c", data=df_long)
Seaborn's groupby function takes Series not DataFrames, that's why it's not working.
As a work around, you can do this :
fig, ax = plt.subplots(1,2, sharey=True)
for i, grp in enumerate(df.filter(regex="a").groupby(by=df.b)):
sns.boxplot(grp[1], ax=ax[i])
it gives :
Note that df.filter(regex="a") is equivalent to df[['a1','a2', 'a3', 'a4']]
a1 a2 a3 a4
0 2 4 5 6
1 4 5 6 7
2 5 4 5 5
3 10 4 7 8
4 9 3 4 6
5 3 3 4 4
Hope this helps
It isn't really any better than the answer you linked, but I think the way to achieve this in seaborn is using the FacetGrid feature, as the groupby parameter is only defined for Series passed to the boxplot function.
Here's some code - the pd.melt is necessary because (as best I can tell) the facet mapping can only take individual columns as parameters, so the data need to be turned into a 'long' format.
g = sns.FacetGrid(pd.melt(df, id_vars='b'), col='b')
g.map(sns.boxplot, 'value', 'variable')
It's not adding a lot to this conversation, but after struggling with this for longer than warranted (the actual clusters are unusable), I thought I would add my implementation as another example. It's got a superimposed scatterplot (because of how annoying my dataset is), shows melt using indices, and some aesthetic tweaks. I hope this is useful for someone.
output_graph
Here it is without using column headers (I saw a different thread that wanted to know how to do this using indices):
combined_array: ndarray = np.concatenate([dbscan_output.data, dbscan_output.labels.reshape(-1, 1)], axis=1)
cluster_data_df: DataFrame = DataFrame(combined_array)
if you want to use labelled columns:
column_names: List[str] = list(outcome_variable_names)
column_names.append('cluster')
cluster_data_df.set_axis(column_names, axis='columns', inplace=True)
graph_data: DataFrame = pd.melt(
frame=cluster_data_df,
id_vars=['cluster'],
# value_vars is an optional param - by default it uses columns except the id vars, but I've included it as an example
# value_vars=['outcome_var_1', 'outcome_var_2', 'outcome_var_3', 'outcome_var_4', 'outcome_var_5', 'outcome_var_6']
var_name='psychometric_test',
value_name='standard deviations from the mean'
)
The resulting dataframe (rows = sample_n x variable_n (in my case 1626 x 6 = 9756)):
index
cluster
psychometric_tst
standard deviations from the mean
0
0.0
outcome_var_1
-1.276182
1
0.0
outcome_var_1
-1.118813
2
0.0
outcome_var_1
-1.276182
9754
0.0
outcome_var_6
0.892548
9755
0.0
outcome_var_6
1.420480
If you want to use indices with melt:
graph_data: DataFrame = pd.melt(
frame=cluster_data_df,
id_vars=cluster_data_df.columns[-1],
# value_vars=cluster_data_df.columns[:-1],
var_name='psychometric_test',
value_name='standard deviations from the mean'
)
And here's the graphing code:
(Done with column headings - just note that y-axis=value_name, x-axis = var_name, hue = id_vars):
# plot graph grouped by cluster
sns.set_theme(style="ticks")
fig = plt.figure(figsize=(10, 10))
fig.set(font_scale=1.2)
fig.set_style("white")
# create boxplot
fig.ax = sns.boxplot(y='standard deviations from the mean', x='psychometric_test', hue='cluster', showfliers=False,
data=graph_data)
# set box alpha:
for patch in fig.ax.artists:
r, g, b, a = patch.get_facecolor()
patch.set_facecolor((r, g, b, .2))
# create scatterplot
fig.ax = sns.stripplot(y='standard deviations from the mean', x='psychometric_test', hue='cluster', data=graph_data,
dodge=True, alpha=.25, zorder=1)
# customise legend:
cluster_n: int = dbscan_output.n_clusters
## create list with legend text
i = 0
cluster_info: Dict[int, int] = dbscan_output.cluster_sizes # custom method
legend_labels: List[str] = []
while i < cluster_n:
label: str = f"cluster {i+1}, n = {cluster_info[i]}"
legend_labels.append(label)
i += 1
if -1 in cluster_info.keys():
cluster_n += 1
label: str = f"Unclustered, n = {cluster_info[-1]}"
legend_labels.insert(0, label)
## fetch existing handles and legends (each tuple will have 2*cluster number -> 1 for each boxplot cluster, 1 for each scatterplot cluster, so I will remove the first half)
handles, labels = fig.ax.get_legend_handles_labels()
index: int = int(cluster_n*(-1))
labels = legend_labels
plt.legend(handles[index:], labels[0:])
plt.xticks(rotation=45)
plt.show()
asds
Just a note: Most of my time was spent debugging the melt function. I predominantly got the error "*only integer scalar arrays can be converted to a scalar index with 1D numpy indices array*". My output required me to concatenate my outcome variable value table and the clusters (DBSCAN), and I'd put extra square brackets around the cluster array in the concat method. So I had a column where each value was an invisible List[int], rather than a plain int. It's pretty niche, but maybe it'll help someone.
List item

Categories

Resources