I am trying to plot several different things in scatter plots by having several subplots and iterating over the remaining categories, but the plots only display the first iteration without throwing any error. To clarify, here is an example of what the data actually look like:
a kind state property T
0 0.905618 I dry prop1 10
1 0.050311 I wet prop1 20
2 0.933696 II dry prop1 30
3 0.114824 III wet prop1 40
4 0.942719 IV dry prop1 50
5 0.276627 II wet prop2 10
6 0.612303 III dry prop2 20
7 0.803451 IV wet prop2 30
8 0.257816 II dry prop2 40
9 0.122468 IV wet prop2 50
And this is how I generated the example:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import gridspec
kinds = ['I','II','III','IV']
states = ['dry','wet']
props = ['prop1','prop2']
T = [10,20,30,40,50]
a = np.random.rand(10)
k = ['I','I','II','III','IV','II','III','IV','II','IV']
s = ['dry','wet','dry','wet','dry','wet','dry','wet','dry','wet']
p = ['prop1','prop1','prop1','prop1','prop1','prop2','prop2','prop2','prop2','prop2']
t = [10,20,30,40,50,10,20,30,40,50]
df = pd.DataFrame(index=range(10),columns=['a','kind','state','property','T'])
df['a']=a
df['kind']=k
df['state']=s
df['property']=p
df['T']=t
print df
Next, I am going to generate 2 rows and 2 columns of subplots, to display variabilities in property1 and property2 in wet and dry states. So I basically slice my dataframe into several smaller ones like this:
first = df[(df['state']=='dry')&(df['property']=='prop1')]
second = df[(df['state']=='wet')&(df['property']=='prop1')]
third = df[(df['state']=='dry')&(df['property']=='prop2')]
fourth = df[(df['state']=='wet')&(df['property']=='prop2')]
dfs = [first,second,third,fourth]
in each of these subplots, which specify certain laboratory conditions, I want to plot the values of a versus T for different kinds of samples. To distinguish between the kinds of samples, I assign different colours and markers to them. So here is my plotting script:
fig = plt.figure(figsize=(8,8.5))
gs = gridspec.GridSpec(2,2, hspace=0.4, wspace=0.3)
colours = ['r','b','g','gold']
symbols = ['v','v','^','^']
titles=['dry 1','wet 1','dry 2','wet 2']
for no, df in enumerate(dfs):
ax = fig.add_subplot(gs[no])
for i, r in enumerate(kinds):
#print i, r
df = df[df['kind']==r]
c = colours[i]
m = symbols[i]
plt.scatter(df['T'],df['a'],c=c,s=50.0, marker=m, edgecolor='k')
ax = plt.xlabel('T')
ax = plt.xticks(T)
ax = plt.ylabel('A')
ax = plt.title(titles[no],fontsize=12,alpha=0.75)
plt.show()
But the result only plots the first iteration, in this case kind I in red triangles. If I remove this first item from the iterating lists, it only plots the first variable (kind II in blue triangles).
What am I doing wrong?
The figure looks like this, but I would like to have each subplot accordingly populated with red and blue and green and gold markers.
(Please note this happens with my real data as well, so the problem should not be in the way I generate the example.)
Your problem lies within the inner for loop. By writing df = df[df['kind']==r], you replace the original df with the version filtered for I. Then, in the next iteration of the loop, where you would filter for II, no further data is found. Therefore you also get no error message, as the code is otherwise 'correct'. If you rewrite the relevant piece of code like this:
for no, df in enumerate(dfs):
ax = fig.add_subplot(gs[no])
for i, r in enumerate(kinds):
#print i, r
df2 = df[df['kind']==r]
c = colours[i]
m = symbols[i]
plt.scatter(df2['T'],df2['a'],c=c,s=50.0, marker=m, edgecolor='k')
ax = plt.xlabel('T')
ax = plt.xticks(T)
ax = plt.ylabel('A')
ax = plt.title(titles[no],fontsize=12,alpha=0.75)
It should work just fine. Tested on Python 3.5.
Related
I have a dataframe like so:
df = pd.DataFrame({"idx":[1,2,3]*2,"a":[1]*3+[2]*3,'b':[3]*3+[4]*3,'grp':[4]*3+[5]*3})
df = df.set_index("idx")
df
a b grp
idx
1 1 3 4
2 1 3 4
3 1 3 4
1 2 4 5
2 2 4 5
3 2 4 5
and I would like to plot the values of a and b as function of idx. Making one subplot per column and one line per group.
I manage to do this creating axis separately and iterating over groups as proposed here. But I would like to use the subplots parameter of the plot function to avoid looping.
I tried solutions like
df.groupby("grp").plot(subplots=True)
But it plot the groups in different subplots and removing the groupby does not make appear the two separated lines as in the example.
Is it possible? Also is it better to iterate and use matplotlib plot or use pandas plot function?
IIUC, you can do something like this:
axs = df.set_index('grp', append=True)\
.stack()\
.unstack('grp')\
.rename_axis(['idx','title'])\
.reset_index('title').groupby('title').plot()
[v.set_title(f'{i}') for i, v in axs.items()]
Output:
Maybe eaiser to simple loop and plot:
fig, ax = plt.subplots(1,2, figsize=(10,5))
ax = iter(ax)
for n, g in df.set_index('grp', append=True)\
.stack()\
.unstack('grp')\
.rename_axis(['idx','title'])\
.reset_index('title').groupby('title'):
g.plot(ax=next(ax), title=f'{n}')
Output:
If i understod your question correct, you can access columns and rows in a pandas dataframe. An example can be like this:
import numpy as np
import matplotlib.pyplot as plt
x = np.array(df['idx'])
a = np.array(df['a'])
b = np.array(df['b'])
plt.subplot(1,2,1)#(121) will also work
#fill inn title etc for the first plot under here
plt.plot(x,a)
plt.subplot(1,2,2)
#fill inn title etc for the second plot under here
plt.plot(x,b)
plt.show()
edit: Sorry now changed for subplot.
I was wondering if I can get all plots of columns in panda dataframe in one-window via heatmap in 24x20 self-made matrix-model-square which I designed to map every 480 values of each column(which means 1-cycle) by mapping them inside of it through all cycles. The challenging point is I want to show missing data by using especial color which is out of color range of colormap cmap ='coolwarm'
I already tried by using df = df.replace([np.inf, -np.inf], np.nan) make sure that all inf convert to nan and then by using df = df.replace(0,np.nan) before sns.heatmap(df, vmin=-1, vmax=+1, cmap ='coolwarm' I can recognize missing values via white color since in cmap ='coolwarm' white color represents nan/inf in this interval [vmin=-1, vmax=+1] after applying above-mentioned instructions however it has 2 problem:
First in case that you have 0 in your dataset it will be shown like missing data by white color too and you can't distinguish between inf/nan and 0 in columns. Second problem is you can't even differentiate between nan and inf values!
I also tried mask=df.isnull() inside sns.heatmap() by specifying a mask, where data will not be shown for those cells whose mask values are True but it covers again 0 based on this answer GH375. I'm not sure the answer here mentioned by #Scotty1- is right solution for my case by adding marker to interpolate the values by newdf = newdf.interpolate().
Is it good idea to filter missing data by subsetting :
import math
df = df[df['a'].apply(lambda x: math.isnan(x))]
df = df[df['a'] == float('inf')]
My scripts are following however in for-loop I couldn't get proper output due to in each cycle it prints plot each of them 3 times in different intervals eg. it prints A left then again it prints A under the name of B and C in middle and right in-one-window. Again it prints B 3-times instead once and put it middle and in the end it prints C 3-times instead of once and put in right side it put in middle and left!
import numpy as np
import pandas as pd
import os
import seaborn as sns
import matplotlib.pyplot as plt
#extract the parameters and put them in lists based on id_set
df = pd.read_csv('D:\SOF.TXT', header=None)
id_set = df[df.index % 4 == 0].astype('int').values
a = df[df.index % 4 == 1].values
b = df[df.index % 4 == 2].values
c = df[df.index % 4 == 3].values
data = {'A': a[:,0], 'B': b[:,0], 'C': c[:,0] }
#main_data contains all the data
main_data = pd.DataFrame(data, columns=['A','B','C'], index = id_set[:,0])
#next iteration create all plots, change the numer of cycles
cycles = int(len(main_data)/480)
print(cycles)
for i in main_data:
try:
os.mkdir(i)
except:
pass
min_val = main_data[i].min()
min_nor = -1
max_val = main_data[i].max()
max_nor = 1
for cycle in range(1): #iterate thriugh all cycles range(1) by ====> range(int(len(main_data)/480))
count = '{:04}'.format(cycle)
j = cycle * 480
ordered_data = mkdf(main_data.iloc[j:j+480][i])
csv = print_df(ordered_data)
#Print .csv files contains matrix of each parameters by name of cycles respectively
csv.to_csv(f'{i}/{i}{count}.csv', header=None, index=None)
if 'C' in i:
min_nor = -40
max_nor = 150
#Applying normalizayion for C between [-40,+150]
new_value = normalize(main_data.iloc[j:j+480][i].values, min_val, max_val, -40, 150)
n_cbar_kws = {"ticks":[-40,150,-20,0,25,50,75,100,125]}
else:
#Applying normalizayion for A,B between [-1,+1]
new_value = normalize(main_data.iloc[j:j+480][i].values, min_val, max_val, -1, 1)
n_cbar_kws = {"ticks":[-1.0,-0.75,-0.50,-0.25,0.00,0.25,0.50,0.75,1.0]}
Sections = mkdf(new_value)
df = print_df(Sections)
#Plotting parameters by using HeatMap
plt.figure()
sns.heatmap(df, vmin=min_nor, vmax=max_nor, cmap ='coolwarm', cbar_kws=n_cbar_kws)
plt.title(i, fontsize=12, color='black', loc='left', style='italic')
plt.axis('off')
#Print .PNG iamges contains HeatMap plots of each parametersby name of cycles respectively
plt.savefig(f'{i}/{i}{count}.png')
#plotting all columns ['A','B','C'] in-one-window side by side
fig, axes = plt.subplots(nrows=1, ncols=3 , figsize=(20,10))
plt.subplot(131)
sns.heatmap(df, vmin=-1, vmax=1, cmap ="coolwarm", cbar=True , cbar_kws={"ticks":[-1.0,-0.75,-0.5,-0.25,0.00,0.25,0.5,0.75,1.0]})
fig.axes[-1].set_ylabel('[MPa]', size=20) #cbar_kws={'label': 'Celsius'}
plt.title('A', fontsize=12, color='black', loc='left', style='italic')
plt.axis('off')
plt.subplot(132)
sns.heatmap(df, vmin=-1, vmax=1, cmap ="coolwarm", cbar=True , cbar_kws={"ticks":[-1.0,-0.75,-0.5,-0.25,0.00,0.25,0.5,0.75,1.0]})
fig.axes[-1].set_ylabel('[Mpa]', size=20) #cbar_kws={'label': 'Celsius'}
#sns.despine(left=True)
plt.title('B', fontsize=12, color='black', loc='left', style='italic')
plt.axis('off')
plt.subplot(133)
sns.heatmap(df, vmin=-40, vmax=150, cmap ="coolwarm" , cbar=True , cbar_kws={"ticks":[-40,150,-20,0,25,50,75,100,125]})
fig.axes[-1].set_ylabel('[°C]', size=20) #cbar_kws={'label': 'Celsius'}
#sns.despine(left=True)
plt.title('C', fontsize=12, color='black', loc='left', style='italic')
plt.axis('off')
plt.suptitle(f'Analysis of data in cycle Nr.: {count}', color='yellow', backgroundcolor='black', fontsize=48, fontweight='bold')
plt.subplots_adjust(top=0.7, bottom=0.3, left=0.05, right=0.95, hspace=0.2, wspace=0.2)
#plt.subplot_tool()
plt.savefig(f'{i}/{i}{i}{count}.png')
plt.show()
my data frame looks like following:
A B C
0 2.291171 -2.689658 -344.047912
10 2.176816 -4.381186 -335.936524
20 2.291171 -2.589725 -342.544885
30 2.176597 -6.360999 0.000000
40 2.577268 -1.993412 -344.326376
50 9.844076 -2.690917 -346.125859
60 2.061782 -2.889378 -346.375655
Here below is overview of my dataset sample from .TXT file: dataset
in case that you want to check out with missing data values please change the last 3 values of end of text file to nan/inf and save it and debug it.
7590 7590
0 nan
7.19025828418 nan
-1738.000075 inf
I'd like to visualise a large pandas-dataframe includes 3 columns columns=['A','B','C'] via heatmaps in-one-window. This dataframe has two types of variables: strings (nan or inf) and floats.
I want the heatmap to show missing data cells inside of matrix-squared-model by fixed colors like nan by black and inf by silver or gray, and the rest of the dataframe as a normal heatmap, with the floats in a scale of cmap ='coolwarm'.
Here is image of desired output when there is no nan/inf in dataset:
I'm looking forward to hearing from those people they are dealing with these issues.
In python pandas I have create a dataframe with one value for each year and two subclasses - i.e., one metric for a parameter triplet
import pandas, requests, numpy
import matplotlib.pyplot as plt
df
Metric Tag_1 Tag_2 year
0 5770832 FOOBAR1 name1 2008
1 7526436 FOOBAR1 xyz 2008
2 33972652 FOOBAR1 name1 2009
3 17491416 FOOBAR1 xyz 2009
...
16 6602920 baznar2 name1 2008
17 6608 baznar2 xyz 2008
...
30 142102944 baznar2 name1 2015
31 0 baznar2 xyz 2015
I would like to produce a bar plot with metrics as y-values over x=(year,Tag_1,Tag_2) and sorting primarily for years and secondly for tag_1 and color the bars depending on tag_1. Something like
(2008,FOOBAR,name1) --> 5770832 *RED*
(2008,baznar2,name1) --> 6602920 *BLUE*
(2008,FOOBAR,xyz) --> 7526436 *RED*
(2008,baznar2,xyz) --> ... *BLUE*
(2008,FOOBAR,name1) --> ... *RED*
I tried starting with a grouping of columns like
df.plot.bar(x=['year','tag_1','tag_2']
but have not found a way to separate selections into two bar sets next to each other.
This should get you on your way:
df = pd.read_csv('path_to_file.csv')
# Group by the desired columns
new_df = df.groupby(['year', 'Tag_1', 'Tag_2']).sum()
# Sort descending
new_df.sort('Metric', inplace=True)
# Helper function for generation sequence of 'r' 'b' colors
def get_color(i):
if i%2 == 0:
return 'r'
else:
return 'b'
colors = [get_color(j) for j in range(new_df.shape[0])]
# Make the plot
fig, ax = plt.subplots()
ind = np.arange(new_df.shape[0])
width = 0.65
a = ax.barh(ind, new_df.Metric, width, color = colors) # plot a vals
ax.set_yticks(ind + width) # position axis ticks
ax.set_yticklabels(new_df.index.values) # set them to the names
fig.tight_layout()
plt.show()
you can also do it this way:
fig, ax = plt.subplots()
df.groupby(['year', 'Tag_1', 'Tag_2']).sum().plot.barh(color=['r','b'], ax=ax)
fig.tight_layout()
plt.show()
PS if don't like scientific notation you can get rid of it:
ax.get_xaxis().get_major_formatter().set_scientific(False)
I have a dataframe called df that looks like this:
Qname X Y Magnitude
Bob 5 19 10
Tom 6 20 20
Jim 3 30 30
I would like to make a visual text plot of the data. I want to plot the Qnames on a figure with their coordinates set = X,Y and a s=Size.
I have tried:
fig = plt.figure()
ax = fig.add_axes((0,0,1,1))
X = df.X
Y = df.Y
S = df.magnitude
Name = df.Qname
ax.text(X, Y, Name, size=S, color='red', rotation=0, alpha=1.0, ha='center', va='center')
fig.show()
However nothing is showing up on my plot. Any help is greatly appreciated.
This should get you started. Matplotlib does not handle the text placement for you so you will probably need to play around with this.
import pandas as pd
import matplotlib.pyplot as plt
# replace this with your existing code to read the dataframe
df = pd.read_clipboard()
plt.scatter(df.X, df.Y, s=df.Magnitude)
# annotate the plot
# unfortunately you have to iterate over your points
# see http://stackoverflow.com/q/5147112/553404
for idx, row in df.iterrows():
# see http://stackoverflow.com/q/5147112/553404
# for better annotation options
plt.annotate(row['Qname'], xy=(row['X'], row['Y']))
plt.show()
I am making a series of bar plots of data with two categorical variables and one numeric. What i have is the below, but what I would love to do is to facet by one of the categorical variables as with facet_wrap in ggplot. I have a somewhat working example, but I get the wrong plot type (lines and not bars) and I do subsetting of the data in a loop--that can't be the best way.
## first try--plain vanilla
import pandas as pd
import numpy as np
N = 100
## generate toy data
ind = np.random.choice(['a','b','c'], N)
cty = np.random.choice(['x','y','z'], N)
jobs = np.random.randint(low=1,high=250,size=N)
## prep data frame
df_city = pd.DataFrame({'industry':ind,'city':cty,'jobs':jobs})
df_city_grouped = df_city.groupby(['city','industry']).jobs.sum().unstack()
df_city_grouped.plot(kind='bar',stacked=True,figsize=(9, 6))
This gives something like this:
city industry jobs
0 z b 180
1 z c 121
2 x a 33
3 z a 121
4 z c 236
However, what i would like to see is something like this:
## R code
library(plyr)
df_city<-read.csv('/home/aksel/Downloads/mockcity.csv',sep='\t')
## summarize
df_city_grouped <- ddply(df_city, .(city,industry), summarise, jobstot = sum(jobs))
## plot
ggplot(df_city_grouped, aes(x=industry, y=jobstot)) +
geom_bar(stat='identity') +
facet_wrap(~city)
The closest I get with matplotlib is something like this:
cols =df_city.city.value_counts().shape[0]
fig, axes = plt.subplots(1, cols, figsize=(8, 8))
for x, city in enumerate(df_city.city.value_counts().index.values):
data = df_city[(df_city['city'] == city)]
data = data.groupby(['industry']).jobs.sum()
axes[x].plot(data)
So two questions:
Can I do bar plots (they plot lines as shown here) using the AxesSubplot object and end up with something along the lines of the facet_wrap example from ggplot example;
In loops generating charts such as this attempt, I subset the data in each. I can't imagine that is the 'proper' way to do this type of faceting?
Second example here: http://pandas-docs.github.io/pandas-docs-travis/visualization.html#bar-plots
Anyway, you can always do that by hand, as you did yourself.
EDIT:
BTW, you can always use rpy2 in python, so you can do all the same things as in R.
Also, have a look at this: https://pandas.pydata.org/pandas-docs/version/0.14.1/rplot.html
I am not sure, but it should be helpful for creating plots over many panels, though might require further reading.
#tcasell suggested the bar call in the loop. Here is a working, if not elegant, example.
## second try--facet by county
N = 100
industry = ['a','b','c']
city = ['x','y','z']
ind = np.random.choice(industry, N)
cty = np.random.choice(city, N)
jobs = np.random.randint(low=1,high=250,size=N)
df_city =pd.DataFrame({'industry':ind,'city':cty,'jobs':jobs})
## how many panels do we need?
cols =df_city.city.value_counts().shape[0]
fig, axes = plt.subplots(1, cols, figsize=(8, 8))
for x, city in enumerate(df_city.city.value_counts().index.values):
data = df_city[(df_city['city'] == city)]
data = data.groupby(['industry']).jobs.sum()
print (data)
print type(data.index)
left= [k[0] for k in enumerate(data)]
right= [k[1] for k in enumerate(data)]
axes[x].bar(left,right,label="%s" % (city))
axes[x].set_xticks(left, minor=False)
axes[x].set_xticklabels(data.index.values)
axes[x].legend(loc='best')
axes[x].grid(True)
fig.suptitle('Employment By Industry By City', fontsize=20)
The Seaborn library, which is built on Matplotlib and could be considered a superset of it, has flexible and powerful plotting options for facet plots--they even use similar terminology to R. Scroll down on this page for multiple examples.