I am having issue plotting two dataframs. One has 20711 entries, the other is 20710 entries. I am using plot(x,y) to plot like this:
import pandas as pd
import csv
import matplotlib.pyplot as plt
fig1 = plt.figure(figsize= (10,10))
ax = fig1.add_subplot(111)
ax.plot(X, Y)
Both are dataframes that were pulled from a csv. that have this structure:
print(X)
0 -2.343060
1 -2.445431
2 -2.335754
3 -2.478535
4 -2.527026
print(Y)
0 0.026940
1 -0.075431
2 0.024246
3 -0.118535
4 -0.167026
5 -0.145475
I keep getting error:
ValueError: x and y must have same first dimension
How do I fix it so that it ignores the last entry?
Well if you can just ditch the last value of Y then the following should work, assuming you have the index in your dataframe too, that is, your csv looks like this:
0,-2.343060
1,-2.445431
2,-2.335754
3,-2.478535
4,-2.527026
and you loaded it like X=pandas.read_csv('x.csv'), then
ax.plot(X.as_matrix().T[1], Y.as_matrix().T[1][:-1])
should work.
As you mentioned in your comment the overlap varies:
ax.plot(X.as_matrix().T[1], Y.as_matrix().T[1][:len(x)])
Related
I have a dataframe like so:
df = pd.DataFrame({"idx":[1,2,3]*2,"a":[1]*3+[2]*3,'b':[3]*3+[4]*3,'grp':[4]*3+[5]*3})
df = df.set_index("idx")
df
a b grp
idx
1 1 3 4
2 1 3 4
3 1 3 4
1 2 4 5
2 2 4 5
3 2 4 5
and I would like to plot the values of a and b as function of idx. Making one subplot per column and one line per group.
I manage to do this creating axis separately and iterating over groups as proposed here. But I would like to use the subplots parameter of the plot function to avoid looping.
I tried solutions like
df.groupby("grp").plot(subplots=True)
But it plot the groups in different subplots and removing the groupby does not make appear the two separated lines as in the example.
Is it possible? Also is it better to iterate and use matplotlib plot or use pandas plot function?
IIUC, you can do something like this:
axs = df.set_index('grp', append=True)\
.stack()\
.unstack('grp')\
.rename_axis(['idx','title'])\
.reset_index('title').groupby('title').plot()
[v.set_title(f'{i}') for i, v in axs.items()]
Output:
Maybe eaiser to simple loop and plot:
fig, ax = plt.subplots(1,2, figsize=(10,5))
ax = iter(ax)
for n, g in df.set_index('grp', append=True)\
.stack()\
.unstack('grp')\
.rename_axis(['idx','title'])\
.reset_index('title').groupby('title'):
g.plot(ax=next(ax), title=f'{n}')
Output:
If i understod your question correct, you can access columns and rows in a pandas dataframe. An example can be like this:
import numpy as np
import matplotlib.pyplot as plt
x = np.array(df['idx'])
a = np.array(df['a'])
b = np.array(df['b'])
plt.subplot(1,2,1)#(121) will also work
#fill inn title etc for the first plot under here
plt.plot(x,a)
plt.subplot(1,2,2)
#fill inn title etc for the second plot under here
plt.plot(x,b)
plt.show()
edit: Sorry now changed for subplot.
(This is a self-answered post to help others shorten their answers to plotly questions by not having to explain how plotly best handles data of long and wide format)
I'd like to build a plotly figure based on a pandas dataframe in as few lines as possible. I know you can do that using plotly.express, but this fails for what I would call a standard pandas dataframe; an index describing row order, and column names describing the names of a value in a dataframe:
Sample dataframe:
a b c
0 100.000000 100.000000 100.000000
1 98.493705 99.421400 101.651437
2 96.067026 98.992487 102.917373
3 95.200286 98.313601 102.822664
4 96.691675 97.674699 102.378682
An attempt:
fig=px.line(x=df.index, y = df.columns)
This raises an error:
ValueError: All arguments should have the same length. The length of argument y is 3, whereas the length of previous arguments ['x'] is 100`
Here you've tried to use a pandas dataframe of a wide format as a source for px.line.
And plotly.express is designed to be used with dataframes of a long format, often referred to as tidy data (and please take a look at that. No one explains it better that Wickham). Many, particularly those injured by years of battling with Excel, often find it easier to organize data in a wide format. So what's the difference?
Wide format:
data is presented with each different data variable in a separate column
each column has only one data type
missing values are often represented by np.nan
works best with plotly.graphobjects (go)
lines are often added to a figure using fid.add_traces()
colors are normally assigned to each trace
Example:
a b c
0 -1.085631 0.997345 0.282978
1 -2.591925 0.418745 1.934415
2 -5.018605 -0.010167 3.200351
3 -5.885345 -0.689054 3.105642
4 -4.393955 -1.327956 2.661660
5 -4.828307 0.877975 4.848446
6 -3.824253 1.264161 5.585815
7 -2.333521 0.328327 6.761644
8 -3.587401 -0.309424 7.668749
9 -5.016082 -0.449493 6.806994
Long format:
data is presented with one column containing all the values and another column listing the context of the value
missing values are simply not included in the dataset.
works best with plotly.express (px)
colors are set by a default color cycle and are assigned to each unique variable
Example:
id variable value
0 0 a -1.085631
1 1 a -2.591925
2 2 a -5.018605
3 3 a -5.885345
4 4 a -4.393955
... ... ... ...
295 95 c -4.259035
296 96 c -5.333802
297 97 c -6.211415
298 98 c -4.335615
299 99 c -3.515854
How to go from wide to long?
df = pd.melt(df, id_vars='id', value_vars=df.columns[:-1])
The two snippets below will produce the very same plot:
How to use px to plot long data?
fig = px.line(df, x='id', y='value', color='variable')
How to use go to plot wide data?
colors = px.colors.qualitative.Plotly
fig = go.Figure()
fig.add_traces(go.Scatter(x=df['id'], y = df['a'], mode = 'lines', line=dict(color=colors[0])))
fig.add_traces(go.Scatter(x=df['id'], y = df['b'], mode = 'lines', line=dict(color=colors[1])))
fig.add_traces(go.Scatter(x=df['id'], y = df['c'], mode = 'lines', line=dict(color=colors[2])))
fig.show()
By the looks of it, go is more complicated and offers perhaps more flexibility? Well, yes. And no. You can easily build a figure using px and add any go object you'd like!
Complete go snippet:
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
# dataframe of a wide format
np.random.seed(123)
X = np.random.randn(100,3)
df=pd.DataFrame(X, columns=['a','b','c'])
df=df.cumsum()
df['id']=df.index
# plotly.graph_objects
colors = px.colors.qualitative.Plotly
fig = go.Figure()
fig.add_traces(go.Scatter(x=df['id'], y = df['a'], mode = 'lines', line=dict(color=colors[0])))
fig.add_traces(go.Scatter(x=df['id'], y = df['b'], mode = 'lines', line=dict(color=colors[1])))
fig.add_traces(go.Scatter(x=df['id'], y = df['c'], mode = 'lines', line=dict(color=colors[2])))
fig.show()
Complete px snippet:
import numpy as np
import pandas as pd
import plotly.express as px
from plotly.offline import iplot
# dataframe of a wide format
np.random.seed(123)
X = np.random.randn(100,3)
df=pd.DataFrame(X, columns=['a','b','c'])
df=df.cumsum()
df['id']=df.index
# dataframe of a long format
df = pd.melt(df, id_vars='id', value_vars=df.columns[:-1])
# plotly express
fig = px.line(df, x='id', y='value', color='variable')
fig.show()
I'm going to add this as answer so it will be on evidence.
First of all thank you #vestland for this. It's a question that come over and over so it's good to have this addressed and it could be easier to flag duplicated question.
Plotly Express now accepts wide-form and mixed-form data
as you can check in this post.
You can change the pandas plotting backend to use plotly:
import pandas as pd
pd.options.plotting.backend = "plotly"
Then, to get a fig all you need to write is:
fig = df.plot()
fig.show() displays the above image.
I am VERY new to the world of python/pandas/matplotlib, but I have been using it recently to create box and whisker plots. I was curious how to create a box and whisker plot for each sheet using a specific column of data, i.e. I have 17 sheets, and I have column called HMB and DV on each sheet. I want to plot 17 data sets on a Box and Whisker for HMB and another 17 data sets on the DV plot. Below is what I have so far.
I can open the file, and get all the sheets into list_dfs, but then don't know where to go from there. I was going to try and manually slice each set (as I started below before coming here for help), but when I have more data in the future, I don't want to have to do that by hand. Any help would be greatly appreciated!
import pandas as pd
import numpy as np
import xlrd
import matplotlib.pyplot as plt
%matplotlib inline
from pandas import ExcelWriter
from pandas import ExcelFile
from pandas import DataFrame
excel_file = 'Project File Merger.xlsm'
list_dfs = []
xls = xlrd.open_workbook(excel_file,on_demand=True)
for sheet_name in xls.sheet_names():
df = pd.read_excel(excel_file,sheet_name)
list_dfs.append(df)
d_psppm = {}
for i, sheet_name in enumerate(xls.sheet_names()):
df = pd.read_excel(excel_file,sheet_name)
d_psppm["PSPPM" + str(i)] = df.loc[:,['PSPPM']]
values_list = list(d_psppm.values())
print(values_list[:])
A sample output looks like below, for 17 list entries, but with different number of rows for each.
PSPPM
0 0.246769
1 0.599589
2 0.082420
3 0.250000
4 0.205140
5 0.850000,
PSPPM
0 0.500887
1 0.475255
2 0.472711
3 0.412953
4 0.415883
5 0.703716,...
The next thing I want to do is create a box and whisker plot, 1 plot with 17 box and whiskers. I am not sure how to get the dictionary to plot with the values and indices as the name. I have tried to dig, and figure out how to convert the dictionary to a list and then plot each element in the list, but have had no luck.
Thanks for the help!
I agree with #Alex that forming your columns into a new DataFrame and then plotting from that would be a good approach, however, if you're going to use the dict, then it should look something like this. Depending on the version of Python you're using, the dictionary may be unordered, so if the ordering on the plot is important to you, then you might want to create a list of dictionary keys in the order you want and iterate over that instead
import matplotlib.pyplot as plt
import numpy as np
#colours = []#list of colours here, if you want
#markers = []#list of markers here, if you want
fig, ax = plt.subplots()
for idx, k in enumerate(d_psppm, 1):
data = d_psppm[k]
jitter = np.random.normal(0, 0.1, data.shape[0]) + idx
ax.scatter(jitter,
data,
s=25,#size of the marker
c="r",#colour, could be from colours
alpha=0.35,#opacity, 1 being solid
marker="^",#or ref. to markers, e.g. markers[idx]
edgecolors="none"#removes black border
)
As per Alex's suggestion, you could use the data to create a seaborn boxplot and overlay a swarmplot to show the data (depends on how many rows each has whether this is practical).
I am extremely new to coding, so I appreciate any help I can get.
I have a large data file that I want to create multiple plots for where the first column is the x axis for all of them. The code would ideally then iterate through all the columns with each respectively being the new y axis. I included my code for the individual plots, but want to create a loop to do it for all the columns.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
X = df[:,0]
col_1= df[:,1]
plt.plot(X,col_1)
plt.show()
col_2= df[:,2]
plt.plot(X,col_2)
plt.show()
Pandas will iterate over all the columns for you. Simply place the x column in the index and then just make a call to plot with your dataframe. Pandas uses the index as the x-axis There is no need to directly use matplotlib. Here is some fake data with a plot:
df = pd.DataFrame(np.random.rand(10,5), columns=['x', 'y1', 'y2', 'y3', 'y4'])
df = df.sort_values('x')
x y1 y2 y3 y4
9 0.262202 0.417279 0.075722 0.547804 0.599150
5 0.314894 0.611873 0.880390 0.282140 0.513770
8 0.406541 0.933734 0.879495 0.500626 0.527526
2 0.407636 0.550611 0.646449 0.635693 0.807088
1 0.437580 0.194937 0.501611 0.949575 0.409130
4 0.497347 0.443345 0.658259 0.457635 0.851847
3 0.500726 0.569175 0.304910 0.151071 0.678991
6 0.547433 0.512125 0.539995 0.701858 0.358552
0 0.783461 0.649381 0.320577 0.107062 0.840443
7 0.793702 0.951807 0.938635 0.526010 0.098321
df.set_index('x').plot(subplots=True)
You could loop through each column plotting it on its own subplot like so:
import matplotlib.pyplot as plt
fig, ax = plt.subplots(df.shape[1]-1, sharex=True)
for i in range(df.shape[1]-1):
ax[i].plot(df[:,0], df[:,i+1])
plt.show()
edit
I just realized your example was displaying 1 plot at a time. You could accomplish that like this:
import matplotlib.pyplot as plt
for i in range(df.shape[1]-1):
plt.plot(df[:,0], df[:,i+1])
plt.show()
plt.close()
I am a newbie to matplotlib. I am trying to plot step function and having some trouble. Right now I am able to read from the file and plot it as shown below. But the graph in the top is not in steps and the one below is not a proper step. I saw examples to plot step function by giving x & y value. I am not sure how to do it by reading from a file though. Can someone help me?
from pylab import plotfile, show, gca
import matplotlib.pyplot as plt
import matplotlib.cbook as cbook
fname = cbook.get_sample_data('sample.csv', asfileobj=False)
plotfile(fname, cols=(0,1), delimiter=' ')
plotfile(fname, cols=(0,2), newfig=False, delimiter=' ')
plt.show()
Sample inputs(3 columns):
27023927 3 0
27023938 2 0
27023949 3 0
27023961 2 0
27023972 3 0
27023984 2 0
27023995 3 0
27024007 2 0
27024008 2 1
27024018 3 1
27024030 2 1
27024031 2 0
27024041 3 0
27024053 2 0
27024054 2 1
27024098 2 0
Note: I have made the y-axis1 values as 3 & 2 so that this graph can occur in the top and another y-axis2 values 0 & 1 so that it comes in the bottom as shown below
Waveform as it looks now
Essentially your resolution is too low, for the lower plot the steps (except the last one) occur over 1 unit in x, while the steps are about an order of magnitude larger. This gives the appearance of steps while if you zoom in you will see the vertical lines have a non-infinite gradient (true steps change with an infinite gradient).
This is the same problem for both the top and bottom plots. We can easily remedy this by using the step function. You will generally find it easier to import the data, in this example I use the powerful numpy genfromtxt. This loads the data as an array data:
import numpy as np
import matplotlib.pylab as plt
data = np.genfromtxt('test.csv', delimiter=" ")
ax1 = plt.subplot(2,1,1)
ax1.step(data[:,0], data[:,1])
ax2 = plt.subplot(2,1,2)
ax2.step(data[:,0], data[:,2])
plt.show()
If you are new to python then there may be two things to mention, we use two subplots (ax1 and ax2) to plot the data rather than plotting on the same plot (this means you wouldn't need to add values to spatially separate them). We access the elements of the array through the [] this gives the [column, row] with : meaning all columns and and index i being the ith column
I would propose to load the data to a numpy array
import numpy as np
data = np.loadtxt('sample.csv')
And than plot it:
# first point
ax = [data[0,0]]
ay = [data[0,1]]
for i in range(1, data.shape[0]):
if ay[-1] != data[i,1]: # if y value has changed
# add current x and old y
ax.append(data[i,0])
ay.append(ay[-1])
# add current x and current y
ax.append(data[i,0])
ay.append(data[i,1])
import matplotlib.pyplot as plt
plt.plot(ax,ay)
plt.show()
What my solution differs from yours, is that I plot two points for every change in y. The two points produce this 90 degree bend. I Only plot the first curve. Change [?,1] to [?,2] for the second one.
Thanks for the suggestions. I was able to plot it after some research and here is my code,
import csv
import datetime
import matplotlib.pyplot as plt
import numpy as np
import dateutil.relativedelta as rd
import bisect
import scipy as sp
fname = "output.csv"
portfolio_list = []
x = []
a = []
b = []
portfolio = csv.DictReader(open(fname, "r"))
portfolio_list.extend(portfolio)
for data in portfolio_list:
x.append(data['i'])
a.append(data['a'])
b.append(data['b'])
stepList = [0, 1,2,3]
fig = plt.figure(figsize=(20, 10))
ax = fig.add_subplot(111)
plt.step(x, a, 'g', where='post')
plt.step(x, b, 'r', where='post')
plt.show()
and got the image like,