This seems like a very simple thing but I canĀ“t make it. I have panda frame like this http://prntscr.com/ko8lyd and I now want to plot one column on X-axis and another column on Y-axis. Here is what i try
import matplotlib.pyplot as plt
x = ATR_7
y = Vysledek
plt.scatter(x,y)
plt.show()
the is the error i am getting
<ipython-input-116-5ead5868ec87> in <module>()
1 import matplotlib.pyplot as plt
----> 2 x = ATR_7
3 y = Vysledek
4 plt.scatter(x,y)
5 plt.show()
where am I going wrong?
You just need:
df.plot.scatter('ATR_7','Vysledek')
Where df is the name of your dataframe. There's no need to use matplotlib.
You are trying to use undefined variables. ATR_7 is a name of a column inside your dataframe, it is not known to the rest of the world.
Try something like:
df.plot.scatter(x='ATR_7', y='Vysledek')
assuming your dataframe name is df
If you want to use matplotlib then you need to make your x and y values a list then pass to plt.scatter
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import style
style.use('ggplot')
%matplotlib inline
x = list(df['ATR_7']) # set x axis by creating a list
y = list(df['Vysledek']) # set y axis by creating a list
plt.scatter(x,y)
It seems there were two issues in your code. First, the names of the columns were not in quotes, so python has no way of knowing those are strings (column names are strings). Second, the easiest way to plot variables using pandas is to use pandas functions. You are trying to plot a scatter plot using matplotlib (that takes as input an array, not just a column name).
First, let's load modules and create the data
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
d = {'ATR_7' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
'Vysledek' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
Then, you can either use pandas plotting as in
x = 'ATR_7'
y = 'Vysledek'
df.plot.scatter(x,y)
Or plain-old matplotlib plotting as in
x = df['ATR_7']
y = df['Vysledek']
plt.scatter(x,y)
Scatter does not know which data to use. You need to provide it with the data.
x = "ATR_7"
y = "Vysledek"
plt.scatter(x,y, data=df)
under the assumption that df is your dataframe and has columns named "ATR_7" and "Vysledek".
Related
I'm trying to do a line plot with one line per column. My dataset looks like this:
I'm using this code, but it's giving me the following error:
ValueError: Wrong number of items passed 3, placement implies 27
plot_x = 'bill__effective_due_date'
plot_y = ['RR_bucket1_perc', 'RR_bucket7_perc', 'RR_bucket14_perc']
ax = sns.pointplot(x=plot_x, y=plot_y, data=df_rollrates_plot, marker="o", palette=sns.color_palette("coolwarm"))
display(ax.figure)
Maybe it's a silly question but I'm new to python so I'm not sure how to do this. This is my expected output:
Thanks!!
You can plot the dataframe as follows (edit: I updated the code below to make bill__effective_due_date the index of the dataframe):
import seaborn as sns
import pandas as pd
rr1 = [20,10,2,10,2,5]
rr7 = [17,8,2,8,2,4]
rr14 = [12,5,2,5,2,3]
x = ['Nov-1','Nov2','Nov-3','Nov-4','Nov-5','Nov-6']
df_rollrates_plot = pd.DataFrame({'RR_bucket1_perc':rr1,
'RR_bucket7_perc':rr7,
'RR_bucket14_perc':rr14})
df_rollrates_plot.index = x
df_rollrates_plot.index.name = 'bill__effective_due_date'
sns.lineplot(data=df_rollrates_plot)
plt.grid()
Your data is in the wrong shape to take advantage of the hue parameter in seaborn's lineplot. You need to stack it so that the columns become categorical values.
import pandas as pd
import seaborn as sns
rr1 = [20,10,2,10,2,5]
rr7 = [17,8,2,8,2,4]
rr14 = [12,5,2,5,2,3]
x = ['Nov-1','Nov2','Nov-3','Nov-4','Nov-5','Nov-6']
df = pd.DataFrame({'bill_effective_due_date':x,
'RR_bucket1_perc':rr1,
'RR_bucket7_perc':rr7,
'RR_bucket14_perc':rr14})
# This is where you are reshaping your data to make it work like you want
df = df.set_index('bill_effective_due_date').stack().reset_index()
df.columns=['bill_effective_due_date','roll_rates_perc','roll_rates']
sns.lineplot(data=df, x='bill_effective_due_date',y='roll_rates', hue='roll_rates_perc', marker='o')
I have a pandas dataframe df for which I plot a multi-histogram as follow :
df.hist(bins=20)
This give me a result that look like this (Yes this exemple is ugly since there is only one data per histogram, sorry) :
I have a subplot for each numerical column of my dataframe.
Now I want all my histograms to have an X-axis between 0 and 1. I saw that the hist() function take a ax parameter, but I cannot manage to make it work.
How is it possible to do that ?
EDIT :
Here is a minmal example :
import pandas as pd
import matplotlib.pyplot as plt
myArray = [(0,0,0,0,0.5,0,0,0,1),(0,0,0,0,0.5,0,0,0,1)]
myColumns = ['col1','col2','col3','co4','col5','col6','col7','col8','col9']
df = pd.DataFrame(myArray,columns=myColumns)
print(df)
df.hist(bins=20)
plt.show()
Here is a solution that works, but for sure is not ideal:
import pandas as pd
import matplotlib.pyplot as plt
myArray = [(0,0,0,0,0.5,0,0,0,1),(0,0,0,0,0.5,0,0,0,1)]
myColumns = ['col1','col2','col3','co4','col5','col6','col7','col8','col9']
df = pd.DataFrame(myArray,columns=myColumns)
print(df)
ax = df.hist(bins=20)
for x in ax:
for y in x:
y.set_xlim(0,1)
plt.show()
I have a data file in the form of
Col0 Col1 Col2
2015 1 4
2016 2 3
The data is float, and I use numpty loadtext to make a ndarray. However, I need to skip the label rows and columns to have an array of the data. How can I make the ndarray out of the data while reading the labels too?
import numpy as np
import matplotlib.pyplot as plt
data = np.loadtxt("data.csv", skiprows=1)
# I need to skip the first row in reading the data but still get the labels.
x= data[:,0]
a= data[:,1]
b= data[:,2]
plt.xlabel(COL0) # Reading the COL0 value from the file.
plt.ylabel(COL1) # Reading the COL1 value from the file.
plt.plot(x,a)
NOTE: The labels (column titles) are unknown in the script. The script should be generic to work with any input file of the same structure.
With genfromtxt it is possible to get the names in a tuple. You can query on name, and you can get the names out into a variable using dtype.names[n], where n is an index.
import numpy as np
import matplotlib.pyplot as plt
data = np.genfromtxt('data.csv', names=True)
x = data[data.dtype.names[0]] # In this case this equals data['Col1'].
a = data[data.dtype.names[1]]
b = data[data.dtype.names[2]]
plt.figure()
plt.plot(x, a)
plt.xlabel(data.dtype.names[0])
plt.ylabel(data.dtype.names[1])
plt.show()
This is not really an answer to the actual question, but I feel you might be interested in knowing how to do the same with pandas instead of numpy.
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv("data.csv", delim_whitespace=True)
df.set_index(df.columns[0]).plot()
plt.show()
would result in
As can be seen, there is no need to know any column name and the plot is labeled automatically.
Of course the data can then also be used to be plotted with matplotlib:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv("data.csv", delim_whitespace=True)
x = df[df.columns[0]]
a = df[df.columns[1]]
b = df[df.columns[2]]
plt.figure()
plt.plot(x, a)
plt.xlabel(df.columns[0])
plt.ylabel(df.columns[1])
plt.show()
I'm relatively new with numpy and pandas (I'm an experimental physicist so I've been using ROOT for years...).
A common plot in ROOT is a 2D scatter plot where, given a list of x- and y- values, makes a "heatmap" type scatter plot of one variable versus the other.
How is this best accomplished with numpy and Pandas? I'm trying to use the Dataframe.plot() function, but I'm struggling to even create the Dataframe.
import numpy as np
import pandas as pd
x = np.random.randn(1,5)
y = np.sin(x)
df = pd.DataFrame(d)
First off, this dataframe has shape (1,2), but I would like it to have shape (5,2).
If I can get the dataframe the right shape, I'm sure I can figure out the DataFrame.plot() function to draw what I want.
There are a number of ways to create DataFrames. Given 1-dimensional column vectors, you can create a DataFrame by passing it a dict whose keys are column names and whose values are the 1-dimensional column vectors:
import numpy as np
import pandas as pd
x = np.random.randn(5)
y = np.sin(x)
df = pd.DataFrame({'x':x, 'y':y})
df.plot('x', 'y', kind='scatter')
Complementing, you can use pandas Series, but the DataFrame must have been created.
import numpy as np
import pandas as pd
x = np.linspace(0,2*np.pi)
y = np.sin(x)
#df = pd.DataFrame()
#df['X'] = pd.Series(x)
#df['Y'] = pd.Series(y)
# You can MIX
df = pd.DataFrame({'X':x})
df['Y'] = pd.Series(y)
df.plot('X', 'Y', kind='scatter')
This is another way that might help
import numpy as np
import pandas as pd
x = np.linspace(0,2*np.pi)
y = np.sin(x)
df = pd.DataFrame(data=np.column_stack((x,y)),columns=['X','Y'])
And also, I find the examples from karlijn (DatacCamp) very helpful
import numpy as np
import pandas as pd
TAB = np.array([['' ,'Col1','Col2'],
['Row1' , 1 , 2 ],
['Row2' , 3 , 4 ],
['Row3' , 5 , 6 ]])
dados = TAB[1:,1:]
linhas = TAB[1:,0]
colunas = TAB[0,1:]
DF = pd.DataFrame(
data=dados,
index=linhas,
columns=colunas
)
print('\nDataFrame:', DF)
In order to do what you want, I wouldn't use the DataFrame plotting methods. I'm also a former experimental physicist, and based on experience with ROOT I think that the Python analog you want is best accomplished using matplotlib. In matplotlib.pyplot there is a method, hist2d(), which will give you the kind of heat map you're looking for.
As for creating the dataframe, an easy way to do it is:
df=pd.DataFrame({'x':x, 'y':y})
I want to plot a dataframe where each data point is not represented as a point but a vertical line from the zero axis like :
df['A'].plot(style='xxx')
where xxx is the style I need.
Also ideally i would like to be able to color each bar based on the values in another column in my dataframe.
I precise that my x axis values are numbers and are not equally spaced.
The pandas plotting tools are convenient wrappers to matplotlib. There is no way I know of to get the functionality you want directly via pandas.
You can get it in a few lines of matplotlib. Most of the code is to do the colour mapping:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import matplotlib.colors as colors
import matplotlib.cm as cmx
#make the dataframe
a = np.random.rand(100)
b = np.random.ranf(100)
df = pd.DataFrame({'a': a, 'b': b})
# do the colour mapping
c_norm = colors.Normalize(vmin=min(df.b), vmax=max(df.b))
scalar_map = cmx.ScalarMappable(norm=c_norm, cmap=plt.get_cmap('jet'))
color_vals = [scalar_map.to_rgba(val) for val in df.b]
# make the plot
plt.vlines(df.index, np.zeros_like(df.a), df.a, colors=color_vals)
I've used the DataFrame index for the x axis values but there is no reason that you could not use irregularly spaced x values.