Creating Pandas Dataframe between two Numpy arrays, then draw scatter plot

Creating Pandas Dataframe between two Numpy arrays, then draw scatter plot - python

I'm relatively new with numpy and pandas (I'm an experimental physicist so I've been using ROOT for years...).
A common plot in ROOT is a 2D scatter plot where, given a list of x- and y- values, makes a "heatmap" type scatter plot of one variable versus the other.
How is this best accomplished with numpy and Pandas? I'm trying to use the Dataframe.plot() function, but I'm struggling to even create the Dataframe.
import numpy as np
import pandas as pd
x = np.random.randn(1,5)
y = np.sin(x)
df = pd.DataFrame(d)
First off, this dataframe has shape (1,2), but I would like it to have shape (5,2).
If I can get the dataframe the right shape, I'm sure I can figure out the DataFrame.plot() function to draw what I want.

There are a number of ways to create DataFrames. Given 1-dimensional column vectors, you can create a DataFrame by passing it a dict whose keys are column names and whose values are the 1-dimensional column vectors:
import numpy as np
import pandas as pd
x = np.random.randn(5)
y = np.sin(x)
df = pd.DataFrame({'x':x, 'y':y})
df.plot('x', 'y', kind='scatter')

Complementing, you can use pandas Series, but the DataFrame must have been created.
import numpy as np
import pandas as pd
x = np.linspace(0,2*np.pi)
y = np.sin(x)
#df = pd.DataFrame()
#df['X'] = pd.Series(x)
#df['Y'] = pd.Series(y)
# You can MIX
df = pd.DataFrame({'X':x})
df['Y'] = pd.Series(y)
df.plot('X', 'Y', kind='scatter')
This is another way that might help
import numpy as np
import pandas as pd
x = np.linspace(0,2*np.pi)
y = np.sin(x)
df = pd.DataFrame(data=np.column_stack((x,y)),columns=['X','Y'])
And also, I find the examples from karlijn (DatacCamp) very helpful
import numpy as np
import pandas as pd
TAB = np.array([['' ,'Col1','Col2'],
['Row1' , 1 , 2 ],
['Row2' , 3 , 4 ],
['Row3' , 5 , 6 ]])
dados = TAB[1:,1:]
linhas = TAB[1:,0]
colunas = TAB[0,1:]
DF = pd.DataFrame(
data=dados,
index=linhas,
columns=colunas
)
print('\nDataFrame:', DF)

In order to do what you want, I wouldn't use the DataFrame plotting methods. I'm also a former experimental physicist, and based on experience with ROOT I think that the Python analog you want is best accomplished using matplotlib. In matplotlib.pyplot there is a method, hist2d(), which will give you the kind of heat map you're looking for.
As for creating the dataframe, an easy way to do it is:
df=pd.DataFrame({'x':x, 'y':y})

Related

Can I take a table from excel and plot a histogram in python?

I have 2 tables a 10 by 110 and a 35 by 110 and both contain random numbers from a exponential distribution function provided by my professor. The assignment is to prove the central limit theorem in statistics.
What I thought to try is:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
"importing data"
df1 = pd.read_excel(r'C:\Users\Henry\Desktop\n10.xlsx')
df2 = pd.read_excel(r'C:\Users\Henry\Desktop\n30.xlsx')
df1avg = pd.read_excel(r'C:\Users\Henry\Desktop\n10avg.xlsx')
df2avg = pd.read_excel(r'C:\Users\Henry\Desktop\n30avg.xlsx')
"plotting n10 histogram"
plt.hist(df1, bins=34)
plt.hist(df1avg, bins=11)
"plotting n30 histogram"
plt.hist(df2, bins=63)
plt.hist(df2avg, bins=11)
Is that ok or do I need to format the tables into a singular column, and if so what is the most efficient way to do that?

I suspect that you will want to flatten your dataframe first, as illustrated below.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
N = np.random.exponential(1, [40, 5])
df = pd.DataFrame(N) # convert to dataframe
bin_edges = np.linspace(0,6,30)
plt.figure()
plt.hist(df, bins = bin_edges, density = True)
plt.xlabel('Value')
plt.ylabel('Probability density')
The multiple (5) colours of lines per bin shows the histograms for each column of the data frame.
Fortunately, this is not hard to adjust. You can convert the data frame to a numpy array and flatten it:
plt.hist(df.to_numpy().flatten(), bins = bin_edges, density = True)
plt.ylabel('Probability density')
plt.xlabel('Value')

Python Change axis on Multi Histogram plot

I have a pandas dataframe df for which I plot a multi-histogram as follow :
df.hist(bins=20)
This give me a result that look like this (Yes this exemple is ugly since there is only one data per histogram, sorry) :
I have a subplot for each numerical column of my dataframe.
Now I want all my histograms to have an X-axis between 0 and 1. I saw that the hist() function take a ax parameter, but I cannot manage to make it work.
How is it possible to do that ?
EDIT :
Here is a minmal example :
import pandas as pd
import matplotlib.pyplot as plt
myArray = [(0,0,0,0,0.5,0,0,0,1),(0,0,0,0,0.5,0,0,0,1)]
myColumns = ['col1','col2','col3','co4','col5','col6','col7','col8','col9']
df = pd.DataFrame(myArray,columns=myColumns)
print(df)
df.hist(bins=20)
plt.show()

Here is a solution that works, but for sure is not ideal:
import pandas as pd
import matplotlib.pyplot as plt
myArray = [(0,0,0,0,0.5,0,0,0,1),(0,0,0,0,0.5,0,0,0,1)]
myColumns = ['col1','col2','col3','co4','col5','col6','col7','col8','col9']
df = pd.DataFrame(myArray,columns=myColumns)
print(df)
ax = df.hist(bins=20)
for x in ax:
for y in x:
y.set_xlim(0,1)
plt.show()

Big dataset contour plot using pyplot and pandas

I have a massive data sample and need to visualize it. Using pandas, I can create a dataframe with relevant variables- 3 arrays of length 20Million.
These are x,y geometrical coordinates and z value on that (x,y) point.
I need a "heatmap" of z at each (x,y) point. But no pyplot function works with numbers this big.
What is the best way to go about it?

Dummy data
Tested with 200,000 rows
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
df=pd.DataFrame(np.random.rand(200000,2), columns=['X','Y'])
df['Z']=df.apply(lambda x: x.X+x.Y*2, axis=1)
Code
Creating bin intervals and groupby dataframe applying mean to Z column, so have mean Z for every X, Y bin pair to plot. Finally, scatter plot
binsX = pd.cut(df.X, np.arange(0,1,0.001))
binsY = pd.cut(df.Y, np.arange(0,1,0.001))
binned = df.groupby([binsX,binsY])['Z'].mean().reset_index()
binned.X = binned.X.apply(lambda x: x.mid)
binned.Y = binned.Y.apply(lambda y: y.mid)
plt.scatter(binned.X, binned.Y, c=binned.Z, s=0.01)

Pandas+seaborn faceting with multidimensional dataframes

In Python pandas, I need to do a facet grid from a multidimensional DataFrame.
In columns a and b I hold scalar values, which represent conditions of an experiment.
In columns x and y instead I have two numpy arrays. Column x is the x-axis of the data and column y is the value of a function corresponding to f(x).
Obviously both x and y have the same number of elements.
I now would like to do a facet grid with rows and columns specifying the conditions, and in every cell of the grid, plot the value of column D vs column D.
This could be a minimal working example:
import pandas as pd
d = [0]*4 # initialize a list with 4 elements
d[0] = {'x':[1,2,3],'y':[4,5,6],'a':1,'b':2} # then fill these elements
d[1] = {'x':[3,1,5],'y':[6,5,1],'a':0,'b':3}
d[2] = {'x':[3,1,5],'y':[6,5,1],'a':1,'b':3}
d[3] = {'x':[3,1,5],'y':[6,5,1],'a':0,'b':2}
pd.DataFrame(d) # create the pandas dataframe
How can I use already existing faceting functions to address the issue of plotting y vs x grouped by the conditions a and b?
Since I need to apply this function to general datasets with different column names, I would like to avoid resorting on hard-coded solutions, but rather see whether it is possible to extend seaborn FacetGrid function to this kind of problem.

I think the best way to go is to split the nested arrays first and then create a facet grid with seaborn.
Thanks to this post (Split nested array values from Pandas Dataframe cell over multiple rows) I was able to split the nested array in your dataframe:
unnested_lst = []
for col in df.columns:
unnested_lst.append(df[col].apply(pd.Series).stack())
result = pd.concat(unnested_lst, axis=1, keys=df.columns).fillna(method='ffill')
Then you can make the facet grid with this code:
import seaborn as sbn
fg = sbn.FacetGrid(result, row='b', col='a')
fg.map(plt.scatter, "x", "y", color='blue')

You need a long-form frame to be able to use FacetGrid, so your best bet is to explode the lists, then recombine and apply:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
d = [0]*4
d[0] = {'x':[1,2,3],'y':[4,5,6],'a':1,'b':2} # then fill these elements
d[1] = {'x':[3,1,5],'y':[6,5,1],'a':0,'b':3}
d[2] = {'x':[3,1,5],'y':[6,5,1],'a':1,'b':3}
d[3] = {'x':[3,1,5],'y':[6,5,1],'a':0,'b':2}
df = pd.DataFrame(d)
df.set_index(['a','b'], inplace=True, drop=True)
x_long = pd.melt(df['x'].apply(pd.Series).reset_index(),
id_vars=['a', 'b'], value_name='x')
y_long = pd.melt(df['y'].apply(pd.Series).reset_index(),
id_vars=['a', 'b'], value_name='y')
long_df = pd.merge(x_long, y_long).drop('variable', axis='columns')
grid = sns.FacetGrid(long_df, row='a', col='b')
grid.map(plt.scatter, 'x', 'y')
plt.show()
This will show you the following:

I believe the best, shortest and most comprehensible solution is to define an appositely created lambda function. It has as input the mapping variables specified by the FacetGrid.map method, and takes its values in form of numpy arrays by the .values[0], as they are unique.
import pandas as pd
d = [0]*4 # initialize a list with 4 elements
d[0] = {'x':[1,2,3],'y':[4,5,6],'a':1,'b':2} # then fill these elements
d[1] = {'x':[3,1,5],'y':[6,5,1],'a':0,'b':3}
d[2] = {'x':[3,1,5],'y':[6,5,1],'a':1,'b':3}
d[3] = {'x':[3,1,5],'y':[6,5,1],'a':0,'b':2}
df = pd.DataFrame(d) # create the pandas dataframe
import seaborn as sns
import matplotlib.pyplot as plt
grid = sns.FacetGrid(df,row='a',col='b')
grid.map(lambda _x,_y,**kwargs : plt.scatter(_x.values[0],_y.values[0]),'x','y')

Plot a pandas dataframe with vertical lines

I want to plot a dataframe where each data point is not represented as a point but a vertical line from the zero axis like :
df['A'].plot(style='xxx')
where xxx is the style I need.
Also ideally i would like to be able to color each bar based on the values in another column in my dataframe.
I precise that my x axis values are numbers and are not equally spaced.

The pandas plotting tools are convenient wrappers to matplotlib. There is no way I know of to get the functionality you want directly via pandas.
You can get it in a few lines of matplotlib. Most of the code is to do the colour mapping:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import matplotlib.colors as colors
import matplotlib.cm as cmx
#make the dataframe
a = np.random.rand(100)
b = np.random.ranf(100)
df = pd.DataFrame({'a': a, 'b': b})
# do the colour mapping
c_norm = colors.Normalize(vmin=min(df.b), vmax=max(df.b))
scalar_map = cmx.ScalarMappable(norm=c_norm, cmap=plt.get_cmap('jet'))
color_vals = [scalar_map.to_rgba(val) for val in df.b]
# make the plot
plt.vlines(df.index, np.zeros_like(df.a), df.a, colors=color_vals)
I've used the DataFrame index for the x axis values but there is no reason that you could not use irregularly spaced x values.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Creating Pandas Dataframe between two Numpy arrays, then draw scatter plot - python

Related

Can I take a table from excel and plot a histogram in python?

Python Change axis on Multi Histogram plot

Big dataset contour plot using pyplot and pandas

Pandas+seaborn faceting with multidimensional dataframes

Plot a pandas dataframe with vertical lines

Categories

Resources