I have a massive data sample and need to visualize it. Using pandas, I can create a dataframe with relevant variables- 3 arrays of length 20Million.
These are x,y geometrical coordinates and z value on that (x,y) point.
I need a "heatmap" of z at each (x,y) point. But no pyplot function works with numbers this big.
What is the best way to go about it?
Dummy data
Tested with 200,000 rows
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
df=pd.DataFrame(np.random.rand(200000,2), columns=['X','Y'])
df['Z']=df.apply(lambda x: x.X+x.Y*2, axis=1)
Code
Creating bin intervals and groupby dataframe applying mean to Z column, so have mean Z for every X, Y bin pair to plot. Finally, scatter plot
binsX = pd.cut(df.X, np.arange(0,1,0.001))
binsY = pd.cut(df.Y, np.arange(0,1,0.001))
binned = df.groupby([binsX,binsY])['Z'].mean().reset_index()
binned.X = binned.X.apply(lambda x: x.mid)
binned.Y = binned.Y.apply(lambda y: y.mid)
plt.scatter(binned.X, binned.Y, c=binned.Z, s=0.01)
Related
I have one question. I have point cloud data, and now I have to read and plot the points. If anyone can help me, I would be very thankful. I am using python(pandas, matplotlib,...), and I got all values of X,Y,Z but don't know how to plot all of them to get 3D plot. The values are taken from point cloud data and it has 170 rows and 254 combinations of x,y,z,I,N values.
https://datalore.jetbrains.com/notebook/n9MPhjVrtrIoU1buWmQuDh/MT7MrS1buzmbD7VSDqhGqu/
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits import mplot3d
import pandas as pd
df1 = pd.read_csv('cloud.txt',delimiter='\t')
pd.set_option('display.max_columns', None)
df1 = df1.apply (pd.to_numeric, errors='coerce')
#cloud.dropna()
df1.fillna(0,axis=0,inplace=True)
df2=df1.iloc[:,:-1]
df2.head(170)
kolone=[]
i=1
while i<6:
kolone.append(i)
i=i+1
display(kolone)
c=[]
columns=kolone*224
c=c+columns
df2.columns=c
display(df2)
#Reading the points: 1 column is x value, 2 column is y value and
3 column is z value. 4 and 5 are intensity and noise values and
they are not important for this.
#First row is exchanged with numerisation of columns: adding
values 1,2,3,4,5 or x,y,z,I,N values.
x=df2[1]
y=df2[2]
z=df2[3]
r=[]
i=1
while i<225:
r.append(i)
i=i+1
#print(r)
x.columns=r
display(x)
#Reading x coordinates--224 values of x
i=1
p=[]
while i<225:
p.append(i)
i=i+1
#print(p)
y.columns=p
display(y)
#Reading y coordinates--224 values of y
i=1
q=[]
while i<225:
q.append(i)
i=i+1
#print(q)
z.columns=q
display(z)
#Reading z coordinates--224 values of z
It is a bit upsetting that you haven't tried anything at all yet. The documentation page for matplotlib's 3D scatter plot includes a complete example.
There is no point in going to all that trouble to assign column names. Indeed, there is really no point in using pandas at all for this; you could read the CSV directly into a numpy array. However, assuming you have a dataframe with unnamed columns, it's still pretty easy.
In this code, I create a 50x3 array of random integers, then I pull the columns as lists and pass them to scatter. You ought to be able to adapt this to your own code.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
data = np.random.randint( 256, size=(50,3))
df = pd.DataFrame(data)
x = df[0].tolist()
y = df[1].tolist()
z = df[2].tolist()
fig = plt.figure()
ax = fig.add_subplot(projection='3d')
ax.scatter( x, y, z )
plt.show()
I have a CSV file which has been generated and altered to the current form;
Quick snapshot of sample Data
I want to be able to plot a graph that will have the X along the X axis as normal, and the Y axis to be a frequency 'True' values i.e (1's) So that I can visualise the relationship between time and frequency of the event occurring.
Thus far I have attempted a melt and using value_counts but they seem to give absolute not relative to the X value. I understand the data will likely need sorting additionally before plotting but I'm not sure the best way to go about this.
Many thanks for any help.
You can either plot a histogram which probably would work. or you can do a groupby 'x' to find aggregate of the sum of 'y' with code s shown below:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data = np.array([[1988,1988,1988,1989,1990,1990,1991,1991], [0,1,1,0,1,1,0,0]])
df = pd.DataFrame(data.T, columns = ['x', 'y'])
df=df.groupby(['x']).sum()
print(df)
I have a list of case and control samples along with the information about what characteristics are present or absent in each of them. A dataframe including the information can be generated by Pandas:
import pandas as pd
df={'Patient':[True,True,False],'Control':[False,True,False]} # Presence/absence data for three genes for each sample
df=pd.DataFrame(df)
df=df.transpose()
df.columns=['GeneA','GeneB','GeneC']
I need to visualize this data as a dotplot/scatterplot in the way that both of the x and y axis to be categorical and presence/absence to be coded by different shapes. Something like following:
Patient| x x -
Control| - x -
__________________
GeneA GeneB GeneC
I am new to Matplotlib/seaborn and I can plot simple line plots and scatter plots. But searching online I could not find any instructions or plot similar to what I need here.
A quick way would be:
import pandas as pd
import matplotlib.pyplot as plt
df={'Patient':[1,1,0],'Control':[0,1,0]} # Presence/absence data for three genes for each sample
df=pd.DataFrame(df)
df=df.transpose()
df.columns=['GeneA','GeneB','GeneC']
heatmap = plt.imshow(df)
plt.xticks(range(len(df.columns.values)), df.columns.values)
plt.yticks(range(len(df.index)), df.index)
cbar = plt.colorbar(mappable=heatmap, ticks=[0, 1], orientation='vertical')
# vertically oriented colorbar
cbar.ax.set_yticklabels(['Absent', 'Present'])
Thanks to #DEEPAK SURANA for adding labels to the colorbar.
I searched the pyplot documentation and could not find a scatter or dot plot exactly like you described. Here is my take on creating a plot that illustrates what you want. The True records are blue and the False records are red.
# creating dataframe and extra column because index is not numeric
import pandas as pd
df={'Patient':[True,True,False],
'Control':[False,True,False]}
df=pd.DataFrame(df)
df=df.transpose()
df.columns=['GeneA','GeneB','GeneC']
df['level'] = [i for i in range(0, len(df))]
print(df)
# plotting the data
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(10,6))
for idx, gene in enumerate(df.columns[:-1]):
df_gene = df[[gene, 'level']]
cList = ['blue' if x == True else 'red' for x in df[gene]]
for inr_idx, lv in enumerate(df['level']):
ax.scatter(x=idx, y=lv, c=cList[inr_idx], s=20)
fig.tight_layout()
plt.yticks([i for i in range(len(df.index))], list(df.index))
plt.xticks([i for i in range(len(df.columns)-1)], list(df.columns[:-1]))
plt.show()
Something like this might work
import pandas as pd
import numpy as np
from matplotlib.ticker import FixedLocator
df={'Patient':[1,1,0],'Control':[0,1,0]} # Presence/absence data for three genes for each sample
df=pd.DataFrame(df)
df=df.transpose()
df.columns=['GeneA','GeneB','GeneC']
plot = df.T.plot()
loc = FixedLocator([0,1,2])
plot.xaxis.set_major_locator(loc)
plot.xaxis.set_ticklabels(df.columns)
look at https://matplotlib.org/examples/pylab_examples/major_minor_demo1.html
and https://matplotlib.org/api/ticker_api.html
I think you have to convert the boolean values to zeros and ones to make it work. Someting like df.astype(int)
I'm relatively new with numpy and pandas (I'm an experimental physicist so I've been using ROOT for years...).
A common plot in ROOT is a 2D scatter plot where, given a list of x- and y- values, makes a "heatmap" type scatter plot of one variable versus the other.
How is this best accomplished with numpy and Pandas? I'm trying to use the Dataframe.plot() function, but I'm struggling to even create the Dataframe.
import numpy as np
import pandas as pd
x = np.random.randn(1,5)
y = np.sin(x)
df = pd.DataFrame(d)
First off, this dataframe has shape (1,2), but I would like it to have shape (5,2).
If I can get the dataframe the right shape, I'm sure I can figure out the DataFrame.plot() function to draw what I want.
There are a number of ways to create DataFrames. Given 1-dimensional column vectors, you can create a DataFrame by passing it a dict whose keys are column names and whose values are the 1-dimensional column vectors:
import numpy as np
import pandas as pd
x = np.random.randn(5)
y = np.sin(x)
df = pd.DataFrame({'x':x, 'y':y})
df.plot('x', 'y', kind='scatter')
Complementing, you can use pandas Series, but the DataFrame must have been created.
import numpy as np
import pandas as pd
x = np.linspace(0,2*np.pi)
y = np.sin(x)
#df = pd.DataFrame()
#df['X'] = pd.Series(x)
#df['Y'] = pd.Series(y)
# You can MIX
df = pd.DataFrame({'X':x})
df['Y'] = pd.Series(y)
df.plot('X', 'Y', kind='scatter')
This is another way that might help
import numpy as np
import pandas as pd
x = np.linspace(0,2*np.pi)
y = np.sin(x)
df = pd.DataFrame(data=np.column_stack((x,y)),columns=['X','Y'])
And also, I find the examples from karlijn (DatacCamp) very helpful
import numpy as np
import pandas as pd
TAB = np.array([['' ,'Col1','Col2'],
['Row1' , 1 , 2 ],
['Row2' , 3 , 4 ],
['Row3' , 5 , 6 ]])
dados = TAB[1:,1:]
linhas = TAB[1:,0]
colunas = TAB[0,1:]
DF = pd.DataFrame(
data=dados,
index=linhas,
columns=colunas
)
print('\nDataFrame:', DF)
In order to do what you want, I wouldn't use the DataFrame plotting methods. I'm also a former experimental physicist, and based on experience with ROOT I think that the Python analog you want is best accomplished using matplotlib. In matplotlib.pyplot there is a method, hist2d(), which will give you the kind of heat map you're looking for.
As for creating the dataframe, an easy way to do it is:
df=pd.DataFrame({'x':x, 'y':y})
I want to plot a dataframe where each data point is not represented as a point but a vertical line from the zero axis like :
df['A'].plot(style='xxx')
where xxx is the style I need.
Also ideally i would like to be able to color each bar based on the values in another column in my dataframe.
I precise that my x axis values are numbers and are not equally spaced.
The pandas plotting tools are convenient wrappers to matplotlib. There is no way I know of to get the functionality you want directly via pandas.
You can get it in a few lines of matplotlib. Most of the code is to do the colour mapping:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import matplotlib.colors as colors
import matplotlib.cm as cmx
#make the dataframe
a = np.random.rand(100)
b = np.random.ranf(100)
df = pd.DataFrame({'a': a, 'b': b})
# do the colour mapping
c_norm = colors.Normalize(vmin=min(df.b), vmax=max(df.b))
scalar_map = cmx.ScalarMappable(norm=c_norm, cmap=plt.get_cmap('jet'))
color_vals = [scalar_map.to_rgba(val) for val in df.b]
# make the plot
plt.vlines(df.index, np.zeros_like(df.a), df.a, colors=color_vals)
I've used the DataFrame index for the x axis values but there is no reason that you could not use irregularly spaced x values.