Trying to resolve problem in pandas-python - python

I have one question. I have point cloud data, and now I have to read and plot the points. If anyone can help me, I would be very thankful. I am using python(pandas, matplotlib,...), and I got all values of X,Y,Z but don't know how to plot all of them to get 3D plot. The values are taken from point cloud data and it has 170 rows and 254 combinations of x,y,z,I,N values.
https://datalore.jetbrains.com/notebook/n9MPhjVrtrIoU1buWmQuDh/MT7MrS1buzmbD7VSDqhGqu/
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits import mplot3d
import pandas as pd
df1 = pd.read_csv('cloud.txt',delimiter='\t')
pd.set_option('display.max_columns', None)
df1 = df1.apply (pd.to_numeric, errors='coerce')
#cloud.dropna()
df1.fillna(0,axis=0,inplace=True)
df2=df1.iloc[:,:-1]
df2.head(170)
kolone=[]
i=1
while i<6:
kolone.append(i)
i=i+1
display(kolone)
c=[]
columns=kolone*224
c=c+columns
df2.columns=c
display(df2)
#Reading the points: 1 column is x value, 2 column is y value and
3 column is z value. 4 and 5 are intensity and noise values and
they are not important for this.
#First row is exchanged with numerisation of columns: adding
values 1,2,3,4,5 or x,y,z,I,N values.
x=df2[1]
y=df2[2]
z=df2[3]
r=[]
i=1
while i<225:
r.append(i)
i=i+1
#print(r)
x.columns=r
display(x)
#Reading x coordinates--224 values of x
i=1
p=[]
while i<225:
p.append(i)
i=i+1
#print(p)
y.columns=p
display(y)
#Reading y coordinates--224 values of y
i=1
q=[]
while i<225:
q.append(i)
i=i+1
#print(q)
z.columns=q
display(z)
#Reading z coordinates--224 values of z

It is a bit upsetting that you haven't tried anything at all yet. The documentation page for matplotlib's 3D scatter plot includes a complete example.
There is no point in going to all that trouble to assign column names. Indeed, there is really no point in using pandas at all for this; you could read the CSV directly into a numpy array. However, assuming you have a dataframe with unnamed columns, it's still pretty easy.
In this code, I create a 50x3 array of random integers, then I pull the columns as lists and pass them to scatter. You ought to be able to adapt this to your own code.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
data = np.random.randint( 256, size=(50,3))
df = pd.DataFrame(data)
x = df[0].tolist()
y = df[1].tolist()
z = df[2].tolist()
fig = plt.figure()
ax = fig.add_subplot(projection='3d')
ax.scatter( x, y, z )
plt.show()

Related

Big dataset contour plot using pyplot and pandas

I have a massive data sample and need to visualize it. Using pandas, I can create a dataframe with relevant variables- 3 arrays of length 20Million.
These are x,y geometrical coordinates and z value on that (x,y) point.
I need a "heatmap" of z at each (x,y) point. But no pyplot function works with numbers this big.
What is the best way to go about it?
Dummy data
Tested with 200,000 rows
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
df=pd.DataFrame(np.random.rand(200000,2), columns=['X','Y'])
df['Z']=df.apply(lambda x: x.X+x.Y*2, axis=1)
Code
Creating bin intervals and groupby dataframe applying mean to Z column, so have mean Z for every X, Y bin pair to plot. Finally, scatter plot
binsX = pd.cut(df.X, np.arange(0,1,0.001))
binsY = pd.cut(df.Y, np.arange(0,1,0.001))
binned = df.groupby([binsX,binsY])['Z'].mean().reset_index()
binned.X = binned.X.apply(lambda x: x.mid)
binned.Y = binned.Y.apply(lambda y: y.mid)
plt.scatter(binned.X, binned.Y, c=binned.Z, s=0.01)

Plotting data with categorical x and y axes in python

I have a list of case and control samples along with the information about what characteristics are present or absent in each of them. A dataframe including the information can be generated by Pandas:
import pandas as pd
df={'Patient':[True,True,False],'Control':[False,True,False]} # Presence/absence data for three genes for each sample
df=pd.DataFrame(df)
df=df.transpose()
df.columns=['GeneA','GeneB','GeneC']
I need to visualize this data as a dotplot/scatterplot in the way that both of the x and y axis to be categorical and presence/absence to be coded by different shapes. Something like following:
Patient| x x -
Control| - x -
__________________
GeneA GeneB GeneC
I am new to Matplotlib/seaborn and I can plot simple line plots and scatter plots. But searching online I could not find any instructions or plot similar to what I need here.
A quick way would be:
import pandas as pd
import matplotlib.pyplot as plt
df={'Patient':[1,1,0],'Control':[0,1,0]} # Presence/absence data for three genes for each sample
df=pd.DataFrame(df)
df=df.transpose()
df.columns=['GeneA','GeneB','GeneC']
heatmap = plt.imshow(df)
plt.xticks(range(len(df.columns.values)), df.columns.values)
plt.yticks(range(len(df.index)), df.index)
cbar = plt.colorbar(mappable=heatmap, ticks=[0, 1], orientation='vertical')
# vertically oriented colorbar
cbar.ax.set_yticklabels(['Absent', 'Present'])
Thanks to #DEEPAK SURANA for adding labels to the colorbar.
I searched the pyplot documentation and could not find a scatter or dot plot exactly like you described. Here is my take on creating a plot that illustrates what you want. The True records are blue and the False records are red.
# creating dataframe and extra column because index is not numeric
import pandas as pd
df={'Patient':[True,True,False],
'Control':[False,True,False]}
df=pd.DataFrame(df)
df=df.transpose()
df.columns=['GeneA','GeneB','GeneC']
df['level'] = [i for i in range(0, len(df))]
print(df)
# plotting the data
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(10,6))
for idx, gene in enumerate(df.columns[:-1]):
df_gene = df[[gene, 'level']]
cList = ['blue' if x == True else 'red' for x in df[gene]]
for inr_idx, lv in enumerate(df['level']):
ax.scatter(x=idx, y=lv, c=cList[inr_idx], s=20)
fig.tight_layout()
plt.yticks([i for i in range(len(df.index))], list(df.index))
plt.xticks([i for i in range(len(df.columns)-1)], list(df.columns[:-1]))
plt.show()
Something like this might work
import pandas as pd
import numpy as np
from matplotlib.ticker import FixedLocator
df={'Patient':[1,1,0],'Control':[0,1,0]} # Presence/absence data for three genes for each sample
df=pd.DataFrame(df)
df=df.transpose()
df.columns=['GeneA','GeneB','GeneC']
plot = df.T.plot()
loc = FixedLocator([0,1,2])
plot.xaxis.set_major_locator(loc)
plot.xaxis.set_ticklabels(df.columns)
look at https://matplotlib.org/examples/pylab_examples/major_minor_demo1.html
and https://matplotlib.org/api/ticker_api.html
I think you have to convert the boolean values to zeros and ones to make it work. Someting like df.astype(int)

How to get labels by numpy loadtext?

I have a data file in the form of
Col0 Col1 Col2
2015 1 4
2016 2 3
The data is float, and I use numpty loadtext to make a ndarray. However, I need to skip the label rows and columns to have an array of the data. How can I make the ndarray out of the data while reading the labels too?
import numpy as np
import matplotlib.pyplot as plt
data = np.loadtxt("data.csv", skiprows=1)
# I need to skip the first row in reading the data but still get the labels.
x= data[:,0]
a= data[:,1]
b= data[:,2]
plt.xlabel(COL0) # Reading the COL0 value from the file.
plt.ylabel(COL1) # Reading the COL1 value from the file.
plt.plot(x,a)
NOTE: The labels (column titles) are unknown in the script. The script should be generic to work with any input file of the same structure.
With genfromtxt it is possible to get the names in a tuple. You can query on name, and you can get the names out into a variable using dtype.names[n], where n is an index.
import numpy as np
import matplotlib.pyplot as plt
data = np.genfromtxt('data.csv', names=True)
x = data[data.dtype.names[0]] # In this case this equals data['Col1'].
a = data[data.dtype.names[1]]
b = data[data.dtype.names[2]]
plt.figure()
plt.plot(x, a)
plt.xlabel(data.dtype.names[0])
plt.ylabel(data.dtype.names[1])
plt.show()
This is not really an answer to the actual question, but I feel you might be interested in knowing how to do the same with pandas instead of numpy.
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv("data.csv", delim_whitespace=True)
df.set_index(df.columns[0]).plot()
plt.show()
would result in
As can be seen, there is no need to know any column name and the plot is labeled automatically.
Of course the data can then also be used to be plotted with matplotlib:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv("data.csv", delim_whitespace=True)
x = df[df.columns[0]]
a = df[df.columns[1]]
b = df[df.columns[2]]
plt.figure()
plt.plot(x, a)
plt.xlabel(df.columns[0])
plt.ylabel(df.columns[1])
plt.show()

Plot a pandas dataframe with vertical lines

I want to plot a dataframe where each data point is not represented as a point but a vertical line from the zero axis like :
df['A'].plot(style='xxx')
where xxx is the style I need.
Also ideally i would like to be able to color each bar based on the values in another column in my dataframe.
I precise that my x axis values are numbers and are not equally spaced.
The pandas plotting tools are convenient wrappers to matplotlib. There is no way I know of to get the functionality you want directly via pandas.
You can get it in a few lines of matplotlib. Most of the code is to do the colour mapping:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import matplotlib.colors as colors
import matplotlib.cm as cmx
#make the dataframe
a = np.random.rand(100)
b = np.random.ranf(100)
df = pd.DataFrame({'a': a, 'b': b})
# do the colour mapping
c_norm = colors.Normalize(vmin=min(df.b), vmax=max(df.b))
scalar_map = cmx.ScalarMappable(norm=c_norm, cmap=plt.get_cmap('jet'))
color_vals = [scalar_map.to_rgba(val) for val in df.b]
# make the plot
plt.vlines(df.index, np.zeros_like(df.a), df.a, colors=color_vals)
I've used the DataFrame index for the x axis values but there is no reason that you could not use irregularly spaced x values.

Step plot by reading from file

I am a newbie to matplotlib. I am trying to plot step function and having some trouble. Right now I am able to read from the file and plot it as shown below. But the graph in the top is not in steps and the one below is not a proper step. I saw examples to plot step function by giving x & y value. I am not sure how to do it by reading from a file though. Can someone help me?
from pylab import plotfile, show, gca
import matplotlib.pyplot as plt
import matplotlib.cbook as cbook
fname = cbook.get_sample_data('sample.csv', asfileobj=False)
plotfile(fname, cols=(0,1), delimiter=' ')
plotfile(fname, cols=(0,2), newfig=False, delimiter=' ')
plt.show()
Sample inputs(3 columns):
27023927 3 0
27023938 2 0
27023949 3 0
27023961 2 0
27023972 3 0
27023984 2 0
27023995 3 0
27024007 2 0
27024008 2 1
27024018 3 1
27024030 2 1
27024031 2 0
27024041 3 0
27024053 2 0
27024054 2 1
27024098 2 0
Note: I have made the y-axis1 values as 3 & 2 so that this graph can occur in the top and another y-axis2 values 0 & 1 so that it comes in the bottom as shown below
Waveform as it looks now
Essentially your resolution is too low, for the lower plot the steps (except the last one) occur over 1 unit in x, while the steps are about an order of magnitude larger. This gives the appearance of steps while if you zoom in you will see the vertical lines have a non-infinite gradient (true steps change with an infinite gradient).
This is the same problem for both the top and bottom plots. We can easily remedy this by using the step function. You will generally find it easier to import the data, in this example I use the powerful numpy genfromtxt. This loads the data as an array data:
import numpy as np
import matplotlib.pylab as plt
data = np.genfromtxt('test.csv', delimiter=" ")
ax1 = plt.subplot(2,1,1)
ax1.step(data[:,0], data[:,1])
ax2 = plt.subplot(2,1,2)
ax2.step(data[:,0], data[:,2])
plt.show()
If you are new to python then there may be two things to mention, we use two subplots (ax1 and ax2) to plot the data rather than plotting on the same plot (this means you wouldn't need to add values to spatially separate them). We access the elements of the array through the [] this gives the [column, row] with : meaning all columns and and index i being the ith column
I would propose to load the data to a numpy array
import numpy as np
data = np.loadtxt('sample.csv')
And than plot it:
# first point
ax = [data[0,0]]
ay = [data[0,1]]
for i in range(1, data.shape[0]):
if ay[-1] != data[i,1]: # if y value has changed
# add current x and old y
ax.append(data[i,0])
ay.append(ay[-1])
# add current x and current y
ax.append(data[i,0])
ay.append(data[i,1])
import matplotlib.pyplot as plt
plt.plot(ax,ay)
plt.show()
What my solution differs from yours, is that I plot two points for every change in y. The two points produce this 90 degree bend. I Only plot the first curve. Change [?,1] to [?,2] for the second one.
Thanks for the suggestions. I was able to plot it after some research and here is my code,
import csv
import datetime
import matplotlib.pyplot as plt
import numpy as np
import dateutil.relativedelta as rd
import bisect
import scipy as sp
fname = "output.csv"
portfolio_list = []
x = []
a = []
b = []
portfolio = csv.DictReader(open(fname, "r"))
portfolio_list.extend(portfolio)
for data in portfolio_list:
x.append(data['i'])
a.append(data['a'])
b.append(data['b'])
stepList = [0, 1,2,3]
fig = plt.figure(figsize=(20, 10))
ax = fig.add_subplot(111)
plt.step(x, a, 'g', where='post')
plt.step(x, b, 'r', where='post')
plt.show()
and got the image like,

Categories

Resources