Plot with matplotlib from .csv file containing duplicate column names - python

I am plotting lines using the combined ID1 and ID2 columns. In the .csv file, the ID1 and ID2 numbers could be repeated at some point. The way to decide if the data needs to be a new line is directly following when ID2 = 0. I want the program to recognize the sample data I provided below as 2 separate lines.
ID1 ID2 x y
1 2 1 1
1 2 2 2
1 2 3 3
1 2 4 4
1 0 5 5
...
1 2 1 3
1 2 2 5
1 2 3 7
Right now, my program would plot this data as a continuous line in the same color. I need a new line in a different color, but I can't figure out how to filter the data to start a new line even when the ID1 and ID2 values are duplicates. The program needs to see the '0' in the ID2 column as a signal to start a new line. Any ideas would be very helpful.

An option is to find out the indizes of the the zeros and loop over them to create individual DataFrames to plot.
u = u"""ID1 ID2 x y
1 2 1 1
1 2 2 2
1 2 3 3
1 2 4 4
1 0 5 5
1 2 1 3
1 2 2 5
1 2 3 7
1 0 1 3
1 2 2 4
1 2 3 2
1 2 4 1"""
import io
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.read_csv(io.StringIO(u), delim_whitespace=True)
fig, ax = plt.subplots()
inx = list(np.where(df["ID2"].values==0)[0]+1)
inx = [0] + inx + [len(df)]
for i in range(len(inx)-1):
dff = df.iloc[inx[i]:inx[i+1],:]
dff.plot(x="x", y="y", ax=ax, label="Label {}".format(i))
plt.show()

One way you could do it is to use cumsum and seaborn plotting with hue:
temp_df = df.assign(line_no=df.ID2.eq(0).cumsum()).query('ID2 != 0')
import seaborn as sns
_ = sns.pointplot(x='x',y='y', hue='line_no',data=temp_df)
Or with matplotlib:
fig,ax = plt.subplots()
for i in temp_df.line_no.unique():
x=temp_df.query('line_no == #i')['x']
y=temp_df.query('line_no == #i')['y']
ax.plot(x,y)

Related

Need pandas code that converts a df with multiple samples to run a box plot

I am writing a script to make box plots from some RNA-Seq data.
The pseudo-code
1. Select a row based on gene name
2. make a column for each type of cell
3. make box plot
I have 1 and 3 down
df2 = df[df[0].str.match("TCAP")]
????
import plotly.express as px
fig = px.box(df,x="CellType",y = "Expression",title = "GENE")
fig.show()
The code needs to convert the following tables
Gene Celltype-1_#1 Celltype-1_#2 Celltype-1_#3 Celltype-2_#1 Celltype-2_#2 Celltype-2_#3
A 1 1 1 3 3 3
B 5 5 5 4 4 4
to this Using: df2 = df[df[0].str.match("TCAP")]
Gene Celltype-1_#1 Celltype-1_#2 Celltype-1_#3 Celltype-2_#1 Celltype-2_#2 Celltype-2_#3
A 1 1 1 3 3 3
Then i need code to make it into this
Gene CellType Expression
A 1 1
A 1 1
A 1 1
A 2 3
A 2 3
A 2 3
You could use the DataFrame.stack method for this kind of transformation.
# need to have an index to make stack work
df = df.set_index('Gene')
# stack returns a series here
df = df.stack().to_frame().reset_index()
# At this point we have:
# Gene level_1 0
# 0 A Celltype-1_#1 1
# 1 A Celltype-1_#2 1
# 2 A Celltype-1_#3 1
# 3 A Celltype-2_#1 3
# 4 A Celltype-2_#2 3
# 5 A Celltype-2_#3 3
# 6 B Celltype-1_#1 5
# 7 B Celltype-1_#2 5
# 8 B Celltype-1_#3 5
# 9 B Celltype-2_#1 4
# 10 B Celltype-2_#2 4
# 11 B Celltype-2_#3 4
df.columns = ['Gene', 'Celltype', 'Expression']
# optionally rename values in celltype column
df['Celltype'] = df['Celltype'].apply(lambda t: t[9:10])
# now you can select by Gene or any other columns and pass to Plotly:
print(df[df['Gene'] == 'A'])
# Gene Celltype Expression
# 0 A 1 1
# 1 A 1 1
# 2 A 1 1
# 3 A 2 3
# 4 A 2 3
# 5 A 2 3
Note that by stacking the whole dataframe upfront, it's now simple to select multiple genes at once and pass them to Plotly together:
df_many = df[df['Gene'].isin(['A', 'B'])]

ggplot multiple lines in python, output only contains the first line

I try to use ggplot to plot the dataframe
import pandas as pd
from ggplot import *
df = pd.DataFrame()
df['x'] = [1,2,3,4,5,6]
df['y'] = [1,6,7,2,3,6]
df['id'] = ['a','a','b','b','c','c']
I get the output
x y id
0 1 1 a
1 2 6 a
2 3 7 b
3 4 2 b
4 5 3 c
5 6 6 c
I wish to plot 3 segments with different colors distinguished by 'id'.
ggplot(df,aes(x='x',y='y',colour='id')) + geom_line()
The output contains only the first segment 'a'
output
What's the problem of my codes?

Multi indexing line plot

After slicing, I have a multi header Dataframe with two levels, indexed by date, obtained like this:
df = df.iloc[:, df.columns.get_level_values(1).isin({'a','b'})]
Date one two
a b a b
2 2 3 3 3
3 2 3 3 3
4 2 3 3 3
5 2 3 3 3
6 2 3 3 3
7 2 3 3 3
What I would like to do is to plot this Dataframe with a line plot with the Date in axis, the same color for the level 0 and solid/dashed lines for the first level.
I have tried unstacking ie.
df.unstack(level=0).plot(kind='line')
but with no success. The plot as it is now, shows Date in x axis but treat each combination of level 0 and 1 headers as a new entry.
Here is a picture of the plot obtained:
What we would like to implement would be a two levels legend (color/shape of line).
Code Example:
import numpy as np
import pandas as pd
A = np.random.rand(4,4)
C = pd.DataFrame(A, index=range(4), columns=[np.array(['A','A','B','B']), np.array(['a','b','a','b'])])
C.plot(kind='line')

Update in pandas on specific columns

I want to update values in one pandas data frame based on the values in another dataframe, but I want to specify which column to update by (i.e., which column should be the “key” for looking up matching rows). Right now it seems to do treat the first column as the key one. Is there a way to pass it a specific column name?
Example:
import pandas as pd
import numpy as np
df_a = pd.DataFrame()
df_a['x'] = range(5)
df_a['y'] = range(4, -1, -1)
df_a['z'] = np.random.rand(5)
df_b = pd.DataFrame()
df_b['x'] = range(5)
df_b['y'] = range(5)
df_b['z'] = range(5)
print('df_b:')
print(df_b.head())
print('\nold df_a:')
print(df_a.head(10))
df_a.update(df_b)
print('\nnew df_a:')
print(df_a.head())
Out:
df_b:
x y z
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
old df_a:
x y z
0 0 4 0.333648
1 1 3 0.683656
2 2 2 0.605688
3 3 1 0.816556
4 4 0 0.360798
new df_a:
x y z
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
You see, what it did is replaced y and z in df_a with the respective columns in df_b based on matches of x between df_a and df_b.
What if I wanted to keep y the same? What if I want it to replace based on y and not x. Also, what if there are multiple columns on which I’d like to do the replacement (in the real problem, I have to update a dataset with a new dataset, where there is a match in two or three columns between the two on the values from a fourth column).
Basically, I want to do some sort of a merge-replace action, where I specify which columns I am merging/replacing on and which column should be replaced.
Hope this makes things clearer. If this cannot be accomplished with update in pandas, I am wondering if there is another way (short of writing a separate function with for loops for it).
This is my current solution, but it seems somewhat inelegant:
df_merge = df_a.merge(df_b, on='y', how='left', suffixes=('_a', '_b'))
print(df_merge.head())
df_merge['x'] = df_merge.x_b
df_merge['z'] = df_merge.z_b
df_update = df_a.copy()
df_update.update(df_merge)
print(df_update)
Out:
x_a y z_a x_b z_b
0 0 0 0.505949 0 0
1 1 1 0.231265 1 1
2 2 2 0.241109 2 2
3 3 3 0.579765 NaN NaN
4 4 4 0.172409 NaN NaN
x y z
0 0 0 0.000000
1 1 1 1.000000
2 2 2 2.000000
3 3 3 0.579765
4 4 4 0.172409
5 5 5 0.893562
6 6 6 0.638034
7 7 7 0.940911
8 8 8 0.998453
9 9 9 0.965866

Pandas: Pivot table without sorting index and columns

I'm trying to pivot data in a way so that the index and columns of the resulting table aren't automatically sorted. An example of the data might be:
X Y Z
1 1 1
3 1 2
2 1 3
4 1 4
1 2 5
3 2 6
2 2 7
4 2 8
The data is interpreted as an X, Y and Z axis. The pivotted result should look like this:
X 1 3 2 4
Y
1 1 2 3 4
2 5 6 7 8
Instead the result looks like this, where the index and columns are sorted, and the data accordingly:
X 1 2 3 4
Y
1 1 3 2 4
2 5 7 6 8
At this point I have lost information about the order in which the measurements were taken. For example say that I would plot the row at Y=1, with X as the X axis and the data value on the Y axis.
This would result in the figures in this picture. On the right is how I would like the data to be plotted. Does anyone have an idea how to prevent pandas from sorting the index and columns when pivotting a table?
I have an alternative to restore the order, as the ordering is based on the X relative to Y values, for instance, you can restore your X columns ordering by something like this:
import pandas as pd
# using your sample data
df = pd.read_clipboard()
df = df.pivot('Y', 'X', 'Z')
df
X 1 2 3 4
Y
1 1 3 2 4
2 5 7 6 8
# re-order your X columns by the values of first Y, for instance
df = df[df.T[1].values]
df
X 1 3 2 4
Y
1 1 2 3 4
2 5 6 7 8
Not the best approach, but sure it will achieve what you want.

Categories

Resources