I would like to make a circos-like plot to visualize SNPs only (with multiple tracks for SNPs attributes). It could be done either with python, R or I am happy to consider other languages.
So far, I have taken a look at the circlize R package.
However, I get the error "Range of the sector ('C') cannot be 0" when initializing the circos plot.
I believe that this error arises from the fact that I have discrete data (SNPs) instead of having data for all positions. Or maybe this is because I have some data points that are repeated.
I have simplified my data below and show the code that I have tried so far:
Sample Gene Pos read_depth Freq
1 A 20394 43 99
1 B 56902 24 99
2 A 20394 50 99
2 B 56902 73 99
3 A 20394 67 50
3 B 56902 20 99
3 C 2100394 21 50
install.packages("circlize")
library(circlize)
data <- read.table("test_circos.txt", sep='\t', header=TRUE)
circos.par("track.height" = 0.1)
circos.initialize(factors = data$Gene, x = data$Pos)
I would like to know whether it is possible to get a circos-like plot where each of my data points (7 in my example) is plotted as an individual data point without any other points being plotted, in the way of a discrete axis.
If it is of interest to anyone, I decided to do as follows:
Number datapoints per category (='Gene'); new column 'Number':
Sample Gene Pos depth Freq Number
1 A 20394 43 99 1
1 B 56902 24 99 1
2 A 20394 50 99 2
2 B 56902 73 99 2
3 A 20394 67 50 3
3 B 56902 20 99 3
3 C 2100394 21 50 1
Design circos config file as follows (header not included in real config file):
chr - ID LABEL START END COLOUR
chr - A A 0 3 chr1
chr - B B 0 3 chr2
chr - C C 0 1 chr3
This means that my genes will have length equal to the number of SNPs identified in said genes and that each bp of the genes will represent one line (=SNP) in my SNP file.
I can then use circos as normal.
In the end, I chose circos because it seemed best documented, therefore easier to learn with the addition of appearing more flexible.
Related
I have a df with numbers:
numbers = pd.DataFrame(columns=['number'], data=[
50,
65,
75,
85,
90
])
and a df with ranges (look up table):
ranges = pd.DataFrame(
columns=['range','range_min','range_max'],
data=[
['A',90,100],
['B',85,95],
['C',70,80]
]
)
I want to determine what range (in second table) a value (in the first table) falls in. Please note ranges overlap, and limits are inclusive.
Also please note the vanilla dataframe above has 3 ranges, however this dataframe gets generated dynamically. It could have from 2 to 7 ranges.
Desired result:
numbers = pd.DataFrame(columns=['number','detected_range'], data=[
[50,'out_of_range'],
[65, 'out_of_range'],
[75,'C'],
[85,'B'],
[90,'overlap'] * could be A or B *
])
I solved this with a for loop but this doesn't scale well to a big dataset I am using. Also code is too extensive and inelegant. See below:
numbers['detected_range'] = nan
for i, row1 in number.iterrows():
for j, row2 in ranges.iterrows():
if row1.number<row2.range_min and row1.number>row2.range_max:
numbers.loc[i,'detected_range'] = row1.loc[j,'range']
else if (other cases...):
...and so on...
How could I do this?
You can use a bit of numpy vectorial operations to generate masks, and use them to select your labels:
import numpy as np
a = numbers['number'].values # numpy array of numbers
r = ranges.set_index('range') # dataframe of min/max with labels as index
m1 = (a>=r['range_min'].values[:,None]).T # is number above each min
m2 = (a<r['range_max'].values[:,None]).T # is number below each max
m3 = (m1&m2) # combine both conditions above
# NB. the two operations could be done without the intermediate variables m1/m2
m4 = m3.sum(1) # how many matches?
# 0 -> out_of_range
# 2 -> overlap
# 1 -> get column name
# now we select the label according to the conditions
numbers['detected_range'] = np.select([m4==0, m4==2], # out_of_range and overlap
['out_of_range', 'overlap'],
# otherwise get column name
default=np.take(r.index, m3.argmax(1))
)
output:
number detected_range
0 50 out_of_range
1 65 out_of_range
2 75 C
3 85 B
4 90 overlap
edit:
It works with any number of intervals in ranges
example output with extra['D',50,51]:
number detected_range
0 50 D
1 65 out_of_range
2 75 C
3 85 B
4 90 overlap
Pandas IntervalIndex fits in here; however, since your data has overlapping points, a for loop is the approach I'll use here (for unique, non-overlapping indices, pd.get_indexer is a fast approach):
intervals = pd.IntervalIndex.from_arrays(ranges.range_min,
ranges.range_max,
closed='both')
box = []
for num in numbers.number:
bools = intervals.contains(num)
if bools.sum()==1:
box.append(ranges.range[bools].item())
elif bools.sum() > 1:
box.append('overlap')
else:
box.append('out_of_range')
numbers.assign(detected_range = box)
number detected_range
0 50 out_of_range
1 65 out_of_range
2 75 C
3 85 B
4 90 overlap
firstly,explode the ranges:
df1=ranges.assign(col1=ranges.apply(lambda ss:range(ss.range_min,ss.range_max),axis=1)).explode('col1')
df1
range range_min range_max col1
0 A 90 100 90
0 A 90 100 91
0 A 90 100 92
0 A 90 100 93
0 A 90 100 94
0 A 90 100 95
0 A 90 100 96
0 A 90 100 97
0 A 90 100 98
0 A 90 100 99
1 B 85 95 85
1 B 85 95 86
1 B 85 95 87
1 B 85 95 88
1 B 85 95 89
1 B 85 95 90
secondly,judge wether each of numbers in first df
def function1(x):
df11=df1.loc[df1.col1==x]
if len(df11)==0:
return 'out_of_range'
if len(df11)>1:
return 'overlap'
return df11.iloc[0,0]
numbers.assign(col2=numbers.number.map(function1))
number col2
0 50 out_of_range
1 65 out_of_range
2 75 C
3 85 B
4 90 overlap
the logic is simple and clear
I have the following dataset, for 36 fragments in total (36 rows × 3 columns):
Fragment lower upper
0 1 1 5
1 2 2 5
2 3 3 5
3 4 2 5
4 5 1 5
5 6 1 5
I've calculated these lower and upper bounds from this dataset (966 rows × 2 columns):
Fragment Confidence Value
0 33 4
1 26 4
2 23 3
3 16 2
4 36 3
which contains multiple instance of a fragment and an associated Confidence value.
The confidence values are data from a Likert scale, i.e. 1-5. I want to create an error bar plot, for example like this:
So on the y-axis to have each fragment 1-36 and on the x-axis to show the range/std/mean (?) of the confidence values for each fragment.
I've tried this, but it's not exactly what I want, I think using the lower and upper bounds isn't the best idea, maybe I need std/range...
#confpd is the second dataset from above
meanconfs = confpd.groupby('Fragment', as_index=False)['Confidence Value'].mean()
minconfs = confpd.groupby(Fragment', as_index=False)['Confidence Value'].min()
maxconfs = confpd.groupby('Fragment', as_index=False)['Confidence Value'].max()
data_dict = {}
data_dict['Fragment'] = ['1','2','3','4','5','6','7','8','9','10','11','12','13','14','15','16','17','18',
'19','20','21','22','23','24','25','26','27','28','29','39','31','32','33','34','35','36']
data_dict['lower'] = minconfs['Confidence Value']
data_dict['upper'] = maxconfs['Confidence Value']
dataset = pd.DataFrame(data_dict)
##dataset is the first dataset I show above
for lower,upper,y in zip(dataset['lower'],dataset['upper'],range(len(dataset))):
plt.plot((lower,upper),(y,y),'ro-',color='orange')
plt.yticks(range(len(dataset)),list(dataset['Fragment']))
The result of this code is this, which is not what I want.
Any help is greatly appreciated!!
I have a CSV file like the below (after sorted the dataframe by iy):
iy,u
1,80
1,90
1,70
1,50
1,60
2,20
2,30
2,35
2,15
2,25
I'm trying to compute the mean and the fluctuation when iy are equal. For example, for the CSV above, what I want is something like this:
iy,u,U,u'
1,80,70,10
1,90,70,20
1,70,70,0
1,50,70,-20
1,60,70,-10
2,20,25,-5
2,30,25,5
2,35,25,10
2,15,25,-10
2,25,25,0
Where U is the average of u when iy are equal, and u' is simply u-U, the fluctuation. I know that there's a function called groupby.mean() in pandas, but I don't want to group the dataframe, just take the mean, put the values in a new column, and then calculate the fluctuation.
How can I proceed?
Use groupby with transform to calculate a mean for each group and assign that value to a new column 'U', then pandas to subtract two columns:
df['U'] = df.groupby('iy').transform('mean')
df["u'"] = df['u'] - df['U']
df
Output:
iy u U u'
0 1 80 70 10
1 1 90 70 20
2 1 70 70 0
3 1 50 70 -20
4 1 60 70 -10
5 2 20 25 -5
6 2 30 25 5
7 2 35 25 10
8 2 15 25 -10
9 2 25 25 0
You could get fancy and do it in one line:
df.assign(U=df.groupby('iy').transform('mean')).eval("u_prime = u-U")
I have a Pandas DataFrame with the following hypothetical data:
ID Time X-coord Y-coord
0 1 5 68 5
1 2 8 72 78
2 3 1 15 23
3 4 4 81 59
4 5 9 78 99
5 6 12 55 12
6 7 5 85 14
7 8 7 58 17
8 9 13 91 47
9 10 10 29 87
For each row (or ID), I want to find the ID with the closest proximity in time and space (X & Y) within this dataframe. Bonus: Time should have priority over XY.
Ideally, in the end I would like to have a new column called "Closest_ID" containing the most proximal ID within the dataframe.
I'm having trouble coming up with a function for this.
I would really appreciate any help or hint that points me in the right direction!
Thanks a lot!
Let's denote df as our dataframe. Then you can do something like:
from sklearn.metrics import pairwise_distances
space_vals = df[['X-coord', 'Y-coord']]
time_vals =df['Time']
space_distance = pairwise_distance(space_vals)
time_distance = pairwise_distance(time_vals)
space_distance[space_distance == 0] = 1e9 # arbitrary large number
time_distance[time_distance == 0] = 1e9 # again
closest_space_id = np.argmin(space_distance, axis=0)
closest_time_id = np.argmin(time_distance, axis=0)
Then, you can store the last 2 results in 2 columns, or somehow decide which one is closer.
Note: this code hasn't been checked, and it might have a few bugs...
I have a pandas dataframe like this:
Index High Low MA(5)-MA(20)
0 100 90 -1
1 101 91 -2
2 102 92 +1
3 99 88 +2
I want to get the maximum of the highs when MA(5) - MA(20) is positive, and the minimum of the lows then the same is negative.
The thing is that I want only the local maxima and minima not the global one, so, getting the maximum and minimum has to be reset each time the sign of MA(5) - MA(20) flips.
I do not want to use a for loop since they are really slow in python.
Any help?
You can use np.sign to get the sign of the last column. Perform a groupby operation, and use np.where to assign values accordingly.
v = np.sign(df['MA(5)-MA(20)']) < 1
g = df.groupby(v.ne(v.shift()).cumsum())
df['Maxima/Minima'] = np.where(
v, g.Low.transform('min'), g.High.transform('max')
)
df
Index High Low MA(5)-MA(20) Maxima/Minima
0 0 100 90 -1 90
1 1 101 91 -2 90
2 2 102 92 1 102
3 3 99 88 2 102
You'll notice that rows are assigned the local minima/maxima values according to their sign.
Is this what you need ?
v=df['MA(5)-MA(20)'].gt(0).astype(int).diff().fillna(0).cumsum()
df.groupby(v).High.transform('max').mask(df['MA(5)-MA(20)'] == 0,df.groupby(v).Low.transform('min'))
0 90
1 90
2 102
3 102
Name: High, dtype: int64