Graphically displaying BLAST alignments from local source - python

I have an issue that I am trying to work through. I have a large dataset of about 25,000 genes that seem to the product of domain shuffling or gene fusions. I would like to view these alignments in pdf format based on BLAST outfmt 6 output.
I have BLAST output files for each of these genes with 1 sequence (the recombinogenic gene) and a varying number of subject genes with the following columns:
qseqid sseqid evalue qstart qend qlen sstart send slen length
I was hoping to parse the files through some code to produce images like the attached file, using the following example blast output file:
Cluster_1___Hsap10003 Cluster_2___Hsap00200 1e-30 5 100 300 10 105 240 95
Cluster_1___Hsap10003 Cluster_2___Hsap00200 1e-10 200 230 300 205 235 30 95
Cluster_1___Hsap10003 Cluster_3___Aver00900 1e-20 5 100 300 10 105 125 100
Cluster_1___Hsap10003 Cluster_3___Atha00809 1e-20 5 110 300 5 115 120 105
Cluster_1___Hsap10003 Cluster_4___Ecol00002 1e-10 70 170 300 205 235 30 95
Cluster_1___Hsap10003 Cluster_4___Ecol00003 1e-30 75 175 300 10 105 240 95
Cluster_1___Hsap10003 Cluster_4___Sfle00009 1e-10 80 180 300 205 235 30 95
Cluster_1___Hsap10003 Cluster_5___Spom00010 1e-30 160 260 300 10 105 240 95
Cluster_1___Hsap10003 Cluster_5___Scer01566 1e-10 170 270 300 205 235 30 95
Cluster_1___Hsap10003 Cluster_5___Afla00888 1e-30 175 275 300 10 105 240 95
I am looking for the query sequence to be a thick coloured bar, and the alignment section of each subject to be thick colourful bars with thin black lines showing the rest of the gene length (one subject per line showing all alignment sections against the query).
Does anyone know any software or know of any github code that may do something like this?
Thanks so much!

Related

data.dropna() doesnt work for my data.csv file and i still get a data with NaN elements

I'm studying Pandas from Python.
I'm trying to remove NaN elements from my data.csv file with data.dropna() and it isn't removing.
import pandas as pd
data = pd.read_csv('data.csv')
new_data = data.dropna()
print(new_data)
This is data.csv content.
Duration Date Pulse Maxpulse Calories
60 '2020/12/01' 110 130 409.1
60 '2020/12/02' 117 145 479.0
60 '2020/12/03' 103 135 340.0
45 '2020/12/04' 109 175 282.4
45 '2020/12/05' 117 148 406.0
60 '2020/12/06' 102 127 300.0
60 '2020/12/07' 110 136 374.0
450 '2020/12/08' 104 134 253.3
30 '2020/12/09' 109 133 195.1
60 '2020/12/10' 98 124 269.0
60 '2020/12/11' 103 147 329.3
60 '2020/12/12' 100 120 250.7
60 '2020/12/12' 100 120 250.7
60 '2020/12/13' 106 128 345.3
60 '2020/12/14' 104 132 379.3
60 '2020/12/15' 98 123 275.0
60 '2020/12/16' 98 120 215.2
60 '2020/12/17' 100 120 300.0
45 '2020/12/18' 90 112 NaN
60 '2020/12/19' 103 123 323.0
45 '2020/12/20' 97 125 243.0
60 '2020/12/21' 108 131 364.2
45 NaN 100 119 282.0
60 '2020/12/23' 130 101 300.0
45 '2020/12/24' 105 132 246.0
60 '2020/12/25' 102 126 334.5
60 2020/12/26 100 120 250.0
60 '2020/12/27' 92 118 241.0
60 '2020/12/28' 103 132 NaN
60 '2020/12/29' 100 132 280.0
60 '2020/12/30' 102 129 380.3
60 '2020/12/31' 92 115 243.0
My guess is that data.csv is written incorrect?
The data.csv file is written wrong, to fix it need to add commas.
Corrected format: data.csv
Duration,Date,Pulse,Maxpulse,Calories
60,2020/12/01',110,130,409.1
60,2020/12/02',117,145,479.0
60,2020/12/03',103,135,340.0
45,2020/12/04',109,175,282.4
45,2020/12/05',117,148,406.0
60,2020/12/06',102,127,300.0
60,2020/12/07',110,136,374.0
450,2020/12/08',104,134,253.3
30,2020/12/09',109,133,195.1
60,2020/12/10',98,124,269.0
60,2020/12/11',103,147,329.3
60,2020/12/12',100,120,250.7
60,2020/12/12',100,120,250.7
60,2020/12/13',106,128,345.3
60,2020/12/14',104,132,379.3
60,2020/12/15',98,123,275.0
60,2020/12/16',98,120,215.2
60,2020/12/17',100,120,300.0
45,2020/12/18',90,112,
60,2020/12/19',103,123,323.0
45,2020/12/20',97,125,243.0
60,2020/12/21',108,131,364.2
45,,100,119,282.0
60,2020/12/23',130,101,300.0
45,2020/12/24',105,132,246.0
60,2020/12/25',102,126,334.5
60,20201226,100,120,250.0
60,2020/12/27',92,118,241.0
60,2020/12/28',103,132,
60,2020/12/29',100,132,280.0
60,2020/12/30',102,129,380.3
60,2020/12/31',92,115,243.0
TL,DR:
Try this:
new_data = df.fillna(pd.NA).dropna()
or:
import numpy as np
new_data = df.fillna(np.NaN).dropna()
That's the real csv file? I don't think so.
There isn't any specification of missing values in csv doc. From my experience, missing values in csv are represented by nothing between two separators (if the separator is a comma, it looks like ,,).
From pandas doc, the pandas.read_csv contains an argument na_values:
na_values : scalar, str, list-like, or dict, optional
Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values. By default the following values are interpreted as NaN: ‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, ‘n/a’, ‘nan’, ‘null’.
If your csv file contains 'NaN', pandas are capable to infer and read as NaN, but you can pass the parameter as you need.
Also, you can use (consider i as the number of row and j for column):
type(df.iloc[i,j])
Compare with:
type(np.NaN) # numpy NaN
float
type(pd.NA) # pandas NaN
pandas._libs.missing.NAType

regex for data preparation and processing afterwards in python

I have a quiet big file of data, which is not in a really good state for further processing. So I want to regex the best out of it and process this data in pandas for further data analysis.
The Data-Information segment repeats itself within the file and contains the necessary information.
My approach so far for the regex was to get some header information out of it. What I'm missing right now, is all three sections of data points. I only need the header from Points to the last data point. How could I grep these sections into multiple or one group?
^(?:Data-Information.*)
(?:\nName:\t+)(?P<Name>.+)
(?:\nSample:\t+)(?P<Sample>.+)
((?:\r?\n.+)+)
(?:\nSystem:\t+)(?P<System>.+)
(?:\r?\n(?!Data-Information).*)*
Sample file
Data-Information
Name: Polymer A
Sample: Sunday till Monday
User: SUD
Count Segments: 5
Application: RHEOSTAR
Tool: CP
Date/Time: 24.10.2021; 13:37
System: CP25
Constants:
- Csr [min/s]: 2,5421
- Css [Pa/mNm]: 2,54679
Section: 1
Number measuring points: 0
Time limit: 2 measuring points, drop
Duration 30 s
Measurement profile:
Temperature T[-1] = 25 °C
Section: 2
Number measuring points: 30
Time limit: 30 measuring points
Duration 2 s
Points Time Viscosity Shear rate Shear stress Momentum Status
[s] [Pa·s] [1/s] [Pa] [mNm] []
1 62 10,93 100 1.090 4,45 TGC,Dy_
2 64 11,05 100 1.100 4,5 TGC,Dy_
3 66 11,07 100 1.110 4,51 TGC,Dy_
4 68 11,05 100 1.100 4,5 TGC,Dy_
5 70 10,99 100 1.100 4,47 TGC,Dy_
6 72 10,92 100 1.090 4,44 TGC,Dy_
Section: 3
Number measuring points: 0
Time limit: 2 measuring points, drop
Duration 60 s
Section: 4
Number measuring points: 30
Time limit: 30 measuring points
Duration 2 s
Points Time Viscosity Shear rate Shear stress Momentum Status
[s] [Pa·s] [1/s] [Pa] [mNm] []
*** 1 *** 242 -6,334E+6 -0,0000115 72,7 0,296 TGC,Dy_
2 244 63,94 10,3 661 2,69 TGC,Dy_
3 246 35,56 20,7 736 2,99 TGC,Dy_
4 248 25,25 31 784 3,19 TGC,Dy_
5 250 19,82 41,4 820 3,34 TGC,Dy_
Section: 5
Number measuring points: 300
Time limit: 300 measuring points
Duration 1 s
Points Time Viscosity Shear rate Shear stress Momentum Status
[s] [Pa·s] [1/s] [Pa] [mNm] []
1 301 4,142 300 1.240 5,06 TGC,Dy_
2 302 4,139 300 1.240 5,05 TGC,Dy_
3 303 4,138 300 1.240 5,05 TGC,Dy_
4 304 4,141 300 1.240 5,06 TGC,Dy_
5 305 4,156 300 1.250 5,07 TGC,Dy_
6 306 4,153 300 1.250 5,07 TGC,Dy_
Data-Information
Name: Polymer B
Sample: Monday till Tuesday
User: SUD
Count Segments: 5
Application: RHEOSTAR
Tool: CP
Date/Time: 24.10.2021; 13:37
System: CP25
Constants:
- Csr [min/s]: 2,5421
- Css [Pa/mNm]: 2,54679
Section: 1
Number measuring points: 0
Time limit: 2 measuring points, drop
Duration 30 s
Measurement profile:
Temperature T[-1] = 25 °C
Section: 2
Number measuring points: 30
Time limit: 30 measuring points
Duration 2 s
Points Time Viscosity Shear rate Shear stress Momentum Status
[s] [Pa·s] [1/s] [Pa] [mNm] []
1 62 10,93 100 1.090 4,45 TGC,Dy_
2 64 11,05 100 1.100 4,5 TGC,Dy_
3 66 11,07 100 1.110 4,51 TGC,Dy_
4 68 11,05 100 1.100 4,5 TGC,Dy_
5 70 10,99 100 1.100 4,47 TGC,Dy_
6 72 10,92 100 1.090 4,44 TGC,Dy_
Section: 3
Number measuring points: 0
Time limit: 2 measuring points, drop
Duration 60 s
Section: 4
Number measuring points: 30
Time limit: 30 measuring points
Duration 2 s
Points Time Viscosity Shear rate Shear stress Momentum Status
[s] [Pa·s] [1/s] [Pa] [mNm] []
*** 1 *** 242 -6,334E+6 -0,0000115 72,7 0,296 TGC,Dy_
2 244 63,94 10,3 661 2,69 TGC,Dy_
3 246 35,56 20,7 736 2,99 TGC,Dy_
4 248 25,25 31 784 3,19 TGC,Dy_
5 250 19,82 41,4 820 3,34 TGC,Dy_
Section: 5
Number measuring points: 300
Time limit: 300 measuring points
Duration 1 s
Points Time Viscosity Shear rate Shear stress Momentum Status
[s] [Pa·s] [1/s] [Pa] [mNm] []
1 301 4,142 300 1.240 5,06 TGC,Dy_
2 302 4,139 300 1.240 5,05 TGC,Dy_
3 303 4,138 300 1.240 5,05 TGC,Dy_
4 304 4,141 300 1.240 5,06 TGC,Dy_
5 305 4,156 300 1.250 5,07 TGC,Dy_
6 306 4,153 300 1.250 5,07 TGC,Dy_
One option is to do it in 2 steps.
First get all the Data-Information parts using a pattern that starts with Data-Information and matches all following lines that do not start with Data-Information.
^Data-Information(?:\n(?!Data-Information$).*)*
Regex demo for Data-Information
The for every part, you can match the line that start with Points, and then match all following lines that contain at least a character (no empty lines)
^Points\b.*(?:\n.+)+
Regex demo for Points

Convert PDF to text: Adobe Reader vs. Python libraries

I have a PDF which I try to convert to text for further processing.
The structure of the PDF is stable but tricky, as it also contains elements and graph that sometimes also serve as a background for the text that is written in the particular position. Therefore, I'd like to extract as much text as possible.
I first tried the Adobe Reader function to save the PDF as text which gives good results but doesn't allow to have this process fully automated. At least I don't know a way to interact with the Adobe Reader through the command line or.
Therefore, I tried some python libraries designed for this purpose but it seems that they have a different way to convert the pdf to text. I tried PdfMiner, PyPDF2 and pdftotext. None of the libraries give me the same result as the Adobe Reader.
The PDF looks like the following (a little cropped due to sensitive data which isn't relevant):
Adobe extracts the following text:
OCT 15° (4.3 mm) ART (25) Q: 34 [HR]
ILMILM200μm200μm 04590135180225270315360
TMPTSNSNASNITITMP
1000 800 600 400 200 0
Position [°]
CC
7.7 (APS)
G227(12%) T206(54%) TS226(20%) TI304(38%) N203(5%) NS213(6%)
NI276(12%) Segmentationunconfirmed! Classification MRW Within
Normal Limits
OCT ART (100) Q: 31 [HS]
ILMILMRNFLRNFL200μm200μm 111 04590135180225270315360
300 240 180 120 60 0
TMP TS NS NAS NI TI TMP
Position [°]
CC
7.7 (APS)
Classification RNFLT Outside Normal Limits
G78<1% T62(15%) TS103(5%) TI134(10%) N65(7%) NS77(3%) NI73(3%)
Segmentationunconfirmed! RNFL Thickness (3.5 mm) [μm]
WithinNormalLimits(>5%) Borderline(<5%)OutsideNormalLimits(<1%)
While, for example, PDFminer extracts:
Average Thickness [�m]
Vol [mm�]
8.26
200 �m 200 �m
OCT 20.0� (5.6 mm) ART (21) Q: 25 [HS]
267
1.42
321
0.50
335
0.53
299
1.59
Center:
Central Min:
Central Max:
222 �m
221 �m
314 �m
Circle Diameters: 1, 3, 6 mm ETDRS
292
1.55
331
0.52
272
0.21
326
0.51
271
1.44
ILMILM
BMBM
200 �m 200 �m
Which is a lot different. Is there any reason for that and do you know any python library that has the same ability of the Adobe Reader to convert PDF to text?
Not necessarily an explanation as to why Adobe Reader extracts the text from a pdf differently as opposed to some python libraries but I have achieved a really good solution with tika.
This is was tika extracted:
OCT 15� (4.2 mm) ART (26) Q: 31 [HR]
NITSTMP NAS TMPTINSM in
im u
m R
im W
id th
[ �
m ]
1000 800 600 400 200
0
Position [�]
36031527022518013590450
ILMILM
RNFLRNFL
200 �m200 �m
OCT ART (100) Q: 27 [HS]
NITSTMP NAS TMPTINS
R N
F L T
h ickn
e ss (3
.5 m
m ) [�
m ]
300 240 180 120 60 0
Position [�]
36031527022518013590450
40
G 240
(10%)
T 239
(70%)
TS 213 (9%)
TI 285
(22%)
N 230 (5%)
NS 209 (3%)
NI 283 (9%)
CC 7.7 (APS)
Segmentation unconfirmed!
Classification MRW
Borderline
G 78
<1%
T 58
(8%)
TS 91
(2%)
TI 124 (6%)
N 64
(8%)
NS 110
(43%)
NI 71
(4%)
CC 7.7 (APS)
Segmentation unconfirmed!
Classification RNFLT
Outside Normal Limits
Within Normal Limits (>5%)
Borderline (<5%) Outside Normal Limits (<1%)
Reference database: European Descent (2014)

Nested loop to replace rows in dataframe

I'm trying to write a for loop that takes each row in a dataframe and compares it to the rows in a second dataframe.
If the row in the second dataframe:
isn't in the first dataframe already
has a higher value in the total points column
has a lower cost than the available budget (row_budget)
then I want to remove the row from the first dataframe and add the row from the second dataframe in its place.
Example data:
df
code team_name total_points now_cost
78 93284 BHA 38 50
395 173514 WAT 42 50
342 20452 SOU 66 50
92 17761 BUR 97 50
427 18073 WHU 99 50
69 61933 BHA 115 50
130 116594 CHE 116 50
pos_pool
code team_name total_points now_cost
438 90585 WOL 120 50
281 67089 NEW 131 50
419 37096 WHU 143 50
200 97032 LIV 208 65
209 110979 LIV 231 115
My expected output for the first three loops should be:
df
code team_name total_points now_cost
92 17761 BUR 97 50
427 18073 WHU 99 50
69 61933 BHA 115 50
130 116594 CHE 116 50
438 90585 WOL 120 50
281 67089 NEW 131 50
419 37096 WHU 143 50
Here is the nested for loop that I've tried:
for index, row in df.iterrows():
budget = squad['budget']
team_limits = squad['team_limits']
pos_pool = players_1920.loc[players_1920['position'] == row['position']].sort_values('total_points', ascending=False)
row_budget = row.now_cost + 1000 - budget
for index2, row2 in pos_pool.iterrows():
if (row2 not in df) and (row2.total_points > row.total_points) and (row2.now_cost <= row_budget):
team_limits[row.team_name] += 1
team_limits[row2.team_name] -=1
budget += row.now_cost - row2.now_cost
df = df.append(row2)
df = df.drop(row)
else:
pass
return df
At the moment I am only iterating through the first dataframe but doesn't seem to do anything in the second.

Python barbs wrong direction

There is probably a really simple answer to this and I'm only asking as a last resort as I usually get my answers by searching but I can't figure this out or find an answer. Basically I'm plotting some wind barbs in Python but they are pointing in the wrong direction and I don't know why.
Data is imported from a file and put into lists, I found on another stackoverflow post how to set the U, V for barbs using np.sin and np.cos, which results in the correct wind speed but the direction is wrong. I'm basically plotting a very simple tephigram or Skew-T.
# Program to read in radiosonde data from a file named "raob.dat"
# Import numpy since we are going to use numpy arrays and the loadtxt
# function.
import numpy as np
import matplotlib.pyplot as plt
# Open the file for reading and store the file handle as "f"
# The filename is 'raob.dat'
f=open('data.dat')
# Read the data from the file handle f. np.loadtxt() is useful for reading
# simply-formatted text files.
datain=np.loadtxt(f)
# Close the file.
f.close();
# We can copy the different columns into
# pressure, temperature and dewpoint temperature
# Note that the colon means consider all elements in that dimension.
# and remember indices start from zero
p=datain[:,0]
temp=datain[:,1]
temp_dew=datain[:,2]
wind_dir=datain[:,3]
wind_spd=datain[:,4]
print 'Pressure/hPa: ', p
print 'Temperature/C: ', temp
print 'Dewpoint temperature: ', temp_dew
print 'Wind Direction/Deg: ', wind_dir
print 'Wind Speed/kts: ', wind_spd
# for the barb vectors. This is the bit I think it causing the problem
u=wind_spd*np.sin(wind_dir)
v=wind_spd*np.cos(wind_dir)
#change units
#p=p/10
#temp=temp/10
#temp_dew=temp_dew/10
#plot graphs
fig1=plt.figure()
x1=temp
x2=temp_dew
y1=p
y2=p
x=np.linspace(50,50,len(y1))
#print x
plt.plot(x1,y1,'r',label='Temp')
plt.plot(x2,y2,'g',label='Dew Point Temp')
plt.legend(loc=3,fontsize='x-small')
plt.gca().invert_yaxis()
#fig2=plt.figure()
plt.barbs(x,y1,u,v)
plt.yticks(y1)
plt.grid(axis='y')
plt.show()
The barbs should all mostly be in the same direction as you can see in the direction in degrees from the data.
Any help is appreciated. Thank you.
Here is the data that is used:
996 25.2 24.9 290 12
963.2 24.5 22.6 315 42
930.4 23.8 20.1 325 43
929 23.8 20 325 43
925 23.4 19.6 325 43
900 22 17 325 43
898.6 21.9 17 325 43
867.6 20.1 16.5 320 41
850 19 16.2 320 44
807.9 16.8 14 320 43
779.4 15.2 12.4 320 44
752 13.7 10.9 325 43
725.5 12.2 9.3 320 44
700 10.6 7.8 325 45
649.7 7 4.9 315 44
603.2 3.4 1.9 325 49
563 0 -0.8 325 50
559.6 -0.2 -1 325 50
500 -3.5 -4.9 335 52
499.3 -3.5 -5 330 54
491 -4.1 -5.5 332 52
480.3 -5 -6.4 335 50
427.2 -9.7 -11 330 45
413 -11.1 -12.3 335 43
400 -12.7 -14.4 340 42
363.9 -16.9 -19.2 350 37
300 -26.3 -30.2 325 40
250 -36.7 -41.2 330 35
200 -49.9 0 335 0
150 -66.6 0 0 10
100 -83.5 0 0 30
Liam
# for the barb vectors. This is the bit I think it causing the problem
u=wind_spd*np.sin(wind_dir)
v=wind_spd*np.cos(wind_dir)
Instead try:
u=wind_spd*np.sin((np.pi/180)*wind_dir)
v=wind_spd*np.cos((np.pi/180)*wind_dir)
(http://tornado.sfsu.edu/geosciences/classes/m430/Wind/WindDirection.html)

Categories

Resources