Break line in Genome Diagram biopython - python

I’m using Genome Diagram to display genomic informations. I would like to separate the feature name and its location by a break line. Then I do something like that :
gdFeature.add_feature(
feat,
color=red,
sigil="ARROW",
name=feat.qualifiers['product'][0].replace(" ", "_") + "\n" +
str(feat.location.start) + " - " + str(feat.location.end),
label_position="middle",
label_angle=0,
label=True)
gdFeature is an instance of Feature class (http://biopython.org/DIST/docs/api/Bio.Graphics.GenomeDiagram._Feature.Feature-class.html )
The problem is when I save my picture on PDF format, I got a black square instead of a break line :
example here
That’s not really what I want. Is there a way to do that ?
Thanks

I dont' see a direct path. Biopython uses Reportlab to generate the pdf. In the GenomeDiagram the function drawString is called to write the name/label (function is here). I.e. when you pass a label like "First Line\nSecond Line", ReportLab generates the PDF code that resembles (numbers are made up):
BT /F1 48 Tf 1 0 0 1 210 400 Tm (First\nSecond)Tj ET
But that is not the way to print two lines in a PDF. You actually have to have two different lines of code, something like:
BT /F1 48 Tf 1 0 0 1 210 400 Tm (First)Tj ET
BT /F1 48 Tf 1 0 0 1 210 400 Tm (Second)Tj ET
And AFAIK, Biopython doesn't have it implemented.

Related

Python how to read a latex generated pdf with equations

Consider the following article
https://arxiv.org/pdf/2101.05907.pdf
It's a typically formatted academic paper with only two pictures in pdf file.
The following code was used to extract the text and equation from the paper
#Related code explanation: https://stackoverflow.com/questions/45470964/python-extracting-text-from-webpage-pdf
import io
import requests
r = requests.get(url)
f = io.BytesIO(r.content)
#Related code explanation: https://stackoverflow.com/questions/45795089/how-can-i-read-pdf-in-python
import PyPDF2
fileReader = PyPDF2.PdfFileReader(f)
#Related code explanation: https://automatetheboringstuff.com/chapter13/
print(fileReader.getPage(0).extractText())
However, the result was not quite correct
Bohmpotentialforthetimedependentharmonicoscillator
FranciscoSoto-Eguibar
1
,FelipeA.Asenjo
2
,SergioA.Hojman
3
andH
´
ectorM.
Moya-Cessa
1
1
InstitutoNacionaldeAstrof´
´
OpticayElectr´onica,CalleLuisEnriqueErroNo.1,SantaMar´Tonanzintla,
Puebla,72840,Mexico.
2
FacultaddeIngenier´yCiencias,UniversidadAdolfoIb´aŸnez,Santiago7491169,Chile.
3
DepartamentodeCiencias,FacultaddeArtesLiberales,UniversidadAdolfoIb´aŸnez,Santiago7491169,Chile.
DepartamentodeF´FacultaddeCiencias,UniversidaddeChile,Santiago7800003,Chile.
CentrodeRecursosEducativosAvanzados,CREA,Santiago7500018,Chile.
Abstract.
IntheMadelung-Bohmapproachtoquantummechanics,weconsidera(timedependent)phasethatdependsquadrati-
callyonpositionandshowthatitleadstoaBohmpotentialthatcorrespondstoatimedependentharmonicoscillator,providedthe
timedependentterminthephaseobeysanErmakovequation.
Introduction
Harmonicoscillatorsarethebuildingblocksinseveralbranchesofphysics,fromclassicalmechanicstoquantum
mechanicalsystems.Inparticular,forquantummechanicalsystems,wavefunctionshavebeenreconstructedasisthe
caseforquantizedincavities[1]andforion-laserinteractions[2].Extensionsfromsingleharmonicoscillators
totimedependentharmonicoscillatorsmaybefoundinshortcutstoadiabaticity[3],quantizedpropagatingin
dielectricmedia[4],Casimire
ect[5]andion-laserinteractions[6],wherethetimedependenceisnecessaryinorder
totraptheion.
Timedependentharmonicoscillatorshavebeenextensivelystudiedandseveralinvariantshavebeenobtained[7,8,9,
10,11].Alsoalgebraicmethodstoobtaintheevolutionoperatorhavebeenshown[12].Theyhavebeensolvedunder
variousscenariossuchastimedependentmass[12,13,14],timedependentfrequency[15,11]andapplicationsof
invariantmethodshavebeenstudiedindi
erentregimes[16].Suchinvariantsmaybeusedtocontrolquantumnoise
[17]andtostudythepropagationoflightinwaveguidearrays[18,19].Harmonicoscillatorsmaybeusedinmore
generalsystemssuchaswaveguidearrays[20,21,22].
Inthiscontribution,weuseanoperatorapproachtosolvetheone-dimensionalSchr
¨
odingerequationintheBohm-
Madelungformalismofquantummechanics.ThisformalismhasbeenusedtosolvetheSchr
¨
odingerequationfor
di
erentsystemsbytakingtheadvantageoftheirnon-vanishingBohmpotentials[23,24,25,26].Alongthiswork,
weshowthatatimedependentharmonicoscillatormaybeobtainedbychoosingapositiondependentquadratictime
dependentphaseandaGaussianamplitudeforthewavefunction.Wesolvetheprobabilityequationbyusingoperator
techniques.Asanexamplewegivearationalfunctionoftimeforthetimedependentfrequencyandshowthatthe
Bohmpotentialhasdi
erentbehaviorforthatfunctionalitybecauseanauxiliaryfunctionneededinthescheme,
namelythefunctionsthatsolvestheErmakovequation,presentstwodi
erentsolutions.
One-dimensionalMadelung-Bohmapproach
ThemainequationinquantummechanicsistheSchrodingerequation,thatinonedimensionandforapotential
V
(
x
;
t
)
iswrittenas(forsimplicity,weset
}
=
1)
i
#
(
x
;
t
)
#
t
=
1
2
m
#
2
(
x
;
t
)
#
x
2
+
V
(
x
;
t
)
(
x
;
t
)
(1)
arXiv:2101.05907v1 [quant-ph] 14 Jan 2021
As shown:
The spacing, such as the title, disappeared and resulted meaning less strings.
The latex equations was wrong, and it got worse on the second page.
How to fix this and extract text and equations correctly from the pdf file that was generated from latex?

Extract a string between other two in Python

I am trying to extract the comments from a fdf (PDF comment file). In practice, this is to extract a string between other two. I did the following:
I open the fdf file with the following command:
import re
import os
os.chdir("currentworkingdirectory")
archcom =open("comentarios.fdf", "r")
cadena = archcom.read()
With the opened file, I create a string called cadena with all the info I need. For example:
cadena = "\n215 0 obj\n<</D[2.0 2.0]/S/D>>\nendobj\n216 0 obj\n<</D[2.0 2.0]/S/D>>\nendobj\n217 0 obj\n<</D[2.0 2.0]/S/D>>\nendobj\n218 0 obj\n<</W 3.0>>\nendobj\n219 0 obj\n<</W 3.0>>\nendobj\ntrailer\n<</Root 1 0 R>>\n%%EOF\n"
I try to extract the needed info with the following line:
a = re.findall(r"nendobj(.*?)W 3\.0",cadena)
Trying to get:
a = "n216 0 obj\n<</D[2.0 2.0]/S/D>>\nendobj\n217 0 obj\n<</D[2.0 2.0]/S/D>>\nendobj\n218 0 obj\n<<"
But I got:
a = []
The problem is in the line a = re.findall(r"nendobj(.*?)W 3\.0",cadena) but I don't realize where. I have tried many combinations with no success.
I appreciate any comment.
Regards
It seems to me that there are 2 problems:
a) you are looking for nendobj, but the N is actually part of the line break \n. Thus you'll also not get a leading N in the output, because there is no N.
b) Since the text you're looking for crosses some newlines, you need the re.DOTALL flag
Final code:
a = re.findall("endobj(.*?)W 3\.0",cadena, re.DOTALL)
Also note, that there will be a second result, confirmed by Regex101.

Pickle persistence

I'm starting using Python recently and I'm trying to make a program that manipulates data in Python using pickle, however I would like my file to be kind like this:
CODE | PIECE | PRICE
line one 1 1 1,00
line two 2 2 2,00
Consider 1 right down CODE, 1 right down PIECE and 1,00 right down PRICE until gets 50.
Here's the question: Is there anyway to do this using pickle? Like:
columns = int(input('Number of columns : ')) # Which would be 3 (code, piece and price)
data = [ ]
for i in range(columns):
raw = input('Enter data '+str(i)+' : ')
data.append(raw)
file = open('file.dat', 'wb')
pickle.dump(data, file)
file.close()
Obviously, it cannot be done using input, so is there some way to do this?

Removing Rows and Columns in .dat file using python

I'm wondering if there's a simple method for deleting particular rows and columns in python. Apologies if this is a trivial question.
To give some context, I'm currently writing a script to automate a series of linux commands(specifically ciao Chandra telescope analysis commands), part of which saves the output of a certain command to a .dat file. At present the output has included some rows and columns which I don't want in there...
E.G the data currently looks like:
Data for Table Block HISTOGRAM
--------------------------------------------------------------------------------
ROW CELL RCENTRE RHALFWIDTH AREA COUNTS SUR_BRI
1 1 1.016260150 1.016260150 12.9783552105 0 0
2 1 3.048780450 1.016260150 38.9350656315 1.0 0.02568378873336
3 1 5.081300750 1.016260150 64.8917760526 1.0 0.01541027324001
4 1 7.113821050 1.016260150 90.8484864736 1.0 0.01100733802858
5 1 9.146341350 1.016260150 116.8051968946 0 0
6 1
-------------------------------------
-------------------------------------
I want to remove the first few rows which incorporate the "Data for Table Block Histogram" and dashes, and also the first two columns which begin with "ROW" and "CELL"?
Thanks in advance
Assuming your separator is tabulation or that you don't need the exact same distance between columns at the end, and assuming the interesting lines begin with a number (as shown in your example) you could write something like that:
def cutFile(fname, firstWantedColumn):
"""Keep wanted lines and colums: detect lines beginning with number and keep input columns"""
f=open(fname, "r")
lf = f.readlines()
f.close()
txtOut = ""
for l in lf:
if l[0].isdigit(): #detect if first char is a number
txtOut += "\t".join(l.split()[firstWantedColumn:]) + "\n"
g = open(".".join(fname.split(".")[:-1]) + "_cutted" + fname.split(".")[-1])
g.write(txtOut)
g.close()
cutFile("myFile.dat", 2)
Edit: this is a bruteforce solution, maybe you were talking about an advanced and oneliner solution but I'm not sur this exists.

Python - Parsing Conundrum

I have searched high and low for a resolution to this situation, and tested a few different methods, but I haven't had any luck thus far. Basically, I have a file with data in the following format that I need to convert into a CSV:
(previously known as CyberWay Pte Ltd)
0 2019
01.com
0 1975
1 TRAVEL.COM
0 228
1&1 Internet
97 606
1&1 Internet AG
0 1347
1-800-HOSTING
0 8
1Velocity
0 28
1st Class Internet Solutions
0 375
2iC Systems
0 192
I've tried using re.sub and replacing the whitespace between the numbers on every other line with a comma, but haven't had any success so far. I admit that I normally parse from CSVs, so raw text has been a bit of a challenge for me. I would need to maintain the string formats that are above each respective set of numbers.
I'd prefer the CSV to be formatted as such:
foo bar
0,8
foo bar
0,9
foo bar
0,10
foo bar
0,11
There's about 50,000 entries, so manually editing this would take an obscene amount of time.
If anyone has any suggestions, I'd be most grateful.
Thank you very much.
If you just want to replace whitespace with comma, you can just do:
line = ','.join(line.split())
You'll have to do this only on every other line, but from your question it sounds like you already figured out how to work with every other line.
If I have correctly understood your requirement, you need a strip() on all lines and a split based on whitespace on even lines (lines starting from 1):
import re
fp = open("csv.txt", "r")
while True:
line = fp.readline()
if '' == line:
break
line = line.strip()
fields = re.split("\s+", fp.readline().strip())
print "\"%s\",%s,%s" % ( line, fields[0], fields[1] )
fp.close()
The output is a CSV (you might need to escape quotes if they occur in your input):
"Content of odd line",Number1,Number2
I do not understand the 'foo,bar' you place as header on your example's odd lines, though.

Categories

Resources