Python how to read a latex generated pdf with equations - python

Consider the following article
https://arxiv.org/pdf/2101.05907.pdf
It's a typically formatted academic paper with only two pictures in pdf file.
The following code was used to extract the text and equation from the paper
#Related code explanation: https://stackoverflow.com/questions/45470964/python-extracting-text-from-webpage-pdf
import io
import requests
r = requests.get(url)
f = io.BytesIO(r.content)
#Related code explanation: https://stackoverflow.com/questions/45795089/how-can-i-read-pdf-in-python
import PyPDF2
fileReader = PyPDF2.PdfFileReader(f)
#Related code explanation: https://automatetheboringstuff.com/chapter13/
print(fileReader.getPage(0).extractText())
However, the result was not quite correct
Bohmpotentialforthetimedependentharmonicoscillator
FranciscoSoto-Eguibar
1
,FelipeA.Asenjo
2
,SergioA.Hojman
3
andH
´
ectorM.
Moya-Cessa
1
1
InstitutoNacionaldeAstrof´
´
OpticayElectr´onica,CalleLuisEnriqueErroNo.1,SantaMar´Tonanzintla,
Puebla,72840,Mexico.
2
FacultaddeIngenier´yCiencias,UniversidadAdolfoIb´aŸnez,Santiago7491169,Chile.
3
DepartamentodeCiencias,FacultaddeArtesLiberales,UniversidadAdolfoIb´aŸnez,Santiago7491169,Chile.
DepartamentodeF´FacultaddeCiencias,UniversidaddeChile,Santiago7800003,Chile.
CentrodeRecursosEducativosAvanzados,CREA,Santiago7500018,Chile.
Abstract.
IntheMadelung-Bohmapproachtoquantummechanics,weconsidera(timedependent)phasethatdependsquadrati-
callyonpositionandshowthatitleadstoaBohmpotentialthatcorrespondstoatimedependentharmonicoscillator,providedthe
timedependentterminthephaseobeysanErmakovequation.
Introduction
Harmonicoscillatorsarethebuildingblocksinseveralbranchesofphysics,fromclassicalmechanicstoquantum
mechanicalsystems.Inparticular,forquantummechanicalsystems,wavefunctionshavebeenreconstructedasisthe
caseforquantizedincavities[1]andforion-laserinteractions[2].Extensionsfromsingleharmonicoscillators
totimedependentharmonicoscillatorsmaybefoundinshortcutstoadiabaticity[3],quantizedpropagatingin
dielectricmedia[4],Casimire
ect[5]andion-laserinteractions[6],wherethetimedependenceisnecessaryinorder
totraptheion.
Timedependentharmonicoscillatorshavebeenextensivelystudiedandseveralinvariantshavebeenobtained[7,8,9,
10,11].Alsoalgebraicmethodstoobtaintheevolutionoperatorhavebeenshown[12].Theyhavebeensolvedunder
variousscenariossuchastimedependentmass[12,13,14],timedependentfrequency[15,11]andapplicationsof
invariantmethodshavebeenstudiedindi
erentregimes[16].Suchinvariantsmaybeusedtocontrolquantumnoise
[17]andtostudythepropagationoflightinwaveguidearrays[18,19].Harmonicoscillatorsmaybeusedinmore
generalsystemssuchaswaveguidearrays[20,21,22].
Inthiscontribution,weuseanoperatorapproachtosolvetheone-dimensionalSchr
¨
odingerequationintheBohm-
Madelungformalismofquantummechanics.ThisformalismhasbeenusedtosolvetheSchr
¨
odingerequationfor
di
erentsystemsbytakingtheadvantageoftheirnon-vanishingBohmpotentials[23,24,25,26].Alongthiswork,
weshowthatatimedependentharmonicoscillatormaybeobtainedbychoosingapositiondependentquadratictime
dependentphaseandaGaussianamplitudeforthewavefunction.Wesolvetheprobabilityequationbyusingoperator
techniques.Asanexamplewegivearationalfunctionoftimeforthetimedependentfrequencyandshowthatthe
Bohmpotentialhasdi
erentbehaviorforthatfunctionalitybecauseanauxiliaryfunctionneededinthescheme,
namelythefunctionsthatsolvestheErmakovequation,presentstwodi
erentsolutions.
One-dimensionalMadelung-Bohmapproach
ThemainequationinquantummechanicsistheSchrodingerequation,thatinonedimensionandforapotential
V
(
x
;
t
)
iswrittenas(forsimplicity,weset
}
=
1)
i
#
(
x
;
t
)
#
t
=
1
2
m
#
2
(
x
;
t
)
#
x
2
+
V
(
x
;
t
)
(
x
;
t
)
(1)
arXiv:2101.05907v1 [quant-ph] 14 Jan 2021
As shown:
The spacing, such as the title, disappeared and resulted meaning less strings.
The latex equations was wrong, and it got worse on the second page.
How to fix this and extract text and equations correctly from the pdf file that was generated from latex?

Related

Extract a string between other two in Python

I am trying to extract the comments from a fdf (PDF comment file). In practice, this is to extract a string between other two. I did the following:
I open the fdf file with the following command:
import re
import os
os.chdir("currentworkingdirectory")
archcom =open("comentarios.fdf", "r")
cadena = archcom.read()
With the opened file, I create a string called cadena with all the info I need. For example:
cadena = "\n215 0 obj\n<</D[2.0 2.0]/S/D>>\nendobj\n216 0 obj\n<</D[2.0 2.0]/S/D>>\nendobj\n217 0 obj\n<</D[2.0 2.0]/S/D>>\nendobj\n218 0 obj\n<</W 3.0>>\nendobj\n219 0 obj\n<</W 3.0>>\nendobj\ntrailer\n<</Root 1 0 R>>\n%%EOF\n"
I try to extract the needed info with the following line:
a = re.findall(r"nendobj(.*?)W 3\.0",cadena)
Trying to get:
a = "n216 0 obj\n<</D[2.0 2.0]/S/D>>\nendobj\n217 0 obj\n<</D[2.0 2.0]/S/D>>\nendobj\n218 0 obj\n<<"
But I got:
a = []
The problem is in the line a = re.findall(r"nendobj(.*?)W 3\.0",cadena) but I don't realize where. I have tried many combinations with no success.
I appreciate any comment.
Regards
It seems to me that there are 2 problems:
a) you are looking for nendobj, but the N is actually part of the line break \n. Thus you'll also not get a leading N in the output, because there is no N.
b) Since the text you're looking for crosses some newlines, you need the re.DOTALL flag
Final code:
a = re.findall("endobj(.*?)W 3\.0",cadena, re.DOTALL)
Also note, that there will be a second result, confirmed by Regex101.

Find the first pattern before the specific substring in python

In Python 3.6.5, say I have a string, read from file, like this:
# comments
newmtl material_0_2_8
Kd 1 1 1
Ka 0 0 0
Ks 0.4 0.4 0.4
Ke 0 0 0
Ns 10
illum 2
map_Kd ../images/texture0.png
newmtl material_1_24
Kd 1 1 1
Ka 0 0 0
Ks 0.4 0.4 0.4
Ke 0 0 0
Ns 10
illum 2
newmtl material_20_1_8
Kd 1 1 1
Ka 0 0 0
Ks 0.4 0.4 0.4
Ke 0 0 0
Ns 10
illum 2
d 1.0
map_Kd ../images/texture0.jpg
... and so on ...
I'm looping for each texture and I need to get the corresponding material code.
I want to retrieve the substring material_* corresponding to a certain texture*, which I know the name.
So for example, if I have texture0.jpg, I want to return material_20_1_8; if I have texture0.png then I want to have material_0_2_8.
How can I do it in this way?
f=open('path/to/file', "r")
if f.mode == 'r':
contents =f.read() # contains the string shown above
for texture in textures: # textures is the list of the texture names
material_code = ?
Or any other way, if you think you know a better one.
Try this:
mapping = {}
with open('input.txt', 'r') as fin:
for line in fin:
if line.startswith('newmtl'):
material = line[len('newmtl '):-1]
elif line.startswith('map_Kd'):
file = line.split('/')[-1][:-1]
mapping[file] = material
Then mapping is a dict with the relations you want:
{'texture0.jpg': 'material_20_1_8', 'texture0.png': 'material_0_2_8'}
Iteratively:
import re
textures = ('texture0.jpg', 'texture0.png')
with open('input.txt') as f:
pat = re.compile(r'\bmaterial_\S+')
for line in f:
line = line.strip()
m = pat.search(line)
if m:
material = m.group()
elif line.endswith(textures):
print(line.split('/')[-1], material)
The output:
texture0.png material_0_2_8
texture0.jpg material_20_1_8
Who likes regular expressions may like this approach for its readability and efficiency.
re.findall() returns a sequence of matched groups (the parts of the regexp enclosed in brackets) for all matches of the regular expression in the input data. The regular expression thus finds all occurences of the "newmtl" line with the nearest following "map_Kd" line and extracts the value parts from those lines using regex groups. The values are then just reversed to create the needed dictionary through dictionary comprehension.
I like this solution because it is compact and efficient. Notice, I added just one (well, multiline) expression into the original example (and one import, to be exact). If you can read regular expressions, it is also well readable.
import re
f = open('path/to/file', "r")
if f.mode == 'r':
contents = f.read() # contains the string shown above
materials = {
filename: material for material, filename in
re.findall(r'^newmtl (material_\S+)$.*?^map_Kd \.\./images/(.+?)$',
contents, re.MULTILINE | re.DOTALL)
}
for texture in textures: # textures is the list of the texture names
material_code = materials[texture]
The regular expression in this example works with given data. If you need to be more strict or more permissive regarding whitespace or other kind of variability in the source data, it may need to be further tuned.

Python read part of a pdf page

I'm trying to read a pdf file where each page is divided into 3x3 blocks of information of the form
A | B | C
D | E | F
G | H | I
Each of the entries is broken into multiple lines. A simplified example of one entry is this card. But then there would be similar entries in the other 8 slots.
I've looked at pdfminer and pypdf2. I haven't found pdfminer overly useful, but pypdf2 has given me something close.
import PyPDF2
from StringIO import StringIO
def getPDFContent(path):
content = ""
p = file(path, "rb")
pdf = PyPDF2.PdfFileReader(p)
numPages = pdf.getNumPages()
for i in range(numPages):
content += pdf.getPage(i).extractText() + "\n"
content = " ".join(content.replace(u"\xa0", " ").strip().split())
return content
However, this only reads the file line by line. I'd like a solution where I can read only a portion of the page so that I could read A, then B, then C, and so on. Also, the answer here works fairly well, but the order of
columns routinely gets distorted and I've only gotten it to read line by line.
I assume the PDF files in question are generated PDFs rather than scanned (as in the example you gave), given that you're using pdfminer and pypdf2. If you know the size of the columns and rows in inches you can use minecart (full disclosure: I wrote minecart). Example code:
import minecart
# minecart units are 1/72 inch, measured from bottom-left of the page
ROW_BORDERS = (
72 * 1, # Bottom row starts 1 inch from the bottom of the page
72 * 3, # Second row starts 3 inches from the bottom of the page
72 * 5, # Third row starts 5 inches from the bottom of the page
72 * 7, # Third row ends 7 inches from the bottom of the page
)
COLUMN_BORDERS = (
72 * 8, # Third col ends 8 inches from the left of the page
72 * 6, # Third col starts 6 inches from the left of the page
72 * 4, # Second col starts 4 inches from the left of the page
72 * 2, # First col starts 2 inches from the left of the page
) # reversed so that BOXES is ordered properly
BOXES = [
(left, bot, right, top)
for top, bot in zip(ROW_BORDERS, ROW_BORDERS[1:])
for left, right in zip(COLUMN_BORDERS, COLUMN_BORDERS[1:])
]
def extract_output(page):
"""
Reads the text from page and splits it into the 9 cells.
Returns a list with 9 entries:
[A, B, C, D, E, F, G, H, I]
Each item in the tuple contains a string with all of the
text found in the cell.
"""
res = []
for box in BOXES:
strings = list(page.letterings.iter_in_bbox(box))
# We sort from top-to-bottom and then from left-to-right, based
# on the strings' top left corner
strings.sort(key=lambda x: (-x.bbox[3], x.bbox[0]))
res.append(" ".join(strings).replace(u"\xa0", " ").strip())
return res
content = []
doc = minecart.Document(open("path/to/pdf-doc.pdf", 'rb'))
for page in doc.iter_pages():
content.append(extract_output(page))

python csv delimiter doesn't work properly

I try to write a python code to extract DVDL values from the input. Here is the truncated input.
A V E R A G E S O V E R 50000 S T E P S
NSTEP = 50000 TIME(PS) = 300.000 TEMP(K) = 300.05 PRESS = -70.0
Etot = -89575.9555 EKtot = 23331.1725 EPtot = -112907.1281
BOND = 759.8213 ANGLE = 2120.6039 DIHED = 4231.4019
1-4 NB = 940.8403 1-4 EEL = 12588.1950 VDWAALS = 13690.9435
EELEC = -147238.9339 EHBOND = 0.0000 RESTRAINT = 0.0000
DV/DL = 13.0462
EKCMT = 10212.3016 VIRIAL = 10891.5181 VOLUME = 416404.8626
Density = 0.9411
Ewald error estimate: 0.6036E-04
R M S F L U C T U A T I O N S
NSTEP = 50000 TIME(PS) = 300.000 TEMP(K) = 1.49 PRESS = 129.9
Etot = 727.7890 EKtot = 115.7534 EPtot = 718.8344
BOND = 23.1328 ANGLE = 36.1180 DIHED = 19.9971
1-4 NB = 12.7636 1-4 EEL = 37.3848 VDWAALS = 145.7213
EELEC = 739.4128 EHBOND = 0.0000 RESTRAINT = 0.0000
DV/DL = 3.7510
EKCMT = 76.6138 VIRIAL = 1195.5824 VOLUME = 43181.7604
Density = 0.0891
Ewald error estimate: 0.4462E-04
Here is the script. Basically we have a lot of DVDL in the input (not in the above truncated input) and we only want the last two. So we read all of them into a list and only get the last two. Finally, we write the last two DVDL in the list into a csv file. The desire output is
13.0462, 3.7510
However, the following script (python 2.7) will bring the output like this. Could any guru enlighten? Thanks.
13.0462""3.7510""
Here is the script:
import os
import csv
DVDL=[]
filename="input.out"
file=open(filename,'r')
with open("out.csv",'wb') as outfile: # define output name
line=file.readlines()
for a in line:
if ' DV/DL =' in a:
DVDL.append(line[line.index(a)].split(' ')[1]) # Extract DVDL number
print DVDL[-2:] # We only need the last two DVDL
yeeha="".join(str(a) for a in DVDL[-2:])
print yeeha
writer = csv.writer(outfile, delimiter=',',lineterminator='\n')#Output the list into a csv file called "outfile"
writer.writerows(yeeha)
As the commenter who proposed an approach has not had the chance to outline some code for this, here's how I'd suggest doing it (edited to allow optionally signed floating point numbers with optional exponents, as suggested by an answer to Python regular expression that matches floating point numbers):
import re,sys
pat = re.compile("DV/DL += +([+-]?(\d+(\.\d*)?|\.\d+)([eE][+-]?\d+)?)")
values = []
for line in open("input.out","r"):
m = pat.search(line)
if m:
values.append(m.group(1))
outfile = open("out.csv","w")
outfile.write(",".join(values[-2:]))
Having run this script:
$ cat out.csv
13.0462,3.7510
I haven't used the csv module in this case because it isn't really necessary for a simple output file like this. However, adding the following lines to the script will use csv to write the same data into out1.csv:
import csv
writer = csv.writer(open("out1.csv","w"))
writer.writerow(values[-2:])

Break line in Genome Diagram biopython

I’m using Genome Diagram to display genomic informations. I would like to separate the feature name and its location by a break line. Then I do something like that :
gdFeature.add_feature(
feat,
color=red,
sigil="ARROW",
name=feat.qualifiers['product'][0].replace(" ", "_") + "\n" +
str(feat.location.start) + " - " + str(feat.location.end),
label_position="middle",
label_angle=0,
label=True)
gdFeature is an instance of Feature class (http://biopython.org/DIST/docs/api/Bio.Graphics.GenomeDiagram._Feature.Feature-class.html )
The problem is when I save my picture on PDF format, I got a black square instead of a break line :
example here
That’s not really what I want. Is there a way to do that ?
Thanks
I dont' see a direct path. Biopython uses Reportlab to generate the pdf. In the GenomeDiagram the function drawString is called to write the name/label (function is here). I.e. when you pass a label like "First Line\nSecond Line", ReportLab generates the PDF code that resembles (numbers are made up):
BT /F1 48 Tf 1 0 0 1 210 400 Tm (First\nSecond)Tj ET
But that is not the way to print two lines in a PDF. You actually have to have two different lines of code, something like:
BT /F1 48 Tf 1 0 0 1 210 400 Tm (First)Tj ET
BT /F1 48 Tf 1 0 0 1 210 400 Tm (Second)Tj ET
And AFAIK, Biopython doesn't have it implemented.

Categories

Resources