HTML parsing using beautiful soup gives structure different to website - python

When I view this link https://www.cftc.gov/sites/default/files/files/dea/cotarchives/2015/futures/financial_lf061615.htm the text is displayed in a clear way. However when I try to parse the page using beautiful soup I am outputting something which doesn't look the same - it is all messed up. Here is the code
import urllib.request
from bs4 import BeautifulSoup
request = urllib.request.Request('https://www.cftc.gov/sites/default/files/files/dea/cotarchives/2015/futures/financial_lf061615.htm')
htm = urllib.request.urlopen(request).read()
soup = BeautifulSoup(htm,'html.parser')
text = soup.get_text()
print(text)
The desired ouput would look like this
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Traders in Financial Futures - Futures Only Positions as of June 16, 2015
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Dealer : Asset Manager/ : Leveraged : Other : Nonreportable :
Intermediary : Institutional : Funds : Reportables : Positions :
Long : Short : Spreading: Long : Short : Spreading: Long : Short : Spreading: Long : Short : Spreading: Long : Short :
-----------------------------------------------------------------------------------------------------------------------------------------------------------
DOW JONES UBS EXCESS RETURN - CHICAGO BOARD OF TRADE ($100 X INDEX)
CFTC Code #221602 Open Interest is 19,721
Positions
97 2,934 0 8,941 1,574 973 6,490 11,975 1,694 1,372 539 0 154 32
Changes from: June 9, 2015 Total Change is: 3,505
48 0 0 2,013 1,141 70 447 1,369 923 -64 0 0 68 2
Percent of Open Interest Represented by Each Category of Trader
0.5 14.9 0.0 45.3 8.0 4.9 32.9 60.7 8.6 7.0 2.7 0.0 0.8 0.2
Number of Traders in Each Category Total Traders: 31
. . 0 5 . . 6 9 . 5 . 0
-----------------------------------------------------------------------------------------------------------------------------------------------------------
After viewing the page source it is not clear to me how a new line is being distingushed in the style - which is where I think the problem comes from.
Is there some type of structure I need to specify in the BeautifulSoup function? I'm very lost here, so any help is much appreciated.
Fwiw I have installing the html2text module and had no luck installing on anaconda using !conda config --append channels conda-forge and !conda install html2text
Cheers
EDIT: ive figured it out. im a brainlet
request = urllib.request.Request('https://www.cftc.gov/sites/default/files/files/dea/cotarchives/2015/futures/financial_lf061615.htm')
htm = urllib.request.urlopen(request).read()
htm = htm.decode('windows-1252')
htm = htm.replace('\n','').replace('\r','')
htm = htm.split('</pre><pre>')
cleaned = []
for i in htm:
i = BeautifulSoup(i,'html.parser' ).get_text()
cleaned.append(i)
with open('trouble.txt','w') as f:
for line in cleaned:
f.write('%s\n' % line)

Related

Python how to read a latex generated pdf with equations

Consider the following article
https://arxiv.org/pdf/2101.05907.pdf
It's a typically formatted academic paper with only two pictures in pdf file.
The following code was used to extract the text and equation from the paper
#Related code explanation: https://stackoverflow.com/questions/45470964/python-extracting-text-from-webpage-pdf
import io
import requests
r = requests.get(url)
f = io.BytesIO(r.content)
#Related code explanation: https://stackoverflow.com/questions/45795089/how-can-i-read-pdf-in-python
import PyPDF2
fileReader = PyPDF2.PdfFileReader(f)
#Related code explanation: https://automatetheboringstuff.com/chapter13/
print(fileReader.getPage(0).extractText())
However, the result was not quite correct
Bohmpotentialforthetimedependentharmonicoscillator
FranciscoSoto-Eguibar
1
,FelipeA.Asenjo
2
,SergioA.Hojman
3
andH
´
ectorM.
Moya-Cessa
1
1
InstitutoNacionaldeAstrof´
´
OpticayElectr´onica,CalleLuisEnriqueErroNo.1,SantaMar´Tonanzintla,
Puebla,72840,Mexico.
2
FacultaddeIngenier´yCiencias,UniversidadAdolfoIb´aŸnez,Santiago7491169,Chile.
3
DepartamentodeCiencias,FacultaddeArtesLiberales,UniversidadAdolfoIb´aŸnez,Santiago7491169,Chile.
DepartamentodeF´FacultaddeCiencias,UniversidaddeChile,Santiago7800003,Chile.
CentrodeRecursosEducativosAvanzados,CREA,Santiago7500018,Chile.
Abstract.
IntheMadelung-Bohmapproachtoquantummechanics,weconsidera(timedependent)phasethatdependsquadrati-
callyonpositionandshowthatitleadstoaBohmpotentialthatcorrespondstoatimedependentharmonicoscillator,providedthe
timedependentterminthephaseobeysanErmakovequation.
Introduction
Harmonicoscillatorsarethebuildingblocksinseveralbranchesofphysics,fromclassicalmechanicstoquantum
mechanicalsystems.Inparticular,forquantummechanicalsystems,wavefunctionshavebeenreconstructedasisthe
caseforquantizedincavities[1]andforion-laserinteractions[2].Extensionsfromsingleharmonicoscillators
totimedependentharmonicoscillatorsmaybefoundinshortcutstoadiabaticity[3],quantizedpropagatingin
dielectricmedia[4],Casimire
ect[5]andion-laserinteractions[6],wherethetimedependenceisnecessaryinorder
totraptheion.
Timedependentharmonicoscillatorshavebeenextensivelystudiedandseveralinvariantshavebeenobtained[7,8,9,
10,11].Alsoalgebraicmethodstoobtaintheevolutionoperatorhavebeenshown[12].Theyhavebeensolvedunder
variousscenariossuchastimedependentmass[12,13,14],timedependentfrequency[15,11]andapplicationsof
invariantmethodshavebeenstudiedindi
erentregimes[16].Suchinvariantsmaybeusedtocontrolquantumnoise
[17]andtostudythepropagationoflightinwaveguidearrays[18,19].Harmonicoscillatorsmaybeusedinmore
generalsystemssuchaswaveguidearrays[20,21,22].
Inthiscontribution,weuseanoperatorapproachtosolvetheone-dimensionalSchr
¨
odingerequationintheBohm-
Madelungformalismofquantummechanics.ThisformalismhasbeenusedtosolvetheSchr
¨
odingerequationfor
di
erentsystemsbytakingtheadvantageoftheirnon-vanishingBohmpotentials[23,24,25,26].Alongthiswork,
weshowthatatimedependentharmonicoscillatormaybeobtainedbychoosingapositiondependentquadratictime
dependentphaseandaGaussianamplitudeforthewavefunction.Wesolvetheprobabilityequationbyusingoperator
techniques.Asanexamplewegivearationalfunctionoftimeforthetimedependentfrequencyandshowthatthe
Bohmpotentialhasdi
erentbehaviorforthatfunctionalitybecauseanauxiliaryfunctionneededinthescheme,
namelythefunctionsthatsolvestheErmakovequation,presentstwodi
erentsolutions.
One-dimensionalMadelung-Bohmapproach
ThemainequationinquantummechanicsistheSchrodingerequation,thatinonedimensionandforapotential
V
(
x
;
t
)
iswrittenas(forsimplicity,weset
}
=
1)
i
#
(
x
;
t
)
#
t
=
1
2
m
#
2
(
x
;
t
)
#
x
2
+
V
(
x
;
t
)
(
x
;
t
)
(1)
arXiv:2101.05907v1 [quant-ph] 14 Jan 2021
As shown:
The spacing, such as the title, disappeared and resulted meaning less strings.
The latex equations was wrong, and it got worse on the second page.
How to fix this and extract text and equations correctly from the pdf file that was generated from latex?

HTML parsing gives wrong result

I am using BeautifulSoup in Python to get some data from a table on a website. The soup object looks wrong. My code looks like this:
url =r'http://www.the-numbers.com/movie/budgets/all'
source_code = requests.get(url)
text= source_code.text
soup = BeautifulSoup(text,"lxml")
When I look at tags in soup I found that the result looks wrong. I think I found the part that causes the problem. The original source code of that part looks like this:
<tr><td class="data">81</td>
<td>5/7/2010</td>
<td><b>Iron Man 2</td>
<td class="data">$170,000,000</td>
<td class="data">$312,128,345</td>
<td class="data">$623,256,345</td>
<tr>
But printing out that part in soup it becomes:
<tr><td class="data">81</td>
<td>5/7/2010</td>
<td><b><a href="/movie">/ I r o n - M a n - 2 # t a b = s u m m a r y "
> I r o n M a n 2 / a > / t d >
t d c l a s s = " d a t a " > $ 1 7 0 , 0 0 0 , 0 0 0 / t d >
t d c l a s s = " d a t a " > $ 3 1 2 , 1 2 8 , 3 4 5 / t d >
t d c l a s s = " d a t a " > $ 6 2 3 , 2 5 6 , 3 4 5 / t d >
t r >
Looks like there is an added quotation mark, and it caused BeautifulSoup to not recognize any more tags after that.
How can I fix it? I tried Python's html parser and lxml. They gave the same result.
After trying out a bunch of things, this is what I found. I tried cutting out the problematic part of the html code, but the result shows the problem at the same location. So I thought it might be length issue. I don't know if there is some kind of limit to html code for parsing in BS.
I found this similar question: BeautifulSoup, where are you putting my HTML?
Installing and using 'html5lib' parser worked.
soup = BeautifulSoup(text, "html5lib")
The new result shows all the rest of the tags in soup.

No Output When Running BeautifulSoup Python Code

I was recently trying out the following Python code using BeautifulSoup from this question, which seems to have worked for the question-asker.
import urllib2
import bs4
import string
from bs4 import BeautifulSoup
badwords = set([
'cup','cups',
'clove','cloves',
'tsp','teaspoon','teaspoons',
'tbsp','tablespoon','tablespoons',
'minced'
])
def cleanIngred(s):
s=s.strip()
s=s.strip(string.digits + string.punctuation)
return ' '.join(word for word in s.split() if not word in badwords)
def cleanIngred(s):
# remove leading and trailing whitespace
s = s.strip()
# remove numbers and punctuation in the string
s = s.strip(string.digits + string.punctuation)
# remove unwanted words
return ' '.join(word for word in s.split() if not word in badwords)
def main():
url = "http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx"
data = urllib2.urlopen(url).read()
bs = BeautifulSoup.BeautifulSoup(data)
ingreds = bs.find('div', {'class': 'ingredients'})
ingreds = [cleanIngred(s.getText()) for s in ingreds.findAll('li')]
fname = 'PorkRecipe.txt'
with open(fname, 'w') as outf:
outf.write('\n'.join(ingreds))
if __name__=="__main__":
main()
I can't get it to work in my case though for some reason. I receive the error:
AttributeError Traceback (most recent call last)
<ipython-input-4-55411b0c5016> in <module>()
41
42 if __name__=="__main__":
---> 43 main()
<ipython-input-4-55411b0c5016> in main()
31 url = "http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx"
32 data = urllib2.urlopen(url).read()
---> 33 bs = BeautifulSoup.BeautifulSoup(data)
34
35 ingreds = bs.find('div', {'class': 'ingredients'})
AttributeError: type object 'BeautifulSoup' has no attribute 'BeautifulSoup'
I suspect this is because I'm using bs4 and not BeautifulSoup. I tried replacing the line bs = BeautifulSoup.BeautifulSoup(data) with bs = bs4.BeautifulSoup(data) and no longer receive an error, but get no output. Are there too many possible causes for this to guess?
The original code used BeautifulSoup version 3:
import BeautifulSoup
You switched to BeautifulSoup version 4, but also switched the style of the import:
from bs4 import BeautifulSoup
Either remove that line; you already have the correct import earlier in your file:
import bs4
and then use:
bs = bs4.BeautifulSoup(data)
or change that latter line to:
bs = BeautifulSoup(data)
(and remove the import bs4 line).
You may also want to review the Porting code to BS4 section of the BeautifulSoup documentation, so you can make any other necessary changes upgrading the code you found to get the best out of BeautifulSoup version 4.
The script otherwise works just fine and produces a new file, PorkRecipe.txt, it doesn't produce output on stdout.
The contents of the file after fixing the bs4.BeautifulSoup reference:
READY IN 4+ hrs
Slow Cooker Pork Chops II
Amazing Pork Tenderloin in the Slow Cooker
Jerre's Black Bean and Pork Slow Cooker Chili
Slow Cooker Pulled Pork
Slow Cooker Sauerkraut Pork Loin
Slow Cooker Texas Pulled Pork
Oven-Fried Pork Chops
Pork Chops for the Slow Cooker
Tangy Slow Cooker Pork Roast
Types of Cooking Oil
Garlic: Fresh Vs. Powdered
All about Paprika
Types of Salt
olive oil
chicken broth
garlic,
paprika
garlic powder
poultry seasoning
dried oregano
dried basil
thick cut boneless pork chops
salt and pepper to taste
PREP 10 mins
COOK 4 hrs
READY IN 4 hrs 10 mins
In a large bowl, whisk together the olive oil, chicken broth, garlic, paprika, garlic powder, poultry seasoning, oregano, and basil. Pour into the slow cooker. Cut small slits in each pork chop with the tip of a knife, and season lightly with salt and pepper. Place pork chops into the slow cooker, cover, and cook on High for 4 hours. Baste periodically with the sauce

Break line in Genome Diagram biopython

I’m using Genome Diagram to display genomic informations. I would like to separate the feature name and its location by a break line. Then I do something like that :
gdFeature.add_feature(
feat,
color=red,
sigil="ARROW",
name=feat.qualifiers['product'][0].replace(" ", "_") + "\n" +
str(feat.location.start) + " - " + str(feat.location.end),
label_position="middle",
label_angle=0,
label=True)
gdFeature is an instance of Feature class (http://biopython.org/DIST/docs/api/Bio.Graphics.GenomeDiagram._Feature.Feature-class.html )
The problem is when I save my picture on PDF format, I got a black square instead of a break line :
example here
That’s not really what I want. Is there a way to do that ?
Thanks
I dont' see a direct path. Biopython uses Reportlab to generate the pdf. In the GenomeDiagram the function drawString is called to write the name/label (function is here). I.e. when you pass a label like "First Line\nSecond Line", ReportLab generates the PDF code that resembles (numbers are made up):
BT /F1 48 Tf 1 0 0 1 210 400 Tm (First\nSecond)Tj ET
But that is not the way to print two lines in a PDF. You actually have to have two different lines of code, something like:
BT /F1 48 Tf 1 0 0 1 210 400 Tm (First)Tj ET
BT /F1 48 Tf 1 0 0 1 210 400 Tm (Second)Tj ET
And AFAIK, Biopython doesn't have it implemented.

Importing data in SPSS syntax incl. 'value labels' and 'var labels'

I am trying to setup a standard work flow to efficiently import data from the Dutch National Bureau of Statistics (http://statline.cbs.nl) in SPSS syntax into R and /or Python so I can do analyses, load it into our database etc.
The good news is that they have standardized a lot of different output formats, amongst others an .sps syntax file. In essence, this is a space-delimited data file with extra information contained in the header and in the footer. The file looks like shown below. I prefer to use this format than plain .csv because it contains more data and should make it easier to import large amounts of data in a consistent manner.
The bad news is that I can't find a working library in Python and/or R that can deal with .sps SPPS syntax files. Most libraries work with the binary .sav or .por formats.
I am not looking for a full working SPSS clone, but something that will parse the data correctly using the meta-data with the keywords 'DATA LIST' (length of each column, 'VAR LABELS' (the column headers) and 'VALUE LABELS' (extra data should be joined/replaced during the import).
I'm sure a Python/R library could be written to parse and process all this info efficiently, but I am not that fluent/experienced in either language to do it myself.
Any suggestions or hints would be helpful
SET DECIMAL = DOT.
TITLE "Gezondheidsmonitor; regio, 2012, bevolking van 19 jaar of ouder".
DATA LIST RECORDS = 1
/1 Key0 1 - 5 (A)
Key1 7 - 7 (A)
Key2 9 - 14 (A)
Key3 16 - 23 (A)
Key4 25 - 28 (A)
Key5 30 - 33 (A)
Key6 35 - 38 (A)
Key7 40 - 43 (A).
BEGIN DATA
80200 1 GM1680 2012JJ00 . . . .
80200 1 GM0738 2012JJ00 13.2 . . 21.2
80200 1 GM0358 2012JJ00 . . . .
80200 1 GM0197 2012JJ00 13.7 . . 10.8
80200 1 GM0059 2012JJ00 12.4 . . 16.5
80200 1 GM0482 2012JJ00 13.3 . . 14.1
80200 1 GM0613 2012JJ00 11.6 . . 16.2
80200 1 GM0361 2012JJ00 17.0 9.6 17.1 14.9
80200 1 GM0141 2012JJ00 . . . .
80200 1 GM0034 2012JJ00 14.3 18.7 22.5 18.3
80200 1 GM0484 2012JJ00 9.7 . . 15.5
(...)
80200 3 GM0642 2012JJ00 15.6 . . 19.6
80200 3 GM0193 2012JJ00 . . . .
END DATA.
VAR LABELS
Key0 "Leeftijd"/
Key1 "Cijfersoort"/
Key2 "Regio's"/
Key3 "Perioden"/
Key4 "Mantelzorger"/
Key5 "Zwaar belaste mantelzorgers"/
Key6 "Uren mantelzorg per week"/
Key7 "Ernstig overgewicht".
VALUE LABELS
Key0 "80200" "65 jaar of ouder"/
Key1 "1" "Percentages"
"2" "Ondergrens"
"3" "Bovengrens"/
Key2 "GM1680" "Aa en Hunze"
"GM0738" "Aalburg"
"GM0358" "Aalsmeer"
"GM0197" "Aalten"
(...)
"GM1896" "Zwartewaterland"
"GM0642" "Zwijndrecht"
"GM0193" "Zwolle"/
Key3 "2012JJ00" "2012".
LIST /CASES TO 10.
SAVE /OUTFILE "Gezondheidsmonitor__regio,_2012,_bevolking_van_19_jaar_of_ouder.SAV".
Some sample code to get you started - sorry not the best Python programmer here.. so any improvements might be welcome.
Steps to add here is a method to load the labels and create a list of dicts for the LABEL VALUES.....
f = open('Bevolking_per_maand__100214211711.sps','r')
#lines = f.readlines()
spss_keys = list()
data = list()
begin_data_step= False
end_data_step = False
for l in f:
# first look for TITLE
if l.find('TITLE') <> -1:
start_pos=l.find('"')+1
end_pos = l.find('"',start_pos+1)
title = l[start_pos:end_pos]
print "title:" ,title
if l.find('DATA LIST') <> -1:
data_list = True
start_pos=l.find('=')+1
end_pos=len(l)
num_records= l[start_pos:end_pos].strip()
print "number of records =", num_records
if num_records=='1':
if ((l.find("Key") <> -1) and (not begin_data_step) and (not end_data_step)):
spss_keys.append([l[15:22].strip(),int(l[23:29].strip()),int(l[32:36].strip()),l[37:].strip()])
if l.find('END DATA.') <> -1:
end_data_step=True
if ((begin_data_step) and (not end_data_step)):
values = list()
for key in spss_keys:
values.append(l[key[1]-1:key[2]])
data.append(values)
if l[-1]=="." :
begin_data=False
if l.find('BEGIN DATA') <> -1:
begin_data_step=True
if end_data_step:
print ""
# more to follow
data
From my point of view I would not bother with the SPSS file option, but choose the HTML version and scrape it down. It looks the tables are nicely formatted with classes which would make scraping/parsing the HTML much easier....
Another question to be answered should be: are you going to download the files manually or would you also like to do that automatically?

Categories

Resources