HTML parsing gives wrong result - python

I am using BeautifulSoup in Python to get some data from a table on a website. The soup object looks wrong. My code looks like this:
url =r'http://www.the-numbers.com/movie/budgets/all'
source_code = requests.get(url)
text= source_code.text
soup = BeautifulSoup(text,"lxml")
When I look at tags in soup I found that the result looks wrong. I think I found the part that causes the problem. The original source code of that part looks like this:
<tr><td class="data">81</td>
<td>5/7/2010</td>
<td><b>Iron Man 2</td>
<td class="data">$170,000,000</td>
<td class="data">$312,128,345</td>
<td class="data">$623,256,345</td>
<tr>
But printing out that part in soup it becomes:
<tr><td class="data">81</td>
<td>5/7/2010</td>
<td><b><a href="/movie">/ I r o n - M a n - 2 # t a b = s u m m a r y "
> I r o n M a n 2 / a > / t d >
t d c l a s s = " d a t a " > $ 1 7 0 , 0 0 0 , 0 0 0 / t d >
t d c l a s s = " d a t a " > $ 3 1 2 , 1 2 8 , 3 4 5 / t d >
t d c l a s s = " d a t a " > $ 6 2 3 , 2 5 6 , 3 4 5 / t d >
t r >
Looks like there is an added quotation mark, and it caused BeautifulSoup to not recognize any more tags after that.
How can I fix it? I tried Python's html parser and lxml. They gave the same result.

After trying out a bunch of things, this is what I found. I tried cutting out the problematic part of the html code, but the result shows the problem at the same location. So I thought it might be length issue. I don't know if there is some kind of limit to html code for parsing in BS.
I found this similar question: BeautifulSoup, where are you putting my HTML?
Installing and using 'html5lib' parser worked.
soup = BeautifulSoup(text, "html5lib")
The new result shows all the rest of the tags in soup.

Related

How to loop one by one when executing full script

below is my some part of python automation code:
Inside def function for loop is there how to takes 1st value from list then continue with starting "kcauto()" then when it comes again in same loop then takes 2nd value from list and continue so on
my code : -
nnlist = [
'3789',
'4567'
]
def kcauto():
ano = '031191'
print(ano)
code = '12'
print(code)
date = '06-Feb-2022'
print(date)
url2 = ('https://www.myweb&nn=')
for nn in nnlist:
callurl2 = print(url2 + nn)
for tn in nnlist:
kcauto()
print(tn)
My output : -
031191
12
06-Feb-2022
https://www.myweb&nn=3789
https://www.myweb&nn=4567
3789
031191
12
06-Feb-2022
https://www.myweb&nn=3789
https://www.myweb&nn=4567
4567
But required output : -
031191
12
06-Feb-2022
https://www.myweb&nn=3789
3789
031191
12
06-Feb-2022
https://www.myweb&nn=4567
4567
You have two loops going on in here. One outside of the kcauto() function and one inside of it. You must remove one of these in order to fix the double link print out.
Something like this might work:
nnlist = ['3789','4567']
def kcauto(items):
for nn in items:
ano = '031191'
print(ano)
code = '12'
print(code)
date = '06-Feb-2022'
print(date)
url2 = ('https://www.myweb&nn=')
callurl2 = print(url2 + nn)
print(nn + "\n")
kcauto(nnlist)

Python how to read a latex generated pdf with equations

Consider the following article
https://arxiv.org/pdf/2101.05907.pdf
It's a typically formatted academic paper with only two pictures in pdf file.
The following code was used to extract the text and equation from the paper
#Related code explanation: https://stackoverflow.com/questions/45470964/python-extracting-text-from-webpage-pdf
import io
import requests
r = requests.get(url)
f = io.BytesIO(r.content)
#Related code explanation: https://stackoverflow.com/questions/45795089/how-can-i-read-pdf-in-python
import PyPDF2
fileReader = PyPDF2.PdfFileReader(f)
#Related code explanation: https://automatetheboringstuff.com/chapter13/
print(fileReader.getPage(0).extractText())
However, the result was not quite correct
Bohmpotentialforthetimedependentharmonicoscillator
FranciscoSoto-Eguibar
1
,FelipeA.Asenjo
2
,SergioA.Hojman
3
andH
´
ectorM.
Moya-Cessa
1
1
InstitutoNacionaldeAstrof´
´
OpticayElectr´onica,CalleLuisEnriqueErroNo.1,SantaMar´Tonanzintla,
Puebla,72840,Mexico.
2
FacultaddeIngenier´yCiencias,UniversidadAdolfoIb´aŸnez,Santiago7491169,Chile.
3
DepartamentodeCiencias,FacultaddeArtesLiberales,UniversidadAdolfoIb´aŸnez,Santiago7491169,Chile.
DepartamentodeF´FacultaddeCiencias,UniversidaddeChile,Santiago7800003,Chile.
CentrodeRecursosEducativosAvanzados,CREA,Santiago7500018,Chile.
Abstract.
IntheMadelung-Bohmapproachtoquantummechanics,weconsidera(timedependent)phasethatdependsquadrati-
callyonpositionandshowthatitleadstoaBohmpotentialthatcorrespondstoatimedependentharmonicoscillator,providedthe
timedependentterminthephaseobeysanErmakovequation.
Introduction
Harmonicoscillatorsarethebuildingblocksinseveralbranchesofphysics,fromclassicalmechanicstoquantum
mechanicalsystems.Inparticular,forquantummechanicalsystems,wavefunctionshavebeenreconstructedasisthe
caseforquantizedincavities[1]andforion-laserinteractions[2].Extensionsfromsingleharmonicoscillators
totimedependentharmonicoscillatorsmaybefoundinshortcutstoadiabaticity[3],quantizedpropagatingin
dielectricmedia[4],Casimire
ect[5]andion-laserinteractions[6],wherethetimedependenceisnecessaryinorder
totraptheion.
Timedependentharmonicoscillatorshavebeenextensivelystudiedandseveralinvariantshavebeenobtained[7,8,9,
10,11].Alsoalgebraicmethodstoobtaintheevolutionoperatorhavebeenshown[12].Theyhavebeensolvedunder
variousscenariossuchastimedependentmass[12,13,14],timedependentfrequency[15,11]andapplicationsof
invariantmethodshavebeenstudiedindi
erentregimes[16].Suchinvariantsmaybeusedtocontrolquantumnoise
[17]andtostudythepropagationoflightinwaveguidearrays[18,19].Harmonicoscillatorsmaybeusedinmore
generalsystemssuchaswaveguidearrays[20,21,22].
Inthiscontribution,weuseanoperatorapproachtosolvetheone-dimensionalSchr
¨
odingerequationintheBohm-
Madelungformalismofquantummechanics.ThisformalismhasbeenusedtosolvetheSchr
¨
odingerequationfor
di
erentsystemsbytakingtheadvantageoftheirnon-vanishingBohmpotentials[23,24,25,26].Alongthiswork,
weshowthatatimedependentharmonicoscillatormaybeobtainedbychoosingapositiondependentquadratictime
dependentphaseandaGaussianamplitudeforthewavefunction.Wesolvetheprobabilityequationbyusingoperator
techniques.Asanexamplewegivearationalfunctionoftimeforthetimedependentfrequencyandshowthatthe
Bohmpotentialhasdi
erentbehaviorforthatfunctionalitybecauseanauxiliaryfunctionneededinthescheme,
namelythefunctionsthatsolvestheErmakovequation,presentstwodi
erentsolutions.
One-dimensionalMadelung-Bohmapproach
ThemainequationinquantummechanicsistheSchrodingerequation,thatinonedimensionandforapotential
V
(
x
;
t
)
iswrittenas(forsimplicity,weset
}
=
1)
i
#
(
x
;
t
)
#
t
=
1
2
m
#
2
(
x
;
t
)
#
x
2
+
V
(
x
;
t
)
(
x
;
t
)
(1)
arXiv:2101.05907v1 [quant-ph] 14 Jan 2021
As shown:
The spacing, such as the title, disappeared and resulted meaning less strings.
The latex equations was wrong, and it got worse on the second page.
How to fix this and extract text and equations correctly from the pdf file that was generated from latex?

Extract a string between other two in Python

I am trying to extract the comments from a fdf (PDF comment file). In practice, this is to extract a string between other two. I did the following:
I open the fdf file with the following command:
import re
import os
os.chdir("currentworkingdirectory")
archcom =open("comentarios.fdf", "r")
cadena = archcom.read()
With the opened file, I create a string called cadena with all the info I need. For example:
cadena = "\n215 0 obj\n<</D[2.0 2.0]/S/D>>\nendobj\n216 0 obj\n<</D[2.0 2.0]/S/D>>\nendobj\n217 0 obj\n<</D[2.0 2.0]/S/D>>\nendobj\n218 0 obj\n<</W 3.0>>\nendobj\n219 0 obj\n<</W 3.0>>\nendobj\ntrailer\n<</Root 1 0 R>>\n%%EOF\n"
I try to extract the needed info with the following line:
a = re.findall(r"nendobj(.*?)W 3\.0",cadena)
Trying to get:
a = "n216 0 obj\n<</D[2.0 2.0]/S/D>>\nendobj\n217 0 obj\n<</D[2.0 2.0]/S/D>>\nendobj\n218 0 obj\n<<"
But I got:
a = []
The problem is in the line a = re.findall(r"nendobj(.*?)W 3\.0",cadena) but I don't realize where. I have tried many combinations with no success.
I appreciate any comment.
Regards
It seems to me that there are 2 problems:
a) you are looking for nendobj, but the N is actually part of the line break \n. Thus you'll also not get a leading N in the output, because there is no N.
b) Since the text you're looking for crosses some newlines, you need the re.DOTALL flag
Final code:
a = re.findall("endobj(.*?)W 3\.0",cadena, re.DOTALL)
Also note, that there will be a second result, confirmed by Regex101.

Read lines from GitHub user content file

I'm trying to read a text file (proxies list) from GitHub user content. Code shall return a random line, but it doesn't work as expected.
My code:
res = reqs.get('https://raw.githubusercontent.com/clarketm/proxy-list/master/proxy-list-raw.txt', headers={'User-Agent':'Mozilla/5.0'})
proxies = []
for lines in res.text:
proxies = ''.join(lines)
print proxies
return proxies
Here is what I get:
.
2
1
:
8
0
8
0
1
9
2
.
1
6
2
.
6
2
.
1
9
7
:
5
9
2
4
6
Here is what is expected:
178.217.106.245:8080
186.192.98.250:8080
If random line can be returned this would be even better.
Result is a string, iterating over a string iterates over letters, not lines.
You'll have to split the string by newlines and iterate over that:
for lines in res.text.split('\n'):
...

HTML parsing using beautiful soup gives structure different to website

When I view this link https://www.cftc.gov/sites/default/files/files/dea/cotarchives/2015/futures/financial_lf061615.htm the text is displayed in a clear way. However when I try to parse the page using beautiful soup I am outputting something which doesn't look the same - it is all messed up. Here is the code
import urllib.request
from bs4 import BeautifulSoup
request = urllib.request.Request('https://www.cftc.gov/sites/default/files/files/dea/cotarchives/2015/futures/financial_lf061615.htm')
htm = urllib.request.urlopen(request).read()
soup = BeautifulSoup(htm,'html.parser')
text = soup.get_text()
print(text)
The desired ouput would look like this
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Traders in Financial Futures - Futures Only Positions as of June 16, 2015
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Dealer : Asset Manager/ : Leveraged : Other : Nonreportable :
Intermediary : Institutional : Funds : Reportables : Positions :
Long : Short : Spreading: Long : Short : Spreading: Long : Short : Spreading: Long : Short : Spreading: Long : Short :
-----------------------------------------------------------------------------------------------------------------------------------------------------------
DOW JONES UBS EXCESS RETURN - CHICAGO BOARD OF TRADE ($100 X INDEX)
CFTC Code #221602 Open Interest is 19,721
Positions
97 2,934 0 8,941 1,574 973 6,490 11,975 1,694 1,372 539 0 154 32
Changes from: June 9, 2015 Total Change is: 3,505
48 0 0 2,013 1,141 70 447 1,369 923 -64 0 0 68 2
Percent of Open Interest Represented by Each Category of Trader
0.5 14.9 0.0 45.3 8.0 4.9 32.9 60.7 8.6 7.0 2.7 0.0 0.8 0.2
Number of Traders in Each Category Total Traders: 31
. . 0 5 . . 6 9 . 5 . 0
-----------------------------------------------------------------------------------------------------------------------------------------------------------
After viewing the page source it is not clear to me how a new line is being distingushed in the style - which is where I think the problem comes from.
Is there some type of structure I need to specify in the BeautifulSoup function? I'm very lost here, so any help is much appreciated.
Fwiw I have installing the html2text module and had no luck installing on anaconda using !conda config --append channels conda-forge and !conda install html2text
Cheers
EDIT: ive figured it out. im a brainlet
request = urllib.request.Request('https://www.cftc.gov/sites/default/files/files/dea/cotarchives/2015/futures/financial_lf061615.htm')
htm = urllib.request.urlopen(request).read()
htm = htm.decode('windows-1252')
htm = htm.replace('\n','').replace('\r','')
htm = htm.split('</pre><pre>')
cleaned = []
for i in htm:
i = BeautifulSoup(i,'html.parser' ).get_text()
cleaned.append(i)
with open('trouble.txt','w') as f:
for line in cleaned:
f.write('%s\n' % line)

Categories

Resources