Urllib2 returning HTML with newlines and tabs - python

I want to scrape the HTML from some website and then send it off to BeautifulSoup for parsing. The problem is that the HTML returned by urllib2.urlopen() contains newlines (\n) and tabs (\t) as well as having single quotes and other characters escaped. When I try to build a BeautifulSoup object with this HTML, I get an error.
b = BeautifulSoup(src)
gives this error.
My code:
def get_page_source(url):
"""
Retrieves the HTML source code for url.
"""
try:
return urllib2.urlopen(url)
except:
return ""
def retrieve_links(url):
"""
Use the BeautifulSoup module to efficiently grab all links from the source
code retrieved by get_page_source.
"""
src = get_page_source(url)
b = BeautifulSoup(src)
.
.
.
How can I solve this problem?
EDIT
import urllib2
link = "http://www.techcrunch.com/"
src = urllib2.urlopen(link).read()
f = open('out.txt', 'w')
f.write(src)
f.close()
gives this output.

The problem is that the HTML you are parsing contains embedded JavaScript code (the BeautifulSoup error complains about line 130, which is in the middle of embedded JavaScript), and the JavaScript contains embedded HTML.
Line 130, notice the <a> tag:
adNode += "<a href='http://t.aol.com?ncid=...
It's Matryoshka doll of HTML and JavaScript, and Python's built-in parser can't handle it.
You can follow the instructions for installing a parser, given by BeatifulSoup itself in the error message you posted:
Python's built-in HTMLParser cannot parse the given document. This is not a bug in Beautiful Soup. The best solution is to install an external parser (lxml or html5lib), and use Beautiful Soup with that parser. See http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser for help.

Related

scraping inside script tag with beautifulsoup

I'm scraping data from e-commerce site and I need model number of each laptops. But in div tags, there are no model numbers. I found model number inside script tag as "productCode". For this example its:
"productCode":"MGND3TU/A"
How can I gather the "productCode" data. I couldn't understand from other posts.
Edit: I find the ‘productCode’ inside script tag. But i don’t know how to get it. You can check from page source.
Since the JSON is hidden in the <head>, it can be parsed, but with some custom logic.
Unfortunately the script tags exports the JSON to a window var, so we'll need to remove that befor we can parse it.
Get url
Get all <script>
Check if PRODUCT_DETAIL_APP_INITIAL_STAT exist in the string (valid json)
Remove the prefix (hardcoded)
Find the index of the next key (hardcoded)
Remove after the suffix
Try to parse to json
Print json['product']['productCode'] if it exists
import json
from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests
reqs = requests.get("https://www.trendyol.com/apple/macbook-air-13-m1-8gb-256gb-ssd-altin-p-67940132")
soup = BeautifulSoup(reqs.text, 'html.parser')
for sc in soup.findAll('script'):
if len(sc.contents) > 0 and "PRODUCT_DETAIL_APP_INITIAL_STAT" in sc.contents[0]:
withoutBegin = sc.contents[0][44:]
endIndex = withoutBegin.find('window.TYPageName=') - 1
withoutEnd = withoutBegin[:endIndex]
try:
j = json.loads(withoutEnd)
if j['product']['productCode']:
print(j['product']['productCode'])
except Exception as e:
print("Unable to parse JSON")
continue
Output:
MGND3TU/A
In this case beautifulsoup is not needed cause response could be searched directly with regex:
json.loads(re.search(r"window.__PRODUCT_DETAIL_APP_INITIAL_STATE__=({.*}});window", r).group(1))
Example
import requests, re, json
r = requests.get('https://www.trendyol.com/apple/macbook-air-13-m1-8gb-256gb-ssd-altin-p-67940132').text
json_data = json.loads(re.search(r"window.__PRODUCT_DETAIL_APP_INITIAL_STATE__=({.*}});window", r).group(1))
json_data['product']['productCode']
Output
MGND3TU/A
That's because those tags are generated using JavaScript. When you send a request to that URL, you will get back a response which has information for a JS script to build DOM for you. (technically JSON information):
To see what your returned response actually is, either print the value of r.text (r is returned from requests.get()) or manually see the "view page source" from the browser. (not inspect element section)
Now to solve it, you can either use something that can render JS, just like your browser. For example Selenium. requests module is not capable of rendering JS. It is just for sending and receiving requests.
Or manually extract that JSON text from the returned text (using Regex or,...) then create a Python dictionary from it.

BeautifulSoup strips incomplete </tr> tags from html files

I'm trying to remove all script tags from html files using beautifulsoup. The problem is that html files do not have opening tags for table rows in some cases (there are only </ tr> tags at the end of the row) and beautifulsoup seems to be removing them, since they are incomplete. This messes up the formatting of the table as a result. Is there any other way to remove these script tags without messing up the formatting?
import os
from pathlib import Path
from bs4 import BeautifulSoup
root_dir = os.path.join(Path().absolute(),'newfolder\\')
for path in Path(root_dir).iterdir():
if path.is.file():
htmlfile = open(path,encoding="utf-8".read()
soup = BeautifulSoup(htmlfile)
to_be_removed = soup.find_all("script")
for x in to_be_removed:
x.extract()
html = soup.prettify("utf-8")
with open(path,"wb") as file:
file.write(html)
This is happening due to the parser used by BeautifulSoup to read your HTML document, at soup = BeautifulSoup(htmlfile).
BeautifulSoup uses lxml as the default, which makes assumptions such as that your HTML is valid. When that is not the case, I would suggest looking into a more lenient parser - the documentation is a great source of information for this.
To get specific, you could use html.parser as it's more lenient, and finally try outputting your code without prettifying it, by using:
html = soup.prettify(formatter=None)

Can not find element with requests/BeautifulSoup

I write a web scraper with requests and BeautifulSoup, and there's an element in the DOM I can't find.
Here's what I do:
import requests
from bs4 import BeautifulSoup
r = requests.get('http://www.decitre.fr/rechercher/result/?q=victor+hugo&search-scope=3')
soup = BeautifulSoup(r.text)
The element I can't find is the "old-price" (the one which is striked-through), which I can see when I inspect the DOM with a browser dev tool.
soup.find_all(class_='old-price') # returns [], no matter if I specify "span"
Moreover I can't see the 'old-price' string in the soup or the result of the request:
'old-price' in soup.text # False
'old-price' in r.text # False
I can't see it when I get the source with wget too.
I can get its div parent, but can't find price children inside it:
commands = soup.find_all(class_='product_commande')
commands[0].find_all('old-price') # []
So I have no idea what's going on. What am I missing ?
do I badly use request/BeautifulSoup ? (I'm not sure if r.text returns the full html)
is that html part generated with a javascript code ? if so how can I know it and is there a way to get the complete html ?
many thanks
In my case I was passing invalid HTML into Beautiful Soup which was causing it to ignore everything after the invalid tag at the start of the document:
<!--?xml version="1.0" encoding="iso-8859-1"?-->
Note that I am also using Ghost.py. Here is how I removed the tag.
#remove invalid xml tag
ghostContent = ghost.content
invalidCode = '<!--?xml version="1.0" encoding="iso-8859-1"?-->'
if ghostContent.startswith(invalidCode):
ghostContent = ghostContent[len(invalidCode):]
doc = BeautifulSoup(ghostContent)
#test to see if we can find text
if 'Application Search Results' in doc.text:
print 'YES!'

get all links from html even with show more link

I am using python and beautifulsoup for html parsing.
I am using the following code :
from BeautifulSoup import BeautifulSoup
import urllib2
import re
url = "http://www.wikipathways.org//index.php?query=signal+transduction+pathway&species=Homo+sapiens&title=Special%3ASearchPathways&doSearch=1&ids=&codes=&type=query"
main_url = urllib2.urlopen(url)
content = main_url.read()
soup = BeautifulSoup(content)
for a in soup.findAll('a',href=True):
print a[href]
but I am not getting output links like :
http://www.wikipathways.org/index.php/Pathway:WP26
and also imp thing is there are 107 pathways. but I will not get all the links as other lins depends on "show links" at the bottom of the page.
so, how can I get all the links (107 links) from that url?
Your problem is line 8, content = url.read(). You're not actually reading the webpage, you're actually just doing nothing (If anything, you should be getting an error).
main_url is what you want to read, so change line 8 to:
content = main_url.read()
You also have another error, print a[href]. href should be a string, so it should be:
print a['href']
I would suggest using lxml its faster and better for parsing html worth investing the time to learn it.
from lxml.html import parse
dom = parse('http://www.wikipathways.org//index.php?query=signal+transduction+pathway&species=Homo+sapiens&title=Special%3ASearchPathways&doSearch=1&ids=&codes=&type=query').getroot()
links = dom.cssselect('a')
That should get you going.

python url regexp

I have a regexp and i want to add output of regexp to my url
for exmaple
url = 'blabla.com'
r = re.findall(r'<p>(.*?</a>))
r output - /any_string/on/any/server/
but a dont know how to make get-request with regexp output
blabla.com/any_string/on/any/server/
Don't use regex to parse html. Use a real parser.
I suggest using the lxml.html parser. lxml supports xpath, which is a very powerful way of querying structured documents. There's a ready-to-use make_links_absolute() method that does what you ask. It's also very fast.
As an example, in this question's page HTML source code (the one you're reading now) there's this part:
<li><a id="nav-tags" href="/tags">Tags</a></li>
The xpath expression //a[#id='nav-tags']/#href means: "Get me the href attribute of all <a> tags with id attribute equal to nav-tags". Let's use it:
from lxml import html
url = 'http://stackoverflow.com/questions/3423822/python-url-regexp'
doc = html.parse(url).getroot()
doc.make_links_absolute()
links = doc.xpath("//a[#id='nav-tags']/#href")
print links
The result:
['http://stackoverflow.com/tags']
Just get beautiful soup:
http://www.crummy.com/software/BeautifulSoup/documentation.html#Parsing+a+Document
import urllib2
from BeautifulSoup import BeautifulSoup
page = urllib2.urlopen(url)
soup = BeautifulSoup(page)
soup.findAll('p')

Categories

Resources