Python broken links after BeautifulSoup - python

I'm a newbie to Python. Just installed it for Windows and try to HTML scraping.
Here's my test code:
from bs4 import BeautifulSoup
html = 'text Download text'
print(html)
soup = BeautifulSoup(html, "html.parser")
for link in soup.find_all('a'):
print(link.get('href'))
This code returns collected but broken link:
Transfert.php?Filename=myfile_x86&version=5¶m=13
How can I fix it?

You are feeding the parser invalid HTML, the correct way to include &
in a URL in a HTML attribute is to escape it to &
Simply change & to &
html = 'text Download text'
soup = BeautifulSoup(html, "html.parser")
for link in soup.find_all('a'):
print(link.get('href'))
Output:
Transfert.php?Filename=myfile_x86&version=5&param=13
The reason why it works with html5lib and lxml is because some parsers can handle broken HTML better than others. As mentioned by Goyo in the comments, you can't prevent other people from writing broken HTML :)
This is a great answer to your question that explains it in detail: https://stackoverflow.com/a/26073147/4796844.

Related

Python Webscraping: Problems parsing chinese characters with beautiful soup/requests

I am scraping a Chinese website and usually there is no problem to parse the chinese characters which i use to find specific urls with the pattern function within bs4.
However, for this particular chinese website the soup cannot be parsed properly.
Below is the code i use to set up the soup:
start = f'http://www.shuichan.cc/news_list.asp?action=&c_id=93&s_id=210&page={1}'
r = requests.get(start)
soup = bs(r.content, "html.parser")
An example of the printed soup is the following:
Current soup
Note: I had to add a picture as Stack though it was spam :)
The above should have looked like the following:
Proper soup
I wonder if i have to specify some kind of encoding within the request or perhaps something within the soup but as for now i have not found anything that would work.
Thanks in advance!
I don't know Chinese. Does this give the desired results?
import requests
from bs4 import BeautifulSoup as bs
start = f'http://www.shuichan.cc/news_list.asp?action=&c_id=93&s_id=210&page={1}'
r = requests.get(start)
soup = bs(r.content.decode('GBK', 'ignore'), "html.parser")
print(soup)

How to sift through specific items from a webpage using conditional statement

I've made a scraper in python. It is running smoothly. Now I would like to discard or accept specific links from that page as in, links only containing "mobiles" but even after making some conditional statement I can't do so. Hope I'm gonna get any help to rectify my mistakes.
import requests
from bs4 import BeautifulSoup
def SpecificItem():
url = 'https://www.flipkart.com/'
Process = requests.get(url)
soup = BeautifulSoup(Process.text, "lxml")
for link in soup.findAll('div',class_='')[0].findAll('a'):
if "mobiles" not in link:
print(link.get('href'))
SpecificItem()
On the other hand if I do the same thing using lxml library with xpath, It works.
import requests
from lxml import html
def SpecificItem():
url = 'https://www.flipkart.com/'
Process = requests.get(url)
tree = html.fromstring(Process.text)
links = tree.xpath('//div[#class=""]//a/#href')
for link in links:
if "mobiles" not in link:
print(link)
SpecificItem()
So, at this point i think with BeautifulSoup library the code should be somewhat different to get the purpose served.
The root of your problem is your if condition works a bit differently between BeautifulSoup and lxml. Basically, if "mobiles" not in link: with BeautifulSoup is not checking if "mobiles" is in the href field. I didn't look too hard but I'd guess it's comparing it to the link.text field instead. Explicitly using the href field does the trick:
import requests
from bs4 import BeautifulSoup
def SpecificItem():
url = 'https://www.flipkart.com/'
Process = requests.get(url)
soup = BeautifulSoup(Process.text, "lxml")
for link in soup.findAll('div',class_='')[0].findAll('a'):
href = link.get('href')
if "mobiles" not in href:
print(href)
SpecificItem()
That prints out a bunch of links and none of them include "mobiles".

BeautifulSoup scraping: I'm confused

I'm trying to scrape this site, and I want to check all of the anchor tags.
I have imported beautifulsoup 4.3.2 and here is my code:
url = """http://www.civicinfo.bc.ca/bids?pn=1"""
Html = urlopen(url).read()
Soup = BeautifulSoup(Html, 'html.parser')
Content = Soup.find_all('a')
My problem is that Content is always empty (i.e. Content = []). Does anyone have any ideas?
From the documentation html.parser is not very lenient before certain versions of Python. So you're likely looking at some malformed HTML.
What you want to do works if you use lxml instead of html.parser
From the documentation:
That said, there are things you can do to speed up Beautiful Soup. If
you’re not using lxml as the underlying parser, my advice is to start.
Beautiful Soup parses documents significantly faster using lxml than
using html.parser or html5lib.
So the relevant code would be:
Soup = BeautifulSoup(Html, 'lxml')

Beautiful Soup doesn't 'get' full webpage

I am using BeautifulSoup to parse a bunch of links from this page but it wasn't extracting all the links I wanted it to. To try and figure out why, I downloaded the html to "web_page.html" and ran
soup = BeautifulSoup(open("web_page.html"))
print soup.get_text()
I notice that it doesn't print the whole web page. It ends at Brackley. I looked at the html code to see if something weird was happening at 'Brackley' but I couldn't find anything. Plus if I move another link to Brackley's place it will print that and not Brackley. It seems like it will only read a certain size html file?
Not sure how have you got the page and links, here is what I did and got all the links starting from "Canada" and ending with "Taloyoak, HAM":
from bs4 import BeautifulSoup
import requests
url = 'http://www12.statcan.gc.ca/census-recensement/2006/dp-pd/tbt/Geo-index-eng.cfm?TABID=5&LANG=E&APATH=3&DETAIL=0&DIM=0&FL=A&FREE=0&GC=0&GID=0&GK=0&GRP=1&PID=99015&PRID=0&PTYPE=88971,97154&S=0&SHOWALL=0&SUB=0&Temporal=2006&THEME=70&VID=0&VNAMEE=&VNAMEF=&D1=0&D2=0&D3=0&D4=0&D5=0&D6=0'
response = requests.get(url)
soup = BeautifulSoup(response.content)
print [a.text for a in soup.select('div.span-8 ol li a')]
Prints:
[
u'Canada',
u'Newfoundland and Labrador / Terre-Neuve-et-Labrador',
...
u'Gjoa Haven, HAM',
u'Taloyoak, HAM'
]
FYI, div.span-8 ol li a is a CSS Selector.
Try using different parsers. You are not specifying one, so you are probably using the default html.parser. Try using lxml or html5lib.
For more info: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser

How to use CSS selectors to retrieve specific links lying in some class using BeautifulSoup?

I am new to Python and I am learning it for scraping purposes I am using BeautifulSoup to collect links (i.e href of 'a' tag). I am trying to collect the links under the "UPCOMING EVENTS" tab of site http://allevents.in/lahore/. I am using Firebug to inspect the element and to get the CSS path but this code returns me nothing. I am looking for the fix and also some suggestions for how I can choose proper CSS selectors to retrieve desired links from any site. I wrote this piece of code:
from bs4 import BeautifulSoup
import requests
url = "http://allevents.in/lahore/"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data)
for link in soup.select( 'html body div.non-overlay.gray-trans-back div.container div.row div.span8 div#eh-1748056798.events-horizontal div.eh-container.row ul.eh-slider li.h-item div.h-meta div.title a[href]'):
print link.get('href')
The page is not the most friendly in the use of classes and markup, but even so your CSS selector is too specific to be useful here.
If you want Upcoming Events, you want just the first <div class="events-horizontal">, then just grab the <div class="title"><a href="..."></div> tags, so the links on titles:
upcoming_events_div = soup.select_one('div.events-horizontal')
for link in upcoming_events_div.select('div.title a[href]'):
print(link['href'])
Note that you should not use r.text; use r.content and leave decoding to Unicode to BeautifulSoup. See Encoding issue of a character in utf-8
import bs4 , requests
res = requests.get("http://allevents.in/lahore/")
soup = bs4.BeautifulSoup(res.text)
for link in soup.select('a[property="schema:url"]'):
print link.get('href')
This code will work fine!!

Categories

Resources