Why isn't beautifulsoup's find_all not working

Why isn't beautifulsoup's find_all not working - python

I have decided to view a website's source code, and chose a class, which is "expanded". I wanted to print out all of its contents, with this code:
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.quora.com/How-can-I-write-a-bot-using-Python")
soup = BeautifulSoup(page.content, 'html.parser')
print soup.find_all(class_='expanded')
but it simply prints out:
[]
Please help me detect what's wrong.
I already saw this thread and tried following what the answer said but it did not help me since this error appears in the terminal:
bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?

When searching for a class value, you should pass it in like this:
soup.find_all(attrs={"class":"expanded"})
That being said, I don't see anything in the source code of that site with a class called "expanded". The closest thing I could find was class='ui_qtext_expanded'. If that is what you are trying to find, you need to include the whole string.
soup.find_all(attrs={"class":"ui_qtext_expanded"})

Related

extracting table data from yahoofinance by beautifulsoup in python

I am a python programmer. I want to extract all of table data in below link by beautifulsoup library.
This is the link: https://finance.yahoo.com/quote/GC%3DF/history?p=GC%3DF[enter image description here]1

You'll want to look into web scraping tutorials.
Here's one to get you started: https://realpython.com/python-web-scraping-practical-introduction/
This kind of thing can get a little complicated with complex mark-up, and I'd say the provided link in the question post qualifies as slightly complex mark-up, but basically, you want to find the container div object with "Pb(10px) Ovx(a) W(100%)" classes or table container with data-test attribute of "historical-prices". Drill down to the mark-up data you need from there.
HOWEVER, if you insist on using BeautifulSoup library, here's a tutorial for that: https://realpython.com/beautiful-soup-web-scraper-python/
Scroll down to step 3: "Parse HTML Code With Beautiful Soup"
install the library: python -m pip install beautifulsoup4
Then, use the following code to scrape the page:
import requests
from bs4 import BeautifulSoup
URL = "https://finance.yahoo.com/quote/GC%3DF/history?p=GC%3DF"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
Then, find the table container with data-test attribute of "historical-prices" which I mentioned earlier:
results = soup.find(attrs={"data-test" : "historical-prices"})
Thanks to this other StackOverflow post for this info on the attrs parameter: Extracting an attribute value with beautifulsoup
From there, you'll want to drill down. I'm not really sure how to do this step properly, as I never did this in Python before, but there are multiple ways to go about doing this. My preferred way would be to use the find method or findAll method on the initial result set:
result_set = results.find("tbody", recursive=False).findAll("tr")
Alternatively, you may be able to use the deprecated findChildren method:
result_set = results.findChildren("tbody", recursive=False)
result_set2 = result_set.findChildren("tr", recursive=False)
You may require a results set loop for each drill-down. The page you mentioned doesn't make things easy, mind you. You'll have to drill down multiple times to find the proper tr elements. Of course, the above code is only example code, not properly tested.

Scraping Wikipedia Table providing no results

venturing into the world of python. I've done the codeacademy course and traweled through stack and youtube but hitting an issue I cant solve.
I'm attempting to do a simple print of a table located in wikipedia, failing misreably at writing my own code I decided to use a tutorial example and build off. However this isn't working and I haven't the foggest idea why.
This is the code here with the appropiate link included. My end result is an empty list "[ ]". I'm using PyCharm 2017.2, beautifulsoup 4.6.0, requests 2.18.4 & python 3.6.2. Any advice appreciated. For reference, the tutorial website is here
import requests
from bs4 import BeautifulSoup
WIKI_URL = "https://en.wikipedia.org/wiki/List_of_volcanoes_by_elevation"
req = requests.get(WIKI_URL)
soup = BeautifulSoup(req.content, 'lxml')
table_classes = {"class": ["sortable", "plainrowheaders"]}
wikitables = soup.findAll("table", table_classes)
print(wikitables)

You can accomplish that using regular expressions.
You get site content by requests.get(WIKI_URL).content
See source code of the site to see how Wikipedia presents tables in HTML.
Find a regular expression that can fit whole table (might be something like <table>(?P<table>*+?)</table>). What this does is get anything between <table> and </table> tokens. Good documentation for regex with python. Take a look at re.findall().
Now you are left with table data. You can use regular expressions again to get data for each row, then regex on each row to get columns. re.findall() is the key again.

Using BeautifulSoup to Extract CData

I'm trying to use BeautifulSoup from bs4/Python 3 to extract CData. However, whenever I search for it using the following, it returns an empty result. Can anyone point out what I'm doing wrong?
from bs4 import BeautifulSoup,CData
txt = '''<foobar>We have
<![CDATA[some data here]]>
and more.
</foobar>'''
soup = BeautifulSoup(txt)
for cd in soup.findAll(text=True):
if isinstance(cd, CData):
print('CData contents: %r' % cd)

The problem appears to be that the default parser doesn't parse CDATA properly. If you specify the correct parser, the CDATA shows up:
soup = BeautifulSoup(txt,'html.parser')
For more information on parsers, see the docs
I got onto this by using the diagnose function, which the docs recommend:
If you have questions about Beautiful Soup, or run into problems, send mail to the discussion group. If your problem involves parsing an HTML document, be sure to mention what the diagnose() function says about that document.
Using the diagnose() function gives you output of how the different parsers see your html, which enables you to choose the right parser for your use case.

Regular Expressions in Python-Scraping Data from website

I am new to Python and I trying to pull in xml files from a website and load them into a database. I have been using the Beautiful Soup module in Python but I cannot pull in the specific xml file that I want.
In the website source code it looks as follows:
ReportName.XML
ReportName.XML
<ReportName.XML
The following shows the code I have in Python. This brings back everything with the 'href' tag whereas I want to filter the files on the 'Report I want name dddddddd'. I have tried using regular expressions such as 'href=\s\w+' for example but to no avail as it returns NONE. Any help is appreciated
from bs4 import BeautifulSoup
import urllib
import re
webpage=("http://www.example.com")
response=urllib.urlopen(webpage).read()
soup=BeautifulSoup(response)
for link in soup.find_all('a'):
print(link.get('href')
When I use Python it findall('href') it pulls back the entire string but I want to filter just the xml aspect. I have tried variations of the code such as findall('href\MarketReports') and findall('href\w+') put this returns "None" when I run the code.
Any help is appreciated

I'm not entirely clear exactly what you're looking for, but if I understand correctly, you only want to get ReportName.XML, in which case it would be:
find('a').text
If you're looking for "/MarketRepoerts/ReportName.XML", then it would be:
find('a').attrs['href']

I used the following code and it was able to find the reports as I needed them. The Google presentation was a great help along with jdotjdot input
http://www.youtube.com/watch?v=kWyoYtvJpe4
The code that I used to find my XML was
import re
import urllib
webpage=("http://www.example.com")
response=urllib.urlopen(webpage).read()
print re.findall(r"Report I want\w+[.]XML",response)

Beautifulsoup, Python and HTML automatic page truncating?

I'm using Python and BeautifulSoup to parse HTML pages. Unfortunately, for some pages (> 400K) BeatifulSoup is truncating the HTML content.
I use the following code to get the set of "div"s:
findSet = SoupStrainer('div')
set = BeautifulSoup(htmlSource, parseOnlyThese=findSet)
for it in set:
print it
At a certain point, the output looks like:
correct string, correct string, incomplete/truncated string ("So, I")
although, the htmlSource contains the string "So, I am bored", and many others. Also, I would like to mention that when I prettify() the tree I see the HTML source truncated.
Do you have an idea how can I fix this issue?
Thanks!

Try using lxml.html. It is a faster, better html parser, and deals better with broken html than latest BeautifulSoup. It is working fine for your example page, parsing the entire page.
import lxml.html
doc = lxml.html.parse('http://voinici.ceata.org/~sana/test.html')
print len(doc.findall('//div'))
Code above returns 131 divs.

I found a solution to this problem using BeautifulSoup at beautifulsoup-where-are-you-putting-my-html, because I think it is easier than lxml.
The only thing you need to do is to install:
pip install html5lib
and add it as a parameter to BeautifulSoup:
soup = BeautifulSoup(html, 'html5lib')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why isn't beautifulsoup's find_all not working - python

Related

extracting table data from yahoofinance by beautifulsoup in python

Scraping Wikipedia Table providing no results

Using BeautifulSoup to Extract CData

Regular Expressions in Python-Scraping Data from website

Beautifulsoup, Python and HTML automatic page truncating?

Categories

Resources