Suitable javascript parser to be used with urlopen - python

I am trying the following:
from urllib2 import urlopen
from BeautifulSoup import BeautifulSoup
url = 'http://search.wcad.org/Property-Detail?PropertyQuickRefID=R000017&PartyQuickRefID=O0532572'
soup = BeautifulSoup(urlopen(url).read())
print soup
The print statement displays very complicated text structure and it is difficult to extract variables. What is the better way to extract variables like Legal Description

You don't need to parse JavaScript to get the "Legal Description" value - you need to parse HTML and BeautifulSoup HTML parser can do the job. Locate the td element "by 'Legal Description' text" and then get the next td element:
soup.find("td", text="Legal Description").find_next_sibling("td").get_text()
Note: you are using BeautifulSoup version 3 - it is very outdated and not maintained - switch to the 4th version:
pip install beautifulsoup4
And change your import from:
from BeautifulSoup import BeautifulSoup
to:
from bs4 import BeautifulSoup

Though you can do this with urllib2 I would recommend to use requests.
The id is unique for each field, so you can get the text directly by finding the element using id.
import requests
from bs4 import BeautifulSoup
url = "http://search.wcad.org/Property-Detail?PropertyQuickRefID=R000017&PartyQuickRefID=O0532572"
html = requests.get(url)
soup = BeautifulSoup(html.text, "lxml")
text = soup.find("td", id="dnn_ctr1460_View_tdGILegalDescription").get_text()
print(text)
NOTE: I've used Beautifulsoup version 4. To install it use this command - pip install bs4.

Related

How to select a tags and scrape href value?

I am having trouble getting hyperlinks for tennis matches listed on a webpage, how do I go about fixing the code below so that it can obtain it through print?
import requests
from bs4 import BeautifulSoup
response = requests.get("https://www.betexplorer.com/results/tennis/?year=2022&month=11&day=02")
webpage = response.content
soup = BeautifulSoup(webpage, "html.parser")
print(soup.findAll('a href'))
In newer code avoid old syntax findAll() instead use find_all() or select() with css selectors - For more take a minute to check docs
Select your elements more specific and use set comprehension to avoid duplicates:
set('https://www.betexplorer.com'+a.get('href') for a in soup.select('a[href^="/tennis"]:has(strong)'))
Example
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.betexplorer.com/results/tennis/?year=2022&month=11&day=02')
soup = BeautifulSoup(r.text)
set('https://www.betexplorer.com'+a.get('href') for a in soup.select('a[href^="/tennis"]:has(strong)'))
Output
{'https://www.betexplorer.com/tennis/itf-men-singles/m15-new-delhi-2/sinha-nitin-kumar-vardhan-vishnu/tOasQaJm/',
'https://www.betexplorer.com/tennis/itf-women-doubles/w25-jerusalem/mushika-mao-mushika-mio-cohen-sapir-nagornaia-sofiia/xbNOHTEH/',
'https://www.betexplorer.com/tennis/itf-men-singles/m25-jakarta-2/barki-nathan-anthony-sun-fajing/zy2r8bp0/',
'https://www.betexplorer.com/tennis/itf-women-singles/w15-solarino/margherita-marcon-abbagnato-anastasia/lpq2YX4d/',
'https://www.betexplorer.com/tennis/itf-women-singles/w60-sydney/lee-ya-hsuan-namigata-junri/CEQrNPIG/',
'https://www.betexplorer.com/tennis/itf-men-doubles/m15-sharm-elsheikh-16/echeverria-john-marrero-curbelo-ivan-ianin-nikita-jasper-lai/nsGbyqiT/',...}
Change the last line to
print([a['href'] for a in soup.findAll('a')])
See a full tutorial here: https://pythonprogramminglanguage.com/get-links-from-webpage/

Python HTML scraping cannot find attribute I know exists?

I am using the lxml and requests modules, and just trying to parse the article from a website. I tried using find_all from BeautifulSoup but still came up empty
from lxml import html
import requests
page = requests.get('https://www.thehindu.com/news/national/karnataka/kumaraswamy-congress-leaders-meet-to-discuss-cabinet-reshuffle/article27283040.ece')
tree = html.fromstring(page.content)
article = tree.xpath('//div[#class="article"]/text()')
Once I print article, I get a list of ['\n','\n','\n','\n','\n'], rather than the body of the article. Where exactly am I going wrong?
I would use bs4 and the class name in css select_one
import requests
from bs4 import BeautifulSoup as bs
page = requests.get('https://www.thehindu.com/news/national/karnataka/kumaraswamy-congress-leaders-meet-to-discuss-cabinet-reshuffle/article27283040.ece')
soup = bs(page.content, 'lxml')
print(soup.select_one('.article').text)
If you use
article = tree.xpath('//div[#class="article"]//text()')
you get a list and still get all the \n but also the text which I think you can handle with re.sub or conditional logic.

How get a fragment of the code from External HTML (web site) with python

I want to get a fragment from a HTML website with python.
For example from the url http://steven-universe.wikia.com/wiki/Steven_Universe_Wiki I want to get the text in the box "next Episode", as a string. How can I get it?
First of all download BeautifulSoup latest version from here
and requests from here
from bs4 import BeautifulSoup
import requests
con = requests.get(url).content
soup = BeautifulSoup(con)
text = soup.find_all("a",href="/wiki/Gem_Harvest").text;
print(link)

BeautifulSoup and Large html

I was trying to scrape a number of large Wikipedia pages like this one.
Unfortunately, BeautifulSoup is not able to work with such a large content, and it truncates the page.
I found a solution to this problem using BeautifulSoup at beautifulsoup-where-are-you-putting-my-html, because I think it is easier than lxml.
The only thing you need to do is to install:
pip install html5lib
and add it as a parameter to BeautifulSoup:
soup = BeautifulSoup(htmlContent, 'html5lib')
However, if you prefer, you can also use lxml as follows:
import lxml.html
doc = lxml.html.parse('https://en.wikipedia.org/wiki/Talk:Game_theory')
I suggest you get the html content and then pass it to BS:
import requests
from bs4 import BeautifulSoup
r = requests.get('https://en.wikipedia.org/wiki/Talk:Game_theory')
if r.ok:
soup = BeautifulSoup(r.content)
# get the div with links at the bottom of the page
links_div = soup.find('div', id='catlinks')
for a in links_div.find_all('a'):
print a.text
else:
print r.status_code

How to find and extract link from webpage?

I have website , e.g http://site.com
I would like fetch main page and extract only links that match the regular expression, e.g .*somepage.*
The format of links in html code can be:
url
url
url
I need the output format:
http://site.com/my-somepage
http://site.com/my-somepage.html
http://site.com/my-somepage.htm
Output url must contain domain name always.
What is the fast python solution for this?
You could use lxml.html:
from lxml import html
url = "http://site.com"
doc = html.parse(url).getroot() # download & parse webpage
doc.make_links_absolute(url)
for element, attribute, link, _ in doc.iterlinks():
if (attribute == 'href' and element.tag == 'a' and
'somepage' in link): # or e.g., re.search('somepage', link)
print(link)
Or the same using beautifulsoup4:
import re
try:
from urllib2 import urlopen
from urlparse import urljoin
except ImportError: # Python 3
from urllib.parse import urljoin
from urllib.request import urlopen
from bs4 import BeautifulSoup, SoupStrainer # pip install beautifulsoup4
url = "http://site.com"
only_links = SoupStrainer('a', href=re.compile('somepage'))
soup = BeautifulSoup(urlopen(url), parse_only=only_links)
urls = [urljoin(url, a['href']) for a in soup(only_links)]
print("\n".join(urls))
Use an HTML Parsing module, like BeautifulSoup.
Some code(only some):
from bs4 import BeautifulSoup
import re
html = '''url
url
url'''
soup = BeautifulSoup(html)
links = soup.find_all('a',{'href':re.compile('.*somepage.*')})
for link in links:
print link['href']
Output:
http://site.com/my-somepage
/my-somepage.html
my-somepage.htm
You should be able to get the format you want from this much data...
Scrapy is the simplest way to do what you want. There is actually link extracting mechanism built-in.
Let me know if you need help with writing the spider to crawl links.
Please, also see:
How do I use the Python Scrapy module to list all the URLs from my website?
Scrapy tutorial

Categories

Resources