I am working on a small project of my own and try to wrap my mind around web scraping.
I am using Python 2 and BeautifulSoap module(but tried other modules as well, experimenting with re module, others).
Briefly, given the website: http://www.bankofcanada.ca/rates/exchange/daily-closing-past-five-day/ I would like to gather the information about the exchange rates for each Currency but with more flexible code.
Here is my example:
import urllib2
from bs4 import BeautifulSoup
import string
import re
myurl = 'http://www.bankofcanada.ca/rates/exchange/daily-closing-past-five-day/'
soup = BeautifulSoup(urllib2.urlopen(myurl).read(), "lxml")
dataTables = soup.find_all('td')
brandNewList = []
for x in dataTables:
text = x.get_text().strip()
brandNewList.append(text)
#print text
for index, item in enumerate(brandNewList):
if item == "U.S. dollar (close)":
for item in brandNewList[index:6]:
print item
It displays:
$ python crawler.py
U.S. dollar (close)
1.4530
1.4557
1.4559
1.4490
1.4279
So, as you may see, I can display the data corresponding to each currency by scraping the 'td' tags; I can get even more specific if I would use 'th' in combination with 'td' tags.
But, what if I don't really want to specify the exact string "U.S. dollar (close)", how can I make the script mode adaptable to different websites?
For example, I would like enter as an argument from the terminal only "US"/"us" and the script will give me back the values corresponding to the US dollar independently on how the column is named on different websites?
Also, I am kind of a beginner in Python so can you, please show me the more neat way of re-writing my web crawler? It feels like I have written it in a kind of "dumb" way, mostly :)
how can I make the script mode adaptable to different websites?
Different sites have really different markups, it is close to impossible to make a universal and reliable location mechanism in your case. Depending on how many sites you want to scrape, you may just loop over the different locating functions with an EAFP approach until you successfully get the currency rate.
Note that some resources provide public or private APIs and you don't really need to scrape them.
By the way, you can improve your code by locating the U.S. dollar (close) label and getting the following td siblings:
us_dollar_label = soup.find("td", text="U.S. dollar (close)")
rates = [td.get_text() for td in us_dollar_label.find_next_siblings("td")]
Related
I'm trying to web scrape a site that is badly designed and I am trying to gather the prices of items. The only thing in common with each page is that the prices all start with a "£" so I thought that if I searched through all the HTML content and returned all strings with "£" attached it would work.
I am not quite sure how to go about this. Any help is greatly appreciated.
Kind regards
If you just want to pull out the prices with '£' prefix then can try something like this.
import re
html = """
cost of living is £2,232
bottle of milk costs £1 and it goes up to £1.05 a year later...
"""
print(re.findall(r"£\S+", html))
Output:
['£2,232', '£1', '£1.05']
If you want to extract the item name along with the price then the regexp will need to be modified. BeautifulSoup Python library can be used to extract info from even malformed HTML sites.
I keep running into an issue when I scrape data with lxml by using the xpath. I want to scrape the dow price but when I print it out in python it says Element span at 0x448d6c0. I know that must be a block of memory but I just want the price. How can I print the price instead of the place in memory it is?
from lxml import html
import requests
page = requests.get('https://markets.businessinsider.com/index/realtime-
chart/dow_jones')
content = html.fromstring(page.content)
#This will create a list of prices:
prices = content.xpath('//*[#id="site"]/div/div[3]/div/div[3]/div[2]/div/table/tbody/tr[1]/th[1]/div/div/div/span')
#This will create a list of volume:
print (prices)
You're getting generators which as you said are just memory locations. To access them, you need to call a function on them, in this case, you want the text so .text
Additionally, I would highly recommend changing your XPath since it's a literal location and subject to change.
prices = content.xpath("//div[#id='site']//div[#class='price']//span[#class='push-data ']")
prices_holder = [i.text for i in prices]
prices_holder
['25,389.06',
'25,374.60',
'7,251.60',
'2,813.60',
'22,674.50',
'12,738.80',
'3,500.58',
'1.1669',
'111.7250',
'1.3119',
'1,219.58',
'15.43',
'6,162.55',
'67.55']
Also of note, you will only get the values at load. If you want the prices as they change, you'd likely need to use Selenium.
The variable prices is a list containing a web element. You need to call the text method to extract the value.
print(prices[0].text)
'25,396.03'
I am trying to extract the contents of a specific tag on a webpage by using lxml, namely on Indeed.com.
Example page: link
I am trying to extract the company name and position name. Chrome shows that the company name is located at
"//*[#id='job-content']/tbody/tr/td[1]/div/span[1]"
and the position name is located at
"//*[#id='job-content']/tbody/tr/td[1]/div/b/font"
This bit of code tries to extract those values from a locally saved and parsed copy of the page:
import lxml.html as h
xslt_root = h.parse("Temp/IndeedPosition.html")
company = xslt_root.xpath("//*[#id='job-content']/tbody/tr/td[1]/div/span[1]/text()")
position = xslt_root.xpath("//*[#id='job-content']/tbody/tr/td[1]/div/b/font/text()")
print(company)
print(position)
However, the print commands return empty strings, meaning nothing was extracted!
What is going on? Am I using the right tags? I don't think these are dynamically generated since the page loads normally with javascript disabled.
I would really appreciate any help with getting those two values extracted.
Try it like this:
company = xslt_root.xpath("//div[#data-tn-component='jobHeader']/span[#class='company']/text()")
position = xslt_root.xpath("//div[#data-tn-component='jobHeader']/b[#class='jobtitle']//text()")
['The Habitat Company']
['Janitor-A (Scattered Sites)']
Once we have the //div[#data-tn-component='jobHeader'] path things become pretty straightforward:
select the text of the child span /span[#class='company']/text() to get the company name
/b[#class='jobtitle']//text() is a bit more convoluted: since the job title is embedded in a font tag. But we can just select any descendant text using //text() to get the position.
An alternative is to select the b or font node and use text_content() to get the text (recursively, if needed), e.g.
xslt_root.xpath("//div[#data-tn-component='jobHeader']/b[#class='jobtitle']")[0].text_content()
Despite your assumption, it seems that the content on the page is loaded dynamically, and is thus not present during loading time.
This means you can't access the elements from your downloaded HTML file (if you do not believe me, try to look for job-content in the actual file on your computer, which will only contain placeholders and descriptors.
It seems you would have to use technologies like Selenium to perform this task.
Again, I want to stress that whatever you are doing (automatically), is a violation of indeed.com's Terms and Conditions, so I would suggest not to go too far with this anyways.
from lxml import html
import requests
import time
#Gets prices
page = requests.get('https://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=hi')
tree = html.fromstring(page.content)
price = tree.xpath('//h2[#data-attribute="Hi Guess the Food - What’s the Food Brand in the Picture"]/text()')
print(price)
This only returns []
When looking into page.content, it shows the amazon anti bot stuff. How can I bypass this?
One general advice when you're trying to scrap something from some website. Take a look first to the returned content, in this case page.content before trying anything. You're assuming wrongly amazon is allowing you nicely to fetch their data, when they don't.
I think urllib2 is better, and xpath could be:
price = c.xpath('//div[#class="s-item-container"]//h2')[0]
print price.text
After all, long string may contains strange characters.
I haven't used RegEx before, and everyone seems to agree that it's bad for webscraping and html in particular, but I'm not really sure how to solve my little challenge without.
I have a small Python scraper that opens 24 different webpages. In each webpage, there's links to other webpages. I want to make a simple solution that gets the links that I need and even though the webpages are somewhat similar, the links that I want are not.
The only common thing between the urls seems to be a specific string: 'uge' or 'Uge' (uge means week in Danish - and the week number changes every week, duh). It's not like the urls have a common ID or something like that I could use to target the correct ones each time.
I figure it would be possible using RegEx to go through the webpage and find all urls that has 'uge' or 'Uge' in them and then open them. But is there a way to do that using BS? And if I do it using RegEx, how would a possible solution look like?
For example, here are two of the urls I want to grab in different webpages:
http://www.domstol.dk/KobenhavnsByret/retslister/Pages/Uge45-Tvangsauktioner.aspx
http://www.domstol.dk/esbjerg/retslister/Pages/Straffesageruge32.aspx
This should work... The RegEx uge\d\d? tells it to find "uge" followed by a digit, and possibly another one.
import re
for item in listofurls:
l = re.findall("uge\d\d?", item, re.IGNORECASE):
if l:
print item #just do whatever you want to do when it finds it
Yes, you can do this with BeautifulSoup.
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html_string)
# To find just 'Uge##' or 'uge##', as specified in the question:
urls = [el["href"] for el in soup.findAll("a", href=re.compile("[Uu]ge\d+"))]
# To find without regard to case at all:
urls = [el["href"] for el in soup.findAll("a", href=re.compile("(?i)uge\d+"))]
Or just use a simple for loop:
list_of_urls = ["""LIST GOES HERE"""]
for url in list_of_urls:
if 'uge' in url.lower():
# Code to execute
The regex expression would look something like: uge\d\d