BeautifulSoup could not get everything

BeautifulSoup could not get everything - python

2 weeks ago, I could read everything in the source code of this url: http://camelcamelcamel.com/Jaybird-Sport-Wireless-Bluetooth-Headphones/product/B013HSW4SM?active=price_amazon
However, today, when I am running the same code again, all the historical price could not appear in soup.... Do you know how to fix this problem?
Here's my python code (it worked well!)
from bs4 import BeautifulSoup
from urllib2 import urlopen
url = 'http://camelcamelcamel.com/Jaybird-Sport-Wireless-Bluetooth-Headphones/product/B013HSW4SM?active=price_amazon'
soup = BeautifulSoup(urlopen(url),'html.parser')
lst = soup.find_all('tbody')
for tbody in lst:
trs = tbody.find_all('tr')
for elem in trs:
tr_class = elem.get('class')
if tr_class != None:
if tr_class[0] == 'highest_price' or tr_class[0] == 'lowest_price':
tds = elem.find_all('td')
td_label = tds[0].get_text().split(' ')[0]
td_price = tds[1].get_text()
td_date = tds[2].get_text()
print td_label, td_price, td_date
else:
tds = elem.find_all('td')
td_label = tds[0].get_text().split(' ')[0]
if td_label == 'Average':
td_price = tds[1].get_text()
print td_label, td_price
ps = soup.find_all('p')
for p in ps:
p_class = p.get('class')
if p_class != None and len(p_class) == 2 and p_class[0] == 'smalltext' and p_class[1] == 'grey':
p_text = p.get_text()
m = re.search('since([\w\d,\s]+)\.', p_text)
if m:
date = m.group(1)
dt = datetime.datetime.strptime(date, ' %b %d, %Y')
print datetime.date.strftime(dt, '%Y-%m-%d')
break

From reading the source code, it seems like the historical price data is accessed via JavaScript. As such, you'll need to find a way to emulate a real browser. Personally, I use Selenium for these kinds of tasks.

I am not really sure about the solution, but you should generally avoid of so much list indexing and find_all clauses. The reason is that the position or number of elements change much more easily than things like class, ids and so on. So I would recommend to use rather css selectors.

Related

Alternatives to Python beautiful soup

I wrote a few lines to get data from a financial data website.
It simply uses beautiful soup to parse and requests to get.
Is there any other simpler or sleeker ways of getting the same result?
I'm just after a discussion to see what others have come up with.
from pandas import DataFrame
import bs4
import requests
def get_webpage():
symbols = ('ULVR','AZN','HSBC')
for ii in symbols:
url = 'https://uk.finance.yahoo.com/quote/' + ii + '.L/history?p=' + ii + '.L'
response = requests.get(url)
soup = bs4.BeautifulSoup(response.text, 'html.parser')
rows = soup.find_all('tr')
data = [[td.getText() for td in rows[i].find_all('td')] for i in range(len(rows))]
#for i in data:
# [-7:] Date
# [-6:] Open
# [-5:] High
# [-4:] Low
# [-3:] Close
# [-2:] Adj Close
# [-1:] Volume
data = DataFrame(data)
print(ii, data)
if __name__ == "__main__":
get_webpage()
Any thoughts?

You can try with read_html() method:
symbols = ('ULVR','AZN','HSBC')
df=[pd.read_html('https://uk.finance.yahoo.com/quote/' + ii + '.L/history?p=' + ii + '.L') for ii in symbols]
df1=df[0][0]
df2=df[1][0]
df3=df[2][0]

As its just the entire table that I want, it seems easier to use pandas.read_html, especially as I have no need to scrape anything in particular apart from the entire table.
There is some helpful information on this site as a guidance https://pbpython.com/pandas-html-table.html
By just using import pandas as pd I get the result I am after.
import pandas as pd
def get_table()
symbols = ('ULVR','AZN','HSBC')
position=0
for ii in symbols:
table=[pd.read_html('https://uk.finance.yahoo.com/quote/' + ii + '.L/history?
p=' + ii + '.L')]
print (symbols[position])
print (table, '\n')
position += 1
if __name__ == "__main__":
get_table()

How to get all emails from a page individually

I am trying to get all emails from a specific page and separate them into an individual variable or even better a dictionary. This is some code.
import requests
import re
import json
from bs4 import BeautifulSoup
page = "http://www.example.net"
info = requests.get(page)
if info.status_code == 200:
print("Page accessed")
else:
print("Error accessing page")
code = info.content
soup = BeautifulSoup(code, 'lxml')
allEmails = soup.find_all("a", href=re.compile(r"^mailto:"))
print(allEmails)
sep = ","
allEmailsStr = str(allEmails)
print(type(allEmails))
print(type(allEmailsStr))
j = allEmailsStr.split(sep, 1)[0]
print(j)
Excuse the poor variable names because I put this together so it would be fine by itself. The output from the example website would be for example something like
[k, kolyma, location, balkans]
So if I ran the problem it would return only
[k
But if I wanted it to return every email on there individually how would I do that?

To get just the email str you can try:
emails = []
for email_link in allEmails:
emails.append(email_link.get("href").replace('mailto:', ''))
print(emails)

Based on your expected output, you can use the unwrap function of BeautifulSoup
allEmails = soup.find_all("a", href=re.compile(r"^mailto:"))
for Email in allEmails:
print(Email.unwrap()) #This will print the whole element along with tag
# k

extracting multiple data from table row in BS4

in the code below I am trying to extract IP addresses and ports of http://free-proxy-list.net from the table using BeautifulSoup.
But every time I get the whole row which is useless because I can't separate IP addresses from their ports.
How can I get IP and port separated?
Here is my code:
def get_proxy(self):
response = requests.get(self.url)
soup = bs(response.content,'html.parser')
data_list = [tr for tr in soup.select('tr') if tr.td]
for i in data_list:
print(i.text)

In your code,
instead of -
i.text you could use i.getText(' ,') (or with another separator of your choice other than ,).
That will give you comma separated IP and Ports.
Moreover for convenience you could load the proxy list into a dataframe as well.
Make the following changes/additions to your code -
soup = bs(response.content,'html.parser')
data_list = [tr for tr in soup.select('tr') if tr.td]
data_list2 = [tr.getText(' ,') for tr in soup.select('tr') if tr.td]
#for i in data_list:
#print(i.text)
df = pd.DataFrame(data_list2,columns=['proxy_list'])
df_proxyList= df['proxy_list'].str.split(',', expand=True)[0:300]
df_proxyList would look like (with few garbage columns) -

Try this. I had to add the isnumeric() condition to make sure that the code doesn't include the data from another table which is present on the same website.
from bs4 import BeautifulSoup as bs
import requests
from collections import defaultdict
def get_proxy(URL):
response = requests.get(url)
soup = bs(response.content,'html.parser')
mapping = defaultdict()
for tr in soup.select('tr'):
if len(list(tr)) == 8:
ip_val = str(list(tr)[0].text)
port_val = str(list(tr)[1].text)
if port_val.isnumeric():
mapping[ip_val] = port_val
for items in mapping.keys():
print("IP:",items)
print("PORT:",mapping[items])
if __name__ == '__main__':
url = "http://free-proxy-list.net"
get_proxy(url)

Excluding 'duplicated' scraped URLs in Python app?

I've never used Python before so excuse my lack of knowledge but I'm trying to scrape a xenforo forum for all of the threads. So far so good, except for the fact its picking up multiple URLs for each page of the same thread, I've posted some data before to explain what I mean.
forums/my-first-forum/: threads/my-gap-year-uni-story.13846/
forums/my-first-forum/: threads/my-gap-year-uni-story.13846/page-9
forums/my-first-forum/: threads/my-gap-year-uni-story.13846/page-10
forums/my-first-forum/: threads/my-gap-year-uni-story.13846/page-11
Really, what I would ideally want to scrape is just one of these.
forums/my-first-forum/: threads/my-gap-year-uni-story.13846/
Here is my script:
from bs4 import BeautifulSoup
import requests
def get_source(url):
return requests.get(url).content
def is_forum_link(self):
return self.find('special string') != -1
def fetch_all_links_with_word(url, word):
source = get_source(url)
soup = BeautifulSoup(source, 'lxml')
return soup.select("a[href*=" + word + "]")
main_url = "http://example.com/forum/"
forumLinks = fetch_all_links_with_word(main_url, "forums")
forums = []
for link in forumLinks:
if link.has_attr('href') and link.attrs['href'].find('.rss') == -1:
forums.append(link.attrs['href']);
print('Fetched ' + str(len(forums)) + ' forums')
threads = {}
for link in forums:
threadLinks = fetch_all_links_with_word(main_url + link, "threads")
for threadLink in threadLinks:
print(link + ': ' + threadLink.attrs['href'])
threads[link] = threadLink
print('Fetched ' + str(len(threads)) + ' threads')

This solution assumes that what should be removed from the url to check for uniqueness is always going to be "/page-#...". If that is not the case this solution will not work.
Instead of using a list to store your urls you can use a set, which will only add unique values. Then in the url remove the last instance of "page" and anything that comes after it if it is in the format of "/page-#", where # is any number, before adding it to the set.
forums = set()
for link in forumLinks:
if link.has_attr('href') and link.attrs['href'].find('.rss') == -1:
url = link.attrs['href']
position = url.rfind('/page-')
if position > 0 and url[position + 6:position + 7].isdigit():
url = url[:position + 1]
forums.add(url);

Parallel web requests in Python (request.get and BeautifulSoup)

I have a simple Python script that loops through a dictionary, where the keys are url links. I have to extract some info from each link and store the info into another dictionary. The code below is the first part of the function and the code seems to work as expected.
But it opens only one link at time, while I think I might get some improvement in run time if I do this in parallel. Do you have any suggestion how I can achieve this in a simple way in Python?
def updater(local):
links = myItems['links']
for link in links.keys():
page = requests.get(link)
soup = BeautifulSoup(page.content, 'html.parser')
newsoup = soup.find("div", {"id": "overviewQuickstatsBenchmarkDiv"})
rows = newsoup.findAll('tr')[1]
counter = 0
date = ""
for td in rows.findAll('td'):
counter += 1
if td.contents[0] == 'Date':
date = td.text.replace("Date", "")
elif counter == 2:
pass
elif counter == 3:
price = re.findall("\d+\.\d+", td.string)[0]
Here's what I tried using multiprocessing (but I cannot get any result and the code seems to run doing nothing):
def read(url):
result = {'link': url, 'data': requests.get(url)}
print "Reading: " + url
return result
def updater(local):
links = myItems['links']
pool = Pool(processes=5)
results = pool.map(read, links.keys())
for link in links.keys():
# need to read the results and store data into a dictionary

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

BeautifulSoup could not get everything - python

From reading the source code, it seems like the historical price data is accessed via JavaScript. As such, you'll need to find a way to emulate a real browser. Personally, I use Selenium for these kinds of tasks.

I am not really sure about the solution, but you should generally avoid of so much list indexing and find_all clauses. The reason is that the position or number of elements change much more easily than things like class, ids and so on. So I would recommend to use rather css selectors.

Related

Alternatives to Python beautiful soup

How to get all emails from a page individually

extracting multiple data from table row in BS4

Excluding 'duplicated' scraped URLs in Python app?

Parallel web requests in Python (request.get and BeautifulSoup)

Categories

Resources