beautifulsoup doesn't show all ellements - python

i'm trying to parse Taobao website and get information about Goods (photo , text and link ) with BeautifulSoup.find but it doesn't find all classes.
url='https://xuanniwen.world.tmall.com/category-1268767539.htm?search=y&catName=%BC%D0%BF%CB#bd&view_op=citations_histogram'
def get_html(url):
r = requests.get(url)
return r.text
html=get_html(url)
soup=BeautifulSoup(html, 'lxml')
z=soup.find("div",{"class":"J_TItems"})
z-is empty.
but for example:
z=soup.find("div",{"class":"skin-box-bd"})
len(z)
Out[196]: 3
works fine
Why this approach doesn't work? What should i do to get all information about good? i am using python 2.7

So, it looks like the items you want to parse are being built dynamically by JavaScript, that's why soup.text.find("J_TItems") returns -1, i.e. there's no "J_TItems" at all in the text. What you can do is use selenium with a JS interpreter, for a headless browsing you can use PhantomJS like this:
from bs4 import BeautifulSoup
from selenium import webdriver
url='https://xuanniwen.world.tmall.com/category-1268767539.htm?search=y&catName=%BC%D0%BF%CB#bd&view_op=citations_histogram'
browser = webdriver.PhantomJS()
browser.get(url)
html = browser.page_source
soup = BeautifulSoup(html, 'html5lib') # I'd also recommend using html5lib
JTitems = soup.find("div", attrs={"class":"J_TItems"})
Note the items you want are inside each row defined by <div class="item4line1">, and there are 5 of them (you maybe only want the first three, because the other two are not really inside the main search, filtering that should not be difficult, a simple rows = rows[2:] do the trick):
rows = JTitems.findAll("div", attrs={"class":"item4line1"})
>>> len(rows)
5
Now notice each "Good" you mention in the question is inside a <dl class="item">, so you need to get them all in a for loop:
Goods = []
for row in rows:
for item in row.findAll("dl", attrs={"class":"item"}):
Goods.append(item)
All there's left to do is to get "photo, text and link" as you mentioned, and this can be easily done accessing each item in Goods list, by inspection you can know how to get each of the information, for examples, for picture url a simple one-line would be:
>>> Goods[0].find("dt", class_='photo').a.img["src"]
'//img.alicdn.com/bao/uploaded/i3/TB19Fl1SpXXXXbsaXXXXXXXXXXX_!!0-item_pic.jpg_180x180.jpg'

Related

BeautifulSoup/Python Problems Parsing Websites

I'm sure this may have been asked in the past but I am attempting to parse a website (hopefully somehow automate it to parse multiple websites at once eventually) but it's not working properly. I may be having issues grabbing appropriate tags or something but essentially I want to go to this website and pull off all of the items from the lists created (possibly with hrefs intact or in a separate document) and stick them into a file where I can share in an easy-to-read format. So far this is my code:
url = "http://catalog.apu.edu/academics/college-liberal-arts-sciences/math-physics-statistics/applied-mathematics-bs/" `
page = urlopen(url)
html = page.read().decode("utf-8")
soup = BeautifulSoup(html, "html.parser")
print(soup.get_text())
results = soup.find_all('div', class_"tab_content")
for element in results:
title_elem = element.find('h1')
h2_elem = element.find('h2')
h3_elem = element.find('h3')
href_elem = element.find('href')
if None in (title_elem, h2_elem, h3_elem, href_elem):
continue
print(title_elem.text.strip())
print(h2_elem.text.strip())
print(h3_elem.text.strip())
print(href_elem.text.strip())
print()
I even attempted to write this for a table but I get the same type of output, which are a bunch of empty elements:
for table in soup.find_all('table'):
for subtable in table.find_all('table'):
print(subtable)
Does anyone have any insight as to why this may be the case? If possible I would also not be opposed to regex parsing, but the main goal here is to go into this site (and hopefully others like it) and take the entire table/lists/descriptions of the individual programs for each major and write the information into an easy-to-read file
Similar approach in that I also selected to combine bs4 with pandas but I tested for the presence of the hyperlink class.
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
url = 'http://catalog.apu.edu/academics/college-liberal-arts-sciences/math-physics-statistics/applied-mathematics-bs/'
r = requests.get(url)
soup = bs(r.content, 'lxml')
for table in soup.select('.sc_courselist'):
tbl = pd.read_html(str(table))[0]
links_column = ['http://catalog.apu.edu' + i.select_one('.bubblelink')['href'] if i.select_one('.bubblelink') is not None else '' for i in table.select('td:nth-of-type(1)')]
tbl['Links'] = links_column
print(tbl)
With BeautifulSoup, an alternative to find/find_all is select_one/select. The latter two apply css selectors with select_one returning the first match for the css selector passed in, and select returning a list of all matches. "." is a class selector, meaning it will select attributes with the specified class e.g. sc_courselist or bubblelink. bubblelink is the class of the element with the desired hrefs. These are within the first column of each table which is selected using td:nth-of-type(1).

How to get nested href in python?

GOAL
(I need to repeatedly do the search for hundreds of times):
1. Search (e.g. "WP_000177210.1") in "https://www.ncbi.nlm.nih.gov/ipg/"
(i.e. https://www.ncbi.nlm.nih.gov/ipg/?term=WP_000177210.1)
2. Select the first record in the second column "CDS Region in Nucleotide" of the table
(i.e. " NC_011415.1 1997353-1998831 (-)", https://www.ncbi.nlm.nih.gov/nuccore/NC_011415.1?from=1997353&to=1998831&strand=2)
3. Select "FASTA" under the name of this sequence
4. Get the fasta sequence
(i.e. ">NC_011415.1:c1998831-1997353 Escherichia coli SE11, complete sequence
ATGACTTTATGGATTAACGGTGACTGGATAACGGGCCAGGGCGCATCGCGTGTGAAGCGTAATCCGGTAT
CGGGCGAG.....").
CODE
1. Search (e.g. "WP_000177210.1") in "https://www.ncbi.nlm.nih.gov/ipg/"
import requests
from bs4 import BeautifulSoup
url = "https://www.ncbi.nlm.nih.gov/ipg/"
r = requests.get(url, params = "WP_000177210.1")
if r.status_code == requests.codes.ok:
soup = BeautifulSoup(r.text,"lxml")
2. Select the first record in the second column "CDS Region in Nucleotide" of the table (In this case " NC_011415.1 1997353-1998831 (-)") (i.e. https://www.ncbi.nlm.nih.gov/nuccore/NC_011415.1?from=1997353&to=1998831&strand=2)
# try 1 (wrong)
## I tried this first, but it seemed like it only accessed to the first level of the href?!
for a in soup.find_all('a', href=True):
if (a['href'][:8]) =="/nuccore":
print("Found the URL:", a['href'])
# try 2 (not sure how to access nested href)
## According to the label I saw in the Develop Tools, I think I need to get the href in the following nested structure. However, it didn't work.
soup.select("html div #maincontent div div div #ph-ipg div table tbody tr td a")
I am stuck in this step right now....
PS
It's my first time to deal with html format. It's also my first time to ask question here. I might not phrase the problem very well. If there's anything wrong, please let me know.
Without using NCBI's REST API,
import time
from bs4 import BeautifulSoup
from selenium import webdriver
# Opens a firefox webbrowser for scrapping purposes
browser = webdriver.Firefox(executable_path=r'your\path\geckodriver.exe') # Put your own path here
# Allows you to load a page completely (with all of the JS)
browser.get('https://www.ncbi.nlm.nih.gov/ipg/?term=WP_000177210.1')
# Delay turning the page into a soup in order to collect the newly fetched data
time.sleep(3)
# Creates the soup
soup = BeautifulSoup(browser.page_source, "html")
# Gets all the links by filtering out ones with just '/nuccore' and keeping ones that include '/nuccore'
links = [a['href'] for a in soup.find_all('a', href=True) if '/nuccore' in a['href'] and not a['href'] == '/nuccore']
Note:
You'll need the package selenium
You'll need to install GeckoDriver

While I use bs4 to parse site

I want to parse the price information in Bitmex using bs4.
(The site url is 'https://www.bitmex.com/app/trade/XBTUSD')
So, I wrote down the code like this
from bs4 import BeautifulSoup
import requests
url = 'https://www.bitmex.com/app/trade/XBTUSD'
bitmex = requests.get(url)
if bitmex.status_code == 200:
print("connected...")
else:
print("Error...")
bitmex_html = bitmex.text
soup = BeautifulSoup(bitmex_html , 'lxml' )
price = soup.find_all("span", {"class": "price"})
print(price)
And, the result is like this
connected...
[]
Why '[]' poped up? and To bring the price text like '6065.5', what should I do?
The text I want to parse is
<span class="price">6065.5</span>
and the selector is
content > div > div.tickerBar.overflown > div > span.instruments.tickerBarSection > span:nth-child(1) > span.price
I just study Python, so question can seems to be odd to pro...sorry
You were pretty close. Give the following a try and see if it's more what you wanted. Perhaps the format you seeing or retrieving is not quite what you expect. Hope this is helpful.
from bs4 import BeautifulSoup
import requests
import sys
import json
url = 'https://www.bitmex.com/app/trade/XBTUSD'
bitmex = requests.get(url)
if bitmex.status_code == 200:
print("connected...")
else:
print("Error...")
sys.exit(1)
bitmex_html = bitmex.text
soup = BeautifulSoup(bitmex_html , 'lxml' )
# extract the json text from the returned page
price = soup.find_all("script", {"id": "initialData"})
price = price.pop()
# parse json text
d = json.loads(price.text)
# pull out the order book and then each price listed in the order book
order_book = d['orderBook']
prices = [v['price'] for v in order_book]
print(prices)
Example output:
connected...
[6045, 6044.5, 6044, 6043.5, 6043, 6042.5, 6042, 6041.5, 6041, 6040.5, 6040, 6039.5, 6039, 6038.5, 6038, 6037.5, 6037, 6036.5, 6036, 6035.5, 6035, 6034.5, 6034, 6033.5, 6033, 6032.5, 6032, 6031.5, 6031, 6030.5, 6030, 6029.5, 6029, 6028.5, 6028, 6027.5, 6027, 6026.5, 6026, 6025.5, 6025, 6024.5, 6024, 6023.5, 6023, 6022.5, 6022, 6021.5, 6021, 6020.5]
Your problem is that the page doesn't contain those span elements in first place. If you check the response tab in your browser developer tools (press F12 in firefox) you can see that the page is composed of script tags with some code written in javascript that creates the elements dynamically when executed.
Since BeautifulSoup can't execute Javascript, you can't extract the elements directly with it. You have two alternatives:
Use something like selenium that allows you to drive a browser from python - that means javascript will be executed because you're using a real browser - however the performance suffers.
Read the javascript code, understand it and write python code to simulate it. This usually is harder but luckly for you this seem very simple for the page you want:
import requests
import lxml.html
r = requests.get('https://www.bitmex.com/app/trade/XBTUSD')
doc = lxml.html.fromstring(r.text)
data = json.loads(doc.xpath("//script[#id='initialData']/text()")[0])
As you can see the data is in json format inside the page. After loading the data variable you can use it to access the infomation you want:
for row in data['orderBook']:
print(row['symbol'], row['price'], row['side'])
Will print:
('XBTUSD', 6051.5, 'Sell')
('XBTUSD', 6051, 'Sell')
('XBTUSD', 6050.5, 'Sell')
('XBTUSD', 6050, 'Sell')

Getting output 0 even if there are 25 same class

Image of the HTML
Link to the page
I am trying to see how many of class are there on this page but the output is 0. And I have been using BeautifulSoup for a while but never saw such error.
from bs4 import BeautifulSoup
import requests
result = requests.get("https://www.holonis.com/motivationquotes")
c = result.content
soup = BeautifulSoup(c)
samples = soup.findAll("div", {"class": "ng-scope"})
print(len(samples))
Output
0
and I want the correct output at least more than 25
This is a "dynamic" Angular-based page which needs a Javascript engine or a browser to be fully loaded. To put it differently - the HTML source code you see in the browser developer tools is not the same as you would see in the result.content - the latter is a non-rendered initial HTML of the page which does not contain the desired data.
You can use things like selenium to have the page rendered and loaded and then HTML-parse it, but, why don't make a direct request to the site API:
import requests
result = requests.get("https://www.holonis.com/api/v2/activities/motivationquotes/all?limit=15&page=0")
data = result.json()
for post in data["items"]:
print(post["body"]["description"])
Post descriptions are retrieved and printed for example-purposes only - the post dictionaries contain all the other relevant post data that is displayed on the web-page.
Basically, result.content does not contain any divs with ng-scope class. As stated in one of the comments the html you are trying to get is added there due to the javascript running on the browser.
I recommend you this package requests-html created by the author of very popular requests.
You can try to use the code below to build on that.
from requests_html import HTMLSession
session = HTMLSession()
r = session.get('https://www.holonis.com/motivationquotes')
r.html.render()
To see how many ng-scope classes are there just do this:
>>> len(r.html.find('.ng-scope'))
302
I assume you want to scrape all the hrefs from the a tags that are children of the divs you gave the image to. You can obtain them this way:
divs = r.html.find('[ng-if="!isVideo"]')
link_sets = (div.absolute_links for div in divs)
>>> list(set(chain.from_iterable(link_sets)))
['https://www.holonis.com/motivationquotes/o/byiqe-ydm',
'https://www.holonis.com/motivationquotes/o/rkhv0uq9f',
'https://www.holonis.com/motivationquotes/o/ry7ra2ycg',
...
'https://www.holonis.com/motivationquotes/o/sydzfwgcz',
'https://www.holonis.com/motivationquotes/o/s1eidcdqf']
There's nothing wrong with BeautifulSoup, in fact, the result of your GET request, does not contain any ng-scope text.
You can see the output here:
>>> from bs4 import BeautifulSoup
>>> import requests
>>>
>>> result = requests.get("https://www.holonis.com/motivationquotes")
>>> c = result.content
>>>
>>> print(c)
**Verify the output yourself**
You only have ng-cloak class as you can see from the regex result:
import re
regex = re.compile('ng.*')
samples = soup.findAll("div", {"class": regex})
samples
#[<div class="ng-cloak slideRoute" data-ui-view="main" fixed="400" main-content="" ng-class="'{{$state.$current.animateClass}}'"></div>]
To get the content of that webpage, it is either wise to use their api or choose any browser simulator like selenium. That webpage loads it's content using lazyload. When you scroll down you will see more content. The webpage expands its content through pagination like https://www.holonis.com/excellenceaddiction/1. However, you can give this a go. I've created this script to parse content displayed within 4 pages. You can always change that page number to satisfy your requirement.
from selenium import webdriver
URL = "https://www.holonis.com/excellenceaddiction/{}"
driver = webdriver.Chrome() #If necessary, define the path
for link in [URL.format(page) for page in range(1,4)]:
driver.get(link)
for items in driver.find_elements_by_css_selector('.tile-content .tile-content-text'):
print(items.text)
driver.quit()
Btw, the above script parses the description of each post.

Beautiful Soup - Blank screen for a long time without any output

I am quite new to python and am working on a scraping based project- where I am supposed to extract all the contents from links containing a particular search term and place them in a csv file. As a first step, I wrote this code to extract all the links from a website based on a search term entered. I only get a blank screen as output and I am unable to find my mistake.
import urllib
import mechanize
from bs4 import BeautifulSoup
import datetime
def searchAP(searchterm):
newlinks = []
browser = mechanize.Browser()
browser.set_handle_robots(False)
browser.addheaders = [('User-agent', 'Firefox')]
text = ""
start = 0
while "There were no matches for your search" not in text:
url = "http://www.marketing-interactive.com/"+"?s="+searchterm
text = urllib.urlopen(url).read()
soup = BeautifulSoup(text, "lxml")
results = soup.findAll('a')
for r in results:
if "rel=bookmark" in r['href'] :
newlinks.append("http://www.marketing-interactive.com"+ str(r["href"]))
start +=10
return newlinks
print searchAP("digital marketing")
You made four mistakes:
You are defining start but you never use it. (Nor can you, as far as I can see on http://www.marketing-interactive.com/?s=something. There is no url based pagination.) So you endlessly looping over the first set of results.
"There were no matches for your search" is not the no-results string returned by that site. So it would go on forever anyway.
You are appending the link, including http://www.marketing-interactive.com to http://www.marketing-interactive.com. So you would end up with http://www.marketing-interactive.comhttp://www.marketing-interactive.com/astro-launches-digital-marketing-arm-blaze-digital/
Concerning rel=bookmark selection: arifs solution is the proper way to go. But if you really want to do it this way you'd need to something like this:
for r in results:
if r.attrs.get('rel') and r.attrs['rel'][0] == 'bookmark':
newlinks.append(r["href"])
This first checks if rel exists and then checks if its first child is "bookmark", as r['href'] simply does not contain the rel. That's not how BeautifulSoup structures things.
To scrape this specific site you can do two things:
You could do something with Selenium or something else that supports Javascript and press that "Load more" button. But this is quite a hassle.
You can use this loophole: http://www.marketing-interactive.com/wp-content/themes/MI/library/inc/loop_handler.php?pageNumber=1&postType=search&searchValue=digital+marketing
This is the url that feeds the list. It has pagination, so you can easily loop over all results.
The following script extracts all the links from the web page based on given search key. But it does not explore beyond the first page. Although the following code can easily be modified to get all results from multiple pages by manipulating page-number in the URL (as described by Rutger de Knijf in the other answer.).
from pprint import pprint
import requests
from BeautifulSoup import BeautifulSoup
def get_url_for_search_key(search_key):
base_url = 'http://www.marketing-interactive.com/'
response = requests.get(base_url + '?s=' + search_key)
soup = BeautifulSoup(response.content)
return [url['href'] for url in soup.findAll('a', {'rel': 'bookmark'})]
Usage:
pprint(get_url_for_search_key('digital marketing'))
Output:
[u'http://www.marketing-interactive.com/astro-launches-digital-marketing-arm-blaze-digital/',
u'http://www.marketing-interactive.com/singapore-polytechnic-on-the-hunt-for-digital-marketing-agency/',
u'http://www.marketing-interactive.com/how-to-get-your-bosses-on-board-your-digital-marketing-plan/',
u'http://www.marketing-interactive.com/digital-marketing-institute-launches-brand-refresh/',
u'http://www.marketing-interactive.com/entropia-highlights-the-7-original-sins-of-digital-marketing/',
u'http://www.marketing-interactive.com/features/futurist-right-mindset-digital-marketing/',
u'http://www.marketing-interactive.com/lenovo-brings-board-new-digital-marketing-head/',
u'http://www.marketing-interactive.com/video/discussing-digital-marketing-indonesia-video/',
u'http://www.marketing-interactive.com/ubs-melvin-kwek-joins-credit-suisse-as-apac-digital-marketing-lead/',
u'http://www.marketing-interactive.com/linkedins-top-10-digital-marketing-predictions-2017/']
Hope this is what you wanted as the first step for your project.

Categories

Resources