Using regular expression in Beautiful Soup scraping - python

I have a list of URLs and I'm trying to use regex to scrap info from each URL. This is my code (well, at least the relevant part):
for url in sammy_urls:
soup = BeautifulSoup(urlopen(url).read()).find("div",{"id":"page"})
addy = soup.find("p","addy").em.encode_contents()
extracted_entities = re.match(r'"\$(\d+)\. ([^,]+), ([\d-]+)', addy).groups()
price = extracted_entities[0]
location = extracted_entities[1]
phone = extracted_entities[2]
if soup.find("p","addy").em.a:
website = soup.find("p", "addy").em.a.encode_contents()
else:
website = ""
When I pull a couple of the URLs and practice the regex equation, the extracted entities and the price location phone website come up fine, but run into trouble when I put it into this larger loop, being feed real URLs.
Did I input the regex incorrectly? (the error message is ''NoneType' object has no attribute 'groups'' so that is my guess).
My 'addy' seems to be what I want... (prints
"$10. 2109 W. Chicago Ave., 773-772-0406, "'theoldoaktap.com
"$9. 3619 North Ave., 773-772-8435, "'cemitaspuebla.com
and so on).

Combining html/xml with regular expressions has a tendency to turn bad.
Why not use bs4 to find the 'a' elements in the div you're interested in and get the 'href' attribute from the element.
see also retrieve links from web page using python and BeautifulSoup

Related

Select css tags with randomized letters at the end

I am currently learning web scraping with python. I'm reading Web scraping with Python by Ryan Mitchell.
I am stuck at Crawling Sites Through Search. For example, reuters search given in the book works perfectly but when I try to find it by myself, as I will do in the future, I get this link.
Whilst in the second link it is working for a human, I cannot figure out how to scrape it due to weird class names like this class="media-story-card__body__3tRWy"
The first link gives me simple names, like this class="search-result-content" that I can scrape.
I've encountered the same problem on other sites too. How would I go about scraping it or finding a link with normal names in the future?
Here's my code example:
from bs4 import BeautifulSoup
import requests
from rich.pretty import pprint
text = "hello"
url = f"https://www.reuters.com/site-search/?query={text}"
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")
results = soup.select("div.media-story-card__body__3tRWy")
for result in results:
pprint(result)
pprint("###############")
You might resort to a prefix attribute value selector, like
div[class^="media-story-card__body__"]
This assumes that the class is the only one ( or at least notationally the first ). However, the idea can be extended to checking for a substring.

scraping data from web page but missing content

I am downloading the verb conjugations to aid my learning. However one thing I can't seem to get from this web page is the english translation near the top of the page.
The code I have is below. When I print results_eng it prints the section I want but there is no english translation, what am I missing?
import requests
from bs4 import BeautifulSoup
URL = 'https://conjugator.reverso.net/conjugation-portuguese-verb-ser.html'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
results_eng = soup.find(id='list-translations')
eng = results_eng.find_all('p', class_='context_term')
In a normal website, you should be able to find the text in a paragraph witht the function get_text(), but in this case this is a search, wich means it's probably pulling the data from a database and its not in the paragraph itself. At least that's what I can come up with, since I tried to use that function and I got an empty string in return. Can't you try another website and see what happens?
p.d: I'm a beginner, sorry if I'm guessing wrong

Equivalent regular expression to extract link using Beautiful Soup

I am trying to randomly explore Webscraping through python.I have link of google search results page. I used url lib to extract all the links which are present in the GOOGLE SEARCH RESULT PAGE. From that parsed page of google I am extracting all possible anchor tags with the help of Beautiful Soup library. So now I have lots of links. Among those I want to pick selected links which matches my required pattern.
Example I want to pick all such lines:
This is one of the many links that got parsed. But I want to narrow down the result of the links which are like this
/url?q=http://avadl.uploadt.com/DL4/Film/&sa=U&ved=0ahUKEwiYwOKe1r7hAhWUf30KHcHUBkMQFggUMAA&usg=AOvVaw39cIJ0T8_CAQMY8EkSWZJl
And among such picks I need to extract only this part
http://avadl.uploadt.com/DL4/Film/
I tried this and this
possible_websites.append(re.findall('/url?q=(\S+)',links))
possible_websites.append(re.findall('/url?q=(\S+^&)',links))
Here's my code
soup = BeautifulSoup(webpage, 'html.parser')
tags = soup('a')
possible_websites=[]
for tag in tags:
links = tag.get('href', None)
possible_websites.append(re.findall('/url?q=(\S+)',links))
I want to use regular expression to extract the required text part. I am using Beautiful soup module to extract the HTML data. In short this is much of a reguar expression problem.
It’s not regex, but I’d use urllib:
from urllib.parse import parse_qs, urlparse
url = urlparse('/url?q=http://avadl.uploadt.com/DL4/Film/&sa=U&ved=0ahUKEwiYwOKe1r7hAhWUf30KHcHUBkMQFggUMAA&usg=AOvVaw39cIJ0T8_CAQMY8EkSWZJl')
qs = parse_qs(url.query)
print(qs['q'][0])
If you really need a regex, use q=(.*/)& otherwise go with Ry-'s answer, i.e.:
import re
u = "/url?q=http://avadl.uploadt.com/DL4/Film/&sa=U&ved=0ahUKEwiYwOKe1r7hAhWUf30KHcHUBkMQFggUMAA&usg=AOvVaw39cIJ0T8_CAQMY8EkSWZJl"
m = re.findall("q=(.*/)&", u)
if m:
print(m[0])
# http://avadl.uploadt.com/DL4/Film/
Demo

xpath how to format path

I would like to get #src value '/pol_il_DECK-SANTA-CRUZ-STAR-WARS-EMPIRE-STRIKES-BACK-POSTER-8-25-20135.jpg' from webpage
from lxml import html
import requests
URL = 'http://systemsklep.pl/pol_m_Kategorie_Deskorolka_Deski-281.html'
session = requests.session()
page = session.get(URL)
HTMLn = html.fromstring(page.content)
print HTMLn.xpath('//html/body/div[1]/div/div/div[3]/div[19]/div/a[2]/div/div/img/#src')[0]
but I can't. No matter how I format xpath, i tdooesnt work.
In the spirit of #pmuntima's answer, if you already know it's the 14th sourced image, but want to stay with lxml, then you can:
print HTMLn.xpath('//img/#data-src')[14]
To get that particular image. It similarly reports:
/pol_il_DECK-SANTA-CRUZ-STAR-WARS-EMPIRE-STRIKES-BACK-POSTER-8-25-20135.jpg
If you want to do your indexing in XPath (possibly more efficient in very large result sets), then:
print HTMLn.xpath('(//img/#data-src)[14]')[0]
It's a little bit uglier, given the need to parenthesize in the XPath, and then to index out the first element of the list that .xpath always returns.
Still, as discussed in the comments above, strictly numerical indexing is generally a fragile scraping pattern.
Update: So why is the XPath given by browser inspect tools not leading to the right element? Because the content seen by a browser, after a dynamic JavaScript-based update process, is different from the content seen by your request. Your request is not running JS, and is doing no such updates. Different content, different address needed--if the address is static and fragile, at any rate.
Part of the updates here seem to be taking src URIs, which initially point to an "I'm loading!" gif, and replacing them with the "real" src values, which are found in the data-src attribute to begin.
So you need two changes:
a stronger way to address the content you want (a way that doesn't break when you move from browser inspect to program fetch) and
to fetch the URIs you want from data-src not src, because in your program fetch, the JS has not done its load-and-switch trick the way it did in the browser.
If you know text associated with the target image, that can be the trick. E.g.:
search_phrase = 'DECK SANTA CRUZ STAR WARS EMPIRE STRIKES BACK POSTER'
path = '//img[contains(#alt, "{}")]/#data-src'.format(search_phrase)
print HTMLn.xpath(path)[0]
This works because the alt attribute contains the target text. You look for images that have the search phrase contained in their alt attributes, then fetch the corresponding data-src values.
I used a combination of requests and beautiful soup libraries. They both are wonderful and I would recommend them for scraping and parsing/extracting HTML. If you have a complex scraping job, scrapy is really good.
So for your specific example, I can do
from bs4 import BeautifulSoup
import requests
URL = 'http://systemsklep.pl/pol_m_Kategorie_Deskorolka_Deski-281.html'
r = requests.get(URL)
soup = BeautifulSoup(r.text, "html.parser")
specific_element = soup.find_all('a', class_="product-icon")[14]
res = specific_element.find('img')["data-src"]
print(res)
It will print out
/pol_il_DECK-SANTA-CRUZ-STAR-WARS-EMPIRE-STRIKES-BACK-POSTER-8-25-20135.jpg

find_all() not finding any results when using Beautiful Soup + Requests

I'm experimenting with using BeautifulSoup and Requests for the first time, and am trying to learn by scraping some information from a news site. The aim of the project is to just be able to read news highlights from terminal, so I need to effectively scrape and parse article titles and article body text.
I am still at the stage of getting the titles, but I simply am not storing any data when I try to use the find_all() function. Below is my code:
from bs4 import BeautifulSoup
from time import strftime
import requests
date = strftime("%Y/%m/%d")
url = "http://www.thedailybeast.com/cheat-sheets/" + date + "/cheat-sheet.html"
result = requests.get(url)
c = result.content
soup = BeautifulSoup(c, "lxml")
titles = soup.find_all('h1 class="title multiline"')
print titles
Any thoughts? If anyone also has any advice / tips to improve what I currently have or the approach I'm taking, I'm always looking to get better so please do tell!
Cheers
You are putting everything here in quotes:
titles = soup.find_all('h1 class="title multiline"')
which makes BeautifulSoup search for h1 class="title multiline" elements.
Instead, use:
titles = soup.find_all("h1", class_="title multiline")
Or, with a CSS selector:
titles = soup.select("h1.title.multiline")
Actually, because of the dynamic nature of the page, to get all of the titles, you have to approach it differently:
import json
results = json.loads(soup.find('div', {'data-pageraillist': True})['data-pageraillist'])
for result in results:
print result["title"]
Prints:
Hillary Email ‘Born Classified’
North Korean Internet Goes Down
Kid-Porn Cops Go to Gene Simmons’s Home
Baylor Player Convicted of Rape After Coverup
U.S. Calls In Aussie Wildfire Experts
Markets’ 2015 Gains Wiped Out
Black Lives Matters Unveils Platform
Sheriff Won’t Push Jenner Crash Charge
Tear Gas Used on Migrants Near Macedonia
Franzen Considered Adopting Iraqi Orphan
You're very close, but find_all only searches the tags, it's not like a generic search function.
Hence if you want to filter by tag and attribute like class, then do this:
soup.find_all('h1', {'class' : 'multiline'})

Categories

Resources