Scraping data not in view page source using python scrapy

Scraping data not in view page source using python scrapy - python

I want to scrape emails of this link:
https://threebestrated.ca/children-dentists-in-airdrie-ab
but the output shows null because these are not in the view page source.
This is the code:
import scrapy
class BooksSpider(scrapy.Spider):
name = "3bestrated"
allowed_domains = ['threebestrated.ca']
start_urls = ["https://threebestrated.ca/children-dentists-in-airdrie-ab"]
def parse(self, response):
emails = response.xpath("//a[contains(#href, 'mailto:')]/text()").getall()
yield {
"a": emails,
}

The e-mail addresses are encoded in a certain way to prevent naive scraping. Here is one such encoded e-mail address:
<p>
<a href="/cdn-cgi/l/email-protection#3851565e57784b515d4a4a595c5d564c5954165b59074b4d5a525d5b4c056a5d494d5d4b4c1d0a084c504a574d5f501d0a086c504a5d5d7a5d4b4c6a594c5d5c165b59">
<i class="fa fa-envelope-o"></i>
<span class="__cf_email__" data-cfemail="70191e161f3003191502021114151e04111c5e1311">[email protected]</span>
</a>
</p>
Which is then decoded using this JavaScript script.
So, your options are:
Reverse-engineer the decoding script
Use some kind of JavaScript runtime to execute the decoding script
If you're going to use a JavaScript runtime, you might as well use
Selenium to begin with (there seems to exist a scrapy-selenium middleware that you could use if you want to stick with scrapy)
EDIT - I've reverse-engineered it for fun:
def deobfuscate(string, start_index):
def extract_hex(string, index):
substring = string[index: index+2]
return int(substring, 16)
key = extract_hex(string, start_index)
for index in range(start_index+2, len(string), 2):
yield chr(extract_hex(string, index) ^ key)
def process_tag(tag):
url_fragment = "/cdn-cgi/l/email-protection#"
href = tag["href"]
start_index = href.find(url_fragment)
if start_index > -1:
return "".join(deobfuscate(href, start_index + len(url_fragment)))
return None
def main():
import requests
from bs4 import BeautifulSoup as Soup
from urllib.parse import unquote
url = "https://threebestrated.ca/children-dentists-in-airdrie-ab"
response = requests.get(url)
response.raise_for_status()
soup = Soup(response.content, "html.parser")
print("E-Mail Addresses from <a> tags:")
for email in map(unquote, filter(None, map(process_tag, soup.find_all("a", href=True)))):
print(email)
cf_elem_attr = "data-cfemail"
print("\nE-Mail Addresses from tags where \"{}\" attribute is present:".format(cf_elem_attr))
for tag in soup.find_all(attrs={cf_elem_attr:True}):
email = unquote("".join(deobfuscate(tag[cf_elem_attr], 0)))
print(email)
if __name__ == "__main__":
import sys
sys.exit(main())
Output:
E-Mail Addresses from <a> tags:
info#sierradental.ca?subject=Request through ThreeBestRated.ca
reviews#threebestrated.ca?subject=My Review for Dr. Amin Salmasi in Airdrie
info#mainstreetdentalairdrie.ca?subject=Request through ThreeBestRated.ca
reviews#threebestrated.ca?subject=My Review for Dr. James Yue in Airdrie
friends#toothpals.ca?subject=Request through ThreeBestRated.ca
reviews#threebestrated.ca?subject=My Review for Dr. Christine Bell in Airdrie
support#threebestrated.ca
E-Mail Addresses from tags where "data-cfemail" attribute is present:
info#sierradental.ca
friends#toothpals.ca
support#threebestrated.ca
>>>

Related

Python web scraping script does not find element by css selector

I'm trying to get this web scraper to get current electricity price from this website, it's in finnish but it's right under "Hinta nyt". https://sahko.tk/
Here's my code:
import requests
from bs4 import BeautifulSoup
url = "https://sahko.tk/"
element_selector = ""
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
elements = soup.find_all(element_selector)
if len(elements) == 0:
print("No element found with selector '%s'" % element_selector)
else:
element_text = elements[0].text
print(element_text)
I left the element_selector to empty because what ever I tried it just did not work. I'm not even sure if I'm on the right tracks.

The data you see is embedded inside <script> in that page. To parse the current price you can use next example:
import re
import json
import requests
url = "https://sahko.tk/"
data = requests.get(url).text
data = re.search(r"function prices_today\(\)\{var t= (.*?});", data).group(1)
data = json.loads(data)
print("Hinta nyt", data["now"], "snt/kWh")
Prints:
Hinta nyt 33.27 snt/kWh

How do I crawl and scrape this specific website and save the data in a text file using Python?

Ok, so I'm doing this project which implements Word2Vec on a Bengali language web corpus to find similar contextual words of words and as pre-requisite I am trying to crawl certain news and blog sites and then scraping the links to build a data corpus. I'm using Google Colab on my Chrome browser, as of now.
Here's my Python code for crawling... (I did take help from the internet for code snippets, I have only recently learnt all of this)
import requests
import urllib.parse
from urllib.parse import urlparse, urljoin
from bs4 import BeautifulSoup
import colorama
from urllib.request import urlopen
from urllib.request import Request
# init the colorama module
colorama.init()
GREEN = colorama.Fore.GREEN
GRAY = colorama.Fore.LIGHTBLACK_EX
RESET = colorama.Fore.RESET
YELLOW = colorama.Fore.YELLOW
# initialize the list of links (unique links)
internal_urls = set() #Set of All internal links
external_urls = set() #Set of All external links
old_internals #Keeps track of internal links before including another
def is_valid(url):
"""
Checks whether `url` is a valid URL.
"""
parsed = urlparse(url)
return bool(parsed.netloc) and bool(parsed.scheme)
"""
Returns all URLs that is found on `url` in which it belongs to the same website
"""
# all URLs of `url
def get_all_website_links(url):
global old_internals
try:
urls = set()
# domain name of the URL without the protocol
domain_name = urlparse(url).netloc
user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'
req = Request(url, headers={'User-Agent': user_agent})
article = urlopen(req).read()
soup = BeautifulSoup(article, "lxml")
old_internals = internal_urls.copy() #Copies old set of internal links
for a_tag in soup.findAll("a"): #Links under <a> tag
href = a_tag.attrs.get("href")
if href == "" or href is None:
# href empty tag
continue
# join the URL if it's relative (not absolute link)
href = urljoin(url, href)
parsed_href = urlparse(href)
# remove URL GET parameters, URL fragments, etc.
href = parsed_href.scheme + "://" + parsed_href.netloc + parsed_href.path
if not is_valid(href):
# not a valid URL
continue
if href in internal_urls:
# already in the set
continue
if domain_name not in href:
# external link
if href not in external_urls:
print(f"{GRAY}[!] External link: {href}{RESET} \n")
external_urls.add(href)
continue
print(f"{GREEN}[*] Internal link: {href}{RESET} \n")
urls.add(href)
internal_urls.add(href)
#I could definitely have done this as a function
#instead of writing the whole code again, but well...
#(I will change it)
for link_tag in soup.findAll("link"): #Links under <link> tag
href = link_tag.attrs.get("href")
if href == "" or href is None:
# href empty tag
continue
# join the URL if it's relative (not absolute link)
href = urljoin(url, href)
parsed_href = urlparse(href)
# remove URL GET parameters, URL fragments, etc.
href = parsed_href.scheme + "://" + parsed_href.netloc + parsed_href.path
if not is_valid(href):
# not a valid URL
continue
if href in internal_urls:
# already in the set
continue
if domain_name not in href:
# external link
if href not in external_urls:
print(f"{GRAY}[!] External link: {href}{RESET} \n")
external_urls.add(href)
continue
print(f"{GREEN}[*] Internal link: {href}{RESET} \n")
urls.add(href)
internal_urls.add(href)
return urls
except Exception as e:
#If the link to be added were problematic, just return the list of
#old internal links. The function was returning an error and stopped
#crawling because of certain internal links midway when max count was
#large, so...
print("\n")
print(e)
print("\nNone returned\n")
#print(internal_urls, "\n\n")
return old_internals
# number of urls visited so far will be stored here
total_urls_visited = 0
def crawl(url, max_urls=30):
"""
Crawls a web page and extracts all links.
You'll find all links in `external_urls` and `internal_urls` global set variables.
params:
max_urls (int): number of max urls to crawl, default is 30.
"""
global total_urls_visited
total_urls_visited += 1
#print(url)
print(f"{YELLOW}[*] Crawling: {url}{RESET} \n")
links = get_all_website_links(url)
loop=links.copy() #Since returning old internal links may change loop size
for link in loop:
if total_urls_visited > max_urls:
break
crawl(link, max_urls)
def extract_name(link_url): #Program to name the file
name=""
link_name= link_url[link_url.index(":")+3:] #skips the "https://" part :)
link_name=link_name.replace('/', '_')
link_name=link_name.replace('.', '_')
link_name=link_name.replace(' ', '_')
link_name=link_name.replace('-', '_')
return link_name+".txt"
def fileWrite(fname, lst):
a_file = open(fname, "wb")
for element in lst:
l = len(element)
if l == 0:
continue
a_file.write(element.encode() + "\n".encode())
a_file.close()
#Runtime
if __name__ == "__main__":
max_urls =
#Arbitrary list of links of Bengali sites
web_links=["https://www.anandabazar.com/",
"https://www.prothomalo.com/",
"https://www.littlemag.org/2019/05/blog-post_60.html"]
#Index of weblink in list
index=1
crawl(web_links[index], max_urls)
fname=extract_name(web_links[index])
fileWrite(fname, internal_urls)
print("[+] Total Internal links:", len(internal_urls))
print("[+] Total External links:", len(external_urls))
print("[+] Total URLs:", len(external_urls) + len(internal_urls))
print("[+] Total crawled URLs:", max_urls)
Now my code works without any issues for the first two sites [indices 0, 1] which I presume is because I can even copy the text manually when I go to that site on my chrome browser.
But the site with the index=2, i.e.
https://www.littlemag.org/2019/05/blog-post_60.html,
it doesn't work at all. And I can't copy or select anything on the browser either. How do I work about this problem and crawl links on the domain of this site?
The same issue is showing up on my web scraping code...
import bs4 as bs
import urllib.request
from urllib.request import Request, urlopen
web_links=["https://www.anandabazar.com/",
"https://www.prothomalo.com/",
"https://www.littlemag.org/2019/05/blog-post_60.html"]
user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'
req = Request(web_links[2], headers={'User-Agent': user_agent})
article = urlopen(req).read()
#print(article)
parsed_article = bs.BeautifulSoup(article, 'lxml')
#I read that the main content of articles on blogs and news sites is stored in the <p> tag,
#(I don't remember everything I studied in html) please feel free to let me know,
#if I should include something else too.
paragraphs = parsed_article.find_all('p')
article_text = ""
for p in paragraphs:
article_text += " " + p.text
print(article_text)
I can't fetch or scrape any data from the Bengali article on this site, https://www.littlemag.org/2019/05/blog-post_60.html, and print it on the Colab console. What should I change in the two codes to work this about and include data from these now copy-able, selectable sites?
Update:
Thank you Andrej Kesely. My problem with scraping of the site is solved, but I would like to know if there is a way to scrape the headings within that page, using your code?
for content in soup.find_all([re.compile('^h[1-6]$'), 'p']):
print(content.text)
This won't work for me, in this case.
Also the piece of code,
import requests
from bs4 import BeautifulSoup
url = "https://www.littlemag.org/2019/05/blog-post_60.html"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
print(soup.select_one(".post-body").get_text(strip=True, separator="\n"))
It is not working for, https://www.littlemag.org/, which is the homepage of the site we are dealing with.
Gives the error AttributeError: 'NoneType' object has no attribute 'get_text'.
What could be the reason and how can I fetch the content along with heading from the homepage as well, https://www.littlemag.org/ ?

To get post text from this site you can use next example:
import requests
from bs4 import BeautifulSoup
url = "https://www.littlemag.org/2019/05/blog-post_60.html"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
print(soup.select_one(".post-body").get_text(strip=True, separator="\n"))
Prints:
● ছবিতে - বাঙালির পাতের চির নবীন শুক্তো।
■ পদ্মপুরাণে বেহুলার বিয়ের নিরামিষ খাবারের মধ্যে শুক্তোর উল্লেখ পাওয়া যায়। ভারতচন্দ্রের অন্নদামঙ্গলেও বাইশ রকমের নিরামিষ পদের মধ্যে শুক্তুনিকে পাওয়া যায়।
মঙ্গলকাব্য ও বৈষ্ণবসাহিত্যে এই রান্নাটির বহুবার উল্লেখ পাওয়া যায়। কিন্তু বর্তমানে 'শুক্তো' বলতে যেমন উচ্ছে, করলা, পল্‌তা, নিম, সিম, বেগুন প্রভৃতি সবজির তিক্ত ব্যঞ্জনকে বোঝায়, প্রাচীনকালে তা ছিল না। একালের শুক্তোকে সেকালে 'তিতো' বলা হত।
সেকালে 'শুক্তা' রান্না করা হত- বেগুন, কাঁচা কুমড়ো, কাঁচকলা, মোচা এই সবজিগুলি গুঁড়ো বা বাটা মসলা অথবা বেসনের সঙ্গে বেশ ভালো করে মেখে বা নেড়ে নিয়ে ঘন 'পিঠালি' মিশিয়ে রান্না করা হত। পরে হিং, জিরা ও মেথি দিয়ে ঘিয়ে সাঁতলিয়ে নামাতে হত।
কিন্তু 'চৈতন্যচরিতামৃতে' সুকুতা, শুকুতা বা সুক্তা বলতে একধরণের শুকনো পাতাকে বলা হয়েছে। এটি ছিল আম-নাশক। সম্ভবত এটি ছিল শুকনো তিতো পাটপাতা। রাঘব পণ্ডিত মহাপ্রভুর জন্য নীলাচলে যেসব জিনিস নিয়ে গিয়েছিলেন তার মধ্যে এই দ্রব্যটিও ছিল।
আবার 'সুকুতা' বলতে সেই সময় শুকনো শাকের ব্যঞ্জনকেও বোঝাত।
বাঙালির চিরকালের পরিচয় ‘ভেতো বাঙালি’। অর্থাৎ যাদের প্রধান খাদ্য হলো ভাত। প্রাচীনকালে গরিব বাঙালির মুখে শোনা যেত দুঃখের কাঁদুনী, ‘হাড়িত ভাত নাহি নিতি আবেশী’ (চর্যাপদ)। মানে ‘ঘরে ভাত নেই তবু অতিথির আসা যাওয়ার কমতি নেই’। তবে ধনী-নির্ধন সব বাঙালির প্রিয় খাদ্য গরম ভাতে গাওয়া ঘি। যারা দিন আনে দিন খায়, তাঁদের চরম প্রাপ্তি হলো — পান্তা ভাতে বাইগন পোড়া। পণ্ডিতরা বলেন, প্রকৃত বাঙালির মনমতো খাবার ছিল কলাপাতায় ‘ওগ্গারা ভত্তা গাইক ঘিত্তা’, অর্থাৎ গাওয়া ঘি আর ফেনা ভাত। দুধ আর সরু চাল মিশিয়ে পায়েস বড়মানুষের প্রিয় খাদ্য।
...

Scraping each element from website with BeautifulSoup

I wrote a code for scraping one real estate website. This is the link:
https://www.nekretnine.rs/stambeni-objekti/stanovi/lista/po-stranici/10/
From this page I can get only location, size and price of the apartment, but Is it possible to write a code that will go on page of each appartment and scrape values from it, because it contains much more info. Check this link:
https://www.nekretnine.rs/stambeni-objekti/stanovi/arena-bulevar-arsenija-carnojevica-97m-2-lode-energoprojekt/NkvJK0Ou5tV/
I have posted a code. I noticed that my url changes when I click on specific real estate. For example:
arena-bulevar-arsenija-carnojevica-97m-2-lode-energoprojekt/NkvJK0Ou5tV/
I taught about creating for loop, but there is no way to know how it changes because it has some id number at the end:
NkvJK0Ou5tV
This is the code that I have:
from bs4 import BeautifulSoup
import requests
website = "https://www.nekretnine.rs/stambeni-objekti/stanovi/lista/po-stranici/10/"
soup = requests.get(website).text
my_html = BeautifulSoup(soup, 'lxml')
lokacija = my_html.find_all('p', class_='offer-location text-truncate')
ukupna_kvadratura = my_html.find_all('p', class_='offer-price offer-price--invert')
ukupna_cena = my_html.find_all('div', class_='d-flex justify-content-between w-100')
ukupni_opis = my_html.find_all('div', class_='mt-1 mb-1 mt-lg-0 mb-lg-0 d-md-block offer-meta-info offer-adress')
for lok, kvadratura, cena_stana, sumarno in zip(lokacija, ukupna_kvadratura, ukupna_cena, ukupni_opis):
lok = lok.text.split(',')[0] #lokacija
kv = kvadratura.span.text.split(' ')[0] #kvadratura
jed = kvadratura.span.text.split(' ')[1] #jedinica mere
cena = cena_stana.span.text #cena
sumarno = sumarno.text
datum = sumarno.split('|')[0].strip()
status = sumarno.split('|')[1].strip()
opis = sumarno.split('|')[2].strip()
print(lok, kv, jed, cena, datum, status, opis)

You can get href from div class="placeholder-preview-box ratio-4-3".
From here you can find the URL.

You can iterate over the links provided by the pagination at the bottom of the page:
from bs4 import BeautifulSoup as soup
import requests
d = soup(requests.get('https://www.nekretnine.rs/stambeni-objekti/stanovi/lista/po-stranici/10/').text, 'html.parser')
def scrape_page(page):
return [{'title':i.h2.get_text(strip=True), 'loc':i.p.get_text(strip=True), 'price':i.find('p', {'class':'offer-price'}).get_text(strip=True)} for i in page.find_all('div', {'class':'row offer'})]
result = [scrape_page(d)]
while d.find('a', {'class':'pagination-arrow arrow-right'}):
d = soup(requests.get(f'https://www.nekretnine.rs{d.find("a", {"class":"pagination-arrow arrow-right"})["href"]}').text, 'html.parser')
result.append(scrape_page(d))

Certain content not loading when scraping a site with Beautiful Soup

I'm trying to scrape the ratings off recipes on NYT Cooking but having issues getting the content I need. When I look at the source on the NYT page, I see the following:
<div class="ratings-rating">
<span class="ratings-header ratings-content">194 ratings</span>
<div class="ratings-stars-wrap">
<div class="ratings-stars ratings-content four-star-rating avg-rating">
The content I'm trying to pull out is 194 ratings and four-star-rating. However, when I pull in the page source via Beautiful Soup I only see this:
<div class="ratings-rating">
<span class="ratings-header ratings-content"><%= header %></span>
<div class="ratings-stars-wrap">
<div class="ratings-stars ratings-content <%= ratingClass %> <%= state %>">
The code I'm using is:
url = 'https://cooking.nytimes.com/recipes/1020049-lemony-chicken-soup-with-fennel-and-dill'
r = get(url, headers = headers, timeout=15)
page_soup = soup(r.text,'html.parser')
Any thoughts why that information isn't pulling through?

Try using below code
import requests
import lxml
from lxml import html
import re
url = "https://cooking.nytimes.com/recipes/1019706-spiced-roasted-cauliflower-with-feta-and-garlic?action=click&module=Recirculation%20Band%20Recipe%20Card&region=More%20recipes%20from%20Alison%20Roman&pgType=recipedetails&rank=1"
r = requests.get(url)
tree = html.fromstring(r.content)
t = tree.xpath('/html/body/script[14]')[0]
# look for value for bootstrap.recipe.avg_rating
m = re.search("bootstrap.recipe.avg_rating = ", t.text)
colon = re.search(";", t.text[m.end()::])
rating = t.text[m.end():m.end()+colon.start()]
print(rating)
# look for value for bootstrap.recipe.num_ratings =
n = re.search("bootstrap.recipe.num_ratings = ", t.text)
colon2 = re.search(";", t.text[n.end()::])
star = t.text[n.end():n.end()+colon2.start()]
print(star)

much easier to use attribute = value selectors to grab from span with class ratings-metadata
import requests
from bs4 import BeautifulSoup
data = requests.get('https://cooking.nytimes.com/recipes/1020049-lemony-chicken-soup-with-fennel-and-dill')
soup = BeautifulSoup(data.content, 'lxml')
rating = soup.select_one('[itemprop=ratingValue]').text
ratingCount = soup.select_one('[itemprop=ratingCount]').text
print(rating, ratingCount)

how can I add main link to sublink html, so that link can be called?

this is my code that gives a list of particular news links from HTML page, it contains only resource name and parameters, I want to include main domain name so that link can be operatable.
import requests
from bs4 import BeautifulSoup
def get_cric_info_articles():
cricinfo_article_link = "http://www.espncricinfo.com/ci/content/story/news.html"
r = requests.get(cricinfo_article_link)
cricinfo_article_html = r.text
soup = BeautifulSoup(cricinfo_article_html, "html.parser")
# print(soup.prettify())
cric_info_items = soup.find_all("h2",
{"class": "story-title"})
cricinfo_article_dict = {}
for div in cric_info_items:
cricinfo_article_dict[div.find('a').string] = div.find('a')['href']
return cricinfo_article_dict
print(get_cric_info_articles())
what I'm getting {'Bell-Drummond leads MCC in curtain-raiser': '/ci/content/story/1135157.html', 'Scotland pick Brad Wheal, Chris Sole for World Cup qualifiers': '/scotland/content/story/1135152.html', 'Newlands working to be water independent': '/southafrica/content/story/1135120.html'}
I'm trying to attach this '/ci/content/story/1135157.html'to http://www.espncricinfo.com/
so the final link will be http://www.espncricinfo.com/ci/content/story/1135157.html', how can I do this? sorry for the long post
changes I did
for div in cric_info_items:
a = div.find('a')['href']
b = 'http://www.espncricinfo.com/'
c = urljoin(b,a)
cricinfo_article_dict[div.find('a').string] = c

You can use the urllib.parse module for this:
from urllib.parse import urljoin
urljoin('http://www.espncricinfo.com/', '/ci/content/story/1135157.html')
Hope it helps.

...
# if protocol is not specified in the link, assume it's relative
for div in cric_info_items:
url = div.find('a')['href']
if "://" not in url:
url = cricinfo_article_link + url
cricinfo_article_dict[div.find('a').string] = url
...
or, using dict comprehension:
return {
div.find('a').string : ("" if "://" in div.find('a')['href'] else cricinfo_article_link) + div.find('a')['href']
for div in soup.find_all("h2", {"class": "story-title"})
}
Upd: a potential border case is links starting with //, e.g. //google.com/?q=foo. This type of links sometimes is used for resources (css and javascript) and is not common for external links. However, you might want to handle this as well

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scraping data not in view page source using python scrapy - python

Related

Python web scraping script does not find element by css selector

How do I crawl and scrape this specific website and save the data in a text file using Python?

Scraping each element from website with BeautifulSoup

Certain content not loading when scraping a site with Beautiful Soup

how can I add main link to sublink html, so that link can be called?

Categories

Resources