How to scrape src from img html in python - python

I'm trying to scrape the src of the img, but the code I found returns many img src, but not the one I want. I can't figure out what I am doing wrong. I am scraping TripAdvisor on "https://www.tripadvisor.dk/Restaurant_Review-g189541-d15804886-Reviews-The_Pescatarian-Copenhagen_Zealand.html"
So this is the HTML snippet I'm trying to extract from:
<div class="restaurants-detail-overview-cards-LocationOverviewCard__cardColumn--2ALwF"><h6>Placering og kontaktoplysninger</h6><span><div><span data-test-target="staticMapSnapshot" class=""><img class="restaurants-detail-overview-cards-LocationOverviewCard__mapImage--22-Al" src="https://trip-raster.citymaps.io/staticmap?scale=1&zoom=15&size=347x137&language=da&center=55.687988,12.596316&markers=icon:http%3A%2F%2Fc1.tacdn.com%2F%2Fimg2%2Fmaps%2Ficons%2Fcomponent_map_pins_v1%2FR_Pin_Small.png|55.68799,12.596316"></span></div></span>
I want the code to return: (a sub-string from src)
55.68799,12.596316
I have tried:
import pandas as pd
pd.options.display.max_colwidth = 200
from urllib.request import urlopen
from bs4 import BeautifulSoup as bs
import re
web_url = "https://www.tripadvisor.dk/Restaurant_Review-g189541-d15804886-Reviews-The_Pescatarian-Copenhagen_Zealand.html"
url = urlopen(web_url)
url_html = url.read()
soup = bs(url_html, 'lxml')
soup.find_all('img')
for link in soup.find_all('img'):
print(link.get('src'))
the return is along the lines of this BUT NOT the src that I need :
https://static.tacdn.com/img2/branding/rebrand/TA_logo_secondary.svg
https://static.tacdn.com/img2/branding/rebrand/TA_logo_primary.svg
https://static.tacdn.com/img2/branding/rebrand/TA_logo_secondary.svg



You can do this with just requests and re. It is only the co-ordinates part of the src which are the location based variable.
import requests, re
p = re.compile(r'"coords":"(.*?)"')
r = requests.get('https://www.tripadvisor.dk/Restaurant_Review-g189541-d15804886-Reviews-The_Pescatarian-Copenhagen_Zealand.html')
coords = p.findall(r.text)[1]
src = f'https://trip-raster.citymaps.io/staticmap?scale=1&zoom=15&size=347x137&language=da&center={coords}&markers=icon:http://c1.tacdn.com//img2/maps/icons/component_map_pins_v1/R_Pin_Small.png|{coords}'
print(src)
print(coords)

Selenium is a workaround i tested it and works liek a charm. Here you are:
from selenium import webdriver
driver = webdriver.Chrome('chromedriver.exe')
driver.get("https://www.tripadvisor.dk/Restaurant_Review-g189541-d15804886-Reviews-The_Pescatarian-Copenhagen_Zealand.html")
links = driver.find_elements_by_xpath("//*[#src]")
urls = []
for link in links:
url = link.get_attribute('src')
if '|' in url:
urls.append(url.split('|')[1]) # saves in a list only the numbers you want i.e. 55.68799,12.596316
print(url)
print(urls)
Result of above
['55.68799,12.596316']
If you haven't used selenium before here you can find a webdriver https://chromedriver.storage.googleapis.com/index.html?path=2.46/
or here
https://sites.google.com/a/chromium.org/chromedriver/downloads

Related

How do I webscrape nested lists in Python?

Link of website: https://www.zivame.com/rosaline-chromaticity-knit-cotton-top-florida-key.html?trksrc=category&trkid=search&trkorder=relevance
What I want to scrape: Short sleeves style, Relaxed fit for comfort
(Basically the bullet points under Description)
This is the code I'm using currently:
from selenium import webdriver
import re
from bs4 import BeautifulSoup
import requests
result = requests.get("https://www.zivame.com/rosaline-chromaticity-knit-cotton-top-florida-key.html?trksrc=category&trkid=search&trkorder=relevance")
soup = BeautifulSoup(result.text, 'lxml')
page = soup.find('div', id="product-page")
description = page.find('div', id="product-basicdetail")
point1 = description.find('div', id="ff-rm text-size pd-b5")
print(point1)
The data is coming as JSON data, you can scrape the data from the source page directly.
import requests
from lxml import html
r = requests.get('https://www.zivame.com/rosaline-chromaticity-knit-cotton-top-florida-key.html?trksrc=category&trkid=search&trkorder=relevance')
source_page = html.fromstring(r.text)
json_value = source_page.xpath("//script[contains(.,'window.__product=')]/text()")[0]
json_value = json_value.split("{features:{values:[{list:[")[1].split("]}],count:1}}},modelMetaData:")[0]
print(json_value.split(','))

Beautiful Soup and Selenium cannot scrape website contents

So I am trying to scrape the contents of a webpage. Initially I tried to use BeautifulSoup, however I was unable to grab the contents because the contents are loaded in dynamically.
After reading around I tried to use Selenium based on people suggestions, however after doing so I'm still unable to grab the contents. The scraped contents is the same as Beautiful soup.
Is it just not possible to scrape the contents of this webpage? (ex: https://odb.org/TW/2021/08/11/accessible-to-all)
import datetime as d
import requests
from bs4 import BeautifulSoup as bs
# BeautifulSoup Implementation
def devo_scrap():
full_date = d.date.today()
string_date = str(full_date)
format_date = string_date[0:4] + '/' + string_date[5:7] + '/' + string_date[8:]
url = "https://odb.org/" + format_date
r = requests.get(url)
soup = bs(r.content, 'lxml')
return soup
print(devo_scrap())
So the above is Beautiful soup implementation. Does anyone have any suggestions? Is it just not possible to scrape? Thanks in advance.
(Updated with Selenium Implementation)
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import datetime as d
PATH = '' <chrome driver path>
driver = webdriver.Chrom(PATH)
full_date = d.date.today()
string_date = str(full_date)
format_date = string_date[0:4] + '/' + string_date[5:7] + '/' + string_date[8:]
url = "https://odb.org/" + format_date
content = driver.get(url)
print(content)
The content (html) grabbed with selenium is the same as with BeautifulSoup.
You can simply do :
source = driver.page_source
to get the page source using selenium. And convert that source into BeautifulSoup as usual :
source = BeautifulSoup(source,"lxml")
Complete code with some improvement :
from selenium import webdriver
from datetime import datetime
import time
from bs4 import BeautifulSoup
now = datetime.today()
format_date= now.strftime("%Y/%m/%d")
driver = webdriver.<>(executable_path=r'<>')
url = "https://odb.org/" + format_date
driver.get(url)
time.sleep(10)
# To load page completely.
content=BeautifulSoup(driver.page_source,"lxml")
print(content)
# Title :
print(content.find("h1",class_="devo-title").text)
# Content :
print(content.find("article",class_="content").text)
Data comes from an API call which returns a list with a single dictionary. Some of the values in the dictionary are html so will need an html parser to parse out the info. You might choose to do this based on the associated keys. For now, for demo'ing contents, I have used a simple test of the whether value is a string and starts with "<". You should consider whether something more robust is required.
from bs4 import BeautifulSoup as bs
import requests
r = requests.get('https://api.experience.odb.org/devotionals/v2?site_id=1&status=publish&country=TW&on=08-11-2021', headers = {'User-Agent':'Mozilla/5.0'})
data = r.json()[0]
for k,v in data.items():
print(k + ' :')
if isinstance(v, str):
if v.startswith('<'):
soup = bs(v)
print(soup.get_text(' '))
else:
print(v)
else:
print(v)
print()

unable to Webscrape dropdown item [Python][beautifulsoup]

i am new to webscraping, i am scraping a website - https://www.valueresearchonline.com/funds/22/uti-mastershare-fund-regular-plan/
In this,i want to scrape this text - Regular Plan
But the thing is, when i do it using inspect element,
code -
import requests
from bs4 import BeautifulSoup
import csv
import sys
url = 'https://www.valueresearchonline.com/funds/newsnapshot.asp?schemecode=22'
res = requests.get(url)
soup = BeautifulSoup(res.text, "html.parser")
regular_direct = soup.find('span',class_="filter-option pull-left").text
print(regular_direct)
i get none in printing, and i don't know why, the code in inspect element and view page source is also different, because in view page source, this span and class is not there.
why i am getting none?? can anyone please tell me and how can i get that text and why inspect element code and view page source code are different?
You need to change the selector because the html source that gets downloaded is different.
import requests
from bs4 import BeautifulSoup
import csv
import sys
url = 'https://www.valueresearchonline.com/funds/newsnapshot.asp?schemecode=22'
res = requests.get(url)
soup = BeautifulSoup(res.text, "html.parser")
regular_direct = soup.find("select", {"id":"select-plan"}).find("option",{"selected":"selected"}).get_text(strip=True)
print(regular_direct)
Output:
Regular plan

Python: How do I get <div> background-image url value

Currently I'm using the below code to get all img tags on the page.
Can I somehow access the background-img url value
import random
import urllib.request
import requests
from bs4 import BeautifulSoup
#1. Reading pages
source = requests.get("https://www.dummie-website/photos_all").text
soup = BeautifulSoup(source,'lxml')
enter code here
#2. Getting every 'src' value from all img tags
match = [x['src'] for x in soup.findAll('img', {'class': ''})]
for i in match:
print(i)

Exacting count of link Images

I am trying to find the number of images (extensions .jpg, .png , jpeg) with the link through python. I can use any library such as beautifulsoup. But how do I do it.
I am using following code :
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('HTMLS%5C110k_Source.htm'), "html.parser")
img_links = len(soup.find_all('.jpg'))
print("Number of Images : ", img_links)
But all in vain.
You can try to use lxml.html as below:
from lxml import html
with open('HTMLS%5C110k_Source.htm', 'r') as f:
source = html.fromstring(f.read())
print(len(source.xpath('//img[contains(#src, ".jpg") or contains(#src, ".jpeg") or contains(#src, ".png")]')))
This is as easy as writing a loop if you read the docs
import bs4
import requests
url = 'somefoobar.net'
page = requests.get(url).text
soup = bs4.BeautifulSoup(page, 'lxml')
images = soup.findAll('img')
# loop through all img elements found and store the urls with matching extensions
urls = list(x for x in images if x['src'].split('.')[-1] in file_types)
print(urls)
print(len(urls))

Categories

Resources