BeautifulSoup not extracting all pages - python

I'm trying to practice some web scraping for a school project, but can't figure out why my script isn't pulling all the listings for a particular region? Would appreciate any help! I've been trying to figure out for hours!
(For simplicity, I'm just sharing one small sub-section of a page i'm trying to scrape. i'm hoping once I can figure out what's wrong here, I can apply it to other regions)
(You might need to create an account to login to see prices, before scraping)
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get('https://condos.ca')
def get_page(page):
url= f'https://condos.ca/toronto/condos-for-sale size_range=300%2C999999999&property_type=Condo%20Apt%2CComm%20Element%20Condo%2CLeasehold%20Condo&mode=Sold&end_date_unix=exact%2C2011&sublocality_id=22&page={page}'
driver.get(url)
page_source = driver.page_source
soup = BeautifulSoup(page_source, 'lxml')
return soup
prices=[]
location=[]
for page in range(5):
soup = get_page(page)
for tag in soup.find_all('div',class_ = 'styles___AskingPrice-sc-54qk44-4 styles___ClosePrice-sc-54qk44-5 dHPUdq hwkkXU'):
prices.append(tag.get_text())
for tag in soup.find_all('address',class_ = 'styles___Address-sc-54qk44-13 gTwVlm'):
location.append(tag.get_text())
For some reason, i'm only getting an output with 48 records, when it should be around 146.
Thanks!

Related

BeautifulSoup object looks totally different from what I see in Chrome

I am in my first attempt in scraping react-based dynamic website - Booking.com search result page. I want to collect the current price of specific hotels under the same conditions.
This site was easy to scrape data with simple CSS selector before, but now they changed how to code and every elements what I want is described with "data-testid" attribute and the series of unknown random numbers, as far as I see in Chrome dev tool. Now the code what I wrote before does not work and I need to rewrite.
Yesterday, I got a wisdom from another question that in this case what I see in Chrome developer tool is different from the HTML contents as of Soup object. So I tried printing the whole soup object beforehand to check the actual CSS, then select elements using these CSS class. I also made sure to use selenium to capture js-illustrated date.
At first this looking good, however, the returned soup object was totally different from what I see. For instance, the request URL should return a hotel called "cup of tea ensemble" on the top of the list with the price for 4 adults for 1 night from 2022-12-22 as its specified in the url params, but when looking into the soup object, the hotel does not come in first and most of the parameters I added in the url are ignored.
Does this usually happen when trying to scrape React-based website? If so, how can I avoid this to collect data as what I see in web browser?
I am not sure if this help but I am attaching the current code what I use. Thank you for reading and I appreciate any advice!
from bs4 import BeautifulSoup
import requests
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
booking_url = 'https://www.booking.com/searchresults.ja.html?lang=ja&ss=Takayama&dest_id=6411914&dest_type=hotel&checkin=2022-12-22&checkout=2022-12-23&group_adults=4&no_rooms=1&group_children=0&sb_travel_purpose=leisure'
#booking_url = 'https://www.google.co.jp/'
options = Options()
options.add_argument('--headless')
driver = webdriver.Chrome(executable_path='./chromedriver', chrome_options=options)
driver.get(booking_url)
html = driver.page_source.encode('utf-8')
soup = BeautifulSoup(html, 'html.parser')
print(soup)
The below code is producing the exact output what the browser displayed
import time
from bs4 import BeautifulSoup
import pandas as pd
from selenium.webdriver.chrome.service import Service
from selenium import webdriver
from selenium.webdriver.common.by import By
webdriver_service = Service("./chromedriver") #Your chromedriver path
driver = webdriver.Chrome(service=webdriver_service)
driver.get('https://www.booking.com/searchresults.ja.html?label=gen173nr-1FCAQoggJCDWhvdGVsXzY0MTE5MTRIFVgEaFCIAQGYARW4ARfIAQzYAQHoAQH4AQOIAgGoAgO4AuiOk5wGwAIB0gIkZjkzMzFmMzQtZDk1My00OTNiLThlYmYtOGFhZWM5YTM2OTIx2AIF4AIB&aid=304142&lang=ja&dest_id=6411914&dest_type=hotel&checkin=2022-12-22&checkout=2022-12-23&group_adults=4&no_rooms=1&group_children=0&sb_travel_purpose=leisure&offset=0')
#driver.maximize_window()
time.sleep(5)
soup = BeautifulSoup(driver.page_source,"lxml")
for u in soup.select('div[data-testid="property-card"]'):
title = u.select_one('div[class="fcab3ed991 a23c043802"]').get_text(strip=True)
print(title)
#price = u.select_one('span[class="fcab3ed991 bd73d13072"]').get_text(strip=True)
#print(price)
Output:
cup of tea ensemble
FAV HOTEL TAKAYAMA
ワットホテル&スパ飛騨高山
飛騨高山温泉 高山グリーンホテル
岡田旅館 和楽亭
ザ・町家ホテル高山
旅館あすなろ
IORI STAY
風屋
Tabist 風雪
飛騨高山 本陣平野屋 花兆庵
Thanyaporn Hotel
旅館むら山
cup of tea
つゆくさ
旅館 一の松
龍リゾート&スパ
飛騨高山の宿 本陣平野屋 別館
飛騨牛専門 旅館 清龍
Tomato Takayama Station
民宿 和屋
ビヨンドホテル 高山 セカンド
旅館 岐山
Utatei
ビヨンドホテル高山1s

I'm Trying To Scrape The Duration Of Tiktok Videos But I am Getting 'None'

I want to scrape the duration of tiktok videos for an upcoming project but my code isn't working
import requests; from bs4 import BeautifulSoup
content = requests.get('https://vm.tiktok.com/ZMFFKmx3K/').text
soup = BeautifulSoup(content, 'lxml')
data = soup.find('div', class_="tiktok-1g3unbt-DivSeekBarTimeContainer e123m2eu1")
print(data)
Using an example tiktok
I would think this would work could anyone help
If you turn off JavaScript then check out the element selection in chrome devtools then you will see that the value is like 00/000 but when you will turn JS and the video is on play mode then the duration is increasing uoto finishig.So the real duration value of that element depends on Js. So you have to use an automation tool something like selenium to grab that dynamic value. And How much duration will scrape that depend on time.sleep() if you are on selenium. If time.sleep is more than the video length then it will show None typEerror.
Example:
import time
from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.chrome.service import Service
webdriver_service = Service("./chromedriver") #Your chromedriver path
driver = webdriver.Chrome(service=webdriver_service)
url ='https://vm.tiktok.com/ZMFFKmx3K/'
driver.get(url)
driver.maximize_window()
time.sleep(25)
soup = BeautifulSoup(driver.page_source,"lxml")
data = soup.find('div', class_="tiktok-1g3unbt-DivSeekBarTimeContainer e123m2eu1")
print(data.text)
Output:
00:25/00:28
the ID associated is likely randomized. Try using regex to get element by class ending in 'TimeContainer' + some other id
import requests
from bs4 import BeautifulSoup
import re
content = requests.get('https://vm.tiktok.com/ZMFFKmx3K/').text
soup = BeautifulSoup(content, 'lxml')
data = soup.find('div', {'class': re.compile(r'TimeContainer.*$')})
print(data)
you next issue is that the page loads before the video, so you'll get 0/0 for the time. try selenium instead so you can add timer waits for loading

BeautifulSoup google trend

I am new to the programming world and have just finished understanding the very basics of Python. I have just started practicing web crawling and have faced a problem already. I have written a very simple code using BeautifulSoup.
from bs4 import BeautifulSoup
from urllib.request import urlopen
response = urlopen('https://trends.google.com/trends/?geo=US/')
soup = BeautifulSoup(response, 'html.parser')
for anchor in soup.select(".list-item-title"):
print(anchor)
I want to retrieve the names of the recently trending stories; however, the code above is not functioning as it's supposed to and returns a blank.
I would be grateful if someone could point out the error. Thank you!
Google Trends(url) is dynamic meaning data is generated by JavaScript and BeautifulSoup can't render javaSceipt.So, You need automation tool something like selenium with BeautifulSoup`. Please just run the code.
Scripts:
from bs4 import BeautifulSoup
import time
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
url = 'https://trends.google.com/trends/?geo=US/'
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.maximize_window()
time.sleep(8)
driver.get(url)
time.sleep(10)
soup = BeautifulSoup(driver.page_source, 'html.parser')
#driver.close()
for anchor in soup.select(".list-item-title"):
print(anchor.text)
Output:
Meta
Lupus
Liga Europy
Chelsea
Prova do lider
Masks in Kenya
UANL
Jussie Smollett
ישי ריבו
Winter storm warning

Download a search result from Twitter using webscraping

I am new to Python and newer to webscraping, so my question might be very basic.
I am trying to use webscraping to download some results from Twitter searches. I have already understood how the urls of searches work, so I am directly accessing the urls of the searches.
I expect most of my searches to provide no results, and I would like to extract that information in those cases. I’m going to use an example of a search which returns no results. The url would be:
https://twitter.com/search?q=%22John%20Doe%22%20stackexchange%20trial&src=typed_query
That would return something like:
I am trying to extract the text ‘No results for ""John Doe" stackexchange trial"’. But there is something in my code which is not working.
The html code from that part is:
The code I am trying is the following:
import os
os.getcwd()
os.chdir('my.dir')
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import pandas as pd
import os
import urllib.request
import requests
import re
urlpage = "https://twitter.com/search?q=%22John%20Doe%22%20stackexchange%20trial&src=typed_query"
page = urllib.request.urlopen(urlpage)
soup = BeautifulSoup(page, 'html.parser')
results = soup.find("main", class_="css-1dbjc4n r-16y2uox r-1wbh5a2")
text_element = results.find("span", class_="css-901oao css-16my406 r-1qd0xha r-ad9z0x r-bcqeeo r-qvutc0")
text = text_element.text
print(text)
I believe the problem is when I define "results", it is not finding what I want.
I got to that version of the code by analogy from a code which does actually work, which is the following:
urlpage="https://stackexchange.com/"
page = urllib.request.urlopen(urlpage)
soup = BeautifulSoup(page, 'html.parser')
results = soup.find(id='content')
print(results.prettify())
title = results.find('h3', class_='title')
print(title.text)
Thank you very much in advance for all your help!
Edit: so apparently BeautifulSoup doesn’t work for this (I’m not sure why, I think it has to do with the way Twitter loads its elements). I had to use Selenium.
Here is a code that does the job:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.keys import Keys
from selenium.common.exceptions import TimeoutException, NoSuchElementException
import time
url = r'https://twitter.com/search?q=%22John%20Doe%22%20stackexchange%20trial&src=typed_query'
driver = webdriver.Chrome()
driver.get(url)
time.sleep(10)
element=driver.find_element_by_xpath("//span[contains(text(),'No results')]")
print(element.text)

How to scrape url's from a website with python beautiful-soup?

I was trying to scrape some url's from a particular link, I used beautiful-soup for scraping those links, but I'm not able to scrape those links. Here I'm attaching the code which I have used. Actually, I want to scrape the urls from the class "fxs_aheadline_tiny"
import requests
from bs4 import BeautifulSoup
url = 'https://www.fxstreet.com/news?q=&hPP=17&idx=FxsIndexPro&p=0&dFR%5BTags%5D%5B0%5D=EURUSD'
r1 = requests.get(url)
coverpage = r1.content
soup1 = BeautifulSoup(coverpage, 'html.parser')
coverpage_news = soup1.find_all('h4', class_='fxs_aheadline_tiny')
print(coverpage_news)
Thank you
I would use Selenium.
Please, try this code:
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.options import Options
#open driver
driver= webdriver.Chrome(ChromeDriverManager().install())
driver.get('https://www.fxstreet.com/news?q=&hPP=17&idx=FxsIndexPro&p=0&dFR%5BTags%5D%5B0%5D=EURUSD')
# Use ChroPath to identify the xpath for the 'page hits'
pagehits=driver.find_element_by_xpath("//div[#class='ais-hits']")
# search for all a tags
links=pagehits.find_elements_by_tag_name("a")
# For each link get the href
for link in links:
print(link.get_attribute('href'))
It exactly does what you want: it takes out all urls/links on your search page (that means also the links to the authors pages).
You could even consider automating the browser and moving through the search page results. See for this the Selenium documentation:https://selenium-python.readthedocs.io/
Hope this helps

Categories

Resources