BeatifulSoup struggling to scrape the Listings Detail Page

BeatifulSoup struggling to scrape the Listings Detail Page - python

I'm still a rookie on the Python world. I'm trying to build a scraper that will be useful on my daily work routine. But I'm stuck at a particular point:
My goal is to scrape a real estate website. I'm using BeatifulSoup, and I manage to get the parameters on the lists pages without problems. But when I enter on the listing details page, I'm not managing to scrape any data.
My code:
from bs4 import BeautifulSoup
import requests
url = "https://timetochoose.co.ao/?search-listings=true"
headers = {'User-Agent': 'whatever'}
response = requests.get(url, headers=headers)
print(response)
data = response.text
print(data)
soup = BeautifulSoup(data, 'html.parser')
anuncios = soup.find_all("div", {"class": "grid-listing-info"})
for anuncios in anuncios:
titles = anuncios.find("a",{"class": "listing-link"}).text
location = anuncios.find("p",{"class": "location muted marB0"}).text
link = anuncios.find("a",{"class": "listing-link"}).get("href")
anuncios_response = requests.get(link)
anuncios_data = anuncios_response.text
anuncios_soup = BeautifulSoup(anuncios_data, 'html.parser')
conteudo = anuncios_soup.find("div", {"id":"listing-content"}).text
print("Título", titles, "\nLocalização", location, "\nLink", link, "\nConteudo", conteudo)
Example: I'm not getting anything under "conteudo" variable. I've tried to get different data from the Details Page, like the Price or number of Rooms but It always fail and I just get "None".
I'm searching an answer since yesterday afternoon, but I'm not getting where I'm failing. I manage to get the parameters on the upper pages without problems, but when I reach the listings detail page level, it just fails.
If someone could just point me what I'm doing wrong, I will be grateful. Thanks in advance for the time that you take to read my question.

To get correct page you need to set User-Agent http header.
For example:
import requests
from bs4 import BeautifulSoup
main_url = 'https://timetochoose.co.ao/?search-listings=true'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0'}
def print_info(url):
soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')
print(soup.select_one('#listing-content').get_text(strip=True, separator='\n'))
soup = BeautifulSoup(requests.get(main_url, headers=headers).content, 'html.parser')
for a in soup.select('a.listing-featured-image'):
print(a['href'])
print_info(a['href'])
print('-' * 80)
Prints:
https://timetochoose.co.ao/listings/loja-rua-rei-katiavala-luanda/
Avenida brasil , Rua katiavala
Maculusso
Loja com 90 metros quadrados
2 andares
1 wc
Frente a estrada
Arrendamento  mensal 500.000 kz Negociável
--------------------------------------------------------------------------------
https://timetochoose.co.ao/listings/apertamento-t3-rua-cabral-montcada-maianga/
Apartamento T3 maianga
1  suíte com varanda
2 quartos com varanda
1 wc
1 sala comum grande
1 cozinha
Tanque de  agua
Predio limpo
Arrendamento 350.000  akz Negociável
--------------------------------------------------------------------------------
...and so on.

Related

HTML problem with tags and classes in a simple and little scraping with BeautifulSoup

I am new and am trying to get BeautifulSoup to work. I have Html problems with recovering classes and tags. I get closer, but there is something I'm wrong. I insert wrong tags and classes to scrape the title, time, link, and text of a news item.
I would like to scrape all those titles in the vertical list, then scrape the date, title, link, and content.
Can you help me with the right html class and tagging please?
I'm not getting any errors, but the python console stays empty
>>>
Code
import requests
from bs4 import BeautifulSoup
site = requests.get('url')
beautify = BeautifulSoup(site.content,'html5lib')
news = beautify.find_all('div', {'class','$00'})
arti = []
for each in news:
time = each.find('span', {'class','hh serif'}).text
title = each.find('span', {'class','title'}).text
link = each.a.get('href')
r = requests.get(url)
soup = BeautifulSoup(r.text,'html5lib')
content = soup.find('div', class_ = "read__content").text.strip()
print(" ")
print(time)
print(title)
print(link)
print(" ")
print(content)
print(" ")

Here is a solution you can give it a try,
import requests
from bs4 import BeautifulSoup
# mock browser request
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'
}
site = requests.get('https://www.tuttomercatoweb.com/atalanta/', headers=headers)
soup = BeautifulSoup(site.content, 'html.parser')
news = soup.find_all('div', attrs={"class": "tcc-list-news"})
for each in news:
for div in each.find_all("div"):
print("-- Time ", div.find('span', attrs={'class': 'hh serif'}).text)
print("-- Href ", div.find("a")['href'])
print("-- Text ", " ".join([span.text for span in div.select("a > span")]))
-- Time 11:36
-- Href https://www.tuttomercatoweb.com/atalanta/?action=read&idtmw=1661241
-- Text focus Serie A, punti nel 2022: Juve prima, ma un solo punto in più rispetto a Milan e Napoli
------------------------------
-- Time 11:24
-- Href https://www.tuttomercatoweb.com/atalanta/?action=read&idtmw=1661233
-- Text focus Serie A, chi più in forma? Le ultime 5 gare: Sassuolo e Juve in vetta, crisi Venezia
------------------------------
-- Time 11:15
-- Href https://www.tuttomercatoweb.com/atalanta/?action=read&idtmw=1661229
-- Text Le pagelle di Cissé: come nelle migliori favole. Dalla seconda categoria al gol in serie A
------------------------------
...
...
EDIT:
Why headers are required here ?
How to use Python requests to fake a browser visit a.k.a and generate User Agent?

Loop duplicating results

I'm writing a code for web-scrape Transfermarkt website, but I'm having some issues on the code.
The code had returned an error that was fixed thru the topic: Loop thru multiple URLs in Python - InvalidSchema("No connection adapters were found for {!r}".format
After this fix, other problems came in.
First: the code is duplicating the results on data frame.
Second one, the code is taking only the last element of each URL. In fact, what I want is get all the agencies URLs in the pagina = range(1) and then scrape all players in each agency, thru the URL scrapped in the first part.
ps.: pagina = range(1) it will be range (1,40), its the numbers of pages that i will scrape to get all agency's links.
Can anyone give me a hand on this issues?
Thanks!
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
from requests.sessions import default_headers
nome=[]
posicao=[]
nacionalidade=[]
idade=[]
clube=[]
contrato=[]
valor=[]
tf = f"http://www.transfermarkt.com.br"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:75.0) Gecko/20100101 Firefox/75.0'
}
pagina = range(1,5)
def main(url):
with requests.Session() as req:
links = []
for lea in pagina:
print(f"Extraindo links da página {lea}")
r = req.get(url.format(lea), headers=headers)
soup = BeautifulSoup(r.content, 'html.parser')
link = [f"{tf}{item.next_element.get('href')}" for item in soup.findAll(
"td", class_="hauptlink")]
links.extend(link)
print(f"Collected {len(links)} Links")
time.sleep(1)
for url in links:
r= requests.get(url, headers=headers)
r.status_code
soup = BeautifulSoup(r.text, 'html.parser')
player_info= soup.find_all('tr', class_=['odd', 'even'])
for info in player_info:
player = info.find_all("td")
vall= info.find('td', {'class': 'zentriert hauptlink'})
nome.append(player[2].text)
posicao.append(player[3].text)
nacionalidade.append(player[4].img['alt'])
idade.append(player[5].text)
clube.append(player[6].img['alt'])
contrato.append(player[7].text)
valor.append(vall)
time.sleep(1)
df = pd.DataFrame(
{"NOME":nome,
"POSICAO":posicao,
"NACIONALIDADE":nacionalidade,
"IDADE":idade,
"CLUBE":clube,
"CONTRATO":contrato,
"VALOR":valor}
)
print(df)
df
#df.to_csv('MBB.csv', index=False)
main("https://www.transfermarkt.com.br/berater/beraterfirmenuebersicht/berater?ajax=yw1&page={}")

Scraping all tables from a webpage using python ps4

I want to use beautifulsoup to get all the tables at this link https://www.investing.com/indices/indices-futures, following that I want to get the titles in the Index Column, and the Links of those titles.
I want what's in the first column only.
So for example.
title href
Dow Jones /indices/us-30-futures
S&P 500 /indices/us-spx-500-futures
...
Mini DAX /indices/mini-dax-futures
...
VSTOXX Mini /indices/vstoxx-mini
I use the following code
url = "https://www.investing.com/indices/indices-futures"
req = requests.get(url, headers=urlheader)
soup = BeautifulSoup(req.content, "lxml")
table = soup.find('div', id="cross_rates_container")
for a in table.find_all('a', href=True):
print (a['title'], a['href'])
I can see the table variable, but I can't seem the access the title (which contains the index name) and href (which contains the links)
What's wrong with it how can I get all the tables' entries at once?

You can iterate over <td> elements and get the <a> link under them.
For example:
import requests
from bs4 import BeautifulSoup
url = 'https://www.investing.com/indices/indices-futures'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:76.0) Gecko/20100101 Firefox/76.0'}
soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')
print('{:<30} {}'.format('Title', 'URL'))
for a in soup.select('td.plusIconTd > a'):
print('{:<30} {}'.format(a.text, 'https://www.investing.com' + a['href']))
Prints:
Title URL
Dow Jones https://www.investing.com/indices/us-30-futures
S&P 500 https://www.investing.com/indices/us-spx-500-futures
Nasdaq https://www.investing.com/indices/nq-100-futures
SmallCap 2000 https://www.investing.com/indices/smallcap-2000-futures
S&P 500 VIX https://www.investing.com/indices/us-spx-vix-futures
DAX https://www.investing.com/indices/germany-30-futures
CAC 40 https://www.investing.com/indices/france-40-futures
FTSE 100 https://www.investing.com/indices/uk-100-futures
Euro Stoxx 50 https://www.investing.com/indices/eu-stocks-50-futures
FTSE MIB https://www.investing.com/indices/italy-40-futures
SMI https://www.investing.com/indices/switzerland-20-futures
IBEX 35 https://www.investing.com/indices/spain-35-futures
ATX https://www.investing.com/indices/austria-20-futures
WIG20 https://www.investing.com/indices/poland-20-futures
AEX https://www.investing.com/indices/netherlands-25-futures
BUX https://www.investing.com/indices/hungary-14-futures
RTS https://www.investing.com/indices/rts-cash-settled-futures
... and so on.
EDIT: Screenshot with <td> elements:

Can't find and process text taken out of an HTML

I'm trying to search in a webpage the "spanish" content but can't get it at all.
This is the code I have so far:
from bs4 import BeautifulSoup
import requests
import re
url = 'http://www.autotaskstatus.net/'
r = requests.get(url)
estado = r.status_code
r = r.content
soup = BeautifulSoup(r, "html.parser")
data = soup.find_all('span', attrs={'class':'name'})[1]
pais = 'Spanish'
data.get_text()
print(data.text)
I have there the "pais" var so it will be replaced by an input so the user can search the country they want.
The only data I get with a 1 there is "Limited Release" but if I go with a 0 I can't filter the results at all
I have been searching all over Internet and couldn't find anyone with this same problem so I can't find a solution.
I am using Python 3.6
Edit: since people seemed to find this unclear I'll explain it now
What I have on the page is: - just a part
<div data-component-id="fp5s6cp13l47"
class="component-inner-container status-green "
data-component-status="operational"
data-js-hook="">
<span class="name">
Concord
</span>
<span class="tooltip-base tool" title="https://concord.centrastage.net">?</span>
<span class="component-status">
Operational
</span>
So spanish is like "Concord" and what I want to take out is the "Spanish" (and later on the "operational") which will be in a var so it can later be changed for any country there

You can get the Spanish server status using this approach:
from bs4 import BeautifulSoup
import requests
URL = 'http://www.autotaskstatus.net/'
with requests.session() as s:
s.headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:50.0) Gecko/20100101 Firefox/50.0'}
r = s.get(URL)
soup = BeautifulSoup(r.content, "html.parser")
data = soup.find_all('div', attrs={'class':'component-inner-container'})
pais = 'Spanish'
print([d.find('span', {'class': 'name'}).text.strip() + ' - ' + d.find('span', {'class': 'component-status'}).text.strip() for d in data if pais in d.text])

strip away html tags from extracted links

I have the following code to extract certain links from a webpage:
from bs4 import BeautifulSoup
import urllib2, sys
import re
def tonaton():
site = "http://tonaton.com/en/job-vacancies-in-ghana"
hdr = {'User-Agent' : 'Mozilla/5.0'}
req = urllib2.Request(site, headers=hdr)
jobpass = urllib2.urlopen(req)
invalid_tag = ('h2')
soup = BeautifulSoup(jobpass)
print soup.find_all('h2')
The links are contained in the 'h2' tags so I get the links as follows:
<h2>cashiers </h2>
<h2>Cake baker</h2>
<h2>Automobile Technician</h2>
<h2>Marketing Officer</h2>
But I'm interested in getting rid of all the 'h2' tags so that I have links only in this manner:
cashiers
Cake baker
Automobile Technician
Marketing Officer
I therefore updated my code to look like this:
def tonaton():
site = "http://tonaton.com/en/job-vacancies-in-ghana"
hdr = {'User-Agent' : 'Mozilla/5.0'}
req = urllib2.Request(site, headers=hdr)
jobpass = urllib2.urlopen(req)
invalid_tag = ('h2')
soup = BeautifulSoup(jobpass)
jobs = soup.find_all('h2')
for tag in invalid_tag:
for match in jobs(tag):
match.replaceWithChildren()
print jobs
But I couldn't get it to work, even though I thought that was the best logic i could come up with.I'm a newbie though so I know there is something better that could be done.
Any help will be gracefully appreciated
Thanks

You could navigate for the next element of each <h2> tag:
for h2 in soup.find_all('h2'):
n = h2.next_element
if n.name == 'a': print n
It yields:
Financial Administrator
House help
Office Manager
...

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

BeatifulSoup struggling to scrape the Listings Detail Page - python

Related

HTML problem with tags and classes in a simple and little scraping with BeautifulSoup

Loop duplicating results

Scraping all tables from a webpage using python ps4

Can't find and process text taken out of an HTML

strip away html tags from extracted links

Categories

Resources