'str' object has no attribute 'text'| BeautifulSoup

'str' object has no attribute 'text'| BeautifulSoup - python

When running the following code I get an error:
'str' object has no attribute 'text'
import requests
import pandas as pd
from bs4 import BeautifulSoup
baseURL= 'https://www.olx.pl/nieruchomosci/mieszkania/sprzedaz/pomorskie/'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:89.0) Gecko/20100101 Firefox/89.0'}
offer_links = []
for x in range (1,2):
r = requests.get(f'https://www.olx.pl/nieruchomosci/mieszkania/sprzedaz/pomorskie/?page={x}')
soup = BeautifulSoup(r.content, 'lxml')
offer_list= soup.find_all('div', class_='space rel')
for item in offer_list:
for link in item.find_all('a', href=True):
offer_links.append(link['href'])
print(offer_links)
for link in offer_links:
r= requests.get(link, headers=headers)
soup=BeautifulSoup(r.content, 'lxml')
nazwa = soup.find('h1',class_='css-1oarkq2-Text eu5v0x0').text.strip()
szczegoly = soup.find('p',class_='css-xl6fe0-Text eu5v0x0').text.strip()
cenazam2=[]
poziom=[]
umeblowane=[]
rynek=[]
zabudowa=[]
powi=[]
for i in range(0,7):
p=szczegoly[i].text.strip()
if("Cena" in p):
cenazam2.append(p)
elif("Poziom" in p or "SSD" in p):
poziom.append(p)
elif("Umeblowane" in p):
umeblowane.append(p)
elif("Rynek" in p):
rynek.append(p)
elif("zabudowy" in p):
zabudowa.append(p)
elif("Powierzchnia" in p):
powi.append(p)
oferty = {'Nazwa':nazwa,'Cena':cenazam2,'Poziom':poziom,'Umeblowane':umeblowane,'Rynek':rynek,'Zabudowa':zabudowa,'Powierzchnia':powi}
dataset = pd.DataFrame.from_dict(oferty, orient='index')
dataset = dataset.transpose()
dataset
I need something like an attachment. It's for one test link. It works for one link but I would like to automate it to paste data for the selected number of links.

szczegoly is a list of string. You assign it here:
szczegoly = soup.find('p',class_='css-xl6fe0-Text eu5v0x0').text.strip()
So you can no longer access the PageElements attributes (.text) here:
p=szczegoly[i].text.strip()
You have to decide where you want to strip the values:
szczegoly = soup.find('p',class_='css-xl6fe0-Text eu5v0x0')
...
p=szczegoly[i].text.strip()
or
szczegoly = soup.find('p',class_='css-xl6fe0-Text eu5v0x0').text.strip()
...
p=szczegoly[i]
These should work, when the element/index exists.

Related

Loop duplicating results

I'm writing a code for web-scrape Transfermarkt website, but I'm having some issues on the code.
The code had returned an error that was fixed thru the topic: Loop thru multiple URLs in Python - InvalidSchema("No connection adapters were found for {!r}".format
After this fix, other problems came in.
First: the code is duplicating the results on data frame.
Second one, the code is taking only the last element of each URL. In fact, what I want is get all the agencies URLs in the pagina = range(1) and then scrape all players in each agency, thru the URL scrapped in the first part.
ps.: pagina = range(1) it will be range (1,40), its the numbers of pages that i will scrape to get all agency's links.
Can anyone give me a hand on this issues?
Thanks!
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
from requests.sessions import default_headers
nome=[]
posicao=[]
nacionalidade=[]
idade=[]
clube=[]
contrato=[]
valor=[]
tf = f"http://www.transfermarkt.com.br"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:75.0) Gecko/20100101 Firefox/75.0'
}
pagina = range(1,5)
def main(url):
with requests.Session() as req:
links = []
for lea in pagina:
print(f"Extraindo links da página {lea}")
r = req.get(url.format(lea), headers=headers)
soup = BeautifulSoup(r.content, 'html.parser')
link = [f"{tf}{item.next_element.get('href')}" for item in soup.findAll(
"td", class_="hauptlink")]
links.extend(link)
print(f"Collected {len(links)} Links")
time.sleep(1)
for url in links:
r= requests.get(url, headers=headers)
r.status_code
soup = BeautifulSoup(r.text, 'html.parser')
player_info= soup.find_all('tr', class_=['odd', 'even'])
for info in player_info:
player = info.find_all("td")
vall= info.find('td', {'class': 'zentriert hauptlink'})
nome.append(player[2].text)
posicao.append(player[3].text)
nacionalidade.append(player[4].img['alt'])
idade.append(player[5].text)
clube.append(player[6].img['alt'])
contrato.append(player[7].text)
valor.append(vall)
time.sleep(1)
df = pd.DataFrame(
{"NOME":nome,
"POSICAO":posicao,
"NACIONALIDADE":nacionalidade,
"IDADE":idade,
"CLUBE":clube,
"CONTRATO":contrato,
"VALOR":valor}
)
print(df)
df
#df.to_csv('MBB.csv', index=False)
main("https://www.transfermarkt.com.br/berater/beraterfirmenuebersicht/berater?ajax=yw1&page={}")

Getting a specific value

hey i want to get the last href value in the navigation bar which is this part "/cat/cremes-solaires-indice-30" but i got error that the string index out of range thank you for helping me :)
from bs4 import BeautifulSoup as soup
PAGE_URL = https://www.e.leclerc/fp/sunissime-bb-fluide-protecteur-anti-age-global-spf30-40ml-3508240006457
def product():
response = requests.get(PAGE_URL, headers=HEADERS)
soupe = soup(response.content, 'html5lib')
sauce = soupe.find_all("div", class_="product-breadcrumb")
for x in sauce:
a = x.select('[href]')
print(type(a))
for i in a:
u = i.get("href")
print(u[-2])```

You can find main p tag and from it use find_all method on a tag where it returns list of tag from it extract href
from bs4 import BeautifulSoup
import requests
headers={"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36"}
req = requests.get('https://www.e.leclerc/fp/sunissime-bb-fluide-protecteur-anti-age-global-spf30-40ml-3508240006457',headers=headers)
soup = BeautifulSoup(req.text, 'html.parser')
soup.find("p",class_="wrapper d-inline-block").find_all("a")[-1]['href']
Output:
'/cat/cremes-solaires-indice-30'

The object 'u' is not a dictionary or any other data set storing object , So every time the loop repeats the variable u is being over written . So you can either declare a list or can make the variable global .
Method 1:
from bs4 import BeautifulSoup as soup
PAGE_URL = https://www.e.leclerc/fp/sunissime-bb-fluide-protecteur-anti-age-global-spf30-40ml-3508240006457
def product():
response = requests.get(PAGE_URL, headers=HEADERS)
soupe = soup(response.content, 'html5lib')
sauce = soupe.find_all("div", class_="product-breadcrumb")
for x in sauce:
a = x.select('[href]')
print(type(a))
for i in a:
global u
u = i.get("href")
print(u)
Method 2:
from bs4 import BeautifulSoup as soup
PAGE_URL = https://www.e.leclerc/fp/sunissime-bb-fluide-protecteur-anti-age-global-spf30-40ml-3508240006457
def product():
response = requests.get(PAGE_URL, headers=HEADERS)
soupe = soup(response.content, 'html5lib')
sauce = soupe.find_all("div", class_="product-breadcrumb")
for x in sauce:
a = x.select('[href]')
u = []
print(type(a))
for i in a:
u.append(i.get("href"))
print(u[-1])
P.S : Your code is missing many thing else

python requests module find urls within the urls already found

This script returns a list of URLs found on the web page.
import requests
from bs4 import BeautifulSoup as BS
from bs4 import Comment
with requests.session() as r:
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:88.0) Gecko/20100101 Firefox/88.0'}
r = requests.get('https://ctflearn.com', verify=False, headers=headers)
response = r.text
soup = BS(response, 'html.parser')
tags = soup.find_all('a')
for tag in tags:
links = tag.get('href')
if links[0] == '/':
appended_link = 'https://ctflearn.com' + links
print(appended_link)
elif links[0] == '#':
pass
else:
print(links)
However, what I am interested in is to visit these web pages and find the links within these pages as well. I know it is possible by using a for loop, but I don't know how to implement it.
Thanks for the help.

You could simply try to use two lists (to_visit and visited) with a check if a url is already in one of those lists before you add it to to_visit.
import requests
from bs4 import BeautifulSoup as BS
from bs4 import Comment
with requests.session() as r:
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:88.0) Gecko/20100101 Firefox/88.0'}
to_visit = ['https://ctflearn.com']
visited = []
while (len(to_visit) > 0):
url = to_visit.pop(0)
visited.append(url)
print('visited: "{0}"'.format(url))
r = requests.get(url, verify=True, headers=headers)
response = r.text
soup = BS(response, 'html.parser')
tags = soup.find_all('a')
for tag in tags:
links = tag.get('href')
if links == None:
continue
elif links[0] == '/':
appended_link = 'https://ctflearn.com' + links
if (appended_link not in visited and appended_link not in to_visit):
to_visit.append(appended_link)
elif links[0] == '#':
pass
else:
if (links not in visited and links not in to_visit):
to_visit.append(links)
but at some point you will run into a problem because you will find and try to access something that is not an url which is why I would recommend using a validator:
import validators
validators.url(url) # returns True if "url" is a valid url

Unable to crawl multiples URLs

I have a function which two crawl the webpage and look for a particular class and find a href tag inside it.
url="https://www.poynter.org/ifcn-covid-19-misinformation/page/220/"
def url_parse(site):
hdr = {'User-Agent': 'Mozilla/5.0'}
req = Request(site,headers=hdr)
page = urlopen(req)
soup = BeautifulSoup(page)
return soup
def article_link(URL):
try:
soup=url_parse(URL)
for i in soup.find_all("a", class_="button entry-content__button entry-content__button--smaller"):
link=i['href']
except:
pass
return link
data['article_source']=""
for i, rows in data.iterrows():
rows['article_source']= article_link(rows['url'])
Issue
The function url_parse and article_link are working fine but when I use the function article_link to update the cell inside a datagram, it stops working after 1500 or 1000 URLs. I understand there could be an IP address with my laptop but I don't understand how to solve it because there is no error message.
Expectation
The function article_link parse all URL inside the data frame.

import requests
from bs4 import BeautifulSoup
from concurrent.futures.thread import ThreadPoolExecutor
url = "https://www.poynter.org/ifcn-covid-19-misinformation/page/{}/"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0'
}
def main(url, num):
with requests.Session() as req:
print(f"Extracting Page# {num}")
r = req.get(url.format(num), headers=headers)
soup = BeautifulSoup(r.content, 'html.parser')
links = [item.get("href") for item in soup.findAll(
"a", class_="button entry-content__button entry-content__button--smaller")]
return links
with ThreadPoolExecutor(max_workers=50) as executor:
futures = [executor.submit(main, url, num) for num in range(1, 238)]
for future in futures:
print(future.result())

How to extract links from elements?

I am trying to extract the links of every individual member but I am not getting output:
from bs4 import BeautifulSoup
import requests
r = requests.get('https://www.asklaila.com/search/Delhi-NCR/-/doctors/')
soup = BeautifulSoup(r.text,'lxml')
for link in soup.find_all('h2',class_='resultTitle'):
link1 = link.find('a')
print link1['href']

You need request url with header param. more details
Where resultContent top doctors in Delhi-NCR result div class, cardWrap every doctor cards div class.
from bs4 import BeautifulSoup
import requests
headers = {'User-Agent': 'Custom user agent'}
r = requests.get('https://www.asklaila.com/search/Delhi-NCR/-/doctors/',headers=headers)
soup = BeautifulSoup(r.text,'lxml')
resultContentArray = soup.find('div',{'class':'resultContent'}).find_all("div",{'class':'cardWrap'})
for rr in resultContentArray:
title = rr.find('h2',{'class':'resultTitle'})
link = rr.find("a",href=True)
if link is not None:
print(link['href'])
O/P:
https://www.asklaila.com/category/Delhi-NCR/-/doctors/doctor/?category=176
https://www.asklaila.com/search/Delhi-NCR/greater-kailash-1/doctors/
https://www.asklaila.com/search/Delhi-NCR/-/maternity-hospital/
https://www.asklaila.com/Delhi-NCR/
https://www.asklaila.com/listing/Delhi-NCR/madangir/dr-vp-kaushik/0Vm4m7jP/
https://www.asklaila.com/listing/Delhi-NCR/sector-19/dr-arvind-garg/1BEtXFWP/
https://www.asklaila.com/listing/Delhi-NCR/indira-puram/dr-sanjay-garg/kUUpPPzH/
https://www.asklaila.com/listing/Delhi-NCR/new-friends-colony/dr-rk-caroli/GK5X4dSI/
https://www.asklaila.com/listing/Delhi-NCR/vasant-vihar/dr-sourabh-nagpal/0v1s6pGr/
https://www.asklaila.com/listing/Delhi-NCR/ncr/care24/0bbotWCf/
https://www.asklaila.com/listing/Delhi-NCR/soami-nagar-north/sudaksh-physiotherapy-psychology-orthopaedic-psychiatry-clinic-/kJxps7Dn/
https://www.asklaila.com/listing/Delhi-NCR/vaishali-sector-3/dr-sb-singh/00PPdXnM/
https://www.asklaila.com/listing/Delhi-NCR/kaushambi/dr-uma-kant-gupta/0ivP1mJ6/
https://www.asklaila.com/listing/Delhi-NCR/vaishali-sector-4/dr-kanwal-deep/09eZqT9k/
https://www.asklaila.com/listing/Delhi-NCR/east-of-kailash/dr-harbhajan-singh/ngDklERb/
https://www.asklaila.com/listing/Delhi-NCR/uttam-nagar/dr-bb-jindal/0Z8u07oQ/
https://www.asklaila.com/listing/Delhi-NCR/greater-kailash-part-1/dr-raman-kapoor/kNFPgYfZ/
https://www.asklaila.com/listing/Delhi-NCR/dwarka-sector-7/dr-pankaj-n-surange/NpIBzM4K/
https://www.asklaila.com/listing/Delhi-NCR/vaishali-sector-3/dr-ritu-gupta/19IoQ4A7/
https://www.asklaila.com/listing/Delhi-NCR/vaishali-sector-5/dr-mala-bhattacharjee/ywTzyamp/
https://www.asklaila.com/listing/Delhi-NCR/vasundhara/dr-mohit-jindal/vN9FiMAd/
https://www.asklaila.com/listing/Delhi-NCR/janakpuri/dr-ravi-manocha/1Qe4iuK1/
https://www.asklaila.com/listing/Delhi-NCR/vikas-marg/sparsh/08ZpsI85/
https://www.asklaila.com/listing/Delhi-NCR/kamla-nagar/dr-deepak-guha/ETn71X1r/
https://www.asklaila.com/search/Delhi-NCR/-/doctors/20

Use:
html.parser
custom header User-agent
soup.select feature
from bs4 import BeautifulSoup
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
r = requests.get('https://www.asklaila.com/search/Delhi-NCR/-/doctors/', headers=headers)
soup = BeautifulSoup(r.content, 'html.parser')
for link in soup.select('h2[class="resultTitle"] > a'):
print(link['href'])
The output:
https://www.asklaila.com/listing/Delhi-NCR/madangir/dr-vp-kaushik/0Vm4m7jP/
https://www.asklaila.com/listing/Delhi-NCR/sector-19/dr-arvind-garg/1BEtXFWP/
https://www.asklaila.com/listing/Delhi-NCR/indira-puram/dr-sanjay-garg/kUUpPPzH/
https://www.asklaila.com/listing/Delhi-NCR/new-friends-colony/dr-rk-caroli/GK5X4dSI/
https://www.asklaila.com/listing/Delhi-NCR/vasant-vihar/dr-sourabh-nagpal/0v1s6pGr/
https://www.asklaila.com/listing/Delhi-NCR/ncr/care24/0bbotWCf/
https://www.asklaila.com/listing/Delhi-NCR/soami-nagar-north/sudaksh-physiotherapy-psychology-orthopaedic-psychiatry-clinic-/kJxps7Dn/
https://www.asklaila.com/listing/Delhi-NCR/vaishali-sector-3/dr-sb-singh/00PPdXnM/
https://www.asklaila.com/listing/Delhi-NCR/kaushambi/dr-uma-kant-gupta/0ivP1mJ6/
https://www.asklaila.com/listing/Delhi-NCR/vaishali-sector-4/dr-kanwal-deep/09eZqT9k/
https://www.asklaila.com/listing/Delhi-NCR/east-of-kailash/dr-harbhajan-singh/ngDklERb/
https://www.asklaila.com/listing/Delhi-NCR/uttam-nagar/dr-bb-jindal/0Z8u07oQ/
https://www.asklaila.com/listing/Delhi-NCR/greater-kailash-part-1/dr-raman-kapoor/kNFPgYfZ/
https://www.asklaila.com/listing/Delhi-NCR/dwarka-sector-7/dr-pankaj-n-surange/NpIBzM4K/
https://www.asklaila.com/listing/Delhi-NCR/vaishali-sector-3/dr-ritu-gupta/19IoQ4A7/
https://www.asklaila.com/listing/Delhi-NCR/vaishali-sector-5/dr-mala-bhattacharjee/ywTzyamp/
https://www.asklaila.com/listing/Delhi-NCR/vasundhara/dr-mohit-jindal/vN9FiMAd/
https://www.asklaila.com/listing/Delhi-NCR/janakpuri/dr-ravi-manocha/1Qe4iuK1/
https://www.asklaila.com/listing/Delhi-NCR/vikas-marg/sparsh/08ZpsI85/
https://www.asklaila.com/listing/Delhi-NCR/sector-40/dr-amit-yadav/1ik21lZw/

Using **SoupStrainer
import httplib2
from bs4 import BeautifulSoup, SoupStrainer
http = httplib2.Http()
status, response = http.request('https://www.asklaila.com/search/Delhi-NCR/-/doctors/')
for link in BeautifulSoup(response, 'html.parser', parse_only=SoupStrainer('a')):
if link.has_attr('href'):
print(link['href'])

There are twenty correct links to retrieve for members. A concise way is to use css selector of parent class with child combinator to get a tag within
from bs4 import BeautifulSoup
import requests
r = requests.get('https://www.asklaila.com/search/Delhi-NCR/-/doctors/',headers= {'User-Agent' : 'Mozilla/5.0'})
soup = BeautifulSoup(r.content,'lxml')
links = [item['href'] for item in soup.select('.resultTitle > a')]
print(links)

The server is looking for User-Agent in header to prevent users from scraping the content
you could set request headers as a work around.
from bs4 import BeautifulSoup
import requests
headers = dict()
headers['User-Agent']= "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:67.0) Gecko/20100101 Firefox/67.0"
r = requests.get('https://www.asklaila.com/search/Delhi-NCR/-/doctors/',headers=headers)
soup = BeautifulSoup(r.text,'lxml')
# with open('h.html','w') as w:
# w.write(soup.text)
for link in soup.find_all('h2',class_='resultTitle'):
link1 = link.find('a')
print link1['href']
Should give you
https://www.asklaila.com/listing/Delhi-NCR/madangir/dr-vp-kaushik/0Vm4m7jP/
https://www.asklaila.com/listing/Delhi-NCR/sector-19/dr-arvind-garg/1BEtXFWP/
https://www.asklaila.com/listing/Delhi-NCR/indira-puram/dr-sanjay-garg/kUUpPPzH/
https://www.asklaila.com/listing/Delhi-NCR/new-friends-colony/dr-rk-caroli/GK5X4dSI/
https://www.asklaila.com/listing/Delhi-NCR/vasant-vihar/dr-sourabh-nagpal/0v1s6pGr/
https://www.asklaila.com/listing/Delhi-NCR/ncr/care24/0bbotWCf/
https://www.asklaila.com/listing/Delhi-NCR/soami-nagar-north/sudaksh-physiotherapy-psychology-orthopaedic-psychiatry-clinic-/kJxps7Dn/
https://www.asklaila.com/listing/Delhi-NCR/vaishali-sector-3/dr-sb-singh/00PPdXnM/
https://www.asklaila.com/listing/Delhi-NCR/vaishali-sector-4/dr-kanwal-deep/09eZqT9k/
https://www.asklaila.com/listing/Delhi-NCR/kaushambi/dr-uma-kant-gupta/0ivP1mJ6/
https://www.asklaila.com/listing/Delhi-NCR/east-of-kailash/dr-harbhajan-singh/ngDklERb/
https://www.asklaila.com/listing/Delhi-NCR/uttam-nagar/dr-bb-jindal/0Z8u07oQ/
https://www.asklaila.com/listing/Delhi-NCR/greater-kailash-part-1/dr-raman-kapoor/kNFPgYfZ/
https://www.asklaila.com/listing/Delhi-NCR/dwarka-sector-7/dr-pankaj-n-surange/NpIBzM4K/
https://www.asklaila.com/listing/Delhi-NCR/vaishali-sector-3/dr-ritu-gupta/19IoQ4A7/
https://www.asklaila.com/listing/Delhi-NCR/vaishali-sector-5/dr-mala-bhattacharjee/ywTzyamp/
https://www.asklaila.com/listing/Delhi-NCR/vasundhara/dr-mohit-jindal/vN9FiMAd/
https://www.asklaila.com/listing/Delhi-NCR/janakpuri/dr-ravi-manocha/1Qe4iuK1/
https://www.asklaila.com/listing/Delhi-NCR/vikas-marg/sparsh/08ZpsI85/
https://www.asklaila.com/listing/Delhi-NCR/kamla-nagar/dr-deepak-guha/ETn71X1r/

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

'str' object has no attribute 'text'| BeautifulSoup - python

Related

Loop duplicating results

Getting a specific value

python requests module find urls within the urls already found

Unable to crawl multiples URLs

How to extract links from elements?

Categories

Resources