I am trying to scrape the name of every favorites on the page of a user of our choice. but with this code I get the error "ResultSet object has no attribute 'find_all'" but if I try to use find it get the opposite error and it ask me to use find_all. I'm a beginner and I don't know what to do. (also to test the code you can use the username "Kineta" she's an administrator so anyone can get access to her profile page).
thanks for your help
from bs4 import BeautifulSoup
import requests
usr_name = str(input('the user you are searching for '))
html_text = requests.get('https://myanimelist.net/profile/'+usr_name)
soup = BeautifulSoup(html_text.text, 'lxml')
favs = soup.find_all('div', class_='fav-slide-outer')
favs_title = favs.find_all('span', class_='title fs10')
print(favs_title)
Your program throws exception because you are trying to use .find_all on ResultSet (favs_title = favs.find_all(...), ResultSetdoesn't have function.find_all`). Instead, you can use CSS selector and select all required elements directly:
import requests
from bs4 import BeautifulSoup
url = "https://myanimelist.net/profile/Kineta"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
for t in soup.select(".fav-slide .title"):
print(t.text)
Prints:
Kono Oto Tomare!
Yuukoku no Moriarty
Kaze ga Tsuyoku Fuiteiru
ACCA: 13-ku Kansatsu-ka
Fukigen na Mononokean
Kakuriyo no Yadomeshi
Shirokuma Cafe
Fruits Basket
Akatsuki no Yona
Colette wa Shinu Koto ni Shita
Okobore Hime to Entaku no Kishi
Meteor Methuselah
Inu x Boku SS
Vampire Juujikai
Mirako, Yuuta
Forger, Loid
Osaki, Kaname
Miyazumi, Tatsuru
Takaoka, Tetsuki
Okamoto, Souma
Shirota, Tsukasa
Archiviste, Noé
Fang, Li Ren
Fukuroi, Michiru
Sakurayashiki, Kaoru
James Moriarty, Albert
Souma, Kyou
Hades
Yona
Son, Hak
Mashima, Taichi
Ootomo, Jin
Collabel, Yuca
Masuda, Toshiki
Furukawa, Makoto
Satou, Takuya
Midorikawa, Hikaru
Miki, Shinichiro
Hino, Satoshi
Hosoya, Yoshimasa
Kimura, Ryouhei
Ono, Daisuke
KENN
Yoshino, Hiroyuki
Toriumi, Kousuke
Toyonaga, Toshiyuki
Ooishi, Masayoshi
Shirodaira, Kyou
Hakusensha
EDIT: To get Anime/Manga/Character favorites:
import requests
from bs4 import BeautifulSoup
url = "https://myanimelist.net/profile/Kineta"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
anime_favorites = [t.text for t in soup.select("#anime_favorites .title")]
manga_favorites = [t.text for t in soup.select("#manga_favorites .title")]
char_favorites = [t.text for t in soup.select("#character_favorites .title")]
print("Anime Favorites")
print("-" * 80)
print(*anime_favorites, sep="\n")
print()
print("Manga Favorites")
print("-" * 80)
print(*manga_favorites, sep="\n")
print()
print("Character Favorites")
print("-" * 80)
print(*char_favorites, sep="\n")
Prints:
Anime Favorites
--------------------------------------------------------------------------------
Kono Oto Tomare!
Yuukoku no Moriarty
Kaze ga Tsuyoku Fuiteiru
ACCA: 13-ku Kansatsu-ka
Fukigen na Mononokean
Kakuriyo no Yadomeshi
Shirokuma Cafe
Manga Favorites
--------------------------------------------------------------------------------
Fruits Basket
Akatsuki no Yona
Colette wa Shinu Koto ni Shita
Okobore Hime to Entaku no Kishi
Meteor Methuselah
Inu x Boku SS
Vampire Juujikai
Character Favorites
--------------------------------------------------------------------------------
Mirako, Yuuta
Forger, Loid
Osaki, Kaname
Miyazumi, Tatsuru
Takaoka, Tetsuki
Okamoto, Souma
Shirota, Tsukasa
Archiviste, Noé
Fang, Li Ren
Fukuroi, Michiru
Sakurayashiki, Kaoru
James Moriarty, Albert
Souma, Kyou
Hades
Yona
Son, Hak
Mashima, Taichi
Ootomo, Jin
Collabel, Yuca
find and find_all work you just need to use them correctly. You can't use them to search through lists (like the 'favs' variable in your example). You can always iterate through the lists with for loop and use the 'find' or 'find_all' functions.
I preferred making it a bit easier but you can choose the way you prefer as I am not sure if mine is more efficient:
from bs4 import BeautifulSoup
import requests
usr_name = str(input('the user you are searching for '))
html_text = requests.get('https://myanimelist.net/profile/'+usr_name)
soup = BeautifulSoup(html_text.text, 'lxml')
favs = soup.find_all('div', class_='fav-slide-outer')
for fav in favs:
tag=fav.span
print(tag.text)
If you need more info on how to use bs4 functions correctly i suggest looking through their docks here.
I looked at the page a bit and changed to code a bit, this way you should get all the results you need:
from bs4 import BeautifulSoup
import requests
usr_name = str(input('the user you are searching for '))
html_text = requests.get('https://myanimelist.net/profile/'+usr_name)
soup = BeautifulSoup(html_text.text, 'lxml')
favs = soup.find_all('li', class_='btn-fav')
for fav in favs:
tag=fav.span
print(tag.text)
I think the problem here is not really the code but how you searched your results and how the site is structured.
Related
from bs4 import BeautifulSoup
from selenium import webdriver
import time
import sys
query_txt = input("크롤링할 내용 입력 :")
path = "C:\Temp\chromedriver_240\chromedriver.exe"
driver = webdriver.Chrome(path)
driver.get("https://www.naver.com")
time.sleep(2)
driver.find_element_by_id("query").send_keys(query_txt)
driver.find_element_by_id("search_btn").click()
driver.find_element_by_link_text("블로그 더보기").click()
full_html = driver.page_source
soup = BeautifulSoup(full_html, 'html.parser')
content_list = soup.find('ul', id='elThumbnailResultArea')
print(content_list)
content = content_list.find('a','sh_blog_title _sp_each_url _sp_each_title' ).get_text()
print(content)
for i in content_list:
con = i.find('a', class_='sh_blog_title _sp_each_url _sp_each_title').get_text()
print(con)
print('\n')
i typed this code with watching online learning but in loop it always error.
con = i.find('a', class_='sh_blog_title _sp_each_url _sp_each_title').get_text()
this line show error 'find() takes no keyword arguments'
The problem is, you have to use .find_all() to get all <a> tags. .find() only returns one tag (if there's any):
import requests
from bs4 import BeautifulSoup
url = 'https://search.naver.com/search.naver?query=tree&where=post&sm=tab_nmr&nso='
full_html = requests.get(url).content
soup = BeautifulSoup(full_html, 'html.parser')
content_list = soup.find_all('a', class_='sh_blog_title _sp_each_url _sp_each_title' )
for i in content_list:
print(i.text)
print('\n')
Prints:
[2017/공학설계 입문] Romantic Tree
장충동/Banyan Tree Club & Spa/Club Members Restaurant
2020-06-27 Joshua Tree National Park Camping(조슈아트리...
[결혼준비/D-102] 웨딩밴드 '누니주얼리 - like a tree'
Book Club - Magic Tree House # 1 : Dinosaur Before Dark...
비밀 정원, 조슈아 트리 국립공원(Joshua Tree National Park)
그뤼너씨 TEA TREE 티트리 라인 3종리뷰
Number of Nodes in the Sub-Tree With the Same Label
태국의 100년 넘은 Giant tree
[부산 기장 카페] 오션뷰 뷰맛집카페 : 씨앤트리 sea&tree
use .find('a', attrs={"class": "<Class name>"}) instead. Reference: Beatifulsoup docs
These two links will definitely help you.
Understand the Find() function in Beautiful Soup
Find on beautiful soup in loop returns TypeError
from bs4 import BeautifulSoup
import requests
def kijiji():
source = requests.get('https://www.kijiji.ca/b-mens-shoes/markham-york-region/c15117001l1700274').text
soup = BeautifulSoup(source,'lxml')
b = soup.find('div', class_='price')
for link in soup.find_all('a',class_ = 'title'):
a = link.get('href')
fulllink = 'http://kijiji.ca'+a
print(fulllink)
b = soup.find('div', class_='price')
print(b.prettify())
kijiji()
Usage of this is to sum up all the different kinds of items sold in kijiji and pair them up with a price.
But I can't seem to find anyway to increment what beautiful soup is finding with a class of price, and I'm stuck with the first price. Find_all doesn't work either as it just prints out the whole blob instead of grouping it together with each item.
If you have Beautiful soup 4.7.1 or above you can use following css selector select() which is much faster.
code:
import requests
from bs4 import BeautifulSoup
res=requests.get("https://www.kijiji.ca/b-mens-shoes/markham-york-region/c15117001l1700274").text
soup=BeautifulSoup(res,'html.parser')
for item in soup.select('.info-container'):
fulllink = 'http://kijiji.ca' + item.find_next('a', class_='title')['href']
print(fulllink)
price=item.select_one('.price').text.strip()
print(price)
Or to use find_all() use below code block
import requests
from bs4 import BeautifulSoup
res=requests.get("https://www.kijiji.ca/b-mens-shoes/markham-york-region/c15117001l1700274").text
soup=BeautifulSoup(res,'html.parser')
for item in soup.find_all('div',class_='info-container'):
fulllink = 'http://kijiji.ca' + item.find_next('a', class_='title')['href']
print(fulllink)
price=item.find_next(class_='price').text.strip()
print(price)
Congratulations on finding the answer. I'll give you another solution for reference only.
import requests
from simplified_scrapy.simplified_doc import SimplifiedDoc
def kijiji():
url = 'https://www.kijiji.ca/b-mens-shoes/markham-york-region/c15117001l1700274'
source = requests.get(url).text
doc = SimplifiedDoc(source)
infos = doc.getElements('div',attr='class',value='info-container')
for info in infos:
price = info.select('div.price>text()')
a = info.select('a.title')
link = doc.absoluteUrl(url,a.href)
title = a.text
print (price)
print (link)
print (title)
kijiji()
Result:
$310.00
https://www.kijiji.ca/v-mens-shoes/markham-york-region/jordan-4-oreo-2015/1485391828
Jordan 4 Oreo (2015)
$560.00
https://www.kijiji.ca/v-mens-shoes/markham-york-region/yeezy-boost-350-yecheil-reflectives/1486296645
Yeezy Boost 350 Yecheil Reflectives
...
Here are more examples:https://github.com/yiyedata/simplified-scrapy-demo/tree/master/doc_examples
from bs4 import BeautifulSoup
import requests
def kijiji():
source = requests.get('https://www.kijiji.ca/b-mens-shoes/markham-york-region/c15117001l1700274').text
soup = BeautifulSoup(source,'lxml')
b = soup.find('div', class_='price')
for link in soup.find_all('a',class_ = 'title'):
a = link.get('href')
fulllink = 'http://kijiji.ca'+a
print(fulllink)
print(b.prettify())
b = b.find_next('div', class_='price')
kijiji()
Was stuck on this for an hour, as soon as I posted this on stack I immediately came up with an idea, messy code but works!
I have a problem with scraping some element from a page:
https://tuning-tec.com/mercedes_w164_ml_mklasa_0507_black_led_seq_lpmed0-5789i
code:
import requests
from bs4 import BeautifulSoup
URL="https://tuning-tec.com/mercedes_w164_ml_mklasa_0507_black_led_seq_lpmed0-5789i"
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
title=soup.find(class_="product_cart_title").text
price=soup.find(class_="icon_main_block_price_a")
number=soup.find(class_="product_cart_info").findAll('tr')[1].findAll('td')[1]
description=soup.find(id="tab_a")
print(description)
Problem is when I want to get to: tab_a
And its a problem cause inside
<div align="left" class="product_cart_info" id="charlong_id">
</div>
is empty. How I can get it?
I see its about js i think. Maybe there is some delay when the page loads?
As stated in the comments, the info is loaded via JavaScript, so BeautifulSoup doesn't see it. But you if you look to Chrome/Firefox network tab, you can see where the page is making requests:
import re
import requests
from bs4 import BeautifulSoup
url = 'https://tuning-tec.com/mercedes_w164_ml_mklasa_0507_black_led_seq_lpmed0-5789i'
ajax_url = 'https://tuning-tec.com/_template/_show_normal/_show_charlong.php?itemId={}'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
print(soup.select_one('.product_cart_title').get_text(strip=True))
print(soup.select_one('.icon_main_block_price_a').get_text(strip=True))
print(soup.select_one('td:contains("Symbol") ~ td').get_text(strip=True))
item_id = re.findall(r"ajax_update_stat\('(\d+)'\)", soup.text)[0]
soup2 = BeautifulSoup(requests.get(ajax_url.format(item_id)).content, 'html.parser')
print()
# just print some info:
for tr in soup2.select('tr'):
print(re.sub(r' {2,}', ' ', tr.select_one('td').get_text(strip=True, separator=' ')))
Prints:
MERCEDES W164 ML M-KLASA 05-07 BLACK LED SEQ
1788.62 PLN
LPMED0
PL
Opis
Lampy
soczewkowe ze światłem
pozycyjnym LED. Z dynamicznym
kierunkowskazem. 100% nowe, w komplecie
(lewa i prawa). Homologacja: norma E13 -
dopuszczone do ruchu.
Szczegóły
Światła pozycyjne: DIODY Kierunkowskaz: DIODY Światła
mijania: H9 w
zestawie Światła
drogowe: H1 w
zestawie Regulacja: elektryczna (silniczek znajduje się w
komplecie).
LED TUBE LIGHT Dynamic Turn Signal >>
A little change in the description, I don't know if it's working, have a look on the following code:
import re
import requests
from bs4 import BeautifulSoup
url = 'https://tuning-tec.com/mercedes_w164_ml_mklasa_0507_black_led_seq_lpmed0-5789i'
ajax_url = 'https://tuning-tec.com/_template/_show_normal/_show_charlong.php?itemId={}'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
def unwrapElements(soup, elementsToFind):
elements = soup.find_all(elementsToFind)
for element in elements:
element.unwrap()
print(soup.select_one('.product_cart_title').get_text(strip=True))
print(soup.select_one('.icon_main_block_price_a').get_text(strip=True))
print(soup.select_one('td:contains("Symbol") ~ td').get_text(strip=True))
item_id = re.findall(r"ajax_update_stat\('(\d+)'\)", soup.text)[0]
soup2 = BeautifulSoup(requests.get(ajax_url.format(item_id)).content, 'html.parser')
description=soup2.findAll('tr')[2].findAll('td')[1]
description.append(soup2.findAll('tr')[4].findAll('td')[1])
unwrapElements(description, "td")
unwrapElements(description, "font")
unwrapElements(description, "span")
print(description)
I need just these elements of description in English language. It will be OK?
And anyway thanks for help !!
Only one thing i don't know why he didn't remove all
Here's what I tried:
import requests
website_url = "https://en.wikipedia.org/wiki/List_of_Texas_Rangers_seasons"
url = requests.get(website_url).text
from bs4 import BeautifulSoup
soup = BeautifulSoup(website_url,'html.parser')
# Selecting the table
table_classes = {"class":"wikitable plainrowheaders"}
rel_table = soup.find_all('table',table_classes)
I am not sure how to proceed further. I did inspect the elements and it appears that the title and href are both dynamic with year field in it. As well, it also contains table for Washington Senators. I would appreciate any help on this! Thank you!
from bs4 import BeautifulSoup
import requests
url = 'https://en.wikipedia.org/wiki/List_of_Texas_Rangers_seasons'
r = requests.get(url)
soup = BeautifulSoup(r.text,'lxml')
#method 1
for row in soup.select('table.plainrowheaders tr')[14:]:
for cell in row.select('td'):
print(cell.text.strip(), end=' ')
print()
#method 2
for row in soup.select('table.plainrowheaders tr')[14:]:
print(row.get_text(strip=True, separator=' '))
I'm unfamiliar with html and web scraping with beautiful soup. I'm trying to retrieve Job titles, salaries, location and company name from various indeed job postings. This is my code so far:
URL = "http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l=New+York&start=10"
import urllib2
import bs4
from bs4 import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen(URL).read())
resultcol = soup.find_all(id = 'resultsCol')
company = soup.findAll('span', attrs={"class":"company"})
jobs = (soup.find_all({'class': " row result"}))
though I have the commands to find jobs and company, I can't get the contents. I'm aware there's a contents command, but none of my variables so far have that attribute. Thanks!
First I seach div with one job all elements and then I search elements inside this div
import urllib2
from bs4 import BeautifulSoup
URL = "http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l=New+York&start=10"
soup = BeautifulSoup(urllib2.urlopen(URL).read(), 'html.parser')
results = soup.find_all('div', attrs={'data-tn-component': 'organicJob'})
for x in results:
company = x.find('span', attrs={"itemprop":"name"})
print 'company:', company.text.strip()
job = x.find('a', attrs={'data-tn-element': "jobTitle"})
print 'job:', job.text.strip()
salary = x.find('nobr')
if salary:
print 'salary:', salary.text.strip()
print '----------'
updated #furas example, for python3:
import urllib.request
from bs4 import BeautifulSoup
URL = "https://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l=New+York&start=10"
soup = BeautifulSoup(urllib.request.urlopen(URL).read(), 'html.parser')
results = soup.find_all('div', attrs={'data-tn-component': 'organicJob'})
for x in results:
company = x.find('span', attrs={"class":"company"})
if company:
print('company:', company.text.strip() )
job = x.find('a', attrs={'data-tn-element': "jobTitle"})
if job:
print('job:', job.text.strip())
salary = x.find('nobr')
if salary:
print('salary:', salary.text.strip())
print ('----------')