I'm trying to create a script in Python to fetch all the links connected to the name of different actors from imdb.com and then parse the first three of their movie links and finally scrape the name of director and writer of those movies. There are around 1000 names in there. I'm okay with the first three names for this example.
Website link
I can scrape the links of different actors and their first three movie links in one go.
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
url = 'https://www.imdb.com/list/ls058011111/'
base = 'https://www.imdb.com/'
def get_actor_list(s):
res = s.get(url,headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(res.text,"lxml")
for name_links in soup.select(".mode-detail")[:3]:
name = name_links.select_one("h3 > a").get_text(strip=True)
item_link = urljoin(base,name_links.select_one("h3 > a").get("href"))
yield from get_movie_links(s,name,item_link)
def get_movie_links(s,name,link):
r = s.get(link)
soup = BeautifulSoup(r.text,"lxml")
item_links = [urljoin(base,item.get("href")) for item in soup.select(".filmo-category-section .filmo-row > b > a[href]")[:3]]
yield name,item_links
if __name__ == '__main__':
with requests.Session() as s:
for elem in get_actor_list(s):
print(elem)
The result I get:
('Robert De Niro', ['https://www.imdb.com/title/tt4075436/', 'https://www.imdb.com/title/tt3143812/', 'https://www.imdb.com/title/tt5537002/'])
('Jack Nicholson', ['https://www.imdb.com/title/tt1341188/', 'https://www.imdb.com/title/tt1356864/', 'https://www.imdb.com/title/tt0825232/'])
('Marlon Brando', ['https://www.imdb.com/title/tt10905860/', 'https://www.imdb.com/title/tt0442674/', 'https://www.imdb.com/title/tt1667880/'])
I can even parse the name of directors and the writers of those linked movies if I individually use those links within the following function:
def get_content(s,url):
res = s.get(url,headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(res.text,"lxml")
director = soup.select_one("h4:contains('Director') ~ a")
director = director.get_text(strip=True) if director else None
writer = soup.select_one("h4:contains('Writer') ~ a").get_text(strip=True)
print(director,writer)
However, I would like to rectify the script merging those functions in such a way so that it produces the following (final) output:
('Robert De Niro', [Jonathan Jakubowicz, Jonathan Jakubowicz, None, Anthony Thorne, Martin Scorsese, David Grann])
('Jack Nicholson', [James L. Brooks, James L. Brooks, Casey Affleck, Casey Affleck, Rob Reiner, Justin Zackham])
('Marlon Brando', [Bob Bendetson, Bob Bendetson, Peter Mitchell, Rubin Mario Puzo, Paul Hunter, Paul Hunter])
How can I get the final output merging the above functions in the right way?
Related
This is a Wikipedia article containing a list of articles about notable computer scientists. I have to write a script that collects the following info for each one of them:
Their full name
The number of awards they have
The universities they've attended
I've already written the following code to gather the links to each article.
import requests
from bs4 import BeautifulSoup
URL = "https://en.wikipedia.org/wiki/List_of_computer_scientists"
response = requests.get(URL)
soup = BeautifulSoup(response.content, 'html.parser')
lines = soup.find(id="mw-content-text").find_all("li")
valid_links = []
for line in lines:
link = line.find("a")
if link['href'].find("/wiki/") == -1:
continue
if link.text == "Lists portal":
break
valid_links.append("https://en.wikipedia.org" + link['href'])
It's also pretty easy to get their full name (it's just the tile for each one). However I'm having trouble writing a script that can get 2 & 3 correctly for each one.
What I have so far is the following:
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
scientist_name = soup.find(id="firstHeading").string
soup.find(id="mw-content-text").find("table", class_="infobox biography vcard")
scientist_education = "PLACEHOLDER"
scientist_awards = "PLACEHOLDER"
Can you try with the following code:
import requests
import re
from bs4 import BeautifulSoup
URL = "https://en.wikipedia.org/wiki/List_of_computer_scientists"
response = requests.get(URL)
soup = BeautifulSoup(response.content, 'html.parser')
lines = soup.find(id="mw-content-text").find_all("li")
valid_links = []
for line in lines:
link = line.find("a")
if link['href'].find("/wiki/") == -1:
continue
if link.text == "Lists portal":
break
valid_links.append("https://en.wikipedia.org" + link['href'])
for url in valid_links:
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
name = soup.find(id="firstHeading").string
edu = soup.find(lambda tag: len(tag.find_all()) == 0 and "Institutions" in tag.text)
edux = [i.text.strip() for i in edu.find_next_siblings("td")] if edu else []
awards = soup.find(lambda tag: len(tag.find_all()) == 0 and "Awards" in tag.text)
awardsx = [i.text.strip() for i in awards.find_next_siblings("td")] if awards else []
res = {"name": name, "education": edux, "awards": awardsx}
print(res)
It returns the following output:
{'name': 'Atta ur Rehman Khan', 'education': ['Ajman University King Saud University University of Malaya Sohar University COMSATS University Air University (Pakistan Air Force) Qurtuba University'], 'awards': []}
{'name': 'Wil van der Aalst', 'education': ['RWTH Aachen University'], 'awards': []}
{'name': 'Scott Aaronson', 'education': ['University of Texas at Austin\nMassachusetts Institute of Technology\nInstitute for Advanced Study\nUniversity of Waterloo'], 'awards': ['Alan T. Waterman Award\nPECASE\nTomassoni–Chisesi Prize\nACM Prize in Computing']}
{'name': 'Rediet Abebe', 'education': ['University of California, BerkeleyHarvard UniversityCornell UniversityUniversity of Cambridge'], 'awards': ['Andrew Carnegie Fellow (2022)Harvard Society of Fellows (2019)MIT Technology Review Innovators Under 35 (2019)']}
....
However, I believe that there are better options for crawling this page, such as Scrapy. Additionally, if that is the case, you could run your spiders on the cloud using a service like estela.
I am trying to scrape the name of every favorites on the page of a user of our choice. but with this code I get the error "ResultSet object has no attribute 'find_all'" but if I try to use find it get the opposite error and it ask me to use find_all. I'm a beginner and I don't know what to do. (also to test the code you can use the username "Kineta" she's an administrator so anyone can get access to her profile page).
thanks for your help
from bs4 import BeautifulSoup
import requests
usr_name = str(input('the user you are searching for '))
html_text = requests.get('https://myanimelist.net/profile/'+usr_name)
soup = BeautifulSoup(html_text.text, 'lxml')
favs = soup.find_all('div', class_='fav-slide-outer')
favs_title = favs.find_all('span', class_='title fs10')
print(favs_title)
Your program throws exception because you are trying to use .find_all on ResultSet (favs_title = favs.find_all(...), ResultSetdoesn't have function.find_all`). Instead, you can use CSS selector and select all required elements directly:
import requests
from bs4 import BeautifulSoup
url = "https://myanimelist.net/profile/Kineta"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
for t in soup.select(".fav-slide .title"):
print(t.text)
Prints:
Kono Oto Tomare!
Yuukoku no Moriarty
Kaze ga Tsuyoku Fuiteiru
ACCA: 13-ku Kansatsu-ka
Fukigen na Mononokean
Kakuriyo no Yadomeshi
Shirokuma Cafe
Fruits Basket
Akatsuki no Yona
Colette wa Shinu Koto ni Shita
Okobore Hime to Entaku no Kishi
Meteor Methuselah
Inu x Boku SS
Vampire Juujikai
Mirako, Yuuta
Forger, Loid
Osaki, Kaname
Miyazumi, Tatsuru
Takaoka, Tetsuki
Okamoto, Souma
Shirota, Tsukasa
Archiviste, Noé
Fang, Li Ren
Fukuroi, Michiru
Sakurayashiki, Kaoru
James Moriarty, Albert
Souma, Kyou
Hades
Yona
Son, Hak
Mashima, Taichi
Ootomo, Jin
Collabel, Yuca
Masuda, Toshiki
Furukawa, Makoto
Satou, Takuya
Midorikawa, Hikaru
Miki, Shinichiro
Hino, Satoshi
Hosoya, Yoshimasa
Kimura, Ryouhei
Ono, Daisuke
KENN
Yoshino, Hiroyuki
Toriumi, Kousuke
Toyonaga, Toshiyuki
Ooishi, Masayoshi
Shirodaira, Kyou
Hakusensha
EDIT: To get Anime/Manga/Character favorites:
import requests
from bs4 import BeautifulSoup
url = "https://myanimelist.net/profile/Kineta"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
anime_favorites = [t.text for t in soup.select("#anime_favorites .title")]
manga_favorites = [t.text for t in soup.select("#manga_favorites .title")]
char_favorites = [t.text for t in soup.select("#character_favorites .title")]
print("Anime Favorites")
print("-" * 80)
print(*anime_favorites, sep="\n")
print()
print("Manga Favorites")
print("-" * 80)
print(*manga_favorites, sep="\n")
print()
print("Character Favorites")
print("-" * 80)
print(*char_favorites, sep="\n")
Prints:
Anime Favorites
--------------------------------------------------------------------------------
Kono Oto Tomare!
Yuukoku no Moriarty
Kaze ga Tsuyoku Fuiteiru
ACCA: 13-ku Kansatsu-ka
Fukigen na Mononokean
Kakuriyo no Yadomeshi
Shirokuma Cafe
Manga Favorites
--------------------------------------------------------------------------------
Fruits Basket
Akatsuki no Yona
Colette wa Shinu Koto ni Shita
Okobore Hime to Entaku no Kishi
Meteor Methuselah
Inu x Boku SS
Vampire Juujikai
Character Favorites
--------------------------------------------------------------------------------
Mirako, Yuuta
Forger, Loid
Osaki, Kaname
Miyazumi, Tatsuru
Takaoka, Tetsuki
Okamoto, Souma
Shirota, Tsukasa
Archiviste, Noé
Fang, Li Ren
Fukuroi, Michiru
Sakurayashiki, Kaoru
James Moriarty, Albert
Souma, Kyou
Hades
Yona
Son, Hak
Mashima, Taichi
Ootomo, Jin
Collabel, Yuca
find and find_all work you just need to use them correctly. You can't use them to search through lists (like the 'favs' variable in your example). You can always iterate through the lists with for loop and use the 'find' or 'find_all' functions.
I preferred making it a bit easier but you can choose the way you prefer as I am not sure if mine is more efficient:
from bs4 import BeautifulSoup
import requests
usr_name = str(input('the user you are searching for '))
html_text = requests.get('https://myanimelist.net/profile/'+usr_name)
soup = BeautifulSoup(html_text.text, 'lxml')
favs = soup.find_all('div', class_='fav-slide-outer')
for fav in favs:
tag=fav.span
print(tag.text)
If you need more info on how to use bs4 functions correctly i suggest looking through their docks here.
I looked at the page a bit and changed to code a bit, this way you should get all the results you need:
from bs4 import BeautifulSoup
import requests
usr_name = str(input('the user you are searching for '))
html_text = requests.get('https://myanimelist.net/profile/'+usr_name)
soup = BeautifulSoup(html_text.text, 'lxml')
favs = soup.find_all('li', class_='btn-fav')
for fav in favs:
tag=fav.span
print(tag.text)
I think the problem here is not really the code but how you searched your results and how the site is structured.
I'm trying to get name, address and key contacts from a webpage using a python script. I can get them individually in the right way. However, what I wish to do is get the name and address as string and the key contacts in a list so that I can write them in a csv file in 6 columns. I can't find any way to include the value of data-cfemail within the list of contacts.
Website address
I've tried with:
import requests
from bs4 import BeautifulSoup
link = "https://www.fis.com/fis/companies/details.asp?l=e&filterby=species&specie_id=615&page=1&company_id=160574&country_id="
res = requests.get(link,headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(res.text,'lxml')
name = soup.select_one("#name").text.strip()
address = soup.select("#description_details tr:contains('Address:') td")[1].text
contacts = [' '.join(item.get_text(strip=True).split()) for item in soup.select("#contacts table tr td")]
print(name,address,contacts)
Current output:
Bahia Grande S.A. - BG Group
Maipú 1252 Piso 8°
['Founder & PresidentMr Guillermo Jacob', 'VP FinanceMr Andres Jacob[email protected]', 'ControllerMr Juan Carlos Peralta[email protected]', 'VP AdmnistrationMs Veronica Vinuela[email protected]', '']
Expected output (as the emails are protected the value of data-cfemail will do):
Bahia Grande S.A. - BG Group
Maipú 1252 Piso 8°
[Founder & President, Mr Guillermo Jacob]
[VP Finance, Mr Andres Jacob,bbdad1dad8d4d9fbd9dad3d2dadcc9dad5dfde95d8d4d695dac9]
[Controller,Mr Juan Carlos Peralta,0b61687b6e796a677f6a4b696a63626a6c796a656f6e25686466256a79]
[VP Admnistration,Ms Veronica Vinuela,87f1f1eee9f2e2ebe6c7e5e6efeee6e0f5e6e9e3e2a9e4e8eaa9e6f5]
You could do it the following way restricting to the appropriate tds #contacts td[height] then the appropriate ids
td.select('#contacts_title, #contacts_name, #contacts_email') then testing in a list comprehension if current has the cfemail and acting accordingly.
from bs4 import BeautifulSoup as bs
import requests
r = requests.get('https://www.fis.com/fis/companies/details.asp?l=e&filterby=species&specie_id=615&page=1&company_id=160574&country_id=')
soup = bs(r.content, 'lxml')
name = soup.select_one('#name').text.strip()
address = soup.select_one('#description_details td:contains("Address:") + td div').text
print(name)
print(address)
for td in soup.select('#contacts td[height]'):
print([i.text.strip().replace('\xa0',' ') if i.select_one('.__cf_email__') is None else i.select_one('.__cf_email__')['data-cfemail']
for i in td.select('#contacts_title, #contacts_name, #contacts_email')])
OP's implementation:
contacts = [', '.join([i.text.strip().replace('\xa0',' ') if i.select_one('.__cf_email__') is None else i.select_one('.__cf_email__')['data-cfemail'] for i in td.select('#contacts_title, #contacts_name, #contacts_email')]) for td in soup.select('#contacts td[height]')]
You can iterate over the table storing the contact information:
import requests
from bs4 import BeautifulSoup as soup
d = soup(requests.get('https://www.fis.com/fis/companies/details.asp?l=e&filterby=species&specie_id=615&page=1&company_id=160574&country_id=').text, 'html.parser')
title, address = d.find('div', {'id':'name'}).text, d.find('div', {'id':'description_details'}).tr.div.text
contacts = [[i.find_all('div') for i in b.find_all('td')] for b in d.find('div', {'id':'contacts'}).table.find_all('tr')]
result = [[j.get_text(strip=True) if j.a is None else j.a.span['data-cfemail'] for j in i] for b in contacts for i in b if i]
Output:
'\xa0Bahia Grande S.A. - BG Group' #title
'Maipú 1252 Piso 8°' #address
[['Founder & President', 'Mr\xa0Guillermo\xa0Jacob'], ['VP Finance', 'Mr\xa0Andres\xa0Jacob', 'e6878c87858984a684878e8f87819487888283c885898bc88794'], ['Controller', 'Mr\xa0Juan Carlos\xa0Peralta', '264c45564354474a52476644474e4f474154474842430845494b084754'], ['VP Admnistration', 'Ms\xa0Veronica\xa0Vinuela', 'baccccd3d4cfdfd6dbfad8dbd2d3dbddc8dbd4dedf94d9d5d794dbc8']] #contact info
Total python3 beginner here. I can't seem to get just the name of of the colleges to print out.
the class is no where near the college names and i can't seem to narrow the find_all down to what i need. and print to a new csv file. Any ideas?
import requests
from bs4 import BeautifulSoup
import csv
res= requests.get("https://en.wikipedia.org/wiki/Ivy_League")
soup = BeautifulSoup(res.text, "html.parser")
colleges = soup.find_all("table", class_ = "wikitable sortable")
for college in colleges:
first_level = college.find_all("tr")
print(first_level)
You can use soup.select() to utilize css selectors and be more precise:
import requests
from bs4 import BeautifulSoup
res= requests.get("https://en.wikipedia.org/wiki/Ivy_League")
soup = BeautifulSoup(res.text, "html.parser")
l = soup.select(".mw-parser-output > table:nth-of-type(2) > tbody > tr > td:nth-of-type(1) a")
for each in l:
print(each.text)
Printed result:
Brown University
Columbia University
Cornell University
Dartmouth College
Harvard University
University of Pennsylvania
Princeton University
Yale University
To put a single column into csv:
import pandas as pd
pd.DataFrame([e.text for e in l]).to_csv("your_csv.csv") # This will include index
With:
colleges = soup.find_all("table", class_ = "wikitable sortable")
you are getting all the tables with this class (there are five), not getting all the colleges in the table. So you can do something like this:
import requests
from bs4 import BeautifulSoup
res= requests.get("https://en.wikipedia.org/wiki/Ivy_League")
soup = BeautifulSoup(res.text, "html.parser")
college_table = soup.find("table", class_ = "wikitable sortable")
colleges = college_table.find_all("tr")
for college in colleges:
college_row = college.find('td')
college_link = college.find('a')
if college_link != None:
college_name = college_link.text
print(college_name)
EDIT: I added an if to discard the first line, that has the table header
I'm scraping a news article using BeautifulSoup trying to only return the text body of the article itself, not all the additional "noise". Is there any easy way to do this?
import bs4
import requests
url = 'https://www.cnn.com/2018/01/22/us/puerto-rico-privatizing-state-power-authority/index.html'
res = requests.get(url)
soup = bs4.BeautifulSoup(res.text,'html.parser')
element = soup.select_one('div.pg-rail-tall__body #body-text').text
print(element)
Trying to exclude some of the information returned such as
{CNN.VideoPlayer.handleUnmutePlayer = function
handleUnmutePlayer(containerId, dataObj) {'use strict';var
playerInstance,playerPropertyObj,rememberTime,unmuteCTA,unmuteIdSelector
= 'unmute_' +
The noise, as you call it, is the text in the <script>...</script> tags (JavaScript code). You can remove it using .extract() like:
for s in soup.find_all('script'):
s.extract()
You can use this:
r = requests.get('https://edition.cnn.com/2018/01/22/us/puerto-rico-privatizing-state-power-authority/index.html')
soup = BeautifulSoup(r.text, 'html.parser')
[x.extract() for x in soup.find_all('script')] # Does the same thing as the 'for-loop' above
element = soup.find('div', class_='pg-rail-tall__body')
print(element.text)
Partial Output:
(CNN)Puerto Rico Gov. Ricardo Rosselló announced Monday that the
commonwealth will begin privatizing the Puerto Rico Electric Power
Authority, or PREPA. In comments published on Twitter, the governor
said the assets sale would transform the island's power generation
system into a "modern" and "efficient" one that would be less
expensive for citizens.He said the system operates "deficiently" and
that the improved infrastructure would respond more "agilely" to
natural disasters. The privatization process will begin "in the next
few days" and occur in three phases over the next 18 months, the
governor said.JUST WATCHEDSchool cheers as power returns after 112
daysReplayMore Videos ...MUST WATCHSchool cheers as power returns
after 112 days 00:48San Juan Mayor Carmen Yulin Cruz, known for her
criticisms of the Trump administration's response to Puerto Rico after
Hurricane Maria, spoke out against the move.Cruz, writing on her
official Twitter account, said PREPA's privatization would put the
commonwealth's economic development into "private hands" and that the
power authority will begin to "serve other interests.
Try this:
import bs4
import requests
url = 'https://www.cnn.com/2018/01/22/us/puerto-rico-privatizing-state-power-au$'
res = requests.get(url)
soup = bs4.BeautifulSoup(res.text, 'html.parser')
elementd = soup.findAll('div', {'class': 'zn-body__paragraph'})
elementp = soup.findAll('p', {'class': 'zn-body__paragraph'})
for i in elementp:
print(i.text)
for i in elementd:
print(i.text)