This is my code:
from bs4 import BeautifulSoup
url = "https://www.example.com/"
result = requests.get(url)
soup = BeautifulSoup(result.text, "html.parser")
find_by_class = soup.find('div', attrs={"class":"class_name"}).find_all('p')
I want to print the data without the html tags, but I can't use get_text() after the find_all('p').
You could use a for loop, like so:
for i in soup.find('div', attrs={"class":"class_name"}).find_all('p'):
print(i.get_text())
or if you want to save that information, put it into an array:
things = []
for i in soup.find('div', attrs={"class":"class_name"}).find_all('p'):
things.append(i.get_text())
I need slugs of all articles on a page. I used bs4 to get href contents of all articles, but some article's link has another URL which I don't need it. I want to delete those items. I used this code:
import requests
import re
from bs4 import BeautifulSoup
r = requests.get('https://davidventuri.medium.com/')
soup = BeautifulSoup(r.text, 'html.parser')
all_slugs = soup.find_all('a', {'class': 'dn br'})
for i in range(len(all_slugs)):
slug = all_slugs[i]['href']
print(slug)
Here is my result of getting hrefs:
/this-is-not-a-real-data-science-degree-d170c660c1cf
/not-a-real-degree-data-science-curriculum-2021-19ba9af2c1d4
/bitcoin-learning-path-9ed73f2f11d9
/your-first-day-of-school-eaf363b19ded
https://medium.com/free-code-camp/an-overview-of-every-data-visualization-course-on-the-internet-9ccf24ea9c9b
https://medium.com/free-code-camp/the-best-data-science-courses-on-the-internet-ranked-by-your-reviews-6dc5b910ea40
https://medium.com/free-code-camp/every-single-machine-learning-course-on-the-internet-ranked-by-your-reviews-3c4a7b8026c0
https://medium.com/free-code-camp/dive-into-deep-learning-with-these-23-online-courses-bf247d289cc0
/how-ai-is-revolutionizing-mental-health-care-a7cec436a1ce
https://medium.com/free-code-camp/i-ranked-all-the-best-data-science-intro-courses-based-on-thousands-of-data-points-db5dc7e3eb8e
Actually I want them as below:
/this-is-not-a-real-data-science-degree-d170c660c1cf
/not-a-real-degree-data-science-curriculum-2021-19ba9af2c1d4
/bitcoin-learning-path-9ed73f2f11d9
/your-first-day-of-school-eaf363b19ded
/an-overview-of-every-data-visualization-course-on-the-internet-9ccf24ea9c9b
/the-best-data-science-courses-on-the-internet-ranked-by-your-reviews-6dc5b910ea40
/every-single-machine-learning-course-on-the-internet-ranked-by-your-reviews-3c4a7b8026c0
/dive-into-deep-learning-with-these-23-online-courses-bf247d289cc0
/how-ai-is-revolutionizing-mental-health-care-a7cec436a1ce
/i-ranked-all-the-best-data-science-intro-courses-based-on-thousands-of-data-points-db5dc7e3eb8e
How can I delete them with regex or sth else?
If the substring to replace is always the same, you can go without regex like this:
slug = a['href'].replace('https://medium.com/free-code-camp','')
Example
import requests
from bs4 import BeautifulSoup
r = requests.get('https://davidventuri.medium.com/')
soup = BeautifulSoup(r.text, 'html.parser')
all_slugs = soup.find_all('a', {'class': 'dn br'})
for a in all_slugs:
slug = a['href'].replace('https://medium.com/free-code-camp','')
print(slug)
Output
/this-is-not-a-real-data-science-degree-d170c660c1cf
/not-a-real-degree-data-science-curriculum-2021-19ba9af2c1d4
/bitcoin-learning-path-9ed73f2f11d9
/your-first-day-of-school-eaf363b19ded
/an-overview-of-every-data-visualization-course-on-the-internet-9ccf24ea9c9b
/the-best-data-science-courses-on-the-internet-ranked-by-your-reviews-6dc5b910ea40
/every-single-machine-learning-course-on-the-internet-ranked-by-your-reviews-3c4a7b8026c0
/dive-into-deep-learning-with-these-23-online-courses-bf247d289cc0
/how-ai-is-revolutionizing-mental-health-care-a7cec436a1ce
/i-ranked-all-the-best-data-science-intro-courses-based-on-thousands-of-data-points-db5dc7e3eb8e
Edit
You can also use split()
slug = a['href'].split('/')[-1]
Example
import requests
from bs4 import BeautifulSoup
r = requests.get('https://davidventuri.medium.com/')
soup = BeautifulSoup(r.text, 'html.parser')
all_slugs = soup.find_all('a', {'class': 'dn br'})
for a in all_slugs:
slug = a['href'].split('/')[-1]
print(slug)
Hey guess so I got as far as being able to add the a class to a list. The problem is I just want the href link to be added to the links_with_text list and not the entire a class. What am I doing wrong?
from bs4 import BeautifulSoup
from requests import get
import requests
URL = "https://news.ycombinator.com"
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find(id = 'hnmain')
articles = results.find_all(class_="title")
links_with_text = []
for article in articles:
link = article.find('a', href=True)
links_with_text.append(link)
print('\n'.join(map(str, links_with_text)))
This prints exactly how I want the list to print but I just want the href from every a class not the entire a class. Thank you
To get all links from the https://news.ycombinator.com, you can use CSS selector 'a.storylink'.
For example:
from bs4 import BeautifulSoup
from requests import get
import requests
URL = "https://news.ycombinator.com"
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
links_with_text = []
for a in soup.select('a.storylink'): # <-- find all <a> with class="storylink"
links_with_text.append(a['href']) # <-- note the ['href']
print(*links_with_text, sep='\n')
Prints:
https://blog.mozilla.org/futurereleases/2020/06/18/introducing-firefox-private-network-vpns-official-product-the-mozilla-vpn/
https://mxb.dev/blog/the-return-of-the-90s-web/
https://github.blog/2020-06-18-introducing-github-super-linter-one-linter-to-rule-them-all/
https://www.sciencemag.org/news/2018/11/why-536-was-worst-year-be-alive
https://www.strongtowns.org/journal/2020/6/16/do-the-math-small-projects
https://devblogs.nvidia.com/announcing-cuda-on-windows-subsystem-for-linux-2/
https://lwn.net/SubscriberLink/822568/61d29096a4012e06/
https://imil.net/blog/posts/2020/fakecracker-netbsd-as-a-function-based-microvm/
https://jepsen.io/consistency
https://tumblr.beesbuzz.biz/post/621010836277837824/advice-to-young-web-developers
https://archive.org/search.php?query=subject%3A%22The+Navy+Electricity+and+Electronics+Training+Series%22&sort=publicdate
https://googleprojectzero.blogspot.com/2020/06/ff-sandbox-escape-cve-2020-12388.html?m=1
https://apnews.com/1da061ce00eb531291b143ace0eed1c9
https://support.apple.com/library/content/dam/edam/applecare/images/en_US/appleid/android-apple-music-account-payment-none.jpg
https://standpointmag.co.uk/issues/may-june-2020/the-healing-power-of-birdsong/
https://steveblank.com/2020/06/18/the-coming-chip-wars-of-the-21st-century/
https://www.videolan.org/security/sb-vlc3011.html
https://onesignal.com/careers/2023b71d-2f44-4934-a33c-647855816903
https://www.bbc.com/news/world-europe-53006790
https://github.com/efficient/HOPE
https://everytwoyears.org/
https://www.historytoday.com/archive/natural-histories/intelligence-earthworms
https://cr.yp.to/2005-590/powerpc-cwg.pdf
https://quantum.country/
http://www.crystallography.net/cod/
https://parkinsonsnewstoday.com/2020/06/17/tiny-magnetically-powered-implant-may-be-future-of-deep-brain-stimulation/
https://spark.apache.org/releases/spark-release-3-0-0.html
https://arxiv.org/abs/1712.09624
https://www.washingtonpost.com/technology/2020/06/18/data-privacy-law-sherrod-brown/
https://blog.chromium.org/2020/06/improving-chromiums-browser.html
from bs4 import BeautifulSoup
import requests
def kijiji():
source = requests.get('https://www.kijiji.ca/b-mens-shoes/markham-york-region/c15117001l1700274').text
soup = BeautifulSoup(source,'lxml')
b = soup.find('div', class_='price')
for link in soup.find_all('a',class_ = 'title'):
a = link.get('href')
fulllink = 'http://kijiji.ca'+a
print(fulllink)
b = soup.find('div', class_='price')
print(b.prettify())
kijiji()
Usage of this is to sum up all the different kinds of items sold in kijiji and pair them up with a price.
But I can't seem to find anyway to increment what beautiful soup is finding with a class of price, and I'm stuck with the first price. Find_all doesn't work either as it just prints out the whole blob instead of grouping it together with each item.
If you have Beautiful soup 4.7.1 or above you can use following css selector select() which is much faster.
code:
import requests
from bs4 import BeautifulSoup
res=requests.get("https://www.kijiji.ca/b-mens-shoes/markham-york-region/c15117001l1700274").text
soup=BeautifulSoup(res,'html.parser')
for item in soup.select('.info-container'):
fulllink = 'http://kijiji.ca' + item.find_next('a', class_='title')['href']
print(fulllink)
price=item.select_one('.price').text.strip()
print(price)
Or to use find_all() use below code block
import requests
from bs4 import BeautifulSoup
res=requests.get("https://www.kijiji.ca/b-mens-shoes/markham-york-region/c15117001l1700274").text
soup=BeautifulSoup(res,'html.parser')
for item in soup.find_all('div',class_='info-container'):
fulllink = 'http://kijiji.ca' + item.find_next('a', class_='title')['href']
print(fulllink)
price=item.find_next(class_='price').text.strip()
print(price)
Congratulations on finding the answer. I'll give you another solution for reference only.
import requests
from simplified_scrapy.simplified_doc import SimplifiedDoc
def kijiji():
url = 'https://www.kijiji.ca/b-mens-shoes/markham-york-region/c15117001l1700274'
source = requests.get(url).text
doc = SimplifiedDoc(source)
infos = doc.getElements('div',attr='class',value='info-container')
for info in infos:
price = info.select('div.price>text()')
a = info.select('a.title')
link = doc.absoluteUrl(url,a.href)
title = a.text
print (price)
print (link)
print (title)
kijiji()
Result:
$310.00
https://www.kijiji.ca/v-mens-shoes/markham-york-region/jordan-4-oreo-2015/1485391828
Jordan 4 Oreo (2015)
$560.00
https://www.kijiji.ca/v-mens-shoes/markham-york-region/yeezy-boost-350-yecheil-reflectives/1486296645
Yeezy Boost 350 Yecheil Reflectives
...
Here are more examples:https://github.com/yiyedata/simplified-scrapy-demo/tree/master/doc_examples
from bs4 import BeautifulSoup
import requests
def kijiji():
source = requests.get('https://www.kijiji.ca/b-mens-shoes/markham-york-region/c15117001l1700274').text
soup = BeautifulSoup(source,'lxml')
b = soup.find('div', class_='price')
for link in soup.find_all('a',class_ = 'title'):
a = link.get('href')
fulllink = 'http://kijiji.ca'+a
print(fulllink)
print(b.prettify())
b = b.find_next('div', class_='price')
kijiji()
Was stuck on this for an hour, as soon as I posted this on stack I immediately came up with an idea, messy code but works!
I am currently crawling a web page (https://www.klook.com/city/30-kyoto/?p=1) using Python 3.4 and bs4 in order to collect the deeplinks of the respective activities.
I found that the links are located in the html source like this:
<a class="j_activity_item_link" href="/activity/1031-arashiyama-rickshaw-tour-kyoto/" class="j_activity_item_link" data-card-tags="{}" data-sold-out="false" data-price="40.0" data-city-id="30" data-id="1031" data-url-seo="arashiyama-rickshaw-tour-kyoto">
But after several trials, this href="/activity/1031-arashiyama-rickshaw-tour-kyoto/" never show up.
Here is my logic so far:
import requests
from bs4 import BeautifulSoup
user_agent = {'User-agent': 'Chrome/43.0.2357'}
for page in range(1,6):
r = requests.get("https://www.klook.com/city/30-kyoto" + "/?p=" + str(page))
soup = BeautifulSoup(r.content, "lxml")
g_data = soup.find_all("a", {"class": "j_activity_item_link"})
for item in g_data:
Deeplink = item.find_all("a")
for t in Deeplink:
print(t.get("href"))
Output:
Process finished with exit code 0
Could you guys help me put? Any feedback is appreciated.
Your "error" of error code 0 simply indicates that everything went ok with your run. According to your example, your list g_data should contain all of the a tags that you are interested in. You should not need the second for loop to again iterate through and find nested a tags. As a debugging step, print the length of your lists to ensure that they are not empty. See the following:
import requests
from bs4 import BeautifulSoup
user_agent = {'User-agent': 'Chrome/43.0.2357'}
for page in range(1,6):
r = requests.get("https://www.klook.com/city/30-kyoto" + "/?p=" + str(page))
soup = BeautifulSoup(r.content, "lxml")
g_data = soup.find_all("a", {"class": "j_activity_item_link"})
for item in g_data:
print(item.get("href"))
You can first find the number of pages of activities, and then use regex with BeautifulSoup:
import re
from bs4 import BeautifulSoup as soup
data = soup(str(urllib.urlopen('https://www.klook.com/city/30-kyoto/?p=1').read()), 'lxml')
page_numbers = [i.text for i in data.find_all('a', {'class':'p_num '})]
activities = {1:[i['href'] for i in data.find_all('a', {'href':re.compile("^/activity/")})]}
for page in page_numbers:
data = soup(str(urllib.urlopen('https://www.klook.com/city/30-kyoto/?p={}'.format(page)).read()), 'lxml')
activities[int(page)] = [i['href'] for i in data.find_all('a', {'href':re.compile("^/activity/")})]
Output:
{1: ['/activity/1079-one-day-kimono-rental-kyoto/', '/activity/1032-higashiyama-rickshaw-tour-kyoto/', '/activity/6128-kyoto-seaside-day-tour-osaka/', '/activity/1540-hankyu-1-day-tourist-pass-osaka/', '/activity/1777-icoca-ic-card-kyoto/', '/activity/1541-kix-airport-limousine-bus-transfer-kyoto/', '/activity/1753-randen-kyoto-bus-subway-1-day-pass-kyoto/', '/activity/3260-sagano-romantic-train-ticket-kyoto/', '/activity/793-japanese-lzakaya-cooking-course-kyoto/', '/activity/882-nishiki-market-teramachi-street-kyoto/', '/activity/792-morning-bento-cooking-course-kyoto/', '/activity/2918-sushi-class-experience-kyoto/', '/activity/6032-ninja-kyoto-restaurant-labyrinth-kyoto/', '/activity/5215-garden-ryokan-nanzenji-yachiyo-kyoto/', '/activity/1079-one-day-kimono-rental-kyoto/', '/activity/3260-sagano-romantic-train-ticket-kyoto/', '/activity/675-wifi-device-japan-kyoto/', '/activity/1031-arashiyama-rickshaw-tour-kyoto/', '/activity/657-day-trip-hiroshima-miyajima-kyoto/', '/activity/4774-4G-wifi-kyoto/', '/activity/2826-gionya-kimono-rental-kyoto/', '/activity/1464-kyoto-tower-admission-ticket-kyoto/', '/activity/2249-sagano-romantic-train-ticket-kyoto/', '/activity/1777-icoca-ic-card-kyoto/', '/activity/1541-kix-airport-limousine-bus-transfer-kyoto/', '/activity/1540-hankyu-1-day-tourist-pass-osaka/', '/activity/3532-wifi-device-japan-kyoto/', '/activity/1753-randen-kyoto-bus-subway-1-day-pass-kyoto/', '/activity/1319-4g-wifi-device-kyoto/', '/activity/1447-wi-ho-japan-wifi-device-kyoto/', '/activity/3826-wifi-device-japan-kyoto/', '/activity/2699-japan-wifi-device-taiwan-kyoto/', '/activity/3652-wifi-device-singapore-kyoto/', '/activity/1122-wi-ho-japan-wifi-device-kyoto/', '/activity/719-japan-docomo-sim-card-kyoto/', '/activity/6128-kyoto-seaside-day-tour-osaka/', '/activity/6241-nanzen-ji-fushimi-inari-taisha-sagano-romantic-train-day-tour/', '/activity/5137-guenpin-fugu-restaurant-kyoto/'], 2: ['/activity/1079-one-day-kimono-rental-kyoto/', '/activity/1032-higashiyama-rickshaw-tour-kyoto/', '/activity/6128-kyoto-seaside-day-tour-osaka/', '/activity/1540-hankyu-1-day-tourist-pass-osaka/', '/activity/1777-icoca-ic-card-kyoto/', '/activity/1541-kix-airport-limousine-bus-transfer-kyoto/', '/activity/1753-randen-kyoto-bus-subway-1-day-pass-kyoto/', '/activity/3260-sagano-romantic-train-ticket-kyoto/', '/activity/793-japanese-lzakaya-cooking-course-kyoto/', '/activity/882-nishiki-market-teramachi-street-kyoto/', '/activity/792-morning-bento-cooking-course-kyoto/', '/activity/2918-sushi-class-experience-kyoto/', '/activity/6032-ninja-kyoto-restaurant-labyrinth-kyoto/', '/activity/5215-garden-ryokan-nanzenji-yachiyo-kyoto/', '/activity/6543-arashiyama-golden-pavilion-temple-todaiji-kobe-mosaic-day-tour-kyoto/', '/activity/5198-nanzenji-junsei-restaurant-kyoto/', '/activity/7877-hanami-kimono-rental-kyoto/', '/activity/793-japanese-lzakaya-cooking-course-kyoto/', '/activity/9915-kyoto-osaka-sightseeing-pass-kyoto-japan/', '/activity/883-geisha-districts-tour-kyoto/', '/activity/1097-gion-kimono-experience-kyoto/', '/activity/6032-ninja-kyoto-restaurant-labyrinth-kyoto/', '/activity/792-morning-bento-cooking-course-kyoto/', '/activity/9272-4g-data-daijobu-sim-card-kyoto/', '/activity/871-sake-brewery-visit-fushimi-inari-shrine-kyoto/', '/activity/5979-tower-terrace-kyoto/', '/activity/632-kyoto-backstreet-cycling/', '/activity/646-kyoto-afternoon-exploration/', '/activity/640-kyoto-morning-sightseeing/', '/activity/872-arashiyama-bamboo-forest-half-day-tour-kyoto/', '/activity/5272-mukadeya-kyoto/', '/activity/6081-one-night-in-kyoto/', '/activity/2918-sushi-class-experience-kyoto/', '/activity/1032-higashiyama-rickshaw-tour-kyoto/', '/activity/5445-kimono-photo-shoot-kyoto/', '/activity/5215-garden-ryokan-nanzenji-yachiyo-kyoto/', '/activity/882-nishiki-market-teramachi-street-kyoto/', '/activity/7096-japan-prepaid-sim-card-kyoto/'], 3: ['/activity/1079-one-day-kimono-rental-kyoto/', '/activity/1032-higashiyama-rickshaw-tour-kyoto/', '/activity/6128-kyoto-seaside-day-tour-osaka/', '/activity/1540-hankyu-1-day-tourist-pass-osaka/', '/activity/1777-icoca-ic-card-kyoto/', '/activity/1541-kix-airport-limousine-bus-transfer-kyoto/', '/activity/1753-randen-kyoto-bus-subway-1-day-pass-kyoto/', '/activity/3260-sagano-romantic-train-ticket-kyoto/', '/activity/793-japanese-lzakaya-cooking-course-kyoto/', '/activity/882-nishiki-market-teramachi-street-kyoto/', '/activity/792-morning-bento-cooking-course-kyoto/', '/activity/2918-sushi-class-experience-kyoto/', '/activity/6032-ninja-kyoto-restaurant-labyrinth-kyoto/', '/activity/5215-garden-ryokan-nanzenji-yachiyo-kyoto/', '/activity/5271-itoh-dining-kyoto/', '/activity/9094-sagano-sightseeing-carriage-tour-kyoto/', '/activity/8192-japan-sim-card-taiwan-airport-pickup-kyoto/', '/activity/8420-south-korea-wifi-device-kyoto/', '/activity/8644-rock-climbing-at-kyoto-konpirayama-kyoto /', '/activity/9934-3g-4g-wifi-mnl-pick-up-delivery-for-japan-kyoto/', '/activity/8966-donburi-cooking-course-and-nishiki-market-tour-kyoto/', '/activity/9215-arashiyama-kyoto-food-drink-half-day-tour/']}