Scraping Issue. Why no data is being scraped? - python

import requests
from bs4 import BeautifulSoup
URL = 'https://www.colonialzone-dr.com/c-dominicanismos-dictionary'
page = requests.get(URL)
print("testing")
soup = BeautifulSoup(page.content, 'html.parser')
words = soup.find_all('p', class_="entry-content")
print(len(words))
for word in words:
print(word.text)
// Nothing is being displayed on my console and the length variable returns 0 which means nothing is being scraped.

If you see html there is div in which all p tags are present so you can take that div tag with associate class and then take p tag from it so you will get your output
import requests
from bs4 import BeautifulSoup
URL = 'https://www.colonialzone-dr.com/c-dominicanismos-dictionary'
page = requests.get(URL)
print("testing")
soup = BeautifulSoup(page.content, 'html.parser')
main_div = soup.find('div', attrs={"class":"entry-content"})
words=main_div.find_all("p")
for word in words:
print(word.text)
Output:
testing
The slang used in Dominican Republic.
C – ce
*Caballo – person similar to a tigre but a little more decent
*Cabron – a large male goat,also means displeased
*Cacaito – candy
....

There were no p tags with entry-content? Remove that and it should work
import requests
from bs4 import BeautifulSoup
URL = 'https://www.colonialzone-dr.com/c-dominicanismos-dictionary'
page = requests.get(URL)
print("testing")
soup = BeautifulSoup(page.content, 'html.parser')
words = soup.find_all('p')
print(len(words))
for word in words:
print(word.text)
But if you want content of p inside the div with class "entry-content" then
entryContent=soup.find('div',attrs={"class":"entry-content"})
words=entryContent.find_all("p")

you are passing additional parameter to words as entry-content, but there is no need to pass additional parameter.
words = soup.find_all('p', class_="entry-content")
try instead of that,
words = soup.find_all('p')
then it will get all the content with p and it gives you the length.
print(len(words))
i hope it will help to you..

Related

Trying to remove tags with Python (BeautifulSoup)

This is my code:
from bs4 import BeautifulSoup
url = "https://www.example.com/"
result = requests.get(url)
soup = BeautifulSoup(result.text, "html.parser")
find_by_class = soup.find('div', attrs={"class":"class_name"}).find_all('p')
I want to print the data without the html tags, but I can't use get_text() after the find_all('p').
You could use a for loop, like so:
for i in soup.find('div', attrs={"class":"class_name"}).find_all('p'):
print(i.get_text())
or if you want to save that information, put it into an array:
things = []
for i in soup.find('div', attrs={"class":"class_name"}).find_all('p'):
things.append(i.get_text())

How to get just links of articles in list using BeautifulSoup

Hey guess so I got as far as being able to add the a class to a list. The problem is I just want the href link to be added to the links_with_text list and not the entire a class. What am I doing wrong?
from bs4 import BeautifulSoup
from requests import get
import requests
URL = "https://news.ycombinator.com"
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find(id = 'hnmain')
articles = results.find_all(class_="title")
links_with_text = []
for article in articles:
link = article.find('a', href=True)
links_with_text.append(link)
print('\n'.join(map(str, links_with_text)))
This prints exactly how I want the list to print but I just want the href from every a class not the entire a class. Thank you
To get all links from the https://news.ycombinator.com, you can use CSS selector 'a.storylink'.
For example:
from bs4 import BeautifulSoup
from requests import get
import requests
URL = "https://news.ycombinator.com"
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
links_with_text = []
for a in soup.select('a.storylink'): # <-- find all <a> with class="storylink"
links_with_text.append(a['href']) # <-- note the ['href']
print(*links_with_text, sep='\n')
Prints:
https://blog.mozilla.org/futurereleases/2020/06/18/introducing-firefox-private-network-vpns-official-product-the-mozilla-vpn/
https://mxb.dev/blog/the-return-of-the-90s-web/
https://github.blog/2020-06-18-introducing-github-super-linter-one-linter-to-rule-them-all/
https://www.sciencemag.org/news/2018/11/why-536-was-worst-year-be-alive
https://www.strongtowns.org/journal/2020/6/16/do-the-math-small-projects
https://devblogs.nvidia.com/announcing-cuda-on-windows-subsystem-for-linux-2/
https://lwn.net/SubscriberLink/822568/61d29096a4012e06/
https://imil.net/blog/posts/2020/fakecracker-netbsd-as-a-function-based-microvm/
https://jepsen.io/consistency
https://tumblr.beesbuzz.biz/post/621010836277837824/advice-to-young-web-developers
https://archive.org/search.php?query=subject%3A%22The+Navy+Electricity+and+Electronics+Training+Series%22&sort=publicdate
https://googleprojectzero.blogspot.com/2020/06/ff-sandbox-escape-cve-2020-12388.html?m=1
https://apnews.com/1da061ce00eb531291b143ace0eed1c9
https://support.apple.com/library/content/dam/edam/applecare/images/en_US/appleid/android-apple-music-account-payment-none.jpg
https://standpointmag.co.uk/issues/may-june-2020/the-healing-power-of-birdsong/
https://steveblank.com/2020/06/18/the-coming-chip-wars-of-the-21st-century/
https://www.videolan.org/security/sb-vlc3011.html
https://onesignal.com/careers/2023b71d-2f44-4934-a33c-647855816903
https://www.bbc.com/news/world-europe-53006790
https://github.com/efficient/HOPE
https://everytwoyears.org/
https://www.historytoday.com/archive/natural-histories/intelligence-earthworms
https://cr.yp.to/2005-590/powerpc-cwg.pdf
https://quantum.country/
http://www.crystallography.net/cod/
https://parkinsonsnewstoday.com/2020/06/17/tiny-magnetically-powered-implant-may-be-future-of-deep-brain-stimulation/
https://spark.apache.org/releases/spark-release-3-0-0.html
https://arxiv.org/abs/1712.09624
https://www.washingtonpost.com/technology/2020/06/18/data-privacy-law-sherrod-brown/
https://blog.chromium.org/2020/06/improving-chromiums-browser.html

Python Scraping empty tag

I have a problem with scraping some element from a page:
https://tuning-tec.com/mercedes_w164_ml_mklasa_0507_black_led_seq_lpmed0-5789i
code:
import requests
from bs4 import BeautifulSoup
URL="https://tuning-tec.com/mercedes_w164_ml_mklasa_0507_black_led_seq_lpmed0-5789i"
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
title=soup.find(class_="product_cart_title").text
price=soup.find(class_="icon_main_block_price_a")
number=soup.find(class_="product_cart_info").findAll('tr')[1].findAll('td')[1]
description=soup.find(id="tab_a")
print(description)
Problem is when I want to get to: tab_a
And its a problem cause inside
<div align="left" class="product_cart_info" id="charlong_id">
</div>
is empty. How I can get it?
I see its about js i think. Maybe there is some delay when the page loads?
As stated in the comments, the info is loaded via JavaScript, so BeautifulSoup doesn't see it. But you if you look to Chrome/Firefox network tab, you can see where the page is making requests:
import re
import requests
from bs4 import BeautifulSoup
url = 'https://tuning-tec.com/mercedes_w164_ml_mklasa_0507_black_led_seq_lpmed0-5789i'
ajax_url = 'https://tuning-tec.com/_template/_show_normal/_show_charlong.php?itemId={}'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
print(soup.select_one('.product_cart_title').get_text(strip=True))
print(soup.select_one('.icon_main_block_price_a').get_text(strip=True))
print(soup.select_one('td:contains("Symbol") ~ td').get_text(strip=True))
item_id = re.findall(r"ajax_update_stat\('(\d+)'\)", soup.text)[0]
soup2 = BeautifulSoup(requests.get(ajax_url.format(item_id)).content, 'html.parser')
print()
# just print some info:
for tr in soup2.select('tr'):
print(re.sub(r' {2,}', ' ', tr.select_one('td').get_text(strip=True, separator=' ')))
Prints:
MERCEDES W164 ML M-KLASA 05-07 BLACK LED SEQ
1788.62 PLN
LPMED0
PL
Opis
Lampy
soczewkowe ze światłem
pozycyjnym LED. Z dynamicznym
kierunkowskazem. 100% nowe, w komplecie
(lewa i prawa). Homologacja: norma E13 -
dopuszczone do ruchu.
Szczegóły
Światła pozycyjne: DIODY Kierunkowskaz: DIODY Światła
mijania: H9 w
zestawie Światła
drogowe: H1 w
zestawie Regulacja: elektryczna (silniczek znajduje się w
komplecie).
LED TUBE LIGHT Dynamic Turn Signal >>
A little change in the description, I don't know if it's working, have a look on the following code:
import re
import requests
from bs4 import BeautifulSoup
url = 'https://tuning-tec.com/mercedes_w164_ml_mklasa_0507_black_led_seq_lpmed0-5789i'
ajax_url = 'https://tuning-tec.com/_template/_show_normal/_show_charlong.php?itemId={}'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
def unwrapElements(soup, elementsToFind):
elements = soup.find_all(elementsToFind)
for element in elements:
element.unwrap()
print(soup.select_one('.product_cart_title').get_text(strip=True))
print(soup.select_one('.icon_main_block_price_a').get_text(strip=True))
print(soup.select_one('td:contains("Symbol") ~ td').get_text(strip=True))
item_id = re.findall(r"ajax_update_stat\('(\d+)'\)", soup.text)[0]
soup2 = BeautifulSoup(requests.get(ajax_url.format(item_id)).content, 'html.parser')
description=soup2.findAll('tr')[2].findAll('td')[1]
description.append(soup2.findAll('tr')[4].findAll('td')[1])
unwrapElements(description, "td")
unwrapElements(description, "font")
unwrapElements(description, "span")
print(description)
I need just these elements of description in English language. It will be OK?
And anyway thanks for help !!
Only one thing i don't know why he didn't remove all

How to crawl href - Python & beautifulsoup

I am currently crawling a web page (https://www.klook.com/city/30-kyoto/?p=1) using Python 3.4 and bs4 in order to collect the deeplinks of the respective activities.
I found that the links are located in the html source like this:
<a class="j_activity_item_link" href="/activity/1031-arashiyama-rickshaw-tour-kyoto/" class="j_activity_item_link" data-card-tags="{}" data-sold-out="false" data-price="40.0" data-city-id="30" data-id="1031" data-url-seo="arashiyama-rickshaw-tour-kyoto">
But after several trials, this href="/activity/1031-arashiyama-rickshaw-tour-kyoto/" never show up.
Here is my logic so far:
import requests
from bs4 import BeautifulSoup
user_agent = {'User-agent': 'Chrome/43.0.2357'}
for page in range(1,6):
r = requests.get("https://www.klook.com/city/30-kyoto" + "/?p=" + str(page))
soup = BeautifulSoup(r.content, "lxml")
g_data = soup.find_all("a", {"class": "j_activity_item_link"})
for item in g_data:
Deeplink = item.find_all("a")
for t in Deeplink:
print(t.get("href"))
Output:
Process finished with exit code 0
Could you guys help me put? Any feedback is appreciated.
Your "error" of error code 0 simply indicates that everything went ok with your run. According to your example, your list g_data should contain all of the a tags that you are interested in. You should not need the second for loop to again iterate through and find nested a tags. As a debugging step, print the length of your lists to ensure that they are not empty. See the following:
import requests
from bs4 import BeautifulSoup
user_agent = {'User-agent': 'Chrome/43.0.2357'}
for page in range(1,6):
r = requests.get("https://www.klook.com/city/30-kyoto" + "/?p=" + str(page))
soup = BeautifulSoup(r.content, "lxml")
g_data = soup.find_all("a", {"class": "j_activity_item_link"})
for item in g_data:
print(item.get("href"))
You can first find the number of pages of activities, and then use regex with BeautifulSoup:
import re
from bs4 import BeautifulSoup as soup
data = soup(str(urllib.urlopen('https://www.klook.com/city/30-kyoto/?p=1').read()), 'lxml')
page_numbers = [i.text for i in data.find_all('a', {'class':'p_num '})]
activities = {1:[i['href'] for i in data.find_all('a', {'href':re.compile("^/activity/")})]}
for page in page_numbers:
data = soup(str(urllib.urlopen('https://www.klook.com/city/30-kyoto/?p={}'.format(page)).read()), 'lxml')
activities[int(page)] = [i['href'] for i in data.find_all('a', {'href':re.compile("^/activity/")})]
Output:
{1: ['/activity/1079-one-day-kimono-rental-kyoto/', '/activity/1032-higashiyama-rickshaw-tour-kyoto/', '/activity/6128-kyoto-seaside-day-tour-osaka/', '/activity/1540-hankyu-1-day-tourist-pass-osaka/', '/activity/1777-icoca-ic-card-kyoto/', '/activity/1541-kix-airport-limousine-bus-transfer-kyoto/', '/activity/1753-randen-kyoto-bus-subway-1-day-pass-kyoto/', '/activity/3260-sagano-romantic-train-ticket-kyoto/', '/activity/793-japanese-lzakaya-cooking-course-kyoto/', '/activity/882-nishiki-market-teramachi-street-kyoto/', '/activity/792-morning-bento-cooking-course-kyoto/', '/activity/2918-sushi-class-experience-kyoto/', '/activity/6032-ninja-kyoto-restaurant-labyrinth-kyoto/', '/activity/5215-garden-ryokan-nanzenji-yachiyo-kyoto/', '/activity/1079-one-day-kimono-rental-kyoto/', '/activity/3260-sagano-romantic-train-ticket-kyoto/', '/activity/675-wifi-device-japan-kyoto/', '/activity/1031-arashiyama-rickshaw-tour-kyoto/', '/activity/657-day-trip-hiroshima-miyajima-kyoto/', '/activity/4774-4G-wifi-kyoto/', '/activity/2826-gionya-kimono-rental-kyoto/', '/activity/1464-kyoto-tower-admission-ticket-kyoto/', '/activity/2249-sagano-romantic-train-ticket-kyoto/', '/activity/1777-icoca-ic-card-kyoto/', '/activity/1541-kix-airport-limousine-bus-transfer-kyoto/', '/activity/1540-hankyu-1-day-tourist-pass-osaka/', '/activity/3532-wifi-device-japan-kyoto/', '/activity/1753-randen-kyoto-bus-subway-1-day-pass-kyoto/', '/activity/1319-4g-wifi-device-kyoto/', '/activity/1447-wi-ho-japan-wifi-device-kyoto/', '/activity/3826-wifi-device-japan-kyoto/', '/activity/2699-japan-wifi-device-taiwan-kyoto/', '/activity/3652-wifi-device-singapore-kyoto/', '/activity/1122-wi-ho-japan-wifi-device-kyoto/', '/activity/719-japan-docomo-sim-card-kyoto/', '/activity/6128-kyoto-seaside-day-tour-osaka/', '/activity/6241-nanzen-ji-fushimi-inari-taisha-sagano-romantic-train-day-tour/', '/activity/5137-guenpin-fugu-restaurant-kyoto/'], 2: ['/activity/1079-one-day-kimono-rental-kyoto/', '/activity/1032-higashiyama-rickshaw-tour-kyoto/', '/activity/6128-kyoto-seaside-day-tour-osaka/', '/activity/1540-hankyu-1-day-tourist-pass-osaka/', '/activity/1777-icoca-ic-card-kyoto/', '/activity/1541-kix-airport-limousine-bus-transfer-kyoto/', '/activity/1753-randen-kyoto-bus-subway-1-day-pass-kyoto/', '/activity/3260-sagano-romantic-train-ticket-kyoto/', '/activity/793-japanese-lzakaya-cooking-course-kyoto/', '/activity/882-nishiki-market-teramachi-street-kyoto/', '/activity/792-morning-bento-cooking-course-kyoto/', '/activity/2918-sushi-class-experience-kyoto/', '/activity/6032-ninja-kyoto-restaurant-labyrinth-kyoto/', '/activity/5215-garden-ryokan-nanzenji-yachiyo-kyoto/', '/activity/6543-arashiyama-golden-pavilion-temple-todaiji-kobe-mosaic-day-tour-kyoto/', '/activity/5198-nanzenji-junsei-restaurant-kyoto/', '/activity/7877-hanami-kimono-rental-kyoto/', '/activity/793-japanese-lzakaya-cooking-course-kyoto/', '/activity/9915-kyoto-osaka-sightseeing-pass-kyoto-japan/', '/activity/883-geisha-districts-tour-kyoto/', '/activity/1097-gion-kimono-experience-kyoto/', '/activity/6032-ninja-kyoto-restaurant-labyrinth-kyoto/', '/activity/792-morning-bento-cooking-course-kyoto/', '/activity/9272-4g-data-daijobu-sim-card-kyoto/', '/activity/871-sake-brewery-visit-fushimi-inari-shrine-kyoto/', '/activity/5979-tower-terrace-kyoto/', '/activity/632-kyoto-backstreet-cycling/', '/activity/646-kyoto-afternoon-exploration/', '/activity/640-kyoto-morning-sightseeing/', '/activity/872-arashiyama-bamboo-forest-half-day-tour-kyoto/', '/activity/5272-mukadeya-kyoto/', '/activity/6081-one-night-in-kyoto/', '/activity/2918-sushi-class-experience-kyoto/', '/activity/1032-higashiyama-rickshaw-tour-kyoto/', '/activity/5445-kimono-photo-shoot-kyoto/', '/activity/5215-garden-ryokan-nanzenji-yachiyo-kyoto/', '/activity/882-nishiki-market-teramachi-street-kyoto/', '/activity/7096-japan-prepaid-sim-card-kyoto/'], 3: ['/activity/1079-one-day-kimono-rental-kyoto/', '/activity/1032-higashiyama-rickshaw-tour-kyoto/', '/activity/6128-kyoto-seaside-day-tour-osaka/', '/activity/1540-hankyu-1-day-tourist-pass-osaka/', '/activity/1777-icoca-ic-card-kyoto/', '/activity/1541-kix-airport-limousine-bus-transfer-kyoto/', '/activity/1753-randen-kyoto-bus-subway-1-day-pass-kyoto/', '/activity/3260-sagano-romantic-train-ticket-kyoto/', '/activity/793-japanese-lzakaya-cooking-course-kyoto/', '/activity/882-nishiki-market-teramachi-street-kyoto/', '/activity/792-morning-bento-cooking-course-kyoto/', '/activity/2918-sushi-class-experience-kyoto/', '/activity/6032-ninja-kyoto-restaurant-labyrinth-kyoto/', '/activity/5215-garden-ryokan-nanzenji-yachiyo-kyoto/', '/activity/5271-itoh-dining-kyoto/', '/activity/9094-sagano-sightseeing-carriage-tour-kyoto/', '/activity/8192-japan-sim-card-taiwan-airport-pickup-kyoto/', '/activity/8420-south-korea-wifi-device-kyoto/', '/activity/8644-rock-climbing-at-kyoto-konpirayama-kyoto /', '/activity/9934-3g-4g-wifi-mnl-pick-up-delivery-for-japan-kyoto/', '/activity/8966-donburi-cooking-course-and-nishiki-market-tour-kyoto/', '/activity/9215-arashiyama-kyoto-food-drink-half-day-tour/']}

After finding all the links and texts using find_all in Beautiful Soup how do grab the one that you need

My example
from bs4 import BeautifulSoup
import requests
result = requests.get("https://pythonprogramming.net/parsememcparseface/")
c = result.content
soup = BeautifulSoup(c,'lxml')
patch_name = soup.find_all(["a", "p"])
u = soup.get_text()
print(u)
How do I get the text I need for I can store it in a variable for later usage.
this will return a list of a and p tag:
patch_name = soup.find_all(["a", "p"])
you can get all the text of the list :
[tag.get_text() for tag in patch_name]

Categories

Resources