Python Scraping empty tag

Python Scraping empty tag - python

I have a problem with scraping some element from a page:
https://tuning-tec.com/mercedes_w164_ml_mklasa_0507_black_led_seq_lpmed0-5789i
code:
import requests
from bs4 import BeautifulSoup
URL="https://tuning-tec.com/mercedes_w164_ml_mklasa_0507_black_led_seq_lpmed0-5789i"
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
title=soup.find(class_="product_cart_title").text
price=soup.find(class_="icon_main_block_price_a")
number=soup.find(class_="product_cart_info").findAll('tr')[1].findAll('td')[1]
description=soup.find(id="tab_a")
print(description)
Problem is when I want to get to: tab_a
And its a problem cause inside
<div align="left" class="product_cart_info" id="charlong_id">
</div>
is empty. How I can get it?
I see its about js i think. Maybe there is some delay when the page loads?

As stated in the comments, the info is loaded via JavaScript, so BeautifulSoup doesn't see it. But you if you look to Chrome/Firefox network tab, you can see where the page is making requests:
import re
import requests
from bs4 import BeautifulSoup
url = 'https://tuning-tec.com/mercedes_w164_ml_mklasa_0507_black_led_seq_lpmed0-5789i'
ajax_url = 'https://tuning-tec.com/_template/_show_normal/_show_charlong.php?itemId={}'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
print(soup.select_one('.product_cart_title').get_text(strip=True))
print(soup.select_one('.icon_main_block_price_a').get_text(strip=True))
print(soup.select_one('td:contains("Symbol") ~ td').get_text(strip=True))
item_id = re.findall(r"ajax_update_stat\('(\d+)'\)", soup.text)[0]
soup2 = BeautifulSoup(requests.get(ajax_url.format(item_id)).content, 'html.parser')
print()
# just print some info:
for tr in soup2.select('tr'):
print(re.sub(r' {2,}', ' ', tr.select_one('td').get_text(strip=True, separator=' ')))
Prints:
MERCEDES W164 ML M-KLASA 05-07 BLACK LED SEQ
1788.62 PLN
LPMED0
PL
Opis
Lampy
soczewkowe ze światłem
pozycyjnym LED. Z dynamicznym
kierunkowskazem. 100% nowe, w komplecie
(lewa i prawa). Homologacja: norma E13 -
dopuszczone do ruchu.
Szczegóły
Światła pozycyjne: DIODY Kierunkowskaz: DIODY Światła
mijania: H9 w
zestawie Światła
drogowe: H1 w
zestawie Regulacja: elektryczna (silniczek znajduje się w
komplecie).
LED TUBE LIGHT Dynamic Turn Signal >>

A little change in the description, I don't know if it's working, have a look on the following code:
import re
import requests
from bs4 import BeautifulSoup
url = 'https://tuning-tec.com/mercedes_w164_ml_mklasa_0507_black_led_seq_lpmed0-5789i'
ajax_url = 'https://tuning-tec.com/_template/_show_normal/_show_charlong.php?itemId={}'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
def unwrapElements(soup, elementsToFind):
elements = soup.find_all(elementsToFind)
for element in elements:
element.unwrap()
print(soup.select_one('.product_cart_title').get_text(strip=True))
print(soup.select_one('.icon_main_block_price_a').get_text(strip=True))
print(soup.select_one('td:contains("Symbol") ~ td').get_text(strip=True))
item_id = re.findall(r"ajax_update_stat\('(\d+)'\)", soup.text)[0]
soup2 = BeautifulSoup(requests.get(ajax_url.format(item_id)).content, 'html.parser')
description=soup2.findAll('tr')[2].findAll('td')[1]
description.append(soup2.findAll('tr')[4].findAll('td')[1])
unwrapElements(description, "td")
unwrapElements(description, "font")
unwrapElements(description, "span")
print(description)
I need just these elements of description in English language. It will be OK?
And anyway thanks for help !!
Only one thing i don't know why he didn't remove all

Related

bs4: splitting text with same class - python

I am web scraping for the first time, and ran into a problem: some classes have the same name.
This is the code:
testlink = 'https://www.ah.nl/producten/product/wi387906/wasa-volkoren'
r = requests.get(testlink)
soup = BeautifulSoup(r.content, 'html.parser')
products = (soup.findAll('dd', class_='product-info-definition-list_value__kspp6'))
And this is the output
[<dd class="product-info-definition-list_value__kspp6">13 g</dd>, <dd class="product-info-definition-list_value__kspp6">20</dd>, <dd class="product-info-definition-list_value__kspp6">Rogge, Glutenbevattende Granen</dd>, <dd class="product-info-definition-list_value__kspp6">Sesamzaad, Melk</dd>]
I need to get the 3rd class (Rogge, Glutenbevattende Granen)... I am using this link to test, and eventually want to scrape multiple pages of the website. Anyone any tips?
Thank you!

You can select all of dd tags with class value product-info-definition-list_value__kspp6 and list slicing
import requests
from bs4 import BeautifulSoup
url='https://www.ah.nl/producten/pasta-rijst-en-wereldkeuken?page={page}'
for page in range(1,11):
req = requests.get(url.format(page=page))
soup = BeautifulSoup(req.content, 'html.parser')
for link in soup.select('div[class="product-card-portrait_content__2xN-b"] a'):
abs_url = 'https://www.ah.nl' + link.get('href')
#print(abs_url)
req2 = requests.get(abs_url)
soup2 = BeautifulSoup(req2.content, 'html.parser')
dd = [d.get_text() for d in soup2.select('dd[class="product-info-definition-list_value__kspp6"]')][2:-2]
print(dd)

Scraping Issue. Why no data is being scraped?

import requests
from bs4 import BeautifulSoup
URL = 'https://www.colonialzone-dr.com/c-dominicanismos-dictionary'
page = requests.get(URL)
print("testing")
soup = BeautifulSoup(page.content, 'html.parser')
words = soup.find_all('p', class_="entry-content")
print(len(words))
for word in words:
print(word.text)
// Nothing is being displayed on my console and the length variable returns 0 which means nothing is being scraped.

If you see html there is div in which all p tags are present so you can take that div tag with associate class and then take p tag from it so you will get your output
import requests
from bs4 import BeautifulSoup
URL = 'https://www.colonialzone-dr.com/c-dominicanismos-dictionary'
page = requests.get(URL)
print("testing")
soup = BeautifulSoup(page.content, 'html.parser')
main_div = soup.find('div', attrs={"class":"entry-content"})
words=main_div.find_all("p")
for word in words:
print(word.text)
Output:
testing
The slang used in Dominican Republic.
C – ce
*Caballo – person similar to a tigre but a little more decent
*Cabron – a large male goat,also means displeased
*Cacaito – candy
....

There were no p tags with entry-content? Remove that and it should work
import requests
from bs4 import BeautifulSoup
URL = 'https://www.colonialzone-dr.com/c-dominicanismos-dictionary'
page = requests.get(URL)
print("testing")
soup = BeautifulSoup(page.content, 'html.parser')
words = soup.find_all('p')
print(len(words))
for word in words:
print(word.text)
But if you want content of p inside the div with class "entry-content" then
entryContent=soup.find('div',attrs={"class":"entry-content"})
words=entryContent.find_all("p")

you are passing additional parameter to words as entry-content, but there is no need to pass additional parameter.
words = soup.find_all('p', class_="entry-content")
try instead of that,
words = soup.find_all('p')
then it will get all the content with p and it gives you the length.
print(len(words))
i hope it will help to you..

Getting only numbers from BeautifulSoup instead of whole div

I am trying to learn python by creating a small websraping program to make life easier, although I am having issues with only getting number when using BS4. I was able to get the price when I scraped an actual ad, but I would like to get all the prices from the page.
Here is my code:
from bs4 import BeautifulSoup
import requests
prices = []
url = 'https://www.kijiji.ca/b-cars-trucks/calgary/new__used/c174l1700199a49?ll=51.044733%2C-114.071883&address=Calgary%2C+AB&radius=50.0'
result = requests.get(url)
print (result.status_code)
src = result.content
soup = BeautifulSoup(src, 'html.parser')
print ("CLEARING")
price = soup.findAll("div", class_="price")
prices.append(price)
print (prices)
Here is my output
[<div class="price">
$46,999.00
<div class="dealer-logo">
<div class="dealer-logo-image">
<img src="https://i.ebayimg.com/00/s/NjBYMTIw/z/xMQAAOSwi9ZfoW7r/$_69.PNG"/>
</div>
</div>
</div>
Ideally, I would only want the output to be "46,999.00".
I tried with text=True, although this did not work and I would not get any output from it besides an empty list.
Thank you

You need to get the text portion of tag and then perform some regex processing on it.
import re
def get_price_from_div(div_item):
str_price = re.sub('[^0-9\.]','', div_item.text)
float_price = float(str_price)
return float_price
Just call this method in your code after you find the divs
from bs4 import BeautifulSoup
import requests
prices = []
url = 'https://www.kijiji.ca/b-cars-trucks/calgary/new__used/c174l1700199a49?ll=51.044733%2C-114.071883&address=Calgary%2C+AB&radius=50.0'
result = requests.get(url)
print (result.status_code)
src = result.content
soup = BeautifulSoup(src, 'html.parser')
print ("CLEARING")
price = soup.findAll("div", class_="price")
prices.extend([get_price_from_div(curr_div) for curr_div in price])
print (prices)

An option without using RegEx, is to filter out tags that startwith() a dollar sign $:
import requests
from bs4 import BeautifulSoup
URL = 'https://www.kijiji.ca/b-cars-trucks/calgary/new__used/c174l1700199a49?ll=51.044733%2C-114.071883&address=Calgary%2C+AB&radius=50.0'
soup = BeautifulSoup(requests.get(URL).content, "html.parser")
price_tags = soup.find_all("div", class_="price")
prices = [
tag.get_text(strip=True)[1:] for tag in price_tags
if tag.get_text(strip=True).startswith('$')
]
print(prices)
Output:
['48,888.00', '21,999.00', '44,488.00', '5,500.00', '33,000.00', '14,900.00', '1,750.00', '35,600.00', '1,800.00', '25,888.00', '36,888.00', '32,888.00', '30,888.00', '18,888.00', '21,888.00', '29,888.00', '22,888.00', '30,888.00', '17,888.00', '17,888.00', '16,888.00', '22,888.00', '22,888.00', '34,888.00', '31,888.00', '32,888.00', '30,888.00', '21,888.00', '15,888.00', '21,888.00', '28,888.00', '19,888.00', '18,888.00', '30,995.00', '30,995.00', '30,995.00', '19,888.00', '47,995.00', '21,888.00', '46,995.00', '32,888.00', '29,888.00', '26,888.00', '21,888.00']

Web Scraping find not moving on to next item

from bs4 import BeautifulSoup
import requests
def kijiji():
source = requests.get('https://www.kijiji.ca/b-mens-shoes/markham-york-region/c15117001l1700274').text
soup = BeautifulSoup(source,'lxml')
b = soup.find('div', class_='price')
for link in soup.find_all('a',class_ = 'title'):
a = link.get('href')
fulllink = 'http://kijiji.ca'+a
print(fulllink)
b = soup.find('div', class_='price')
print(b.prettify())
kijiji()
Usage of this is to sum up all the different kinds of items sold in kijiji and pair them up with a price.
But I can't seem to find anyway to increment what beautiful soup is finding with a class of price, and I'm stuck with the first price. Find_all doesn't work either as it just prints out the whole blob instead of grouping it together with each item.

If you have Beautiful soup 4.7.1 or above you can use following css selector select() which is much faster.
code:
import requests
from bs4 import BeautifulSoup
res=requests.get("https://www.kijiji.ca/b-mens-shoes/markham-york-region/c15117001l1700274").text
soup=BeautifulSoup(res,'html.parser')
for item in soup.select('.info-container'):
fulllink = 'http://kijiji.ca' + item.find_next('a', class_='title')['href']
print(fulllink)
price=item.select_one('.price').text.strip()
print(price)
Or to use find_all() use below code block
import requests
from bs4 import BeautifulSoup
res=requests.get("https://www.kijiji.ca/b-mens-shoes/markham-york-region/c15117001l1700274").text
soup=BeautifulSoup(res,'html.parser')
for item in soup.find_all('div',class_='info-container'):
fulllink = 'http://kijiji.ca' + item.find_next('a', class_='title')['href']
print(fulllink)
price=item.find_next(class_='price').text.strip()
print(price)

Congratulations on finding the answer. I'll give you another solution for reference only.
import requests
from simplified_scrapy.simplified_doc import SimplifiedDoc
def kijiji():
url = 'https://www.kijiji.ca/b-mens-shoes/markham-york-region/c15117001l1700274'
source = requests.get(url).text
doc = SimplifiedDoc(source)
infos = doc.getElements('div',attr='class',value='info-container')
for info in infos:
price = info.select('div.price>text()')
a = info.select('a.title')
link = doc.absoluteUrl(url,a.href)
title = a.text
print (price)
print (link)
print (title)
kijiji()
Result:
$310.00
https://www.kijiji.ca/v-mens-shoes/markham-york-region/jordan-4-oreo-2015/1485391828
Jordan 4 Oreo (2015)
$560.00
https://www.kijiji.ca/v-mens-shoes/markham-york-region/yeezy-boost-350-yecheil-reflectives/1486296645
Yeezy Boost 350 Yecheil Reflectives
...
Here are more examples:https://github.com/yiyedata/simplified-scrapy-demo/tree/master/doc_examples

from bs4 import BeautifulSoup
import requests
def kijiji():
source = requests.get('https://www.kijiji.ca/b-mens-shoes/markham-york-region/c15117001l1700274').text
soup = BeautifulSoup(source,'lxml')
b = soup.find('div', class_='price')
for link in soup.find_all('a',class_ = 'title'):
a = link.get('href')
fulllink = 'http://kijiji.ca'+a
print(fulllink)
print(b.prettify())
b = b.find_next('div', class_='price')
kijiji()
Was stuck on this for an hour, as soon as I posted this on stack I immediately came up with an idea, messy code but works!

How to crawl href - Python & beautifulsoup

I am currently crawling a web page (https://www.klook.com/city/30-kyoto/?p=1) using Python 3.4 and bs4 in order to collect the deeplinks of the respective activities.
I found that the links are located in the html source like this:
<a class="j_activity_item_link" href="/activity/1031-arashiyama-rickshaw-tour-kyoto/" class="j_activity_item_link" data-card-tags="{}" data-sold-out="false" data-price="40.0" data-city-id="30" data-id="1031" data-url-seo="arashiyama-rickshaw-tour-kyoto">
But after several trials, this href="/activity/1031-arashiyama-rickshaw-tour-kyoto/" never show up.
Here is my logic so far:
import requests
from bs4 import BeautifulSoup
user_agent = {'User-agent': 'Chrome/43.0.2357'}
for page in range(1,6):
r = requests.get("https://www.klook.com/city/30-kyoto" + "/?p=" + str(page))
soup = BeautifulSoup(r.content, "lxml")
g_data = soup.find_all("a", {"class": "j_activity_item_link"})
for item in g_data:
Deeplink = item.find_all("a")
for t in Deeplink:
print(t.get("href"))
Output:
Process finished with exit code 0
Could you guys help me put? Any feedback is appreciated.

Your "error" of error code 0 simply indicates that everything went ok with your run. According to your example, your list g_data should contain all of the a tags that you are interested in. You should not need the second for loop to again iterate through and find nested a tags. As a debugging step, print the length of your lists to ensure that they are not empty. See the following:
import requests
from bs4 import BeautifulSoup
user_agent = {'User-agent': 'Chrome/43.0.2357'}
for page in range(1,6):
r = requests.get("https://www.klook.com/city/30-kyoto" + "/?p=" + str(page))
soup = BeautifulSoup(r.content, "lxml")
g_data = soup.find_all("a", {"class": "j_activity_item_link"})
for item in g_data:
print(item.get("href"))

You can first find the number of pages of activities, and then use regex with BeautifulSoup:
import re
from bs4 import BeautifulSoup as soup
data = soup(str(urllib.urlopen('https://www.klook.com/city/30-kyoto/?p=1').read()), 'lxml')
page_numbers = [i.text for i in data.find_all('a', {'class':'p_num '})]
activities = {1:[i['href'] for i in data.find_all('a', {'href':re.compile("^/activity/")})]}
for page in page_numbers:
data = soup(str(urllib.urlopen('https://www.klook.com/city/30-kyoto/?p={}'.format(page)).read()), 'lxml')
activities[int(page)] = [i['href'] for i in data.find_all('a', {'href':re.compile("^/activity/")})]
Output:
{1: ['/activity/1079-one-day-kimono-rental-kyoto/', '/activity/1032-higashiyama-rickshaw-tour-kyoto/', '/activity/6128-kyoto-seaside-day-tour-osaka/', '/activity/1540-hankyu-1-day-tourist-pass-osaka/', '/activity/1777-icoca-ic-card-kyoto/', '/activity/1541-kix-airport-limousine-bus-transfer-kyoto/', '/activity/1753-randen-kyoto-bus-subway-1-day-pass-kyoto/', '/activity/3260-sagano-romantic-train-ticket-kyoto/', '/activity/793-japanese-lzakaya-cooking-course-kyoto/', '/activity/882-nishiki-market-teramachi-street-kyoto/', '/activity/792-morning-bento-cooking-course-kyoto/', '/activity/2918-sushi-class-experience-kyoto/', '/activity/6032-ninja-kyoto-restaurant-labyrinth-kyoto/', '/activity/5215-garden-ryokan-nanzenji-yachiyo-kyoto/', '/activity/1079-one-day-kimono-rental-kyoto/', '/activity/3260-sagano-romantic-train-ticket-kyoto/', '/activity/675-wifi-device-japan-kyoto/', '/activity/1031-arashiyama-rickshaw-tour-kyoto/', '/activity/657-day-trip-hiroshima-miyajima-kyoto/', '/activity/4774-4G-wifi-kyoto/', '/activity/2826-gionya-kimono-rental-kyoto/', '/activity/1464-kyoto-tower-admission-ticket-kyoto/', '/activity/2249-sagano-romantic-train-ticket-kyoto/', '/activity/1777-icoca-ic-card-kyoto/', '/activity/1541-kix-airport-limousine-bus-transfer-kyoto/', '/activity/1540-hankyu-1-day-tourist-pass-osaka/', '/activity/3532-wifi-device-japan-kyoto/', '/activity/1753-randen-kyoto-bus-subway-1-day-pass-kyoto/', '/activity/1319-4g-wifi-device-kyoto/', '/activity/1447-wi-ho-japan-wifi-device-kyoto/', '/activity/3826-wifi-device-japan-kyoto/', '/activity/2699-japan-wifi-device-taiwan-kyoto/', '/activity/3652-wifi-device-singapore-kyoto/', '/activity/1122-wi-ho-japan-wifi-device-kyoto/', '/activity/719-japan-docomo-sim-card-kyoto/', '/activity/6128-kyoto-seaside-day-tour-osaka/', '/activity/6241-nanzen-ji-fushimi-inari-taisha-sagano-romantic-train-day-tour/', '/activity/5137-guenpin-fugu-restaurant-kyoto/'], 2: ['/activity/1079-one-day-kimono-rental-kyoto/', '/activity/1032-higashiyama-rickshaw-tour-kyoto/', '/activity/6128-kyoto-seaside-day-tour-osaka/', '/activity/1540-hankyu-1-day-tourist-pass-osaka/', '/activity/1777-icoca-ic-card-kyoto/', '/activity/1541-kix-airport-limousine-bus-transfer-kyoto/', '/activity/1753-randen-kyoto-bus-subway-1-day-pass-kyoto/', '/activity/3260-sagano-romantic-train-ticket-kyoto/', '/activity/793-japanese-lzakaya-cooking-course-kyoto/', '/activity/882-nishiki-market-teramachi-street-kyoto/', '/activity/792-morning-bento-cooking-course-kyoto/', '/activity/2918-sushi-class-experience-kyoto/', '/activity/6032-ninja-kyoto-restaurant-labyrinth-kyoto/', '/activity/5215-garden-ryokan-nanzenji-yachiyo-kyoto/', '/activity/6543-arashiyama-golden-pavilion-temple-todaiji-kobe-mosaic-day-tour-kyoto/', '/activity/5198-nanzenji-junsei-restaurant-kyoto/', '/activity/7877-hanami-kimono-rental-kyoto/', '/activity/793-japanese-lzakaya-cooking-course-kyoto/', '/activity/9915-kyoto-osaka-sightseeing-pass-kyoto-japan/', '/activity/883-geisha-districts-tour-kyoto/', '/activity/1097-gion-kimono-experience-kyoto/', '/activity/6032-ninja-kyoto-restaurant-labyrinth-kyoto/', '/activity/792-morning-bento-cooking-course-kyoto/', '/activity/9272-4g-data-daijobu-sim-card-kyoto/', '/activity/871-sake-brewery-visit-fushimi-inari-shrine-kyoto/', '/activity/5979-tower-terrace-kyoto/', '/activity/632-kyoto-backstreet-cycling/', '/activity/646-kyoto-afternoon-exploration/', '/activity/640-kyoto-morning-sightseeing/', '/activity/872-arashiyama-bamboo-forest-half-day-tour-kyoto/', '/activity/5272-mukadeya-kyoto/', '/activity/6081-one-night-in-kyoto/', '/activity/2918-sushi-class-experience-kyoto/', '/activity/1032-higashiyama-rickshaw-tour-kyoto/', '/activity/5445-kimono-photo-shoot-kyoto/', '/activity/5215-garden-ryokan-nanzenji-yachiyo-kyoto/', '/activity/882-nishiki-market-teramachi-street-kyoto/', '/activity/7096-japan-prepaid-sim-card-kyoto/'], 3: ['/activity/1079-one-day-kimono-rental-kyoto/', '/activity/1032-higashiyama-rickshaw-tour-kyoto/', '/activity/6128-kyoto-seaside-day-tour-osaka/', '/activity/1540-hankyu-1-day-tourist-pass-osaka/', '/activity/1777-icoca-ic-card-kyoto/', '/activity/1541-kix-airport-limousine-bus-transfer-kyoto/', '/activity/1753-randen-kyoto-bus-subway-1-day-pass-kyoto/', '/activity/3260-sagano-romantic-train-ticket-kyoto/', '/activity/793-japanese-lzakaya-cooking-course-kyoto/', '/activity/882-nishiki-market-teramachi-street-kyoto/', '/activity/792-morning-bento-cooking-course-kyoto/', '/activity/2918-sushi-class-experience-kyoto/', '/activity/6032-ninja-kyoto-restaurant-labyrinth-kyoto/', '/activity/5215-garden-ryokan-nanzenji-yachiyo-kyoto/', '/activity/5271-itoh-dining-kyoto/', '/activity/9094-sagano-sightseeing-carriage-tour-kyoto/', '/activity/8192-japan-sim-card-taiwan-airport-pickup-kyoto/', '/activity/8420-south-korea-wifi-device-kyoto/', '/activity/8644-rock-climbing-at-kyoto-konpirayama-kyoto /', '/activity/9934-3g-4g-wifi-mnl-pick-up-delivery-for-japan-kyoto/', '/activity/8966-donburi-cooking-course-and-nishiki-market-tour-kyoto/', '/activity/9215-arashiyama-kyoto-food-drink-half-day-tour/']}

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Scraping empty tag - python

Related

bs4: splitting text with same class - python

Scraping Issue. Why no data is being scraped?

Getting only numbers from BeautifulSoup instead of whole div

Web Scraping find not moving on to next item

How to crawl href - Python & beautifulsoup

Categories

Resources