How do I webscrape nested lists in Python?

How do I webscrape nested lists in Python? - python

Link of website: https://www.zivame.com/rosaline-chromaticity-knit-cotton-top-florida-key.html?trksrc=category&trkid=search&trkorder=relevance
What I want to scrape: Short sleeves style, Relaxed fit for comfort
(Basically the bullet points under Description)
This is the code I'm using currently:
from selenium import webdriver
import re
from bs4 import BeautifulSoup
import requests
result = requests.get("https://www.zivame.com/rosaline-chromaticity-knit-cotton-top-florida-key.html?trksrc=category&trkid=search&trkorder=relevance")
soup = BeautifulSoup(result.text, 'lxml')
page = soup.find('div', id="product-page")
description = page.find('div', id="product-basicdetail")
point1 = description.find('div', id="ff-rm text-size pd-b5")
print(point1)

The data is coming as JSON data, you can scrape the data from the source page directly.
import requests
from lxml import html
r = requests.get('https://www.zivame.com/rosaline-chromaticity-knit-cotton-top-florida-key.html?trksrc=category&trkid=search&trkorder=relevance')
source_page = html.fromstring(r.text)
json_value = source_page.xpath("//script[contains(.,'window.__product=')]/text()")[0]
json_value = json_value.split("{features:{values:[{list:[")[1].split("]}],count:1}}},modelMetaData:")[0]
print(json_value.split(','))

Related

how can I get the names in this html code by python?

I want to get both of names "Justin Cutroni" and "Krista Seiden" without the tags
this is my html code that I want to get the names by python3:
I used beautifulsoup but I don't know how to get deep in the html codes and get the names.
import requests
from bs4 import BeautifulSoup as bs
web_pages = ["https://maktabkhooneh.org/learn/"]
def find_lessons(web_page):
# Load the webpage content
r = requests.get(web_page)
# Convert to a beautiful soup object
soup = bs(r.content, features="html.parser")
table = soup.select('div[class="course-card__title"]')
data = [x.text.split(';')[-1].strip() for x in table]
return data
find_teachers(web_pages[0])

You are looking at course-card__title, when it appears you want is course-card__teacher. When you're using requests, it's often more useful to look at the real HTML (using wget or curl) rather than the object model, as in your image.
What you have pretty much works with that change:
import requests
from bs4 import BeautifulSoup as bs
web_pages = ["https://maktabkhooneh.org/learn/"]
def find_teachers(web_page):
# Load the webpage content
r = requests.get(web_page)
soup = bs(r.content, features="html.parser")
table = soup.select('div[class="course-card__teacher"]')
return [x.text.strip() for x in table]
print(find_teachers(web_pages[0]))

Trying to scrape Aliexpress

So I am trying to scrape the price of a product on Aliexpress. I tried inspecting the element which looks like
<span class="product-price-value" itemprop="price" data-spm-anchor-id="a2g0o.detail.1000016.i3.fe3c2b54yAsLRn">US $14.43</span>
I'm trying to run the following code
'''
import pandas as pd
from bs4 import BeautifulSoup
from urllib.request import urlopen
import re
url = 'https://www.aliexpress.com/item/32981494236.html?spm=a2g0o.productlist.0.0.44ba26f6M32wxY&algo_pvid=520e41c9-ba26-4aa6-b382-4aa63d014b4b&algo_expid=520e41c9-ba26-4aa6-b382-4aa63d014b4b-22&btsid=0bb0623b16170222520893504e9ae8&ws_ab_test=searchweb0_0,searchweb201602_,searchweb201603_'
source = urlopen(url).read()
soup = BeautifulSoup(source, 'lxml')
soup.find('span', class_='product-price-value')
'''
but I keep getting a blank output. I must be doing something wrong but these methods seem to work in the tutorials I've seen.

So, what i got. As i understood right, the page what you gave, was recived by scripts, but in origin, it doesn't contain it, just script tags, so i just used split to get it. Here is my code:
from bs4 import BeautifulSoup
import requests
url = 'https://aliexpress.ru/item/1005002281350811.html?spm=a2g0o.productlist.0.0.42d53b59T5ddTM&algo_pvid=f3c72fef-c5ab-44b6-902c-d7d362bcf5a5&algo_expid=f3c72fef-c5ab-44b6-902c-d7d362bcf5a5-1&btsid=0b8b035c16170960366785062e33c0&ws_ab_test=searchweb0_0,searchweb201602_,searchweb201603_&sku_id=12000019900010138'
data = requests.get(url)
soup = BeautifulSoup(data.content, features="lxml")
res = soup.findAll("script")
total_value = str(res[-3]).split("totalValue:")[1].split("}")[0].replace("\"", "").replace(".", "").strip()
print(total_value)
It works fine, i tried on few pages from Ali.

BS4 returns [] instead of the wanted HTML tag

I want to parse the given website and scrape the table. To me the code looks right. New to python and web parsing
import requests
from bs4 import BeautifulSoup
response = requests.get('https://delhifightscorona.in/')
doc = BeautifulSoup(response.text, 'lxml-xml')
cases = doc.find_all('div', {"class": "cell"})
print(cases)
doing this returns
[]

Change your parser and the class and there you have it.
import requests
from bs4 import BeautifulSoup
soup = BeautifulSoup(requests.get('https://delhifightscorona.in/').text, 'html.parser').find('div', {"class": "grid-x grid-padding-x small-up-2"})
print(soup.find("h3").getText())
Output:
423,831

You can choose to print only the cases or the total stats with the date.
import requests
from bs4 import BeautifulSoup
response = requests.get('https://delhifightscorona.in/')
doc = BeautifulSoup(response.text, 'html.parser')
stats = doc.find('div', {"class": "cell medium-5"})
print(stats.text) #Print the whole block with dates and the figures
cases = stats.find('h3')
print(cases.text) #Print the cases only

python BeautifulSoup4 - why does result print 5 times?

I'm trying to do some web scraping in python using BeautifulSoup 4.
I am trying to scrape the salary of an public employee. I am doing that successfully but the result is returned 5 times and I cannot figure out why.
Here is the website I am scraping: https://data.richmond.com/salaries/2018/state/university-of-virginia/tony-bennett
Here is my code example:
import requests
from bs4 import BeautifulSoup
source = requests.get(f'https://data.richmond.com/salaries/2018/state/university-of-virginia/tony-bennett')
soup = BeautifulSoup(source.text, 'html.parser')
main_box = soup.find_all('div')
for i in main_box:
try:
x = i.find('div', class_='col-12 col-lg-4 pay')
z = x.find('h2').text
print(z)
except Exception:
pass
And my results are:
$525,000
$525,000
$525,000
$525,000
$525,000
This is the correct salary, but as I said the results print 5 times.
If I go to the page, right click, and 'inspect' I find the class I am looking for, which is 'col-12 col-lg-4 pay' and then within that the 'h2' tag. There is only one 'h2' tag. And print the text of that.
So it seems I am missing something, but what?

I would just get rid of the for loop and use a more specific find query
import requests
from bs4 import BeautifulSoup
source = requests.get(f'https://data.richmond.com/salaries/2018/state/university-of-virginia/tony-bennett')
soup = BeautifulSoup(source.text, 'html.parser')
main_box = soup.find("div", {"class": "pay"})
print(main_box.find('h2').text)

You can also extract this is by using CSS
import requests
import json
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
url = 'https://data.richmond.com/salaries/2018/state/university-of-virginia/tony-bennett'
res = requests.get(url).text
soup = BeautifulSoup(res , 'html.parser')
Value = soup.select('#paytotal')
print(Value[0].text)

Facebook Scraper

I'm trying to scrape Post and images from this facebook profile; https://www.facebook.com/carlostablanteoficial and getting nothing when trying to reach the actual post text with this code:
from urllib.request import urlopen
import requests
from bs4 import BeautifulSoup
html = urlopen("https://www.facebook.com/carlostablanteoficial")
res = BeautifulSoup(html.read(),"html5lib");
resdiv = res.div
post = resdiv.findAll('div', class_='text_exposed_root')
print(post)

This will return many results:
import requests
from bs4 import BeautifulSoup
data = requests.get("https://www.facebook.com/carlostablanteoficial")
soup = BeautifulSoup(data.text, 'html.parser')
for div in soup.find_all('div'):
print(div)
to search for a specific class, change the loop to:
for div in soup.find_all('div', {'class', 'text_exposed_root'}):
print(div)
but when I tried it returned nothing, meaning there is no div with that class on the page

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How do I webscrape nested lists in Python? - python

Related

how can I get the names in this html code by python?

Trying to scrape Aliexpress

BS4 returns [] instead of the wanted HTML tag

python BeautifulSoup4 - why does result print 5 times?

Facebook Scraper

Categories

Resources