How to scrape data dynamically with BeautifulSoup

How to scrape data dynamically with BeautifulSoup - python

I'm learning how to scrape data from websites using BeautifulSoup and Trying to scrape Movies link and some data about it from YTS website. But I'm stuck in it. I write a script to scrape movies type for two types. But some movies have two or more types of movies qualities in the Tech Specs area. To select i have to write code for every movie type. But how to create a for or while loop to scrape all data.
import requests
from bs4 import BeautifulSoup
m_r = requests.get('https://yts.mx/movies/suicide-squad-2016')
m_page = BeautifulSoup(m_r.content, 'html.parser')
#------------------ Name, Date, Category ----------------
m_det = m_page.find_all('div', class_='hidden-xs')
m_detail = m_det[4]
m_name = m_detail.contents[1].string
m_date = m_detail.contents[3].string
m_category = m_detail.contents[5].string
print(m_name)
print(m_date)
print(m_category)
#------------------ Download Links ----------------
m_li = m_page.find_all('p', {'class':'hidden-xs hidden-sm'})
m_link = m_li[0]
m_link_720 = m_link.contents[3].get('href')
print(m_link_720)
m_link_1080 = m_link.contents[5].get('href')
print(m_link_1080)
#-------------------- File Size & Language -------------------------
tech_spec = m_page.find_all('div', class_='row')
s_size = tech_spec[6].contents[1].contents[1]
#-----------Convert file size to MB-----------
if 'MB' in s_size:
s_size = s_size.replace('MB', '').strip()
print(s_size)
elif 'GB' in s_size:
s_size = float(s_size.replace('GB', '').strip())
s_size = s_size * 1024
print(s_size)
#--------- Big file Languge-----------
s_lan = tech_spec[6].contents[5].contents[2].strip()
print(s_lan)
b_size = tech_spec[8].contents[1].contents[1]
#-----------Convert file size to MB-----------
if 'MB' in b_size:
b_size = b_size.replace('MB', '').strip()
print(b_size)
elif 'GB' in b_size:
b_size = float(b_size.replace('GB', '').strip())
b_size = b_size * 1024
print(b_size)
#--------- Big file Languge-----------
b_lan = tech_spec[8].contents[5].contents[2].strip()
print(b_lan)

This script will get all information for each movie quality:
import requests
from bs4 import BeautifulSoup
url = 'https://yts.mx/movies/suicide-squad-2016'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
for tech_quality, tech_info in zip(soup.select('.tech-quality'), soup.select('.tech-spec-info')):
print('Tech Quality:', tech_quality.get_text(strip=True))
file_size, resolution, language, rating = [td.get_text(strip=True, separator=' ') for td in tech_info.select('div.row:nth-of-type(1) > div')]
subtitles, fps, runtime, peers_seeds = [td.get_text(strip=True, separator=' ') for td in tech_info.select('div.row:nth-of-type(2) > div')]
print('File size:', file_size)
print('Resolution:', resolution)
print('Language:', language)
print('Rating:', rating)
print('Subtitles:', tech_info.select_one('div.row:nth-of-type(2) > div:nth-of-type(1)').a['href'] if subtitles else '-')
print('FPS:', fps)
print('Runtime:', runtime)
print('Peers/Seeds:', peers_seeds)
print('-' * 80)
Prints:
Tech Quality: 3D.BLU
File size: 1.88 GB
Resolution: 1920*800
Language: English 2.0
Rating: PG - 13
Subtitles: -
FPS: 23.976 fps
Runtime: 2 hr 3 min
Peers/Seeds: P/S 8 / 35
--------------------------------------------------------------------------------
Tech Quality: 720p.BLU
File size: 999.95 MB
Resolution: 1280*720
Language: English 2.0
Rating: PG - 13
Subtitles: -
FPS: 23.976 fps
Runtime: 2 hr 3 min
Peers/Seeds: P/S 61 / 534
--------------------------------------------------------------------------------
Tech Quality: 1080p.BLU
File size: 2.06 GB
Resolution: 1920*1080
Language: English 2.0
Rating: PG - 13
Subtitles: -
FPS: 23.976 fps
Runtime: 2 hr 3 min
Peers/Seeds: P/S 80 / 640
--------------------------------------------------------------------------------
Tech Quality: 2160p.BLU
File size: 5.82 GB
Resolution: 3840*1600
Language: English 5.1
Rating: PG - 13
Subtitles: -
FPS: 23.976 fps
Runtime: 2 hr 2 min
Peers/Seeds: P/S 49 / 110
--------------------------------------------------------------------------------

Related

'NoneType' object has no attribute 'text' and also not looping

Trying to scrape a page but its not looping and by extracting data from the other items and aslo i just want the text but it shows
'NoneType' object has no attribute 'text'
i am confused
this is my code
from bs4 import BeautifulSoup
import requests
from csv import writer
url = "https://www.jumia.com.ng/computing/"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
lists = soup.find_all('a', class_="core")
for lista in lists:
laptop_name = lista.find('h3', class_="name").text
price = lista.find('div', class_="old").text
best_price = lista.find('div', class_="prc").test
percentage_discount = lista.find('div', class_="bdg_dsct_sm").test
info = [laptop_name, price, best_price, percentage_discount]
print(info)
[<h3 class="name">Hp 15 Intel Celeron N4020 8GB RAM 1TB HDD Windows 10 + Mouse</h3>, <div class="old">₦ 235,000</div>, <div class="prc">₦ 199,000</div>, None]
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_16528/2657122307.py in <module>
1 for lista in lists:
----> 2 laptop_name = lista.find('h3', class_="name").text
3 price = lista.find('div', class_="old").text
4 best_price = lista.find('div', class_="prc").test
5 percentage_discount = lista.find('div', class_="bdg_dsct_sm").test
AttributeError: 'NoneType' object has no attribute 'text'

To get name, price, discount of the products on that page you can try:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = "https://www.jumia.com.ng/computing/"
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")
all_data = []
for article in soup.select("article.prd"):
laptop_name = article.find(class_="name").text
price = article.find("div", class_="prc").get("data-oprc")
best_price = article.find("div", class_="prc").text
percentage_discount = article.find("div", class_="_dsct")
percentage_discount = (
percentage_discount.text if percentage_discount else "N/A"
)
info = [laptop_name, price, best_price, percentage_discount]
all_data.append(info)
df = pd.DataFrame(all_data, columns=["Name", "Price 1", "Price 2", "Discount"])
print(df.head().to_markdown(index=False))
Prints:
Name
Price 1
Price 2
Discount
Speedisk 3.0 Pen Flash Drive 64GB - Metal With Micro USB OTG
₦ 10,000
₦ 2,699
73%
Lenovo V15-IGL Intel Celeron 1TB HDD 4GB RAM Win 10
₦ 160,990
₦ 130,990
19%
Hp 15 Intel Celeron N4020 8GB RAM 1TB HDD Windows 10 + Mouse
₦ 235,000
₦ 199,000
15%
Asus E203NAH Intel Celeron 4GB RAM 128GB EMMC 11.6' Win 10 - Star Grey
₦ 119,990
₦ 99,990
17%
Lenovo AMD RYZEN 3 1TB HDD 8GB RAM 2.6 To 3.4ghz Win 10+ 32GB Flash
₦ 250,000
₦ 218,500
13%

Loop scrapes the same page 20 times instead of iterating through range

I'm trying to scrape IMDB for a list of the top 1000 movies and get some details about them. However, when I run it, instead of getting the first 50 movies and going to the next page for the next 50, it repeats the loop and makes the same 50 entries 20 times in my database.
# Dataframe template
data = pd.DataFrame(columns=['ID','Title','Genre','Summary'])
#Get page data function
def getPageContent(start=1):
start = 1
url = 'https://www.imdb.com/search/title/?title_type=feature&year=1950-01-01,2019-12-31&sort=num_votes,desc&start='+str(start)
r = requests.get(url)
bs = bsp(r.text, "lxml")
return bs
#Run for top 1000
for start in range(1,1001,50):
getPageContent(start)
movies = bs.findAll("div", "lister-item-content")
for movie in movies:
id = movie.find("span", "lister-item-index").contents[0]
title = movie.find('a').contents[0]
genres = movie.find('span', 'genre').contents[0]
genres = [g.strip() for g in genres.split(',')]
summary = movie.find("p", "text-muted").find_next_sibling("p").contents
i = data.shape[0]
data.loc[i] = [id,title,genres,summary]
#Clean data
# data.ID = [float(re.sub('.','',str(i))) for i in data.ID] #remove . from ID
data.head(51)
0 1. The Shawshank Redemption [Drama] [\nTwo imprisoned men bond over a number of ye...
1 2. The Dark Knight [Action, Crime, Drama] [\nWhen the menace known as the Joker wreaks h...
2 3. Inception [Action, Adventure, Sci-Fi] [\nA thief who steals corporate secrets throug...
3 4. Fight Club [Drama] [\nAn insomniac office worker and a devil-may-...
...
46 47. The Usual Suspects [Crime, Drama, Mystery] [\nA sole survivor tells of the twisty events ...
47 48. The Truman Show [Comedy, Drama] [\nAn insurance salesman discovers his whole l...
48 49. Avengers: Infinity War [Action, Adventure, Sci-Fi] [\nThe Avengers and their allies must be willi...
49 50. Iron Man [Action, Adventure, Sci-Fi] [\nAfter being held captive in an Afghan cave,...
50 1. The Shawshank Redemption [Drama] [\nTwo imprisoned men bond over a number of ye...

Delete 'start' variable inside 'getPageContent' function. It assigns 'start=1' every time.
#Get page data function
def getPageContent(start=1):
url = 'https://www.imdb.com/search/title/?title_type=feature&year=1950-01-01,2019-12-31&sort=num_votes,desc&start='+str(start)
r = requests.get(url)
bs = bsp(r.text, "lxml")
return bs

I was not able to test this code. See inline comments for what I see as the main issue.
# Dataframe template
data = pd.DataFrame(columns=['ID', 'Title', 'Genre', 'Summary'])
# Get page data function
def getPageContent(start=1):
start = 1
url = 'https://www.imdb.com/search/title/?title_type=feature&year=1950-01-01,2019-12-31&sort=num_votes,desc&start=' + str(
start)
r = requests.get(url)
bs = bsp(r.text, "lxml")
return bs
# Run for top 1000
# for start in range(1, 1001, 50): # 50 is a
# step value so this gets every 50th movie
# Try 2 loops
start = 0
for group in range(0, 1001, 50):
for item in range(group, group + 50):
getPageContent(item)
movies = bs.findAll("div", "lister-item-content")
for movie in movies:
id = movie.find("span", "lister-item-index").contents[0]
title = movie.find('a').contents[0]
genres = movie.find('span', 'genre').contents[0]
genres = [g.strip() for g in genres.split(',')]
summary = movie.find("p", "text-muted").find_next_sibling("p").contents
i = data.shape[0]
data.loc[i] = [id, title, genres, summary]
# Clean data
# data.ID = [float(re.sub('.','',str(i))) for i in data.ID] #remove . from ID
data.head(51)

how to use regular expression in python to get memory details in product name

I am trying to use regular expression to parse memory of the product.
product name : 1) Lg K42 Blu Tim Smartphone 64 gb
2) Xiaomi Smartphone 0.128 gb ram 4 gb. tim quadband - Redmi Note 9 128gb Grigio Tim.
How to get the 64 gb using regex in python. gb may be small or caps and memory value may be contains 2 or 3 numbers
import xlwt
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
import re
import time
from bs4 import BeautifulSoup
from datetime import date
class cometmobiles:
def __init__(self):
self.url='https://www.comet.it/smartphone-e-telefonia/smartphone-e-cellulari/smartphone'
self.country='IT'
self.currency='euro'
self.VAT='Included'
def comet(self):
#try:
wb = xlwt.Workbook()
ws = wb.add_sheet('Sheet1',cell_overwrite_ok=True)
ws.write(0,0,"Product_Url")
ws.write(0,1,"Product_Manufacturer")
ws.write(0,2,"Product_Model")
ws.write(0,3,"Product_memory")
ws.write(0,4,"Product_Price")
ws.write(0,5,"Currency")
ws.write(0,6,"VAT")
ws.write(0,7,"Shipping")
ws.write(0,8,"Country")
ws.write(0,9,"Date")
wb.save(r"C:\Users\Karthick R\Desktop\VS code\comet.xls")
driver=webdriver.Chrome()
today = date.today()
driver.maximize_window()
driver.implicitly_wait(30)
driver.get(self.url)
wait = WebDriverWait(driver, 20)
cookies = driver.find_element_by_xpath('//i[#class="btn-close-popup fas fa-times"]')
cookies.click()
time.sleep(5)
print("clicked")
x = 0
titles = []
models = []
memorys = []
prices = []
links =[]
product_links = []
while True:
containers = driver.find_elements_by_css_selector('div[class="col-12 col-sm-6 col-md-4"]')
i = 1
for container in containers:
url = container.find_element_by_css_selector('div[class="sotto-cat__products__item"]')
urls = url.find_element_by_tag_name('a').get_attribute('href')
i = i + 1
product_links.append(urls)
print(product_links)
x+=1
time.sleep(5)
driver.get(self.url+"?p="+str(x))
print("next page")
if driver.current_url == self.url:
break
for links in product_links:
driver.get(links)
time.sleep(10)
#product links
print(driver.current_url)
source = driver.page_source
soup = BeautifulSoup(source,'html.parser')
#title
title = soup.find('h1',{'class':'scheda-prodotto__info__col-sx__title'}).text
y = re.search('([^\s]+)',title)
print(title)
titles.append(y.group(1))
#models
model = re.sub(y.group(1),"",title).strip()
print(model)
models.append(model)
#memory
memory = re.search('^[0-9]{2,3}+[A-Za-z]',model).strip()
print(memory)
memorys.append(memory)
#price
price = soup.find('span',{'class':'caption__price'}).text
print(price)
i=0
while i<len(titles):
ws.write(i+1,0,str(links[i]))
ws.write(i+1,1,str(titles[i]))
ws.write(i+1,2,str(models[i]))
ws.write(i+1,3,str(memorys[i]))
ws.write(i+1,4,str(prices[i]))
ws.write(i+1,5,str(self.currency))
ws.write(i+1,6,str(self.VAT))
ws.write(i+1,7,str(self.shipping))
ws.write(i+1,8,str(self.country))
ws.write(i+1,9,str(date.today()))
i=i+1
wb.save(r"C:\Users\Karthick R\Desktop\VS code\comet.xls")
#except:
#pass
comets=cometmobiles()
comets.comet()

You can try this out.
import re
s = 'Lg K42 Blu Tim Smartphone 64 gb Xiaomi Smartphone 0.128 gb ram 4 gb. tim quadband - Redmi Note 9 128Gb Grigio Tim.'
f = re.findall(r'\b\d{2,3}\s*gb\b',s,re.I)
print(f)
['64 gb', '128 gb', '128Gb']

input_str = "Lg K42 Blu Tim Smartphone 64 gb "
import re
output_value = re.search('[0-9]*.(?=gb)',input_str)
print (output_value[0])
Output : 64
input_str = "Redmi Note 9 128gb Grigio Tim"
Output : 128

Web scraping through multiple pages doesnt save each result -beautifulsoup

My problem is, it loops through pages, but it doesn't write anything into my list.
At the end I print len(title) and it is still 0.
from bs4 import BeautifulSoup
import requests
for page in range(20, 200, 20):
current_page = 'https://auto.bazos.sk/{}/?hledat=kolesa&hlokalita=&humkreis=&cen'.format(page)
web_req = requests.get(current_page).text
soup = BeautifulSoup(requests.get(current_page).content, 'html.parser')
title_data = soup.select('.nadpis')
title = []
for each_title in title_data:
title.append(each_title.text)
print(current_page)
print(len(title))

Move title out of the loop and there you have it.
import requests
from bs4 import BeautifulSoup
title = []
for page in range(20, 40, 20):
current_page = 'https://auto.bazos.sk/{}/?hledat=kolesa&hlokalita=&humkreis=&cen'.format(page)
soup = BeautifulSoup(requests.get(current_page).content, 'html.parser')
title_data = soup.select('.nadpis')
for each_title in title_data:
title.append(each_title.text)
print(current_page)
print(title)
Output:
['ELEKTRONY SKODA OCTAVIA SCOUT DISKY “PROTEUS” R17', 'Fiat Sedici 1.6, 4x4, r.v 04/2009, 79 kw, slovenské ŠPZ', 'Bmw e46 328ci', '255/50 R19', 'Honda Jazz 1.3', 'Predám 4 ks kolesá', 'Audi A5 3.2 FSI quattro tiptronic S LINE R20 TOP STAV', 'Peugeot 407 combi 1,6 hdi', 'Škoda Superb 2.0TDI 4x4 od 260€ mesačne, bez akontácia', 'Predam elektrony Audi 5x112 R17 a letne pneu', 'ROZPREDÁM MAZDA 3 2.0i 110kW NA NÁHRADNÉ DIELY', 'Predám Astra j Turbo Noblesse bronz', 'ŠKODA KAROQ 1.6 TDI - full výbava', 'VW CHICAGO 5x112 + letné pneu 215/40 R18', 'Fiat 500 SPORT 1.3 multijet 70kw', 'Volvo FL280 - TROJSTRANNÝ SKLÁPAČ + HYDRAULICKÁ RUKA', 'ŠKODA SUPERB COMBI 2.0 TDI 190K 4X4 L&K DSG', 'FORD FOCUS 2.0 TDCI TITANIUM', 'FORD EDGE 2.0 TDCi - 154 kW VIGNALE : 27.000 km', 'R18 5x112 originalne Vw Seat Audi Skoda']

Beautifulsoup + Python HTML UL targeting, creating a list and appending to variables

I'm trying to scrape Autotrader's website to get an excel of the stats and names.
I'm stuck at trying to loop through an html 'ul' element without any classes or IDs and organize that info in python list to then append the individual li elements to different fields in my table.
As you can see I'm able to target the title and price elements, but the 'ul' is really tricky... Well... for someone at my skill level.
The specific code I'm struggling with:
for i in range(1, 2):
response = get('https://www.autotrader.co.uk/car-search?sort=sponsored&seller-type=private&page=' + str(i))
html_soup = BeautifulSoup(response.text, 'html.parser')
ad_containers = html_soup.find_all('h2', class_ = 'listing-title title-wrap')
price_containers = html_soup.find_all('section', class_ = 'price-column')
for container in ad_containers:
name = container.find('a', class_ ="js-click-handler listing-fpa-link").text
names.append(name)
# Trying to loop through the key specs list and assigned each 'li' to a different field in the table
lis = []
list_container = container.find('ul', class_='listing-key-specs')
for li in list_container.find('li'):
lis.append(li)
year.append(lis[0])
body_type.append(lis[1])
milage.append(lis[2])
engine.append(lis[3])
hp.append(lis[4])
transmission.append(lis[5])
petrol_type.append(lis[6])
lis = [] # Clearing dictionary to get ready for next set of data
And the error message I get is the following:
Full code here:
from requests import get
from bs4 import BeautifulSoup
import pandas
# from time import sleep, time
# import random
# Create table fields
names = []
prices = []
year = []
body_type = []
milage = []
engine = []
hp = []
transmission = []
petrol_type = []
for i in range(1, 2):
# Make a get request
response = get('https://www.autotrader.co.uk/car-search?sort=sponsored&seller-type=private&page=' + str(i))
# Pause the loop
# sleep(random.randint(4, 7))
# Create containers
html_soup = BeautifulSoup(response.text, 'html.parser')
ad_containers = html_soup.find_all('h2', class_ = 'listing-title title-wrap')
price_containers = html_soup.find_all('section', class_ = 'price-column')
for container in ad_containers:
name = container.find('a', class_ ="js-click-handler listing-fpa-link").text
names.append(name)
# Trying to loop through the key specs list and assigned each 'li' to a different field in the table
lis = []
list_container = container.find('ul', class_='listing-key-specs')
for li in list_container.find('li'):
lis.append(li)
year.append(lis[0])
body_type.append(lis[1])
milage.append(lis[2])
engine.append(lis[3])
hp.append(lis[4])
transmission.append(lis[5])
petrol_type.append(lis[6])
lis = [] # Clearing dictionary to get ready for next set of data
for pricteainers in price_containers:
price = pricteainers.find('div', class_ ='vehicle-price').text
prices.append(price)
test_df = pandas.DataFrame({'Title': names, 'Price': prices, 'Year': year, 'Body Type': body_type, 'Mileage': milage, 'Engine Size': engine, 'HP': hp, 'Transmission': transmission, 'Petrol Type': petrol_type})
print(test_df.info())
# test_df.to_csv('Autotrader_test.csv')

I followed the advice from David in the other answer's comment area.
Code:
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
pd.set_option('display.width', 1000)
pd.set_option('display.height', 1000)
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
names = []
prices = []
year = []
body_type = []
milage = []
engine = []
hp = []
transmission = []
petrol_type = []
for i in range(1, 2):
response = get('https://www.autotrader.co.uk/car-search?sort=sponsored&seller-type=private&page=' + str(i))
html_soup = BeautifulSoup(response.text, 'html.parser')
outer = html_soup.find_all('article', class_='search-listing')
for inner in outer:
lis = []
names.append(inner.find_all('a', class_ ="js-click-handler listing-fpa-link")[1].text)
prices.append(inner.find('div', class_='vehicle-price').text)
for li in inner.find_all('ul', class_='listing-key-specs'):
for i in li.find_all('li')[-7:]:
lis.append(i.text)
year.append(lis[0])
body_type.append(lis[1])
milage.append(lis[2])
engine.append(lis[3])
hp.append(lis[4])
transmission.append(lis[5])
petrol_type.append(lis[6])
test_df = pd.DataFrame.from_dict({'Title': names, 'Price': prices, 'Year': year, 'Body Type': body_type, 'Mileage': milage, 'Engine Size': engine, 'HP': hp, 'Transmission': transmission, 'Petrol Type': petrol_type}, orient='index')
print(test_df.transpose())
Output:
Title Price Year Body Type Mileage Engine Size HP Transmission Petrol Type
0 Citroen C3 1.4 HDi Exclusive 5dr £500 2002 (52 reg) Hatchback 123,065 miles 1.4L 70bhp Manual Diesel
1 Volvo V40 1.6 XS 5dr £585 1999 (V reg) Estate 125,000 miles 1.6L 109bhp Manual Petrol
2 Toyota Yaris 1.3 VVT-i 16v GLS 3dr £700 2000 (W reg) Hatchback 94,000 miles 1.3L 85bhp Automatic Petrol
3 MG Zt-T 2.5 190 + 5dr £750 2002 (52 reg) Estate 95,000 miles 2.5L 188bhp Manual Petrol
4 Volkswagen Golf 1.9 SDI E 5dr £795 2001 (51 reg) Hatchback 153,000 miles 1.9L 68bhp Manual Diesel
5 Volkswagen Polo 1.9 SDI Twist 5dr £820 2005 (05 reg) Hatchback 106,116 miles 1.9L 64bhp Manual Diesel
6 Volkswagen Polo 1.4 S 3dr (a/c) £850 2002 (02 reg) Hatchback 125,640 miles 1.4L 75bhp Manual Petrol
7 KIA Picanto 1.1 LX 5dr £990 2005 (05 reg) Hatchback 109,000 miles 1.1L 64bhp Manual Petrol
8 Vauxhall Corsa 1.2 i 16v SXi 3dr £995 2004 (54 reg) Hatchback 81,114 miles 1.2L 74bhp Manual Petrol
9 Volkswagen Beetle 1.6 3dr £995 2003 (53 reg) Hatchback 128,000 miles 1.6L 102bhp Manual Petrol

The ul is not a child of the h2 . It's a sibling.
So you will need to make a separate selection because it's not part of the ad_containers.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to scrape data dynamically with BeautifulSoup - python

Related

'NoneType' object has no attribute 'text' and also not looping

Loop scrapes the same page 20 times instead of iterating through range

how to use regular expression in python to get memory details in product name

Web scraping through multiple pages doesnt save each result -beautifulsoup

Beautifulsoup + Python HTML UL targeting, creating a list and appending to variables

Categories

Resources