Not able to extract All URL's from json Script using beautifulsoup3

Not able to extract All URL's from json Script using beautifulsoup3 - python

import requests
from bs4 import BeautifulSoup
import json
import re
url = "https://www.daraz.pk/catalog/?q=dell&_keyori=ss&from=input&spm=a2a0e.searchlist.search.go.57446b5079XMO8"
page = requests.get(url)
print(page.status_code)
print(page.text)
soup = BeautifulSoup(page.text, 'html.parser')
print(soup.prettify())
alpha = soup.find_all('script',{'type':'application/ld+json'})
jsonObj = json.loads(alpha[1].text)
Below is the code to Find All the relevant product information from json object
for item in jsonObj['itemListElement']:
name = item['name']
price = item['offers']['price']
currency = item['offers']['priceCurrency']
availability = item['offers']['availability'].split('/')[-1]
availability = [s for s in re.split("([A-Z][^A-Z]*)", availability) if s]
availability = ' '.join(availability)
Here is the code to extract URL for json script
url = item['url']
print('Availability: %s Price: %0.2f %s Name: %s' %(availability,float(price), currency,name, url))
Below is the code to extract data inro csv:
outfile = open('products.csv','w', newline='')
writer = csv.writer(outfile)
writer.writerow(["name", "type", "price", "priceCurrency", "availability" ])
alpha = soup.find_all('script',{'type':'application/ld+json'})
jsonObj = json.loads(alpha[1].text)
for item in jsonObj['itemListElement']:
name = item['name']
type = item['#type']
url = item['url']
price = item['offers']['price']
currency = item['offers']['priceCurrency']
availability = item['offers']['availability'].split('/')[-1]
The File creates the Header but no data in CSV for the URL
writer.writerow([name, type, price, currency, availability, URL ])
outfile.close()

first, you don't include the header there. not a big deal, just the first row would have a blank for your header in the url column. So to include that:
writer.writerow(["name", "type", "price", "priceCurrency", "availability", "url" ])
Second, you store the string as url, but then reference URL in your writer. URL isn't holding any value. In fact, it should have given an error of URL is not defined or something similar.
And since you already use url in your code with url = "https://www.daraz.pk/catalog/?q=dell&_keyori=ss&from=input&spm=a2a0e.searchlist.search.go.57446b5079XMO8", I would also probably change the variable name to something like url_text.
I'd probably also use variable type_text or something other than type, since type is a built-in function in python.
But you need to change to:
writer.writerow([name, type, price, currency, availability, url ])
outfile.close()
Full code:
import requests
from bs4 import BeautifulSoup
import json
import csv
url = "https://www.daraz.pk/catalog/?q=dell&_keyori=ss&from=input&spm=a2a0e.searchlist.search.go.57446b5079XMO8"
page = requests.get(url)
print(page.status_code)
print(page.text)
soup = BeautifulSoup(page.text, 'html.parser')
print(soup.prettify())
alpha = soup.find_all('script',{'type':'application/ld+json'})
jsonObj = json.loads(alpha[1].text)
outfile = open('c:\products.csv','w', newline='')
writer = csv.writer(outfile)
writer.writerow(["name", "type", "price", "priceCurrency", "availability" , "url"])
for item in jsonObj['itemListElement']:
name = item['name']
type_text = item['#type']
url_text = item['url']
price = item['offers']['price']
currency = item['offers']['priceCurrency']
availability = item['offers']['availability'].split('/')[-1]
writer.writerow([name, type_text, price, currency, availability, url_text ])
outfile.close()

The only thing I could find wrong is that you have a typo in the last line - upper-case URL instead of lower-case url. Changing it made the script work perfectly.

Related

Need help in scraping information from multiple webpages and import to csv file in tabular form - Python

I have been working on webscraping the infobox information on Wikipedia. This is the following code that I have been using:
import requests
import csv
from bs4 import BeautifulSoup
URL = ['https://en.wikipedia.org/wiki/Workers_Credit_Union','https://en.wikipedia.org/wiki/San_Diego_County_Credit_Union',
'https://en.wikipedia.org/wiki/USA_Federal_Credit_Union','https://en.wikipedia.org/wiki/Commonwealth_Credit_Union',
'https://en.wikipedia.org/wiki/Center_for_Community_Self-Help','https://en.wikipedia.org/wiki/ESL_Federal_Credit_Union',
'https://en.wikipedia.org/wiki/State_Employees_Credit_Union','https://en.wikipedia.org/wiki/United_Heritage_Credit_Union']
for url in URL:
headers=[]
rows=[]
response = requests.get(url)
soup = BeautifulSoup(response.text,'html.parser')
table = soup.find('table',class_ ='infobox')
credit_union_name= soup.find('h1', id = "firstHeading")
header_tags = table.find_all('th')
headers = [header.text.strip() for header in header_tags]
data_rows = table.find_all('tr')
for row in data_rows:
value = row.find_all('td')
beautified_value = [dp.text.strip() for dp in value]
if len(beautified_value) == 0:
continue
rows.append(beautified_value)
rows.append("")
rows.append([credit_union_name.text.strip()])
rows.append([url])
with open(r'credit_unions.csv','a+',newline="") as output:
writer=csv.writer(output)
writer.writerow(headers)
writer.writerow(rows)
However, I checked the csv file and information is not being presented in tabular form. The scraped elements are being stored in nested lists instead of a singular list. I need the scraped information of each URL to be stored in a singular list and print the list in csv file in tabular form with the headings. Need help regarding this.

The infoboxes have different structures and labels. So I think the best way to solve this is to use dicts and a DictWriter.
import requests
import csv
from bs4 import BeautifulSoup
URL = ['https://en.wikipedia.org/wiki/Workers_Credit_Union',
'https://en.wikipedia.org/wiki/San_Diego_County_Credit_Union',
'https://en.wikipedia.org/wiki/USA_Federal_Credit_Union',
'https://en.wikipedia.org/wiki/Commonwealth_Credit_Union',
'https://en.wikipedia.org/wiki/Center_for_Community_Self-Help',
'https://en.wikipedia.org/wiki/ESL_Federal_Credit_Union',
'https://en.wikipedia.org/wiki/State_Employees_Credit_Union',
'https://en.wikipedia.org/wiki/United_Heritage_Credit_Union']
csv_headers = set()
csv_rows = []
for url in URL:
csv_row = {}
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
credit_union_name = soup.find('h1', id="firstHeading")
table = soup.find('table', class_='infobox')
data_rows = table.find_all('tr')
for data_row in data_rows:
label = data_row.find('th')
value = data_row.find('td')
if label is None or value is None:
continue
beautified_label = label.text.strip()
beautified_value = value.text.strip()
csv_row[beautified_label] = beautified_value
csv_headers.add(beautified_label)
csv_row["name"] = credit_union_name.text.strip()
csv_row["url"] = url
csv_rows.append(csv_row)
with open(r'credit_unions.csv', 'a+', newline="") as output:
headers = ["name", "url"]
headers += sorted(csv_headers)
writer = csv.DictWriter(output, fieldnames=headers)
writer.writeheader()
writer.writerows(csv_rows)

I'm getting KeyError trying to scrape data from website

I wrote a code for data scraping; it works well for some pages, but for some it displays:
KeyError: 'isbn'.
Could you please guide me on how can I solve this issue?
Here is my code:
import requests
import re
import json
from bs4 import BeautifulSoup
import csv
import sys
import codecs
def Soup(content):
soup = BeautifulSoup(content, 'html.parser')
return soup
def Main(url):
r = requests.get(url)
soup = Soup(r.content)
scripts = soup.findAll("script", type="application/ld+json",
text=re.compile("data"))
prices = [span.text for span in soup.select(
"p.product-field.price span span") if span.text != "USD"]
with open("AudioBook/Fiction & Literature/African American.csv", 'a', encoding="utf-8", newline="") as f:
writer = csv.writer(f)
writer.writerow(["Title", "Writer", "Price", "IMG", "URL", "ISBN"])
for script, price in zip(scripts, prices):
script = json.loads(script.text)
title = script["data"]["name"]
author = script["data"]["author"][0]["name"]
img = f'https:{script["data"]["thumbnailUrl"]}'
isbn = script["data"]["isbn"]
url = script["data"]["url"]
writer.writerow([title, author, price, img, url, isbn])
for x in range(1,10):
url = ("https://www.kobo.com/ww/en/audiobooks/contemporary-1?pageNumber=" + str(x))
print("Scrapin page " + str(x) + ".....")
Main(url)

Since audiobooks don't have an ISBN on the listings page, you could prepare for this case with a default value, e.g.:
isbn = script["data"].get("isbn", "")
In this case, if the "isbn" key doesn't exist in script["data"], it will fallback on the value of an empty string.
Alternatively, you could get the book ISBN from the audiobook-specific page (your script["data"]["url"] above), e.g.:
def Main(url):
r = requests.get(url)
soup = Soup(r.content)
scripts = soup.findAll("script", type="application/ld+json",
text=re.compile("data"))
prices = [span.text for span in soup.select(
"p.product-field.price span span") if span.text != "USD"]
with open("AudioBook/Fiction & Literature/African American.csv", 'a', encoding="utf-8", newline="") as f:
writer = csv.writer(f)
writer.writerow(["Title", "Writer", "Price", "IMG", "URL", "ISBN"])
for script, price in zip(scripts, prices):
script = json.loads(script.text)
title = script["data"]["name"]
author = script["data"]["author"][0]["name"]
img = f'https:{script["data"]["thumbnailUrl"]}'
# NEW CODE
url = script["data"]["url"]
if "isbn" in script["data"]:
# ebook listings
isbn = script["data"]["isbn"]
else:
# audiobook listings
r = requests.get(url)
inner_soup = Soup(r.content)
try:
inner_script = json.loads(
inner_soup.find("script", type="application/ld+json",
text=re.compile("workExample")).text)
isbn = inner_script["workExample"]["isbn"]
except AttributeError:
isbn = ""
# END NEW CODE
writer.writerow([title, author, price, img, url, isbn])

Webscraping data from a json source

So, I'am trying to get some information from a website with python, from a webshop.
I tried this one:
my_url = requests.get(MY_URL)
data = my_url.json()
name = data['MainContent'][0]['contents'][0]['productList']['products'][0]['productModel']["displayName"]
price = data['MainContent'][0]['contents'][0]['productList']['products'][0]['priceInfo']['priceItemSale']["gross"]
url= data['MainContent'][0]['contents'][0]['productList']['products'][0]['productModel']["url"]
filename = "test.csv"
csv_writer = csv.writer(open(filename, 'w'))
headers = "Name, Price, Link\n"
f.write(headers)
f.close()
In this webshop there are a lot of product with these attribute "productModel", but how can i get these all and write into a csv?
I want web-scraping the name, the price and the url link this page in differents cells.

Something like:
for mc in data['MainContent']:
for co in mc:
for prod in co['productList']['products']:
name = prod['productModel']['displayName']

bs4 python extracting value from <span></span> to .csv printing the same result over and over

I have managed to build a very primitive program to scrape vehicle data from pistonheads and print it to a .csv file with the link, make, model and am working on getting the price which is where I am encountering a problem.
I want to scrape the prices to the fourth column in my .csv file (Price) and to correctly print the prices from each vehicle on the website.
I am only getting it to print the price from one vehicle and repeat it again and again next to each vehicle in the .csv file.
I have tried soup.findAll and soup.find_all to see whether parsing through multiple elements would work but this is just creating a bigger mess.
Might someone be able to help?
I am also trying to scrape the image src and would like to print that on another column (5) called images.
import csv ; import requests
from bs4 import BeautifulSoup
outfile = open('pistonheads.csv','w', newline='')
writer = csv.writer(outfile)
writer.writerow(["Link", "Make", "Model", "Price"])
url = 'https://www.pistonheads.com/classifieds?Category=used-cars&Page=1&ResultsPerPage=100'
get_url = requests.get(url)
get_text = get_url.text
soup = BeautifulSoup(get_text, 'html.parser')
car_link = soup.find_all('div', 'listing-headline', 'price')
for div in car_link:
links = div.findAll('a')
for a in links:
link = ("https://www.pistonheads.com" + a['href'])
make = (a['href'].split('/')[-4])
model = (a['href'].split('/')[-3])
price = soup.find('span')
writer.writerow([link, make, model, price])
print(link, make, model, price)
outfile.close()

You can try this:
import csv, requests, re
from urllib.parse import urlparse
from bs4 import BeautifulSoup as soup
d = soup(requests.get('https://www.pistonheads.com/classifieds?Category=used-cars&ResultsPerPage=100').text, 'html.parser')
def extract_details(_s:soup) -> list:
_link = _s.find('a', {'href':re.compile('/classifieds/used\-cars/')})['href']
_, _, make, model, *_ = _link[1:].split('/')
price, img = _s.find('div', {'class':'price'}).text, [i['src'] for i in _s.find_all('img')]
return [_link, make, model, price, 'N/A' if not img else img[0]]
with open('filename.csv', 'w') as f:
_listings = [extract_details(i) for i in d.find_all('div', {'class':'ad-listing'}) if i.find('div', {'class':'price'})]
write = csv.writer(f)
write.writerows([['make', 'model', 'price', 'img'], *_listings])

The reason is because of price = soup.find('span')
.find() will grab the first element it finds. And you have it looking into your soup object. But where you want it to look, is within your a, because that's what you are looping through with for a in links:
I also add .text as I am assuming you just want the text, not the whole tag element. Ie price = a.find('span').text
import csv ; import requests
from bs4 import BeautifulSoup
outfile = open('pistonheads.csv','w', newline='')
writer = csv.writer(outfile)
writer.writerow(["Link", "Make", "Model", "Price", 'Images'])
url = 'https://www.pistonheads.com/classifieds?Category=used-cars&Page=1&ResultsPerPage=100'
get_url = requests.get(url)
get_text = get_url.text
soup = BeautifulSoup(get_text, 'html.parser')
car_link = soup.find_all('div', 'listing-headline', 'price')
for div in car_link:
links = div.findAll('a')
for a in links:
link = ("https://www.pistonheads.com" + a['href'])
make = (a['href'].split('/')[-4])
model = (a['href'].split('/')[-3])
price = a.find('span').text
image_link = a.parent.parent.find('img')['src']
image = link + image_link
writer.writerow([link, make, model, price, image])
print(link, make, model, price, image)
outfile.close()

how to loop using beautifulsoup

I am trying to scrape data on car model, price, mileage, location, etc using beautifulsoup. However, the return result only reports data on one random car. I want to be able to collect data on all cars advertised on the site to date. My python code is below. How can I modify my code to retrieve data such that each day I have information on car model, price, mileage, location, etc? Example:
Car model price mileage location date
Toyota Corrola $4500 22km Accra 16/02/2018
Nissan Almera $9500 60km Tema 16/02/2018
etc
import requests
from bs4 import BeautifulSoup
import pandas
import csv
from datetime import datetime
for i in range(300):
url = "https://tonaton.com/en/ads/ghana/cars?".format(i)
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
print soup.prettify()
data = soup.find(class_='item-content')
for tag in data:
item_title = data.find("a",attrs={"class":"item-title h4"})
model = item_title.text.encode('utf-8').strip()
item_meta = data.find("p",attrs={"class":"item-meta"})
mileage = item_meta.text.encode('utf-8').strip()
item_location = data.find("p",attrs={"class":"item-location"})
location = item_location.text.encode('utf-8').strip()
item_info = data.find("p",attrs={"class":"item-info"})
price = item_info.text.encode('utf-8').strip()
with open('example.csv', 'a') as csv_file:
writer = csv.writer(csv_file)
writer.writerow([model, price, mileage, location, datetime.now()])

First off, this loop:
for i in range(300):
url = "https://tonaton.com/en/ads/ghana/cars?".format(i)
is not doing what I assume you think it is. This loop simply resets the url 300 times and leaves you with the original url you set. You need to wrap all your code in this loop to ensure you are hitting each of the URLs you want (1-300).
Restructure your code (paying attention to indents!) so that the next url is the one being used in the request:
# This will print ALOT of titles
for i in range(300):
url = "https://tonaton.com/en/ads/ghana/cars?" + str(i)
print(url) # Notice how the url changes with each iteration?
r = requests.get(url)
soup = bsoup(r.content, "html.parser")
titles = soup.findAll("a",attrs={"class":"item-title h4"})
for item in titles:
currTitle = item.text.encode('utf-8').strip()
print(currTitle)
This code:
import requests
from bs4 import BeautifulSoup as bsoup
url = "https://tonaton.com/en/ads/ghana/cars?1"
r = requests.get(url)
soup = bsoup(r.content, "html.parser")
titles = soup.findAll("a",attrs={"class":"item-title h4"})
for item in titles:
print(item.text.encode('utf-8').strip())
Yields (not sure what the 'b' is doing):
b'Hyundai Veloster 2013'
b'Ford Edge 2009'
b'Mercedes-Benz C300 2016'
b'Mazda Demio 2007'
b'Hyundai Santa fe 2005'
# And so on...
The problem is that 1) if you call find(), it will stop after you find the first match given your params. Using findAll() will dump all matches into a list which you then can iterate through and process as needed. And 2) the result you get from a call to find() is a broken structure of the original HTML. Thus the next find() calls won't work.

import requests
from bs4 import BeautifulSoup as bsoup
import csv
from datetime import datetime
for i in range(300):
url = "https://tonaton.com/en/ads/ghana/cars?".format(i)
r = requests.get(url)
soup = bsoup(r.content, "html.parser")
item_title = soup.findAll("a",attrs={"class":"item-title h4"})
for item in item_title:
model = item.text.encode('utf-8').strip()
item_meta = soup.findAll("p",attrs={"class":"item-meta"})
for item in item_meta:
milleage = item.text.encode('utf-8').strip()
item_location = soup.findAll("p",attrs={"class":"item-location"})
for item in item_location:
location = item.text.encode('utf-8').strip()
item_info = soup.findAll("p",attrs={"class":"item-info"})
for item in item_info:
price = item.text.encode('utf-8').strip()
with open('index.csv', 'w') as csv_file:
writer = csv.writer(csv_file)
writer.writerow([model, price, milleage, location, datetime.now()])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Not able to extract All URL's from json Script using beautifulsoup3 - python

The only thing I could find wrong is that you have a typo in the last line - upper-case URL instead of lower-case url. Changing it made the script work perfectly.

Related

Need help in scraping information from multiple webpages and import to csv file in tabular form - Python

I'm getting KeyError trying to scrape data from website

Webscraping data from a json source

bs4 python extracting value from <span></span> to .csv printing the same result over and over

how to loop using beautifulsoup

Categories

Resources