Webscraping data from a json source - python

So, I'am trying to get some information from a website with python, from a webshop.
I tried this one:
my_url = requests.get(MY_URL)
data = my_url.json()
name = data['MainContent'][0]['contents'][0]['productList']['products'][0]['productModel']["displayName"]
price = data['MainContent'][0]['contents'][0]['productList']['products'][0]['priceInfo']['priceItemSale']["gross"]
url= data['MainContent'][0]['contents'][0]['productList']['products'][0]['productModel']["url"]
filename = "test.csv"
csv_writer = csv.writer(open(filename, 'w'))
headers = "Name, Price, Link\n"
f.write(headers)
f.close()
In this webshop there are a lot of product with these attribute "productModel", but how can i get these all and write into a csv?
I want web-scraping the name, the price and the url link this page in differents cells.

Something like:
for mc in data['MainContent']:
for co in mc:
for prod in co['productList']['products']:
name = prod['productModel']['displayName']

Related

Trying to import all the pages by web scraping but getting only 4 rows

I am new to web scraping. I am trying to web scrap a website called https://www.bproperty.com/
Here I am trying to get all the data from all the pages of one specific area.(You will get an idea after going to the link : https://www.bproperty.com/en/dhaka/apartments-for-rent-in-gulshan/) In the link at the bottom you will be able to see the pagination. So I tried to take all the urls in an array and loop through them and scrapped the data and imported to CSV. In the terminal window all the data are showing find but While trying to import it to the CSV I can only see 4 rows.
Here Is my code:
from bs4 import BeautifulSoup
import requests
from csv import writer
urls = [
"https://www.bproperty.com/en/dhaka/apartments-for-rent-in-gulshan/"
"https://www.bproperty.com/en/dhaka/apartments-for-rent-in-gulshan/page-2/",
"https://www.bproperty.com/en/dhaka/apartments-for-rent-in-gulshan/page-3/",
"https://www.bproperty.com/en/dhaka/apartments-for-rent-in-gulshan/page-4/",
"https://www.bproperty.com/en/dhaka/apartments-for-rent-in-gulshan/page-5/",
"https://www.bproperty.com/en/dhaka/apartments-for-rent-in-gulshan/page-6/",
"https://www.bproperty.com/en/dhaka/apartments-for-rent-in-gulshan/page-7/",
"https://www.bproperty.com/en/dhaka/apartments-for-rent-in-gulshan/page-8/",
"https://www.bproperty.com/en/dhaka/apartments-for-rent-in-gulshan/page-9/",
"https://www.bproperty.com/en/dhaka/apartments-for-rent-in-gulshan/page-10/",
"https://www.bproperty.com/en/dhaka/apartments-for-rent-in-gulshan/page-11/"
]
for u in urls:
page = requests.get(u)
soup = BeautifulSoup(page.content, 'html.parser')
lists = soup.find_all('article', class_="ca2f5674")
with open("bproperty-gulshan.csv", 'w', encoding="utf8", newline='') as f:
wrt = writer(f)
header = ["Title", "Location", "Price", "type", "Beds", "Baths", "Length"]
wrt.writerow(header)
for list in lists:
price = list.find('span', class_="f343d9ce").text.replace("/n", "")
location = list.find('div', class_="_7afabd84").text.replace("/n", "")
type = list.find('div', class_="_9a4e3964").text.replace("/n", "")
title = list.find('h2', class_="_7f17f34f").text.replace("/n", "")
beds = list.find('span', class_="b6a29bc0").text.replace("/n", "")
baths = list.find('span', class_="b6a29bc0").text.replace("/n", "")
length = list.find('span', class_="b6a29bc0").text.replace("/n", "")
info = [title, location, price, type, beds, baths, length]
wrt.writerow(info)
print(info)
Here is my CSV
So actually I want to show the pages data withing one script. Is there any way to do this or is there any way to solve this issue?
list and type is a data structure type in python. Do not use them as variable name.
Using them as a variable name will mask the built-in function "type" and "list" within the scope of the block. So while doing so does not raise a SyntaxError, it is not considered good practice, and I would avoid doing so.
Also since you are opening and closing the file in the same loop, use append mode.
The below code should work fine :
for u in urls:
page = requests.get(u)
soup = BeautifulSoup(page.content, 'html.parser')
lists = soup.find_all('article', class_="ca2f5674")
with open("bproperty-gulshan.csv", 'a', encoding="utf8", newline='') as f:
wrt = writer(f)
header = ["Title", "Location", "Price", "type", "Beds", "Baths", "Length"]
wrt.writerow(header)
for list_ in lists:
price = list_.find('span', class_="f343d9ce").text.replace("/n", "")
location = list_.find('div', class_="_7afabd84").text.replace("/n", "")
type_ = list_.find('div', class_="_9a4e3964").text.replace("/n", "")
title = list_.find('h2', class_="_7f17f34f").text.replace("/n", "")
beds = list_.find('span', class_="b6a29bc0").text.replace("/n", "")
baths = list_.find('span', class_="b6a29bc0").text.replace("/n", "")
length = list_.find('span', class_="b6a29bc0").text.replace("/n", "")
info = [title, location, price, type_, beds, baths, length]
wrt.writerow(info)
print(info)
which gives us
Keep in mind your headers are written everytime you open the file, so its better you open the handle to the file only once and close once you are done processing in the loop.

Writing scraped data in rows with python

I have a basic bs4 web scraper, There are no issues in getting my scrape data, but when I try to write it to a .csv file, I got some problems. I am unable to write my data to more than one column. In the tutorial I kinda follow, he can separate rows with "," easily but when I open my CSV with excel, neither in the header nor in data there is a separation, what am I missing?
import requests
from bs4 import BeautifulSoup
url="myurl"
page=requests.get(url)
soup=BeautifulSoup(page.content,'html.parser')
items=soup.find_all('a', class_='listing-card')
filename = 'data.csv'
f = open(filename, "w")
header = "name, price\n"
f.write(header)
for item in items:
title = item.find('span', class_='title').text
price = item.find('span', class_='price').text
f.write(title.replace(",","|") + ',' + price + "\n")
f.close()
Another method.
from simplified_scrapy import SimplifiedDoc, utils, req
url = "myurl"
html = req.get(url)
rows = []
rows.append(['name', 'price']) # Add header
doc = SimplifiedDoc(html)
items = doc.getElements('a', attr='class', value='listing-card') # Get all nodes a according to the class
for item in items:
title = item.getElement('span', value='title').text
price = item.getElement('span', value='price').text
rows.append([title, price])
utils.save2csv('data.csv', rows) # Save to CSV file
Here are more examples: https://github.com/yiyedata/simplified-scrapy-demo/tree/master/doc_examples
I have found that the easiest way to get your data into a CSV file is to put the data into a pandas DataFrame then use the to_csv method to write the file.
Using your example the code would be as follows:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url="myurl"
page=requests.get(url)
soup=BeautifulSoup(page.content,'html.parser')
items=soup.find_all('a', class_='listing-card')
filename = 'data.csv'
f = open(filename, "w")
header = "name, price\n"
f.write(header)
#
# Create an empty list to store entries
mylist = []
for item in items:
title = item.find('span', class_='title').text
price = item.find('span', class_='price').text
#
# Create the dictionary item to be appended to the list
entry = {'name' : title, 'price' : price}
mylist.append(entry)
myDataframe = pd.DataFrame(mylist)
myDataframe.to_csv('CSV_file.csv')

Scrape information from multiple URLs listed in a CSV using BeautifulSoup and then export these results to a new CSV file

I have a 45k+ rows CSV file, each one containing a different path of the same domain - which are structurally identical to each other - and every single one is clickable. I managed to use BeautifulSoup to scrape the title and content of each one and through the print function, I was able to validate the scraper. However, when I try to export the information gathered to a new CSV file, I only get the last URL's street name and description, and not all of them as I expected.
from bs4 import BeautifulSoup
import requests
import csv
with open('URLs.csv') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
site = requests.get(row['addresses']).text
soup = BeautifulSoup(site, 'lxml')
StreetName = soup.find('div', class_='hist-title').text
Description = soup.find('div', class_='hist-content').text
with open('OutputList.csv','w', newline='') as output:
Header = ['StreetName', 'Description']
writer = csv.DictWriter(output, fieldnames=Header)
writer.writeheader()
writer.writerow({'StreetName' : StreetName, 'Description' : Description})
How can the output CSV have on each row the street name and description for the respective URL row in the input CSV file?
You need to open both files on the same level and then read and write on each iteration. Something like this:
from bs4 import BeautifulSoup
import requests
import csv
with open('URLs.csv') as a, open('OutputList.csv', 'w') as b:
reader = csv.reader(a)
writer = csv.writer(b, quoting=csv.QUOTE_ALL)
writer.writerow(['StreetName', 'Description'])
# Assuming url is the first field in the CSV
for url, *_ in reader:
r = requests.get(url)
if r.ok:
soup = BeautifulSoup(r.text, 'lxml')
street_name = soup.find('div', class_='hist-title').text.strip()
description = soup.find('div', class_='hist-content').text.strip()
writer.writerow([street_name, description])
I hope it helps.

bs4 python extracting value from <span></span> to .csv printing the same result over and over

I have managed to build a very primitive program to scrape vehicle data from pistonheads and print it to a .csv file with the link, make, model and am working on getting the price which is where I am encountering a problem.
I want to scrape the prices to the fourth column in my .csv file (Price) and to correctly print the prices from each vehicle on the website.
I am only getting it to print the price from one vehicle and repeat it again and again next to each vehicle in the .csv file.
I have tried soup.findAll and soup.find_all to see whether parsing through multiple elements would work but this is just creating a bigger mess.
Might someone be able to help?
I am also trying to scrape the image src and would like to print that on another column (5) called images.
import csv ; import requests
from bs4 import BeautifulSoup
outfile = open('pistonheads.csv','w', newline='')
writer = csv.writer(outfile)
writer.writerow(["Link", "Make", "Model", "Price"])
url = 'https://www.pistonheads.com/classifieds?Category=used-cars&Page=1&ResultsPerPage=100'
get_url = requests.get(url)
get_text = get_url.text
soup = BeautifulSoup(get_text, 'html.parser')
car_link = soup.find_all('div', 'listing-headline', 'price')
for div in car_link:
links = div.findAll('a')
for a in links:
link = ("https://www.pistonheads.com" + a['href'])
make = (a['href'].split('/')[-4])
model = (a['href'].split('/')[-3])
price = soup.find('span')
writer.writerow([link, make, model, price])
print(link, make, model, price)
outfile.close()
You can try this:
import csv, requests, re
from urllib.parse import urlparse
from bs4 import BeautifulSoup as soup
d = soup(requests.get('https://www.pistonheads.com/classifieds?Category=used-cars&ResultsPerPage=100').text, 'html.parser')
def extract_details(_s:soup) -> list:
_link = _s.find('a', {'href':re.compile('/classifieds/used\-cars/')})['href']
_, _, make, model, *_ = _link[1:].split('/')
price, img = _s.find('div', {'class':'price'}).text, [i['src'] for i in _s.find_all('img')]
return [_link, make, model, price, 'N/A' if not img else img[0]]
with open('filename.csv', 'w') as f:
_listings = [extract_details(i) for i in d.find_all('div', {'class':'ad-listing'}) if i.find('div', {'class':'price'})]
write = csv.writer(f)
write.writerows([['make', 'model', 'price', 'img'], *_listings])
The reason is because of price = soup.find('span')
.find() will grab the first element it finds. And you have it looking into your soup object. But where you want it to look, is within your a, because that's what you are looping through with for a in links:
I also add .text as I am assuming you just want the text, not the whole tag element. Ie price = a.find('span').text
import csv ; import requests
from bs4 import BeautifulSoup
outfile = open('pistonheads.csv','w', newline='')
writer = csv.writer(outfile)
writer.writerow(["Link", "Make", "Model", "Price", 'Images'])
url = 'https://www.pistonheads.com/classifieds?Category=used-cars&Page=1&ResultsPerPage=100'
get_url = requests.get(url)
get_text = get_url.text
soup = BeautifulSoup(get_text, 'html.parser')
car_link = soup.find_all('div', 'listing-headline', 'price')
for div in car_link:
links = div.findAll('a')
for a in links:
link = ("https://www.pistonheads.com" + a['href'])
make = (a['href'].split('/')[-4])
model = (a['href'].split('/')[-3])
price = a.find('span').text
image_link = a.parent.parent.find('img')['src']
image = link + image_link
writer.writerow([link, make, model, price, image])
print(link, make, model, price, image)
outfile.close()

Not able to extract All URL's from json Script using beautifulsoup3

import requests
from bs4 import BeautifulSoup
import json
import re
url = "https://www.daraz.pk/catalog/?q=dell&_keyori=ss&from=input&spm=a2a0e.searchlist.search.go.57446b5079XMO8"
page = requests.get(url)
print(page.status_code)
print(page.text)
soup = BeautifulSoup(page.text, 'html.parser')
print(soup.prettify())
alpha = soup.find_all('script',{'type':'application/ld+json'})
jsonObj = json.loads(alpha[1].text)
Below is the code to Find All the relevant product information from json object
for item in jsonObj['itemListElement']:
name = item['name']
price = item['offers']['price']
currency = item['offers']['priceCurrency']
availability = item['offers']['availability'].split('/')[-1]
availability = [s for s in re.split("([A-Z][^A-Z]*)", availability) if s]
availability = ' '.join(availability)
Here is the code to extract URL for json script
url = item['url']
print('Availability: %s Price: %0.2f %s Name: %s' %(availability,float(price), currency,name, url))
Below is the code to extract data inro csv:
outfile = open('products.csv','w', newline='')
writer = csv.writer(outfile)
writer.writerow(["name", "type", "price", "priceCurrency", "availability" ])
alpha = soup.find_all('script',{'type':'application/ld+json'})
jsonObj = json.loads(alpha[1].text)
for item in jsonObj['itemListElement']:
name = item['name']
type = item['#type']
url = item['url']
price = item['offers']['price']
currency = item['offers']['priceCurrency']
availability = item['offers']['availability'].split('/')[-1]
The File creates the Header but no data in CSV for the URL
writer.writerow([name, type, price, currency, availability, URL ])
outfile.close()
first, you don't include the header there. not a big deal, just the first row would have a blank for your header in the url column. So to include that:
writer.writerow(["name", "type", "price", "priceCurrency", "availability", "url" ])
Second, you store the string as url, but then reference URL in your writer. URL isn't holding any value. In fact, it should have given an error of URL is not defined or something similar.
And since you already use url in your code with url = "https://www.daraz.pk/catalog/?q=dell&_keyori=ss&from=input&spm=a2a0e.searchlist.search.go.57446b5079XMO8", I would also probably change the variable name to something like url_text.
I'd probably also use variable type_text or something other than type, since type is a built-in function in python.
But you need to change to:
writer.writerow([name, type, price, currency, availability, url ])
outfile.close()
Full code:
import requests
from bs4 import BeautifulSoup
import json
import csv
url = "https://www.daraz.pk/catalog/?q=dell&_keyori=ss&from=input&spm=a2a0e.searchlist.search.go.57446b5079XMO8"
page = requests.get(url)
print(page.status_code)
print(page.text)
soup = BeautifulSoup(page.text, 'html.parser')
print(soup.prettify())
alpha = soup.find_all('script',{'type':'application/ld+json'})
jsonObj = json.loads(alpha[1].text)
outfile = open('c:\products.csv','w', newline='')
writer = csv.writer(outfile)
writer.writerow(["name", "type", "price", "priceCurrency", "availability" , "url"])
for item in jsonObj['itemListElement']:
name = item['name']
type_text = item['#type']
url_text = item['url']
price = item['offers']['price']
currency = item['offers']['priceCurrency']
availability = item['offers']['availability'].split('/')[-1]
writer.writerow([name, type_text, price, currency, availability, url_text ])
outfile.close()
The only thing I could find wrong is that you have a typo in the last line - upper-case URL instead of lower-case url. Changing it made the script work perfectly.

Categories

Resources