Python 3 Beautifulsoup to scrape county names from a gov.uk webpage - python

I would be grateful for any help!
I'm trying to scrape the county names on this webpage (https://www.gov.uk/guidance/full-list-of-local-restriction-tiers-by-area) into four corresponding lists: Tier1, Tier2, Tier3, Tier4.
The issue is how I'm navigating the page...
This is how I'm setting my soup.
from bs4 import BeautifulSoup
url = "https://www.gov.uk/guidance/full-list-of-local-restriction-tiers-by-area"
headers = {...}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, "lxml")
I've tried finding the h2s and then looping through the siblings, find_all_next, etc. but I haven't had any luck.
Endstate
I'm trying to put each of the counties into a CSV that looks like this:
(colour is mapped as follows Tier 1:green, Tier 2:yellow, Tier 3:amber, Tier 4:red)
County
Country
Tier
colour
Isles of Scilly
England
1
Green
Rutland
England
3
Amber
etc.
Update: As a bare minimum example of the data to be extracted:
from bs4 import BeautifulSoup
html = '''<div class="govspeak">
<ul>
<li>case detection rates in all age groups</li>
</ul>
<h2 id="tier-1-medium-alert">Tier 1: Medium alert</h2>
<h3 id="south-west">South West</h3>
<ul>
<li>Isles of Scilly</li>
</ul>
<h2 id="tier-2-high-alert">Tier 2: High alert</h2>
<p>No areas are currently in Tier 2.</p>
<h2 id="tier-3-very-high-alert">Tier 3: Very High alert</h2>
<h3 id="east-midlands">East Midlands</h3>
<ul>
<li>Rutland</li>
</ul>
<h3 id="north-west">North West</h3>
<ul>
<li>Liverpool City Region</li>
</ul>
</div>'''
soup = BeautifulSoup(html, "lxml")
h2 = soup.find_all('h2')
# Whats the best way to find related li tags?

The issue is that the HTML is the H2 & UL are in a flat structure. There are many ways to extract the data. For example preforming For loop on every element.
soup.find('div', {"class": "govspeak"}) - Find the parent div (that contains h2 & li).
container.find_all('li') - Find all the li.
x.fetchPrevious('h2')[0].text.strip() - Find the first [0] previous h2 (and remove any whitespaces).
if x.fetchPrevious('h2')[0].findParent('div', {"class": "govspeak"}) - filter out any h2 that don't appear inside the parent div. (as fetchPrevious will literally find the previous).
namedtuple (which I've called CountyTierModel) to store the scraped data as an array.
re.search("(?<=Tier )\d(?=:)", x.tier) - RegEx to fetch number from h2 title.
Example for scraping data:
from collections import namedtuple
import re
import requests
from bs4 import BeautifulSoup
CountyTierModel = namedtuple('CountyTiers', ['tier', 'county'])
url = "https://www.gov.uk/guidance/full-list-of-local-restriction-tiers-by-area"
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")
container = soup.find('div', {"class": "govspeak"})
results = [CountyTierModel(x.fetchPrevious('h2')[0].text.strip(), x.text.strip()) for x in container.find_all('li')
if x.fetchPrevious('h2') and x.fetchPrevious('h2')[0].findParent('div', {"class": "govspeak"})]
# Here you can write yor code to convert to CSV & provide mapping for Country & Color.
for x in results:
# Regex to extract number from H2 title based on starting with 'Tier ' + number + ':'
m = re.search("(?<=Tier )\d(?=:)", x.tier)
print(f"{m.group(0)} - {x.county}")
Output:
1 - Isles of Scilly
3 - Rutland
3 - Liverpool City Region
3 - Bath and North East Somerset
3 - Bristol
3 - Cornwall
3 - Devon, Plymouth and Torbay
3 - Dorset
3 - North Somerset
3 - South Gloucestershire
3 - Wiltshire
3 - Herefordshire
3 - Shropshire, and Telford and Wrekin
3 - Worcestershire
3 - City of York and North Yorkshire
3 - The Humber: East Riding of Yorkshire, Kingston upon Hull/Hull, North East Lincolnshire and North Lincolnshire
3 - South Yorkshire (Barnsley, Doncaster, Rotheram, Sheffield)
3 - West Yorkshire (Bradford, Calderdale, Kirklees, Leeds, Wakefield)
4 - Derby and Derbyshire
4 - Leicester City and Leicestershire
4 - Lincolnshire
4 - Northamptonshire
4 - Nottingham and Nottinghamshire
4 - Bedford, Central Bedfordshire, Luton and Milton Keynes
4 - Cambridgeshire
4 - Essex, Southend-on-Sea and Thurrock
4 - Hertfordshire
4 - Norfolk
4 - Peterborough
4 - Suffolk
4 - All 32 London boroughs plus City of London
4 - North East Combined Authority (this area includes the local authorities of County Durham, Gateshead, South Tyneside and Sunderland)
4 - North of Tyne Combined Authority (this area includes the local authorities of Newcastle-upon-Tyne, North Tyneside and Northumberland)
4 - Tees Valley Combined Authority (this area includes the local authorities of Darlington, Hartlepool, Middlesbrough, Redcar and Cleveland, and Stockton-on-Tees)
4 - Cumbria
4 - Greater Manchester
4 - Lancashire, Blackburn with Darwen, and Blackpool
4 - Warrington and Cheshire Region
4 - Berkshire
4 - Brighton and Hove, East Sussex and West Sussex
4 - Buckinghamshire
4 - Hampshire, Southampton and Portsmouth
4 - Isle of Wight
4 - Kent and Medway
4 - Oxfordshire
4 - Surrey
4 - Bournemouth, Christchurch and Poole
4 - Gloucestershire (Cheltenham, Cotswold, Forest of Dean, Gloucester City, Stroud and Tewkesbury)
4 - Somerset (Mendip, Sedgemoor, Somerset West and Taunton, and South Somerset)
4 - Swindon
4 - Birmingham, Dudley, Sandwell, Walsall and Wolverhampton
4 - Coventry
4 - Solihull
4 - Staffordshire and Stoke-on-Trent
4 - Warwickshire
Note: To keep the question focused; I've only added code for scraping. Extracting to CSV should be a separate question.

Related

python BeaitifulSoup problem'NoneType' object has no attribute 'find_all'

Hi I'm new to web scraping and I faced this problem that no other answer was good for it
from bs4 import BeautifulSoup
import requests
import csv
from itertools import zip_longest
import re
jobTitle = []
companyName = []
location = []
employmentReq = []
exp = []
startTime = []
links = []
salary = []
experience = []
pageNum = 0
result = requests.get(f"https://wuzzuf.net/search/jobs/?a=hpb&q=web&start={pageNum}")
src = result.content
soup = BeautifulSoup(src, "html.parser")
jobTitles = soup.find_all("h2", {"class": "css-m604qf"})
companyNames = soup.find_all("a", {"class": "css-17s97q8"})
locations = soup.find_all("span", {"class": "css-5wys0k"})
employment = soup.find_all("div", {"class": "css-1lh32fc"})
for i in range(len(jobTitles)):
jobTitle.append(jobTitles[i].text)
links.append("https://wuzzuf.net" + jobTitles[i].find("a").attrs['href'])
companyName.append(companyNames[i].text)
location.append(locations[i].text)
employmentReq.append(employment[i].text)
years = re.sub(r'[^0-9-]', '', soup.find_all("div", {"class": "css-y4udm8"})[i].find_all("div")[1].find_all("span")[0].text)
experience.append(years)
for link in links:
result = requests.get(link)
src = result.content
soup = BeautifulSoup(src, "html.parser")
a = soup.find("main")
print(a)
b=a.find("section",{"class":"css-3kx5e2"})
print(b)
c =b.find_all("div")
print(c)
d =c.find_all("span")
print(d)
#if sal_span != "Confidential":
# salaries = re.sub(r'E.*$', '', sal_span)
#else:
# salaries = sal_span
#salary.append(salaries)
#print(salary)
#fileList = [jobTitle, companyName, location, employmentReq, links, experience, salary]
#exported = zip_longest(*fileList)
#with open("D:\T1t4nProject\python\wuzzuf.csv", "w") as excel_sheet:
# wr = csv.writer(excel_sheet)
# wr.writerow(["job title", "company name", "location", "full or part time", "links", "Years of experience", "Salary"])
# wr.writerows(exported)
the error that i get is this
sal_span =soup.find("main").find("section",{"class":"css-3kx5e2"}).find_all("div")[3].find_all("span")[1].find("span").text
AttributeError: 'NoneType' object has no attribute 'find_all'
i don't know what is the problem i used the same method on the variable years and on such errors
beware that the variable years is taking from another page
Try to avoid dynamic class names, if this is difficult, put them in a separate variable to quickly return the parser to work. I made a simple example that will get the fields you need from the first 3 pages.
import pandas as pd
import requests
from bs4 import BeautifulSoup
def get_jobs(page_count):
jobs = []
for page in range(page_count):
url = f'https://wuzzuf.net/search/jobs/?a=hpb&q=web&start={page}'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
job_block_class = 'css-pkv5jc'
job_type_class = 'css-n2jc4m'
for job in soup.find_all('div', class_=job_block_class):
jobs.append({
'Link': 'https://wuzzuf.net' + job.find('h2').find('a').get('href'),
'Title': job.find('h2').get_text(),
'Company': job.find('h2').findNext('div').find('a').get_text().strip(' -'),
'Location': job.find('h2').findNext('div').findNext('span').get_text(),
'Type': ', '.join([jt.get_text() for jt in job.findAll('a', class_=job_type_class)]),
'Experienced': job.find('div', attrs={'class': None}).findNext('span').get_text()[2:]
})
return jobs
df = pd.DataFrame(get_jobs(3))
# df.to_csv('filename.csv')
print(df.to_string())
OUTPUT:
Link Title Company Location Type Experienced
0 https://wuzzuf.net/jobs/p/YbxjPYIASBeD-Web-Developer-WEBFLOW-ENGLISH-Giza-Egypt?o=1&l=sp&t=sj&a=web|search-v3|hpb Web Developer WEBFLOW (ENGLISH) Confidential Giza, Egypt Full Time, Part Time 4 - 5 Yrs of Exp
1 https://wuzzuf.net/jobs/p/4JeWRMrh02cL-Web-Design-Instructor-Arabic-Localizer-Giza-Egypt?o=2&l=sp&t=sj&a=web|search-v3|hpb Web Design Instructor Arabic Localizer Mohandessin, Giza, Egypt Part Time, Freelance / Project 3 - 4 Yrs of Exp
2 https://wuzzuf.net/jobs/p/ymGucmZ1g9bu-Web-Developer-Flojics-Alexandria-Egypt?o=3&l=sp&t=sj&a=web|search-v3|hpb Web Developer Flojics Alexandria, Egypt Full Time 0 - 1 Yrs of Exp
3 https://wuzzuf.net/jobs/p/BZgw5FvOYM9n-Web-Developer-Usual-Agency-Cairo-Egypt?o=4&l=sp&t=sj&a=web|search-v3|hpb Web Developer Usual Agency Heliopolis, Cairo, Egypt Full Time 2 - 4 Yrs of Exp
4 https://wuzzuf.net/jobs/p/wy1BWQF9OItu-Junior-Web-Developer-Back-End---Alexandria-Arabic-Localizer-Alexandria-Egypt?o=5&l=sp&t=sj&a=web|search-v3|hpb Junior Web Developer (Back-End) - Alexandria Arabic Localizer San Stefano, Alexandria, Egypt Full Time 0 - 3 Yrs of Exp
5 https://wuzzuf.net/jobs/p/PaP6bc0KD9NB-Front-End-Web-Developer-ReactJS-Cairo-Egypt?o=6&l=sp&t=sj&a=web|search-v3|hpb Front-End Web Developer (React.JS) Confidential New Cairo, Cairo, Egypt Full Time 2+ Yrs of Exp
6 https://wuzzuf.net/jobs/p/VnU8hMLxuERr-Translator-Web-Content-Writer-INTERNET-SOLUTIONS-Giza-Egypt?o=7&l=sp&t=sj&a=web|search-v3|hpb Translator-Web Content Writer INTERNET SOLUTIONS Sheikh Zayed, Giza, Egypt Full Time 2 - 4 Yrs of Exp
7 https://wuzzuf.net/jobs/p/z7kkIx8RUP5V-Web-Programmer-Alexandria-Egypt?o=8&l=sp&t=sj&a=web|search-v3|hpb Web Programmer Confidential Cleopatra, Alexandria, Egypt Full Time 3 - 7 Yrs of Exp
8 https://wuzzuf.net/jobs/p/W43NaUFQdk4H-UI-Web-Developer-Perfect-Presentation-Giza-Egypt?o=9&l=sp&t=sj&a=web|search-v3|hpb UI Web Developer Perfect Presentation 6th of October, Giza, Egypt Full Time 5+ Yrs of Exp
9 https://wuzzuf.net/jobs/p/cuIQdXTtU7Kv-Senior-NetAngular-Web-Developer-Ultimate-Solutions-Egypt-Giza-Egypt?o=10&l=sp&t=sj&a=web|search-v3|hpb Senior .Net/Angular Web Developer Ultimate Solutions Egypt 6th of October, Giza, Egypt Full Time 4+ Yrs of Exp
10 https://wuzzuf.net/jobs/p/2CHzjc7E8zz5-Senior-Java-Web-Developer-FlairsTech-Cairo-Egypt?o=11&l=sp&t=sj&a=web|search-v3|hpb Senior Java Web Developer FlairsTech Maadi, Cairo, Egypt Full Time 3 - 10 Yrs of Exp
11 https://wuzzuf.net/jobs/p/X3gNnVZM3Y3u-UIUX-Web-Designer-Nile-Creations-Cairo-Egypt?o=12&l=sp&t=sj&a=web|search-v3|hpb UI/UX Web Designer Nile Creations Heliopolis, Cairo, Egypt Full Time 3+ Yrs of Exp
12 https://wuzzuf.net/jobs/p/61mX9Z4mLmOc-Web-Developer-Etisal-EG-Cairo-Egypt?o=13&l=sp&t=sj&a=web|search-v3|hpb Web Developer Etisal EG New Cairo, Cairo, Egypt Full Time 3+ Yrs of Exp
13 https://wuzzuf.net/jobs/p/tePwhoJIBK2V-Senior-Web-Developer-Turbo-Giza-Egypt?o=14&l=sp&t=sj&a=web|search-v3|hpb Senior Web Developer Turbo Giza, Egypt Full Time 3 - 6 Yrs of Exp
14 https://wuzzuf.net/jobs/p/q6WxlzYhj8VX-Full-Stack-Web-Developer--E-Commerce-Technical-Manager-Cairo-Egypt?o=15&l=sp&t=sj&a=web|search-v3|hpb Full Stack Web Developer -E Commerce Technical Manager Confidential Nasr City, Cairo, Egypt Full Time 5 - 8 Yrs of Exp
15 https://wuzzuf.net/jobs/p/ezagCZAHRsEB-Senior-Web-Developer-Madar-Soft-Alexandria-Egypt?o=16&l=sp&t=sj&a=web|search-v3|hpb Senior Web Developer Madar Soft Fleming, Alexandria, Egypt Full Time 3+ Yrs of Exp
16 https://wuzzuf.net/jobs/p/FZfxQVvPd354-Web-Developer-Qualify-For-Training-Cairo-Egypt?o=17&l=sp&t=sj&a=web|search-v3|hpb Web Developer Qualify For Training Cairo, Egypt Part Time, Freelance / Project, Work From Home 3 - 7 Yrs of Exp
17 https://wuzzuf.net/jobs/p/j2FQjOw3JOyN-PHP-Web-Developer-Giza-Egypt?o=18&l=sp&t=sj&a=web|search-v3|hpb PHP Web Developer Confidential Dokki, Giza, Egypt Full Time 2+ Yrs of Exp
18 https://wuzzuf.net/jobs/p/dO4oaKV5oTNn-Senior-Web-Developer-Naba-Soft-Cairo-Egypt?o=19&l=sp&t=sj&a=web|search-v3|hpb Senior Web Developer Naba Soft Nasr City, Cairo, Egypt Full Time 5 - 20 Yrs of Exp
19 https://wuzzuf.net/jobs/p/OGbcaevuFODl-Web-Developer-Akhnaton-for-Trading-Distributing-Cairo-Egypt?o=20&l=sp&t=sj&a=web|search-v3|hpb Web Developer Akhnaton for Trading & Distributing Downtown, Cairo, Egypt Full Time 3+ Yrs of Exp
20 https://wuzzuf.net/jobs/p/mRehCpOufJWl-Junior-ASPNET-Web-Developer-Dimensions-Information-Technology-Cairo-Egypt?o=21&l=sp&t=sj&a=web|search-v3|hpb Junior ASP.NET Web Developer Dimensions Information Technology Cairo, Egypt Full Time 4 - 6 Yrs of Exp
21 https://wuzzuf.net/jobs/p/fgFp51ORpe21-Web-Designer-Front-End-Developer---WordPress-INTERNET-SOLUTIONS-Giza-Egypt?o=22&l=sp&t=sj&a=web|search-v3|hpb Web Designer & Front End Developer - WordPress INTERNET SOLUTIONS Sheikh Zayed, Giza, Egypt Full Time 2 - 5 Yrs of Exp
22 https://wuzzuf.net/jobs/p/HmjaKL0NYT8c-Web-Developer-Part-time-Fixed-term-contract-Cairo-Egypt?o=23&l=sp&t=sj&a=web|search-v3|hpb Web Developer ( Part-time, Fixed term contract) Confidential Heliopolis, Cairo, Egypt Part Time, Work From Home 5 - 7 Yrs of Exp
23 https://wuzzuf.net/jobs/p/hPvrjx9oe5CX-Web-Developer-justagain-Cairo-Egypt?o=24&l=sp&t=sj&a=web|search-v3|hpb Web Developer justagain New Cairo, Cairo, Egypt Full Time 1 - 3 Yrs of Exp
24 https://wuzzuf.net/jobs/p/4CDCKYnVUNHt-Web-Development-Team-Leader-Cairo-Egypt?o=25&l=sp&t=sj&a=web|search-v3|hpb Web Development Team Leader Confidential Cairo, Egypt Full Time, Work From Home 7+ Yrs of Exp
25 https://wuzzuf.net/jobs/p/1b6C9mKaEYw5-Digital-Chat-Representative-Web-Chat-Seoudi-Supermarket-Giza-Egypt?o=26&l=sp&t=sj&a=web|search-v3|hpb Digital Chat Representative (Web Chat) Seoudi Supermarket Sheikh Zayed, Giza, Egypt Full Time 1 - 3 Yrs of Exp
26 https://wuzzuf.net/jobs/p/BfFChnjzswPX-Web-Developer-PHP---Open-Source-Flojics-Alexandria-Egypt?o=27&l=sp&t=sj&a=web|search-v3|hpb Web Developer (PHP - Open-Source) Flojics Alexandria, Egypt Full Time, Work From Home 5+ Yrs of Exp
27 https://wuzzuf.net/jobs/p/eVAPCCiIKpxZ-Senior-Web-Developer-Flojics-Cairo-Egypt?o=28&l=sp&t=sj&a=web|search-v3|hpb Senior Web Developer Flojics Cairo, Egypt Full Time, Work From Home 5 - 20 Yrs of Exp
28 https://wuzzuf.net/jobs/p/7aFp3iCHa5kz-Talented-Web-Designer-Egypt-Yellow-Pages-Cairo-Egypt?o=29&l=sp&t=sj&a=web|search-v3|hpb Talented Web Designer Egypt Yellow Pages Maadi, Cairo, Egypt Full Time 2 - 3 Yrs of Exp
29 https://wuzzuf.net/jobs/p/VKHX8zMmaWEm-NET-Web-Developer-Giza-Egypt?o=30&l=sp&t=sj&a=web|search-v3|hpb .NET Web Developer Confidential 6th of October, Giza, Egypt Full Time 2 - 3 Yrs of Exp
30 https://wuzzuf.net/jobs/p/AcqIdMveUs71-Creative-Web-Developer-ParamInfo-Dubai-United-Arab-Emirates?o=31&l=sp&t=sj&a=web|search-v3|hpb Creative Web Developer ParamInfo Dubai, United Arab Emirates Full Time 2 - 10 Yrs of Exp
31 https://wuzzuf.net/jobs/p/EM8J2Kkh0ErO-Senior-PHP-Web-Developer-Arabia-for-Information-Technology-Cairo-Egypt?o=32&l=sp&t=sj&a=web|search-v3|hpb Senior PHP Web Developer Arabia for Information Technology Cairo, Egypt Full Time 5+ Yrs of Exp
32 https://wuzzuf.net/jobs/p/2kBsyMmlCKTU-Front-End-Web-Developer-Peerless-Giza-Egypt?o=33&l=sp&t=sj&a=web|search-v3|hpb Front End Web Developer Peerless Mohandessin, Giza, Egypt Full Time 2 - 4 Yrs of Exp
33 https://wuzzuf.net/jobs/p/M0fjpZUveVl8-Senior-Java-Developer-AllegianceMD-Cairo-Egypt?o=34&l=sp&t=sj&a=web|search-v3|hpb Senior Java Developer AllegianceMD Mokattam, Cairo, Egypt Full Time 3 - 5 Yrs of Exp
34 https://wuzzuf.net/jobs/p/fILb5T14SNpB-Software-Engineer---Kuwait-RDI-Cairo-Egypt?o=35&l=sp&t=sj&a=web|search-v3|hpb Software Engineer - Kuwait RDI Cairo, Egypt Full Time 5+ Yrs of Exp
35 https://wuzzuf.net/jobs/p/wGTJHY1ohsUC-Senior-Front-End-Developer-Ejadtech-Giza-Egypt?o=36&l=sp&t=sj&a=web|search-v3|hpb Senior Front End Developer Ejadtech Dokki, Giza, Egypt Full Time 4+ Yrs of Exp
36 https://wuzzuf.net/jobs/p/PBZM2s2pFVKn-Senior-NET-Developer-ASPNET-WebForms---Hybrid-GET-Group--Egypt-Cairo-Egypt?o=37&l=sp&t=sj&a=web|search-v3|hpb Senior .NET Developer ( ASP.NET WebForms ) - Hybrid GET Group- Egypt Heliopolis, Cairo, Egypt Full Time 3+ Yrs of Exp
37 https://wuzzuf.net/jobs/p/UP8Zyh8D83Py-Senior-PHP-Developer-Riyadh-Saudi-Arabia?o=38&l=sp&t=sj&a=web|search-v3|hpb Senior PHP Developer Confidential Riyadh, Saudi Arabia Full Time 2 - 8 Yrs of Exp
38 https://wuzzuf.net/jobs/p/fBOHU9BwWhJG-Java-Team-Lead-Ejada-Cairo-Cairo-Egypt?o=39&l=sp&t=sj&a=web|search-v3|hpb Java Team Lead Ejada (Cairo) Cairo, Egypt Full Time 8+ Yrs of Exp
39 https://wuzzuf.net/jobs/p/a17VcCnAyamF-Senior-Java-Developer-Ejada-Cairo-Cairo-Egypt?o=40&l=sp&t=sj&a=web|search-v3|hpb Senior Java Developer Ejada (Cairo) Heliopolis, Cairo, Egypt Full Time 4 - 6 Yrs of Exp
40 https://wuzzuf.net/jobs/p/VLWdhbaQDPJ0-Development-Team-Leader-egabi-solutions-Cairo-Egypt?o=41&l=sp&t=sj&a=web|search-v3|hpb Development Team Leader egabi solutions Cairo, Egypt Full Time 6 - 10 Yrs of Exp
41 https://wuzzuf.net/jobs/p/vYqyoutFh9m4-Senior-Graphic-Designer---UIUXWebsite-Giza-Egypt?o=42&l=sp&t=sj&a=web|search-v3|hpb Senior Graphic Designer - UI/UX/Website Confidential Dokki, Giza, Egypt Full Time 3 - 20 Yrs of Exp
42 https://wuzzuf.net/jobs/p/BZ6A5Pej3BJr-Front-End-Developer-Tam-Development-LLC-Riyadh-Saudi-Arabia?o=43&l=sp&t=sj&a=web|search-v3|hpb Front End Developer Tam Development LLC Riyadh, Saudi Arabia Full Time, Work From Home 3+ Yrs of Exp
43 https://wuzzuf.net/jobs/p/Ganw296LvwQS-Senior-Front-End-Developer-Dafater-Cairo-Egypt?o=44&l=sp&t=sj&a=web|search-v3|hpb Senior Front End Developer Dafater Nasr City, Cairo, Egypt Full Time, Work From Home 3+ Yrs of Exp
44 https://wuzzuf.net/jobs/p/CsHvaWU326dE-Senior-Odoo-Developer-aliaict-Cairo-Egypt?o=45&l=sp&t=sj&a=web|search-v3|hpb Senior Odoo Developer aliaict Nasr City, Cairo, Egypt Full Time 3 - 7 Yrs of Exp

beautiful soup find_all() not returning all elements

I am trying to scrape this website using bs4. Using inspect on particular car ad tile, I figured what I need to scrape in order to get the title & the link to the car's page.
I am making use of the find_all() function of the bs4 library but the issue is that it's not scraping the required info of all the cars. It returns only info of about 21, whereas it's clearly visible on the website that there are about 2410 cars.
The relevant code:
from bs4 import BeautifulSoup as bs
from urllib.request import Request, urlopen
import re
import requests
url = 'https://www.cardekho.com/used-cars+in+bangalore'
req = Request(url , headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
page_soup = bs(webpage,"html.parser")
tags = page_soup.find_all("div","title")
print(len(tags))
How to get info on all of the cars present on the page.
P.S - Want to point out just one thing, all the cars aren't displayed at once. More car info gets loaded as you scroll down. Could it because of that? Not sure.
Ok, I've written up a sample code to show you how it can be done. Although the site has a convenient api that we can leverage, the first page is not available through the api, but is embedded in a script tag in the html code. This requires additional processing to extract. After that it is simply a matte of getting the json data from the api, parsing it to python dictionaries and appending the car entries to a list. The link to the api can be found when inspecting network activity in Chrome or Firefox while scrolling the site.
from bs4 import BeautifulSoup
import re
import json
from subprocess import check_output
import requests
import time
from tqdm import tqdm #tqdm is just to implement a progress bar, https://pypi.org/project/tqdm/
cars = [] #create empty list to which we will append the car dicts from the json data
url = 'https://www.cardekho.com/used-cars+in+bangalore'
r = requests.get(url , headers={'User-Agent': 'Mozilla/5.0'})
soup = BeautifulSoup(r.content.decode('utf-8'),"html.parser")
s = soup.find('script', {"type":"application/ld+json"}).next_sibling #find the section with the json data. It looks for a script tage with application/ld+json type, and takes the next tag, which is the one with the data we need, see page source code
js = 'window = {};\n'+s.text.strip()+';\nprocess.stdout.write(JSON.stringify(window.__INITIAL_STATE__));' #strip the text from unnecessary strings and load the json as python dict, taken from: https://stackoverflow.com/questions/54991571/extract-json-from-html-script-tag-with-beautifulsoup-in-python/54992015#54992015
with open('temp.js','w') as f: # save the sting to a javascript file
f.write(js)
data_site = json.loads(check_output(['node','temp.js'])) #execute the file with node, which will return the json data that will be loaded with json.loads.
for i in data_site['items']: #iterate over the dict and append all cars to the empty list 'cars'
cars.append(i)
for page in tqdm(range(20, data_site['total_count'], 20)): #'pagefrom' in the api call is 20, 40, 60, etc. so create a range and loop it
r = requests.get(f"https://www.cardekho.com/api/v1/usedcar/search?&cityId=105&connectoid=&lang_code=en&regionId=0&searchstring=used-cars%2Bin%2Bbangalore&pagefrom={page}&sortby=updated_date&sortorder=asc&mink=0&maxk=200000&dealer_id=&regCityNames=&regStateNames=", headers={'User-Agent': 'Mozilla/5.0'})
data = r.json()
for i in data['data']['cars']: #iterate over the dict and append all cars to the empty list 'cars'
cars.append(i)
time.sleep(5) #wait a few seconds to avoid overloading the site
This will result in cars being a list of dictionaries. The car names can be found in the vid key, and the urls are present in the vlink key.
You can load it into a pandas dataframe to explore the data:
import pandas as pd
df = pd.DataFrame(cars)
df.head() will output (I omitted the images column for readability):
loc
myear
bt
ft
km
it
pi
pn
pu
dvn
ic
ucid
sid
ip
oem
model
vid
city
vlink
p_numeric
webp_image
position
pageNo
centralVariantId
isExpiredModel
modelId
isGenuine
is_ftc
seller_location
utype
views
tmGaadiStore
cls
0
Koramangala
2014
SUV
Diesel
30,000
0
https://images10.gaadicdn.com/usedcar_image/320x240/used_car_2206305_1614944913.jpg
9.9 Lakh
Mahindra XUV500 W6 2WD
13
3019084
9509A09F1673FE2566DF59EC54AAC05B
1
Mahindra
Mahindra XUV500
Mahindra XUV500 2011-2015 W6 2WD
Bangalore
/used-car-details/used-Mahindra-XUV500-2011-2015-W6-2WD-cars-Bangalore_9509A09F1673FE2566DF59EC54AAC05B.htm
990000
https://images10.gaadicdn.com/usedcar_image/320x240webp/2021/used_car_2206305_1614944913.webp
1
1
3822
True
570
0
0
{'address': 'BDA Complex, 100 Feet Rd, 3rd Block, Koramangala 3 Block, Koramangala, Bengaluru, Karnataka 560034, Bangalore', 'lat': 12.931, 'lng': 77.6228}
Dealer
235
False
1
Marathahalli Colony
2017
SUV
Petrol
30,000
0
https://images10.gaadicdn.com/usedcar_image/320x240/used_car_2203506_1614754307.jpeg
7.85 Lakh
Ford Ecosport 1.5 Petrol Trend BSIV
14
3015331
2C0E4C4E543D4792C1C3186B361F718B
1
Ford
Ford Ecosport
Ford Ecosport 2015-2021 1.5 Petrol Trend BSIV
Bangalore
/used-car-details/used-Ford-Ecosport-2015-2021-1.5-Petrol-Trend-BSIV-cars-Bangalore_2C0E4C4E543D4792C1C3186B361F718B.htm
785000
https://images10.gaadicdn.com/usedcar_image/320x240webp/2021/used_car_2203506_1614754307.webp
2
1
6086
True
175
0
0
{'address': '2, Varthur Rd, Ayyappa Layout, Chandra Layout, Marathahalli, Bengaluru, Karnataka 560037, Marathahalli Colony, Bangalore', 'lat': 12.956727624875453, 'lng': 77.70174980163576}
Dealer
495
False
2
Yelahanka
2020
SUV
Diesel
13,969
0
https://images10.gaadicdn.com/usedcar_image/320x240/usedcar_11_276591614316705_1614316747.jpg
41 Lakh
Toyota Fortuner 2.8 4WD AT
12
3007934
BBC13FB62DF6840097AA62DDEA05BB04
1
Toyota
Toyota Fortuner
Toyota Fortuner 2016-2021 2.8 4WD AT
Bangalore
/used-car-details/used-Toyota-Fortuner-2016-2021-2.8-4WD-AT-cars-Bangalore_BBC13FB62DF6840097AA62DDEA05BB04.htm
4100000
https://images10.gaadicdn.com/usedcar_image/320x240webp/2021/usedcar_11_276591614316705_1614316747.webp
3
1
7618
True
364
0
0
{'address': 'Sonnappanahalli Kempegowda Intl Airport Road Jala Uttarahalli Hobli, Yelahanka, Bangalore, Karnataka 560064', 'lat': 13.1518821, 'lng': 77.6220694}
Dealer
516
False
3
Byatarayanapura
2017
Sedans
Diesel
18,000
0
https://images10.gaadicdn.com/usedcar_image/320x240/used_car_2202297_1615013237.jpg
35 Lakh
Mercedes-Benz E-Class E250 CDI Avantgarde
15
3013606
4553943A967049D873712AFFA5F65A56
1
Mercedes-Benz
Mercedes-Benz E-Class
Mercedes-Benz E-Class 2009-2012 E250 CDI Avantgarde
Bangalore
/used-car-details/used-Mercedes-Benz-E-Class-2009-2012-E250-CDI-Avantgarde-cars-Bangalore_4553943A967049D873712AFFA5F65A56.htm
3500000
https://images10.gaadicdn.com/usedcar_image/320x240webp/2021/used_car_2202297_1615013237.webp
4
1
4611
True
674
0
0
{'address': 'NO 19, Near Traffic Signal, Byatanarayanapura, International Airport Road, Byatarayanapura, Bangalore, Karnataka 560085', 'lat': 13.0669588, 'lng': 77.5928756}
Dealer
414
False
4
nan
2015
Sedans
Diesel
80,000
0
https://stimg.cardekho.com/pwa/img/noimage.svg
12.5 Lakh
Skoda Octavia Elegance 2.0 TDI AT
1
3002709
156E5F2317C0A3A3BF8C03FFC35D404C
1
Skoda
Skoda Octavia
Skoda Octavia 2013-2017 Elegance 2.0 TDI AT
Bangalore
/used-car-details/used-Skoda-Octavia-2013-2017-Elegance-2.0-TDI-AT-cars-Bangalore_156E5F2317C0A3A3BF8C03FFC35D404C.htm
1250000
5
1
3092
True
947
0
0
{'lat': 0, 'lng': 0}
Individual
332
False
Or if you wish to explode the dict in seller_location to columns, you can load it with df = pd.json_normalize(cars).
You can save all data to a csv file: df.to_csv('output.csv')

Beautifulsoup doesn't return the whole html seen in inspect

I'm trying to parse the html of a live sport results website, but my code doesn't return every span tag there is to the site. I saw under inspect that all the matches are , but my code can't seem to find anything from the website apart from the footer or header. Also tried with the divs, those didn't work either. I'm new to this and kinda lost, this is my code, could someone help me?
I left the firs part of the for loop for more clarity.
#Creating the urls for the different dates
my_url='https://www.livescore.com/en/football/{}'.format(d1)
print(my_url)
today=date.today()-timedelta(days=i)
d1 = today.strftime("%Y-%m-%d/")
#Opening up the connection and grabbing the html
uClient=uReq(my_url)
page_html=uClient.read()
uClient.close()
#HTML parser
page_soup=soup(page_html,"html.parser")
spans=page_soup.findAll("span")
matches=page_soup.findAll("div", {"class":"LiveRow-w0tngo-0 styled__Root-sc-2sc0sh-0 styled__FootballRoot-sc-2sc0sh-4 eAwOMF"})
print(spans)
The page is dynamic and rendered by JS. When you do a request, you are getting the static html response before it's rendered. There are few things you could do to work with this situation:
Use something like Selenium which simulates the browser operations. It'l open a browser, go to the site, allow the site to render the page. Once the page is rendered, you THEN can get the html of that page which will have the data. It'll work, but takes longer to process since it literally is simulating the process as you would do it manually.
Use requests-HTML package which also allows the page to be rendered (I have not tried this package before as it conflicts with my IDE Spyder). This would be similar to Selenium, without the borwser actually opening. It's essentially the requests package, but with javascript support.
See if the data (in the static html response) is embedded in the <script> tags in json format. Sometimes you'll find it there, but takes a little work to pull that out and conform/manipulate to a valid json format to be read in using json.loads()
Find if there is an api of some sort (checking XHR) and fetch the data directly from there.
The best option is always #4 if it's available. Why? Because the data will be consistently structured. Even if the website changes it's structure or css changes (which would change the html you parse), the underlying data feeding into it will rarely change it's structure. This site does have an api to access the data:
import requests
import datetime
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'}
dates_list = ['20210214', '20210215', '20210216']
for dateStr in dates_list:
url = f'https://prod-public-api.livescore.com/v1/api/react/date/soccer/{dateStr}/0.00'
dateStr_alpha = datetime.datetime.strptime(dateStr, '%Y%m%d').strftime('%B %d')
response = requests.get(url, headers=headers).json()
stages = response['Stages']
for stage in stages:
location = stage['Cnm']
stageName = stage['Snm']
events = stage['Events']
print('\n\n%s - %s\t%s' %(location, stageName, dateStr_alpha))
print('*'*50)
for event in events:
outcome = event['Eps']
team1Name = event['T1'][0]['Nm']
if 'Tr1' in event.keys():
team1Goals = event['Tr1']
else:
team1Goals = '?'
team2Name = event['T2'][0]['Nm']
if 'Tr2' in event.keys():
team2Goals = event['Tr2']
else:
team2Goals = '?'
print('%s\t%s %s - %s %s' %(outcome, team1Name, team1Goals, team2Name, team2Goals))
Output:
England - Premier League February 15
********************************************************************************
FT West Ham United 3 - Sheffield United 0
FT Chelsea 2 - Newcastle United 0
Spain - LaLiga Santander February 15
********************************************************************************
FT Cadiz 0 - Athletic Bilbao 4
Germany - Bundesliga February 15
********************************************************************************
FT Bayern Munich 3 - Arminia Bielefeld 3
Italy - Serie A February 15
********************************************************************************
FT Hellas Verona 2 - Parma Calcio 1913 1
Portugal - Primeira Liga February 15
********************************************************************************
FT Sporting CP 2 - Pacos de Ferreira 0
Belgium - Jupiler League February 15
********************************************************************************
FT Gent 4 - Royal Excel Mouscron 0
Belgium - First Division B February 15
********************************************************************************
FT Westerlo 1 - Lommel 1
Turkey - Super Lig February 15
********************************************************************************
FT Genclerbirligi 0 - Besiktas 3
FT Antalyaspor 1 - Yeni Malatyaspor 1
Brazil - Serie A February 15
********************************************************************************
FT Gremio 1 - Sao Paulo 2
FT Ceara 1 - Fluminense 3
FT Sport Recife 0 - Bragantino 0
Italy - Serie B February 15
********************************************************************************
FT Cosenza 2 - Reggina 2
France - Ligue 2 February 15
********************************************************************************
FT Sochaux 2 - Valenciennes 0
FT Toulouse 3 - AC Ajaccio 0
Spain - LaLiga Smartbank February 15
********************************************************************************
FT Castellon 1 - Fuenlabrada 2
FT Real Oviedo 3 - Lugo 1
...
Uganda - Super League February 16
********************************************************************************
FT Busoga United FC 1 - Bright Stars FC 1
FT Kitara FC 0 - Mbarara City 1
FT Kyetume 2 - Vipers SC 2
FT UPDF FC 0 - Onduparaka FC 1
FT Uganda Police 2 - BUL FC 0
Uruguay - Primera División: Clausura February 16
********************************************************************************
FT Boston River 0 - Montevideo City Torque 3
International - Friendlies Women February 16
********************************************************************************
FT Guatemala 3 - Panama 1
Africa - Africa Cup Of Nations U20: Group C February 16
********************************************************************************
FT Ghana U20 4 - Tanzania U20 0
FT Gambia U20 0 - Morocco U20 1
Brazil - Amazonense: Group A February 16
********************************************************************************
Postp. Manaus FC ? - Penarol AC AM ?
Now assuming you have the correct class to scrape, a simple loop would work:
for i in soup.find_all("div", {"class":"LiveRow-w0tngo-0 styled__Root-sc-2sc0sh-0 styled__FootballRoot-sc-2sc0sh-4 eAwOMF"}):
print(i)
Or add it into a list:
teams = []
for i in soup.find_all("div", {"class":"LiveRow-w0tngo-0 styled__Root-sc-2sc0sh-0 styled__FootballRoot-sc-2sc0sh-4 eAwOMF"}):
teams.append(i.text)
print(teams)
If this does not work, run some tests to see if you are actually scraping the correct things e.g. print a singular thing.
Also in your code I see that you are printing "spans" and not "matches", this could also be a problem with your code.
You can also look at this post what further explains how to do this.

Beautiful soup scraping with selenium

I'm learning how to scrape using Beautiful soup with selenium and I found a website that has multiple tables and found table tags (first time dealing with them). I'm learning how to try to scrape those texts from each table and append each element to respected list. First im trying to scrape the first table, and the rest I want to do on my own. But I cannot access the tag for some reason.
I also incorporated selenium to access the sites, because when I copy the link to the site onto another tab, the list of tables disappears, for some reason.
My code so far:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
from selenium import webdriver
from selenium.webdriver.support.ui import Select
PATH = "C:\Program Files (x86)\chromedriver.exe"
driver = webdriver.Chrome(PATH)
targetSite = "https://www.sdvisualarts.net/sdvan_new/events.php"
driver.get(targetSite)
select_event = Select(driver.find_element_by_name('subs'))
select_event.select_by_value('All')
select_loc = Select(driver.find_element_by_name('loc'))
select_loc.select_by_value("All")
driver.find_element_by_name("submit").click()
targetSite = "https://www.sdvisualarts.net/sdvan_new/viewevents.php"
event_title = []
name = []
address = []
city = []
state = []
zipCode = []
location = []
webSite = []
fee = []
event_dates = []
opening_dates = []
description = []
try:
page = requests.get(targetSite )
soup = BeautifulSoup(page.text, 'html.parser')
items = soup.find_all('table', {"class":"popdetail"})
for i in items:
event_title.append(item.find('b', {'class': "text"})).text.strip()
name.append(item.find('td', {'class': "text"})).text.strip()
address.append(item.find('td', {'class': "text"})).text.strip()
city.append(item.find('td', {'class': "text"})).text.strip()
state.append(item.find('td', {'class': "text"})).text.strip()
zipCode.append(item.find('td', {'class': "text"})).text.strip()
Can someone let me know if I am doing this correctly, This is my first time dealing with site's urls elements disappear when copied onto a new tab and/or window
So far, I am unable to append any information to each list.
One issue is with the for loop.
you have for i in items:, but then you are calling item instead of i.
And secondly, if you are using selenium to render the page, then you should probably use selenium to get the html. They also have some embedded tables within tables, so it's not as straight forward as iterating through the <table> tags. What I ended up doing was having pandas read in the tables (returns a list of dataframes), then iterating through those as there is a pattern of how the dataframes are constructed.
import pandas as pd
from selenium import webdriver
from selenium.webdriver.support.ui import Select
PATH = "C:\Program Files (x86)\chromedriver.exe"
driver = webdriver.Chrome(PATH)
targetSite = "https://www.sdvisualarts.net/sdvan_new/events.php"
driver.get(targetSite)
select_event = Select(driver.find_element_by_name('subs'))
select_event.select_by_value('All')
select_loc = Select(driver.find_element_by_name('loc'))
select_loc.select_by_value("All")
driver.find_element_by_name("submit").click()
targetSite = "https://www.sdvisualarts.net/sdvan_new/viewevents.php"
event_title = []
name = []
address = []
city = []
state = []
zipCode = []
location = []
webSite = []
fee = []
event_dates = []
opening_dates = []
description = []
dfs = pd.read_html(driver.page_source)
driver.close
for idx, table in enumerate(dfs):
if table.iloc[0,0] == 'Event Title':
event_title.append(table.iloc[-1,0])
tempA = dfs[idx+1]
tempA.index = tempA[0]
tempB = dfs[idx+4]
tempB.index = tempB[0]
tempC = dfs[idx+5]
tempC.index = tempC[0]
name.append(tempA.loc['Name',1])
address.append(tempA.loc['Address',1])
city.append(tempA.loc['City',1])
state.append(tempA.loc['State',1])
zipCode.append(tempA.loc['Zip',1])
location.append(tempA.loc['Location',1])
webSite.append(tempA.loc['Web Site',1])
fee.append(tempB.loc['Fee',1])
event_dates.append(tempB.loc['Dates',1])
opening_dates.append(tempB.loc['Opening Days',1])
description.append(tempC.loc['Event Description',1])
df = pd.DataFrame({'event_title':event_title,
'name':name,
'address':address,
'city':city,
'state':state,
'zipCode':zipCode,
'location':location,
'webSite':webSite,
'fee':fee,
'event_dates':event_dates,
'opening_dates':opening_dates,
'description':description})
Output:
print (df.to_string())
event_title name address city state zipCode location webSite fee event_dates opening_dates description
0 The San Diego Museum of Art Welcomes a Special... San Diego Museum of Art 1450 El Prado, Balboa Park San Diego CA 92101 Central San Diego https://www.sdmart.org/ NaN Starts On 6-18-2020 Ends On 1-10-2021 Opens virtually on June 18. The work will beco... The San Diego Museum of Art is launching its f...
1 New Exhibit: Miller Dairy Remembered Lemon Grove Historical Society 3185 Olive Street, Treganza Heritage Park Lemon Grove CA 91945 Central San Diego http://www.lghistorical.org Children 12 and under free and must be accompa... Starts On 6-27-2020 Ends On 12-4-2020 Exhibit on view Saturdays 11 am to 2 pm; close... From 1926 there were cows smack in the midst o...
2 Gizmos and Shivelight Distinction Gallery 317 E. Grand Ave Escondido CA 92025 North County Inland http://www.distinctionart.com NaN Starts On 7-14-2020 Ends On 9-5-2020 08/08/20 - 09/05/20 Distinction Gallery is proud to present our so...
3 Virtual Opening - July Exhibitions Vision Art Museum 2825 Dewey Rd. Suite 100 San Diego CA 92106 Central San Diego http://www.visionsartmuseum.org Free Starts On 7-18-2020 Ends On 10-4-2020 NaN Join Visions Art Museum for a virtual exhibiti...
4 Laying it Bare: The Art of Walter Redondo and ... Fresh Paint Gallery 1020-B Prospect Street La Jolla CA 92037 Central San Diego http://freshpaintgallery.com/ NaN Starts On 8-1-2020 Ends On 9-27-2020 Tuesday through Sunday. Mondays closed. A two-person exhibit of new abstract expressio...
5 Online oil painting lessons with Concetta Antico NaN NaN NaN NaN NaN Virtual http://concettaantico.com/live-online-oil-pain... NaN Starts On 8-10-2020 Ends On 8-31-2020 NaN Anyone can learn to paint like the masters! Ov...
6 MOMENTUM: A Creative Industry Symposium Vanguard Culture Via Zoom San Diego California 92101 Virtual https://www.eventbrite.com/e/momentum-a-creati... $10 suggested donation Starts On 8-17-2020 Ends On 9-7-2020 NaN MOMENTUM: A Creative Industry Symposium Monday...
7 Virtual Locals Invitational Show Art & Frames of Coronado 936 ORANGE AVE Coronado CA 92118 0 https://www.artsteps.com/view/5eed0ad62cd0d65b... free Starts On 8-21-2020 Ends On 8-1-2021 NaN Art and Frames of Coronado invites you to our ...
8 HERE & Now R.B. Stevenson Gallery 7661 Girard Avenue, Suite 101 La Jolla California 92037 Central San Diego http://www.rbstevensongallery.com Free Starts On 8-22-2020 Ends On 9-25-2020 Tuesday through Saturday R.B.Stevenson Gallery is pleased to announce t...
9 Art Unites Learning: Normal 2.0 Art Unites NaN San Diego NaN 92116 Central San Diego https://www.facebook.com/events/956878098104971 Free Starts On 8-25-2020 Ends On 8-25-2020 NaN Please join us on Tuesday, August 25th as we: ...
10 Image Quest Sojourn; Visual Journaling for Per... Pamela Underwood Studios Virtual NaN NaN NaN Virtual http://www.pamelaunderwood.com/event/new-onlin... $595.00 Starts On 8-26-2020 Ends On 11-11-2020 NaN Create a personal Image Quest resource journal...
11 Behind The Exhibition: Southern California Con... Oceanside Museum of Art 704 Pier View Way Oceanside California 92054 Virtual https://oma-online.org/events/behind-the-exhib... No fee required. Donations recommended. Starts On 8-27-2020 Ends On 8-27-2020 NaN Join curator Beth Smith and exhibitions manage...
12 Lay it on Thick, a Virtual Art Exhibition San Diego Watercolor Society 2825 Dewey Rd Bldg #202 San Diego California 92106 0 https://www.sdws.org NaN Starts On 8-30-2020 Ends On 9-26-2020 NaN The San Diego Watercolor Society proudly prese...
13 The Forum: Marketing & Branding for Creatives Vanguard Culture Via Zoom San Diego CA 92101 South San Diego http://vanguardculture.com/ $5 suggested donation Starts On 9-1-2020 Ends On 9-1-2020 NaN Attention creative industry professionals! Joi...
14 Write or Die Solo Exhibition You Belong Here 3619 EL CAJON BLVD San Diego CA 92104 Central San Diego http://www.youbelongsd.com/upcoming-events/wri... $10 donation to benefit You Belong Here Starts On 9-4-2020 Ends On 9-6-2020 NaN Write or Die is an immersive installation and ...
15 SDVAN presents Art San Diego at Bread and Salt San Diego Visual Arts Network 1955 Julian Avenue San Digo CA 92113 Central San Diego http://www.sdvisualarts.net and https://www.br... Free Starts On 9-5-2020 Ends On 10-24-2020 NaN We are pleased to announce the four artist rec...
16 The Coming of Treganza Heritage Park Lemon Grove Historical Society 3185 Olive Street Lemon Grove CA 91945 Central San Diego http://www.lghistorical.org Free for all ages Starts On 9-10-2020 Ends On 9-10-2020 The park is open daily, 8 am to 8 pm. Covid 19... Lemon Grove\'s central city park will be renam...
17 Online oil painting course | 4 weeks NaN NaN NaN NaN NaN Virtual http://concettaantico.com/live-online-oil-pain... NaN Starts On 9-14-2020 Ends On 10-5-2020 NaN Over 4 weekly Zoom lessons, learn the techniqu...
18 Online oil painting course | 4 weeks NaN NaN NaN NaN NaN Virtual http://concettaantico.com/live-online-oil-pain... NaN Starts On 10-12-2020 Ends On 11-2-2020 NaN Over 4 weekly Zoom lessons, learn the techniqu...
19 36th Annual Mission Fed ArtWalk Mission Fed ArtWalk Ash Street San Diego California 92101 Central San Diego www.missionfedartwalk.org Free Starts On 11-7-2020 Ends On 11-8-2020 Sat and Sun Nov 7 and 8 Mission Fed ArtWalk returns to San Diego’s Lit...
20 Mingei Pop Up Workshop: My Daruma Doll New Childrens Museum 200 West Island Avenue San Diego California 92101 Central San Diego http://thinkplaycreate.org/ Free with admission Starts On 11-13-2020 Ends On 11-13-2020 NaN Join Mingei International Museum at The New Ch...

Pandas Dataframe

I want to represent data using pandas dataframe , the column name - Product Title and populate t .
For eg :
Product Title
Marvel : Movies Collection
Marvel
Diney Movie and so on..
import requests
from bs4 import BeautifulSoup
import csv
import pandas as pd
r= requests.get("http://www.walmart.com/search/?query=marvel&cat_id=4096_530598")
r.content
soup = BeautifulSoup(r.content)
g_data = soup.find_all("div", {"class" : "tile-conent"})
g_price = soup.find_all("div",{"class" : "item-price-container"})
g_star = soup.find_all("div",{"class" : "stars stars-small tile-row"})
for product_title in g_data:
a_product_title = product_title.find_all("a","js-product-title")
for text_product_title in a_product_title :
t = text_product_title.text
print t
Desired Output-
Product Title :
Marvel Heroes: Collection
Marvel: Guardians Of The Galaxy (Widescreen)
Marvel Complete Giftset (Widescreen)
Marvel's The Avengers (Widescreen)
Marvel Knights: Wolverine Versus Sabretooth - Reborn (Widescreen)
Superheroes Collection: The Incredible Hulk Returns / The Trial Of The Incredible Hulk / How To Draw Comics The Marvel Way (Widescreen)
Marvel: Iron Man & Hulk - Heroes United (Widescreen)
Marvel's The Avengers (DVD + Blu-ray) (Widescreen)
Captain America: The Winter Soldier (Widescreen)
Iron Man 3 (DVD + Digital Copy) (Widescreen)
Thor: The Dark World (Widescreen)
Spider-Man (2-Disc) (Special Edition) (Widescreen)
Elektra / Fantastic Four / Daredevil (Director's Cut) / Fantastic Four 2: Rise Of The Silver Surfer
Spider-Man / Spider-Man 2 / Spider-Man 3 (Widescreen)
Spider-Man 2 (Widescreen)
The Punisher (Extended Cut) (Widescreen)
DC Showcase: Superman / Shazam!: The Return Of The Black Adam
Ultimate Avengers: The Movie (Widescreen)
The Next Avengers: Heroes Of Tomorrow (Widescreen)
Ultimate Avengers 1 & 2 (Blu-ray) (Widescreen)
I tired append function and join but it dint work.. Do we have any specific function this in pandas dataframe?
The desired output should be outcome of using Pandas dataframe.
Well this will get you started, this extracts all the titles into a dict (I use a defaultdict for convenience):
In [163]:
from collections import defaultdict
data=defaultdict(list)
for product_title in g_data:
a_product_title = product_title.find_all("a","js-product-title")
for text_title in a_product_title:
data['Product title'].append(text_title.text)
df = pd.DataFrame(data)
df
Out[163]:
Product title
0 Marvel Heroes: Collection
1 Marvel: Guardians Of The Galaxy (Widescreen)
2 Marvel Complete Giftset (Widescreen)
3 Marvel's The Avengers (Widescreen)
4 Marvel Knights: Wolverine Versus Sabretooth - ...
5 Superheroes Collection: The Incredible Hulk Re...
6 Marvel: Iron Man & Hulk - Heroes United (Wides...
7 Marvel's The Avengers (DVD + Blu-ray) (Widescr...
8 Captain America: The Winter Soldier (Widescreen)
9 Iron Man 3 (DVD + Digital Copy) (Widescreen)
10 Thor: The Dark World (Widescreen)
11 Spider-Man (2-Disc) (Special Edition) (Widescr...
12 Elektra / Fantastic Four / Daredevil (Director...
13 Spider-Man / Spider-Man 2 / Spider-Man 3 (Wide...
14 Spider-Man 2 (Widescreen)
15 The Punisher (Extended Cut) (Widescreen)
16 DC Showcase: Superman / Shazam!: The Return Of...
17 Ultimate Avengers: The Movie (Widescreen)
18 The Next Avengers: Heroes Of Tomorrow (Widescr...
19 Ultimate Avengers 1 & 2 (Blu-ray) (Widescreen)
So you can modify this script to add the price and actors as keys to the data dict and then construct the df from the resultant dict, this will be better than appending a row at a time

Categories

Resources