I am trying to scrape Craigslist using BeautifulSoup4. All data shows properly EXCEPT price. I can't seem to find the right tagging to loop through pricing instead of showing the same price for each post.
import requests
from bs4 import BeautifulSoup
source = requests.get('https://washingtondc.craigslist.org/search/nva/sss?query=5%20hp%20boat%20motor&sort=rel').text
soup = BeautifulSoup(source, 'lxml')
for summary in soup.find_all('p', class_='result-info'):
pricing = soup.find('span', class_='result-price')
price = pricing
title = summary.a.text
url = summary.a['href']
print(title + '\n' + price.text + '\n' + url + '\n')
Left: HTML code from Craigslist, commented out is irrelevant (in my opinion) code. I want pricing to not loop the same number. Right: Sublime SS of code.
Snippet of code running through terminal. Pricing is the same for each post.
Thank you
Your script is almost correct. You need to change the soup object for the price to summary
import requests
from bs4 import BeautifulSoup
source = requests.get('https://washingtondc.craigslist.org/search/nva/sss?query=5%20hp%20boat%20motor&sort=rel').text
soup = BeautifulSoup(source, 'lxml')
for summary in soup.find_all('p', class_='result-info'):
price = summary.find('span', class_='result-price')
title = summary.a.text
url = summary.a['href']
print(title + '\n' + price.text + '\n' + url + '\n')
Output:
Boat Water Tender - 10 Tri-Hull with Electric Trolling Motor
$629
https://washingtondc.craigslist.org/nva/boa/d/haymarket-boat-water-tender-10-tri-hull/7160572264.html
1987 Boston Whaler Montauk 17
$25450
https://washingtondc.craigslist.org/nva/boa/d/alexandria-1987-boston-whaler-montauk-17/7163033134.html
1971 Westerly Warwick Sailboat
$3900
https://washingtondc.craigslist.org/mld/boa/d/upper-marlboro-1971-westerly-warwick/7170495800.html
Buy or Rent. DC Party Pontoon for Dock Parties or Cruises
$15000
https://washingtondc.craigslist.org/doc/boa/d/washington-buy-or-rent-dc-party-pontoon/7157810378.html
West Marine Zodiac Inflatable Boat SB285 With 5HP Gamefisher (Merc)
$850
https://annapolis.craigslist.org/boa/d/annapolis-west-marine-zodiac-inflatable/7166031908.html
2012 AB aluminum/hypalon inflatable dinghy/2012 Yamaha 6hp four stroke
$3400
https://annapolis.craigslist.org/bpo/d/annapolis-2012-ab-aluminum-hypalon/7157768911.html
RHODES-18’ CENTERBOARD DAYSAILER
$6500
https://annapolis.craigslist.org/boa/d/ocean-view-rhodes-18-centerboard/7148322078.html
Mercury Outboard 7.5 HP
$250
https://baltimore.craigslist.org/bpo/d/middle-river-mercury-outboard-75-hp/7167399866.html
8 hp yamaha 2 stroke
$0
https://baltimore.craigslist.org/bpo/d/8-hp-yamaha-2-stroke/7154103281.html
TRADE 38' BENETEAU IDYLLE 1150
$35000
https://baltimore.craigslist.org/boa/d/middle-river-trade-38-beneteau-idylle/7163761741.html
5-hp Top Tank Mercury
$0
https://baltimore.craigslist.org/bpo/d/5-hp-top-tank-mercury/7154102434.html
5-hp Top Tank Mercury
$0
https://baltimore.craigslist.org/bpo/d/5-hp-top-tank-mercury/7154102744.html
Wanted ur unwanted outboards
$0
https://baltimore.craigslist.org/bpo/d/randallstown-wanted-ur-unwanted/7141349142.html
Grumman Sport Boat
$2250
https://baltimore.craigslist.org/boa/d/baldwin-grumman-sport-boat/7157186381.html
1996 Carver 355 Aft Cabin Motor Yacht
$47000
https://baltimore.craigslist.org/boa/d/middle-river-1996-carver-355-aft-cabin/7156830617.html
Lower unit, long shaft
$50
https://baltimore.craigslist.org/bpo/d/catonsville-lower-unit-long-shaft/7155566763.html
Lower unit, long shaft
$50
https://baltimore.craigslist.org/bpo/d/catonsville-lower-unit-long-shaft/7155565771.html
Lower unit, long shaft
$50
https://baltimore.craigslist.org/bpo/d/catonsville-lower-unit-long-shaft/7155566035.html
Lower unit, long shaft
$50
https://baltimore.craigslist.org/bpo/d/catonsville-lower-unit-long-shaft/7155565301.html
Cape Dory 25 Sailboat for sale or trade
$6500
https://baltimore.craigslist.org/boa/d/reedville-cape-dory-25-sailboat-for/7149227778.html
West Marine HP-V 350
$1200
https://baltimore.craigslist.org/boa/d/pasadena-west-marine-hp-350/7147285666.html
Related
I am trying to create variables location; contract items; contract code; federal aid using regex on the following text:
PAGE 1
BID OPENING DATE 07/25/18 FROM 0.2 MILES WEST OF ICE HOUSE 07/26/18 CONTRACT NUMBER 03-2F1304 ROAD TO 0.015 MILES WEST OF CONTRACT CODE 'A '
LOCATION 03-ED-50-39.5/48.7 DIVISION HIGHWAY ROAD 44 CONTRACT ITEMS
INSTALL SANDTRAPS AND PULLOUTS FEDERAL AID ACNH-P050-(146)E
PAGE 1
BID OPENING DATE 07/25/18 IN EL DORADO COUNTY AT VARIOUS 07/26/18 CONTRACT NUMBER 03-2H6804 LOCATIONS ALONG ROUTES 49 AND 193 CONTRACT CODE 'C ' LOCATION 03-ED-0999-VAR 13 CONTRACT ITEMS
TREE REMOVAL FEDERAL AID NONE
PAGE 1
BID OPENING DATE 07/25/18 IN LOS ANGELES, INGLEWOOD AND 07/26/18 CONTRACT NUMBER 07-296304 CULVER CITY, FROM I-105 TO PORT CONTRACT CODE 'B '
LOCATION 07-LA-405-R21.5/26.3 ROAD UNDERCROSSING 55 CONTRACT ITEMS
ROADWAY SAFETY IMPROVEMENT FEDERAL AID ACIM-405-3(056)E
This text is from one word file; I'll be looping my code on multiple doc files. In the text above are three location; contract items; contract code; federal aid pairs. But when I use regex to create variables, only the first instance of each pair is included.
The code I have right now is:
# imports
import os
import pandas as pd
import re
import docx2txt
import textract
import antiword
all_bod = []
all_cn = []
all_location = []
all_fedaid = []
all_contractcode = []
all_contractitems = []
all_file = []
text = ' PAGE 1
BID OPENING DATE 07/25/18 FROM 0.2 MILES WEST OF ICE HOUSE 07/26/18 CONTRACT NUMBER 03-2F1304 ROAD TO 0.015 MILES WEST OF CONTRACT CODE 'A '
LOCATION 03-ED-50-39.5/48.7 DIVISION HIGHWAY ROAD 44 CONTRACT ITEMS
INSTALL SANDTRAPS AND PULLOUTS FEDERAL AID ACNH-P050-(146)E
PAGE 1
BID OPENING DATE 07/25/18 IN EL DORADO COUNTY AT VARIOUS 07/26/18 CONTRACT NUMBER 03-2H6804 LOCATIONS ALONG ROUTES 49 AND 193 CONTRACT CODE 'C ' LOCATION 03-ED-0999-VAR 13 CONTRACT ITEMS
TREE REMOVAL FEDERAL AID NONE
PAGE 1
BID OPENING DATE 07/25/18 IN LOS ANGELES, INGLEWOOD AND 07/26/18 CONTRACT NUMBER 07-296304 CULVER CITY, FROM I-105 TO PORT CONTRACT CODE 'B '
LOCATION 07-LA-405-R21.5/26.3 ROAD UNDERCROSSING 55 CONTRACT ITEMS
ROADWAY SAFETY IMPROVEMENT FEDERAL AID ACIM-405-3(056)E'
bod1 = re.search('BID OPENING DATE \s+ (\d+\/\d+\/\d+)', text)
bod2 = re.search('BID OPENING DATE\n\n(\d+\/\d+\/\d+)', text)
if not(bod1 is None):
bod = bod1.group(1)
elif not(bod2 is None):
bod = bod2.group(1)
else:
bod = 'NA'
all_bod.append(bod)
# creating contract number
cn1 = re.search('CONTRACT NUMBER\n+(.*)', text)
cn2 = re.search('CONTRACT NUMBER\s+(.........)', text)
if not(cn1 is None):
cn = cn1.group(1)
elif not(cn2 is None):
cn = cn2.group(1)
else:
cn = 'NA'
all_cn.append(cn)
# location
location1 = re.search('LOCATION \s+\S+', text)
location2 = re.search('LOCATION \n+\S+', text)
if not(location1 is None):
location = location1.group(0)
elif not(location2 is None):
location = location2.group(0)
else:
location = 'NA'
all_location.append(location)
# federal aid
fedaid = re.search('FEDERAL AID\s+\S+', text)
fedaid = fedaid.group(0)
all_fedaid.append(fedaid)
# contract code
contractcode = re.search('CONTRACT CODE\s+\S+', text)
contractcode = contractcode.group(0)
all_contractcode.append(contractcode)
# contract items
contractitems = re.search('\d+ CONTRACT ITEMS', text)
contractitems = contractitems.group(0)
all_contractitems.append(contractitems)
This code parses the only first instance of these variables in the text.
contract-number
location
contract-items
contract-code
federal-aid
03-2F1304
03-ED-50-39.5/48.7
44
A
ACNH-P050-(146)E
But, I am trying to figure out a way to get all possible instances in different observations.
contract-number
location
contract-items
contract-code
federal-aid
03-2F1304
03-ED-50-39.5/48.7
44
A
ACNH-P050-(146)E
03-2H6804
03-ED-0999-VAR
13
C
NONE
07-296304
07-LA-405-R21.5/26.3
55
B
ACIM-405-3(056)E
The all_variables in the code are for looping over multiple word files - we can ignore that if we want :).
Any leads would be super helpful. Thanks so much!
import re
data = []
df = pd.DataFrame()
regex_contract_number =r"(?:CONTRACT NUMBER\s+(?P<contract_number>\S+?)\s)"
regex_location = r"(?:LOCATION\s+(?P<location>\S+))"
regex_contract_items = r"(?:(?P<contract_items>\d+)\sCONTRACT ITEMS)"
regex_federal_aid =r"(?:FEDERAL AID\s+(?P<federal_aid>\S+?)\s)"
regex_contract_code =r"(?:CONTRACT CODE\s+\'(?P<contract_code>\S+?)\s)"
regexes = [regex_contract_number,regex_location,regex_contract_items,regex_federal_aid,regex_contract_code]
for regex in regexes:
for match in re.finditer(regex, text):
data.append(match.groupdict())
df = pd.concat([df, pd.DataFrame(data)], axis=1)
data = []
df
Getting below exeception for the line pst_hldr.
Also find the error below:
File "/home/PycharmProjects/reditt/redit1.py", line 44, in get_links
pst_hldr = wait.until(cond.visibility_of_element_located((By.XPATH, ".//*[#class='QBfRw7Rj8UkxybFpX-USO']")))
File "/usr/local/lib/python3.9/dist-packages/selenium/webdriver/support/wait.py", line 80, in until
raise TimeoutException(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message:
code:
def get_links(*keywords):
urllist = []
keystring = ''
for kw in keywords:
keystring += kw + "%20"
keystring = keystring.strip()
driver = webdriver.Chrome("chromedriver")
driver.get('https://www.reddit.com/search/?q=' + keystring)
driver.maximize_window()
wait = WebDriverWait(driver, 10)
tbody = wait.until(cond.presence_of_element_located(
(By.XPATH, "//*[#class='_3ozFtOe6WpJEMUtxDOIvtU']//*[#class='q4a8asWOWdfdniAbgNhMh']")))
tb_bar = tbody.find_element_by_xpath("//*[#class='_3ozFtOe6WpJEMUtxDOIvtU']//*["
"#class='q4a8asWOWdfdniAbgNhMh']//*[#class='M7VDHU4AdgCc6tHaZ-UUy']")
driver.execute_script("arguments[0].click();", tb_bar)
print("end of bar")
k = 0
for i in range(200):
newht = i * 500
driver.execute_script("window.scrollTo(0, " + str(newht) + ");")
time.sleep(0.1)
wait = WebDriverWait(driver, 10)
pst_hldr = wait.until(cond.visibility_of_element_located((By.XPATH, ".//*[#class='QBfRw7Rj8UkxybFpX-USO']")))
pst_tiles = pst_hldr.find_elements_by_xpath(
".//*[#class='_1poyrkZ7g36PawDueRza-J']//*[#class='_2XDITKxlj4y3M99thqyCsO']//*["
"#class='_1Y6dfr4zLlrygH-FLmr8x-']")
for tl in pst_tiles:
ttl = tl.find_element_by_xpath(".//*[#class='y8HYJ-y_lTUHkQIc1mdCq _2INHSNB8V5eaWp4P0rY_mE']//a")
href = ttl.get_attribute('href')
print(href)
driver.close()
get_links('america', 'coronavirus', 'cases')
those classes looks dynamic in nature, try with below code :
you can use the below css_selector :
div[data-testid='search-results-subnav']+div
In code,
pst_hldr = wait.until(cond.visibility_of_element_located((By.CSS_SELECTOR, "div[data-testid='search-results-subnav']+div")))
However, this exception
raise TimeoutException(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message:
is common, when you use Explicit wait which is WebDriverWait.
so try to use
driver.find_element_by_css("div[data-testid='search-results-subnav']+div")
and see what is the exact error you are getting.
Update 1 :
div[class*='ListingLayout-outerContainer'] div:nth-of-type(2) div:nth-of-type(3)
in code :-
pst_hldr = wait.until(cond.visibility_of_element_located((By.CSS_SELECTOR, "div[class*='ListingLayout-outerContainer'] div:nth-of-type(2) div:nth-of-type(3)")))
pst_tiles = wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "a[data-click-id='body'] span")))
for title in pst_tiles:
print(title.text)
Update 2 :
driver.implicitly_wait(30)
driver.maximize_window()
driver.get("https://www.reddit.com/search/?q=america%20coronavirus%20cases%20")
wait = WebDriverWait(driver, 20)
pst_hldr = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div[class*='ListingLayout-outerContainer'] div:nth-of-type(2) div:nth-of-type(3)")))
pst_tiles = wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "a[data-click-id='body']")))
for title in pst_tiles:
print(title.get_attribute('href'))
Output :
Timeline suggests a link between Alzheimers, Prion Disease & Covid-19
Vaccines
+2,089 New Cases = 1,243,932 Total Cases in PA; +16 New Deaths = 27,941 Total Deaths in PA Take me out of Latam 🗺✈ is that a good bet?
r/CoronavirusDownunder random daily discussion thread - 13 August,
2021 The New York parade reappears Asian-Americans gather downstairs
to protest Yan Limeng spread rumors about the origin of the virus
Florida becomes epicentre of America’s pandemic as coronavirus cases
surge 50 per cent. 1 in 5 coronavirus cases nationally is found in
Florida Pre-market brief
+1,811 New Cases = 1,241,843 Total Cases in PA; +11 New Deaths = 27,925 Total Deaths in PA Florida becomes epicentre of America’s
pandemic as coronavirus cases surge 50 per cent What A Day:
Reconcilable BIFerences by Sarah Lazarus & Crooked Media (08/11/21)
r/CoronavirusDownunder random daily discussion thread - 12 August,
2021 Hungarian nationalism is not the answer - Slow Boring More
evidence suggests COVID-19 was in US by Christmas 2019 3 Major Reasons
Why I am All in GME Pre-market brief Know anyone who is PRO MASK / VAX
/ LOCKDOWN?
+2,076 New Cases = 1,240,032 Total Cases in PA; +11 New Deaths = 27,914 Total Deaths in PA What A Day: Cuom Over by Sarah Lazarus &
Crooked Media (08/10/21) r/CoronavirusDownunder random daily
discussion thread - 11 August, 2021 Democrats and their accomplices in
the media are working overtime in Texas and Florida to create panic in
an effort to pressure governors to give them back the power to
reinstate mask mandates and lockdowns in their communities. More
evidence suggests COVID-19 was in US by Christmas 2019
+1,280 New Cases = 1,237,956 Total Cases in PA; +1 New Deaths = 27,903 Total Deaths in PA Even if we don't compare the value of animal lives
to human lives, the Holocaust comparison is still valid. AMC Talks
Bitcoin, GameStop With Its Reddit Followers -- 2nd Update
(https://markets.qtrade.ca/news/story?t=iKXJizM-dSY,RB0cWxp8F57oYAx8Odv4T-UoxUxTcyQA&article=651d9d865e21f8f0#651d9d865e21f8f0)
Process finished with exit code 0
This is expected behavior. You are using WebDriverWait, which waits for a fixed time (which you give) for an element to be visible. If the element was not found within that time, this exception is thrown. It is a way to tell you: "Hey, the element did not appear in the time you set".
Read more about this exception here: https://www.educative.io/edpresso/timeoutexception-in-selenium
I have been working on this scraper for a while and I think it could be improved but I'm not sure where to go from here.
The initial scraper looks like this and I believe it does everything I need it to do:
url = "https://matrix.heartlandmls.com/Matrix/Public/Portal.aspx?L=1&k=990316X949Z&p=DE-74613894-421"
h_table = []
driver = webdriver.Firefox()
driver.get(url)
driver.find_element_by_xpath("/html/body/form/div[3]/div/div/div[5]/div[3]/span[2]/div/div/div[2]/div[1]/div/div/div[2]/div[2]/div[1]/span/a").click()
time.sleep(10)
i = 200
while i > 0:
h_table.append(driver.find_element_by_id("wrapperTable").text)
driver.find_element_by_xpath("/html/body/form/div[3]/div/div/div[5]/div[2]/div/div[1]/div/div/span/ul/li[2]/a").click()
time.sleep(10)
i -= 1
this outputs everything into a table which i can clean up
['210 Sitter Street\nPleasant Hill, MO 64080\nMLS#:2178982\nMap\n$15,000\nSold\n4Bedrms\n2Full Bath(s)\n0Half Bath(s)\n1,848Sqft\nBuilt in1950\n0.27Acres\nSingle Family\n1 / 10\nThis Home sits on a level, treed, and nice .279 acre sizeable double lot. The property per taxes, is identified as a Single Family Home however it has 2 separate utility meters and 2 living spaces, each with 2 bedrooms and 1 full bath and laundry areas, and was utilized as a Duplex for Rental income for 2 units. This property is a CASH ONLY sale and is being sold "In It\'s Present Condition". Home and detached garage are in need of repair OR would be a candidate for a tear down and complete rebuild on the lot.\nAbout 210 Sitter Street, Pleasant Hill, MO 64080\nDirections:I-70 to 7 Hwy, to Broadway, to Sitter St, to property.\nGeneral Description\nMLS Number\n2178982\nCounty\nCass\nCity\nPleasant Hill\nSub Div\nWalkers & Sitlers\nType\nSingle Family\nFloor Plan Description\nRanch\nBdrms\n4\nBaths Full\n2\nBaths Half\n0\nAge Description\n51-75 Years\nYear Built\n1950\nSqft Main\n1848\nSQFT MAIN SOURCE\nPublic Record\nBelow Grade Finished Sq Ft\n0\nBelow Grade Finished Sq Ft Source\nPublic Record\nSqft\n1848\nLot Size\n12,155\nAcres\n0.27\nSchools E\nPleasant Hill Prim\nSchools M\nPleasant Hill\nSchools H\nPleasant Hill\nSchool District\nPleasant Hill\nLegal Description\nWALKER & SITLERS LOT 47 & 48 BLK 5\nS Terms\nCash\nInterior Features\nFireplace?\nY\nFireplace Description\nLiving Room, Wood Burning\nBasement\nN\nBasement Description\nBlock, Crawl Space\nDining Area Description\nEat-In Kitchen\nUtility Room\nMultiple, Main Level\nInterior Features\nFixer Up\nRooms\nBathroom Full\nLevel 1\n2nd Full Bath\nLevel 1\nMaster Bedroom\nLevel 1\nSecond Bedroom\nLevel 1\nMaster BR- 2nd\nLevel 1\nFourth Bedroom\nLevel 1\nKitchen\nLevel 1\nKitchen- 2nd\nLevel 1\nLiving Room\nLevel 1\nFamily Rm- 2nd\nLevel 1\nExterior / Construction\nGarage/Parking?\nY\nGarage/Parking #\n2\nGarage Description\nDetached, Front Entry\nConstruction\nFrame\nArchitecture\nTraditional\nRoof\nComposition\nLot Description\nCity Limits, City Lot, Level, Treed\nIn Floodplain\nNo\nInside City Limits\nYes\nStreet Maintenance\nPub Maint, Paved\nExterior Features\nFixer Up\nUtility Information\nCentral Air\nY\nHeat\nForced Air Gas\nCool\nCentral Electric, Window Unit(s)\nWater\nCity/Public\nSewer\nCity/Public\nFinancial Information\nS Terms\nCash\nHoa Amount\n$0\nTax\n$1,066\nSpecial Tax\n$0\nTotal Tax\n$1,066\nExclusions\nEntire Property\nType Of Ownership\nPrivate\nWill Sell\nCash\nAssessment & Tax\nAssessment Year\n2019\n2018\n2017\nAssessed Value - Total\n$17,240\n$15,380\n$15,380\nAssessed Value - Land\n$2,400\n$1,920\n$1,920\nAssessed Value - Improved\n$14,840\n$13,460\n$13,460\nYOY Change ($)\n$1,860\n$\nYOY Change (%)\n12%\n0%\nTax Year\n2019\n2018\n2017\nTotal Tax\n$1,178.32\n$1,065.64\n$1,064.30\nYOY Change ($)\n$113\n$1\nYOY Change (%)\n11%\n0%\nNotes for you and your agent\nAdd Note\nMap data ©2020\nTerms of Use\nReport a map error\nMap\n200 ft \nParcel Disclaimer'
however, I had seen some other examples with WebDriverWait, but so far I have been unsuccessful, I think it would greatly speed up the scraper, here's the code I wrote
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
url = "https://matrix.heartlandmls.com/Matrix/Public/Portal.aspx?L=1&k=990316X949Z&p=DE-74613894-421"
h_table = []
xpath = '/html/body/form/div[3]/div/div/div[5]/div[2]/div/div[1]/div/div/span/ul/li[2]/a'
driver = webdriver.Firefox()
driver.get(url)
driver.find_element_by_xpath("/html/body/form/div[3]/div/div/div[5]/div[3]/span[2]/div/div/div[2]/div[1]/div/div/div[2]/div[2]/div[1]/span/a").click()
time.sleep(10)
while True:
button = driver.find_elements_by_xpath("/html/body/form/div[3]/div/div/div[5]/div[2]/div/div[1]/div/div/span/ul/li[2]/a")
if len(button) < 1:
print('done')
break
else:
h_table.append(driver.find_element_by_id("wrapperTable").text)
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, 'xpath'))).click()
this seems to give all the results, but it gives duplicates and I couldn't stop it without a keyboard interrupt
calling len(h_table) = 258, where it should be 200
if the length of your list is the problem, why not using :
if len(h_table) >= 200:
print("done")
break
I'm trying to scrape Autotrader's website to get an excel of the stats and names.
I'm stuck at trying to loop through an html 'ul' element without any classes or IDs and organize that info in python list to then append the individual li elements to different fields in my table.
As you can see I'm able to target the title and price elements, but the 'ul' is really tricky... Well... for someone at my skill level.
The specific code I'm struggling with:
for i in range(1, 2):
response = get('https://www.autotrader.co.uk/car-search?sort=sponsored&seller-type=private&page=' + str(i))
html_soup = BeautifulSoup(response.text, 'html.parser')
ad_containers = html_soup.find_all('h2', class_ = 'listing-title title-wrap')
price_containers = html_soup.find_all('section', class_ = 'price-column')
for container in ad_containers:
name = container.find('a', class_ ="js-click-handler listing-fpa-link").text
names.append(name)
# Trying to loop through the key specs list and assigned each 'li' to a different field in the table
lis = []
list_container = container.find('ul', class_='listing-key-specs')
for li in list_container.find('li'):
lis.append(li)
year.append(lis[0])
body_type.append(lis[1])
milage.append(lis[2])
engine.append(lis[3])
hp.append(lis[4])
transmission.append(lis[5])
petrol_type.append(lis[6])
lis = [] # Clearing dictionary to get ready for next set of data
And the error message I get is the following:
Full code here:
from requests import get
from bs4 import BeautifulSoup
import pandas
# from time import sleep, time
# import random
# Create table fields
names = []
prices = []
year = []
body_type = []
milage = []
engine = []
hp = []
transmission = []
petrol_type = []
for i in range(1, 2):
# Make a get request
response = get('https://www.autotrader.co.uk/car-search?sort=sponsored&seller-type=private&page=' + str(i))
# Pause the loop
# sleep(random.randint(4, 7))
# Create containers
html_soup = BeautifulSoup(response.text, 'html.parser')
ad_containers = html_soup.find_all('h2', class_ = 'listing-title title-wrap')
price_containers = html_soup.find_all('section', class_ = 'price-column')
for container in ad_containers:
name = container.find('a', class_ ="js-click-handler listing-fpa-link").text
names.append(name)
# Trying to loop through the key specs list and assigned each 'li' to a different field in the table
lis = []
list_container = container.find('ul', class_='listing-key-specs')
for li in list_container.find('li'):
lis.append(li)
year.append(lis[0])
body_type.append(lis[1])
milage.append(lis[2])
engine.append(lis[3])
hp.append(lis[4])
transmission.append(lis[5])
petrol_type.append(lis[6])
lis = [] # Clearing dictionary to get ready for next set of data
for pricteainers in price_containers:
price = pricteainers.find('div', class_ ='vehicle-price').text
prices.append(price)
test_df = pandas.DataFrame({'Title': names, 'Price': prices, 'Year': year, 'Body Type': body_type, 'Mileage': milage, 'Engine Size': engine, 'HP': hp, 'Transmission': transmission, 'Petrol Type': petrol_type})
print(test_df.info())
# test_df.to_csv('Autotrader_test.csv')
I followed the advice from David in the other answer's comment area.
Code:
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
pd.set_option('display.width', 1000)
pd.set_option('display.height', 1000)
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
names = []
prices = []
year = []
body_type = []
milage = []
engine = []
hp = []
transmission = []
petrol_type = []
for i in range(1, 2):
response = get('https://www.autotrader.co.uk/car-search?sort=sponsored&seller-type=private&page=' + str(i))
html_soup = BeautifulSoup(response.text, 'html.parser')
outer = html_soup.find_all('article', class_='search-listing')
for inner in outer:
lis = []
names.append(inner.find_all('a', class_ ="js-click-handler listing-fpa-link")[1].text)
prices.append(inner.find('div', class_='vehicle-price').text)
for li in inner.find_all('ul', class_='listing-key-specs'):
for i in li.find_all('li')[-7:]:
lis.append(i.text)
year.append(lis[0])
body_type.append(lis[1])
milage.append(lis[2])
engine.append(lis[3])
hp.append(lis[4])
transmission.append(lis[5])
petrol_type.append(lis[6])
test_df = pd.DataFrame.from_dict({'Title': names, 'Price': prices, 'Year': year, 'Body Type': body_type, 'Mileage': milage, 'Engine Size': engine, 'HP': hp, 'Transmission': transmission, 'Petrol Type': petrol_type}, orient='index')
print(test_df.transpose())
Output:
Title Price Year Body Type Mileage Engine Size HP Transmission Petrol Type
0 Citroen C3 1.4 HDi Exclusive 5dr £500 2002 (52 reg) Hatchback 123,065 miles 1.4L 70bhp Manual Diesel
1 Volvo V40 1.6 XS 5dr £585 1999 (V reg) Estate 125,000 miles 1.6L 109bhp Manual Petrol
2 Toyota Yaris 1.3 VVT-i 16v GLS 3dr £700 2000 (W reg) Hatchback 94,000 miles 1.3L 85bhp Automatic Petrol
3 MG Zt-T 2.5 190 + 5dr £750 2002 (52 reg) Estate 95,000 miles 2.5L 188bhp Manual Petrol
4 Volkswagen Golf 1.9 SDI E 5dr £795 2001 (51 reg) Hatchback 153,000 miles 1.9L 68bhp Manual Diesel
5 Volkswagen Polo 1.9 SDI Twist 5dr £820 2005 (05 reg) Hatchback 106,116 miles 1.9L 64bhp Manual Diesel
6 Volkswagen Polo 1.4 S 3dr (a/c) £850 2002 (02 reg) Hatchback 125,640 miles 1.4L 75bhp Manual Petrol
7 KIA Picanto 1.1 LX 5dr £990 2005 (05 reg) Hatchback 109,000 miles 1.1L 64bhp Manual Petrol
8 Vauxhall Corsa 1.2 i 16v SXi 3dr £995 2004 (54 reg) Hatchback 81,114 miles 1.2L 74bhp Manual Petrol
9 Volkswagen Beetle 1.6 3dr £995 2003 (53 reg) Hatchback 128,000 miles 1.6L 102bhp Manual Petrol
The ul is not a child of the h2 . It's a sibling.
So you will need to make a separate selection because it's not part of the ad_containers.
URL: http://www.imdb.com/chart/?ref_=nv_ch_cht_2
I want you to print top box office list from above site (all the movies' rank, title, weekend, gross and weeks movies in the order)
Example output:
Rank:1
title: godzilla
weekend:$93.2M
Gross:$93.2M
Weeks: 1
Rank: 2
title: Neighbours
This is just a simple way to extract those entities by BeautifulSoup
from bs4 import BeautifulSoup
import urllib2
url = "http://www.imdb.com/chart/?ref_=nv_ch_cht_2"
data = urllib2.urlopen(url).read()
page = BeautifulSoup(data, 'html.parser')
rows = page.findAll("tr", {'class': ['odd', 'even']})
for tr in rows:
for data in tr.findAll("td", {'class': ['titleColumn', 'weeksColumn','ratingColumn']}):
print data.get_text()
P.S.-Arrange according to your will.
There is no need to scrape anything. See the answer I gave here.
How to scrape data from imdb business page?
The below Python script will give you, 1) List of Top Box Office movies from IMDb 2) And also the List of Cast for each of them.
from lxml.html import parse
def imdb_bo(no_of_movies=5):
bo_url = 'http://www.imdb.com/chart/'
bo_page = parse(bo_url).getroot()
bo_table = bo_page.cssselect('table.chart')
bo_total = len(bo_table[0][2])
if no_of_movies <= bo_total:
count = no_of_movies
else:
count = bo_total
movies = {}
for i in range(0, count):
mo = {}
mo['url'] = 'http://www.imdb.com'+bo_page.cssselect('td.titleColumn')[i][0].get('href')
mo['title'] = bo_page.cssselect('td.titleColumn')[i][0].text_content().strip()
mo['year'] = bo_page.cssselect('td.titleColumn')[i][1].text_content().strip(" ()")
mo['weekend'] = bo_page.cssselect('td.ratingColumn')[i*2].text_content().strip()
mo['gross'] = bo_page.cssselect('td.ratingColumn')[(i*2)+1][0].text_content().strip()
mo['weeks'] = bo_page.cssselect('td.weeksColumn')[i].text_content().strip()
m_page = parse(mo['url']).getroot()
m_casttable = m_page.cssselect('table.cast_list')
flag = 0
mo['cast'] = []
for cast in m_casttable[0]:
if flag == 0:
flag = 1
else:
m_starname = cast[1][0][0].text_content().strip()
mo['cast'].append(m_starname)
movies[i] = mo
return movies
if __name__ == '__main__':
no_of_movies = raw_input("Enter no. of Box office movies to display:")
bo_movies = imdb_bo(int(no_of_movies))
for k,v in bo_movies.iteritems():
print '#'+str(k+1)+' '+v['title']+' ('+v['year']+')'
print 'URL: '+v['url']
print 'Weekend: '+v['weekend']
print 'Gross: '+v['gross']
print 'Weeks: '+v['weeks']
print 'Cast: '+', '.join(v['cast'])
print '\n'
Output (run in terminal):
parag#parag-innovate:~/python$ python imdb_bo_scraper.py
Enter no. of Box office movies to display:3
#1 Cinderella (2015)
URL: http://www.imdb.com/title/tt1661199?ref_=cht_bo_1
Weekend: $67.88M
Gross: $67.88M
Weeks: 1
Cast: Cate Blanchett, Lily James, Richard Madden, Helena Bonham Carter, Nonso Anozie, Stellan Skarsgård, Sophie McShera, Holliday Grainger, Derek Jacobi, Ben Chaplin, Hayley Atwell, Rob Brydon, Jana Perez, Alex Macqueen, Tom Edden
#2 Run All Night (2015)
URL: http://www.imdb.com/title/tt2199571?ref_=cht_bo_2
Weekend: $11.01M
Gross: $11.01M
Weeks: 1
Cast: Liam Neeson, Ed Harris, Joel Kinnaman, Boyd Holbrook, Bruce McGill, Genesis Rodriguez, Vincent D'Onofrio, Lois Smith, Common, Beau Knapp, Patricia Kalember, Daniel Stewart Sherman, James Martinez, Radivoje Bukvic, Tony Naumovski
#3 Kingsman: The Secret Service (2014)
URL: http://www.imdb.com/title/tt2802144?ref_=cht_bo_3
Weekend: $6.21M
Gross: $107.39M
Weeks: 5
Cast: Adrian Quinton, Colin Firth, Mark Strong, Jonno Davies, Jack Davenport, Alex Nikolov, Samantha Womack, Mark Hamill, Velibor Topic, Sofia Boutella, Samuel L. Jackson, Michael Caine, Taron Egerton, Geoff Bell, Jordan Long