Web scraping through multiple pages doesnt save each result -beautifulsoup

Web scraping through multiple pages doesnt save each result -beautifulsoup - python

My problem is, it loops through pages, but it doesn't write anything into my list.
At the end I print len(title) and it is still 0.
from bs4 import BeautifulSoup
import requests
for page in range(20, 200, 20):
current_page = 'https://auto.bazos.sk/{}/?hledat=kolesa&hlokalita=&humkreis=&cen'.format(page)
web_req = requests.get(current_page).text
soup = BeautifulSoup(requests.get(current_page).content, 'html.parser')
title_data = soup.select('.nadpis')
title = []
for each_title in title_data:
title.append(each_title.text)
print(current_page)
print(len(title))

Move title out of the loop and there you have it.
import requests
from bs4 import BeautifulSoup
title = []
for page in range(20, 40, 20):
current_page = 'https://auto.bazos.sk/{}/?hledat=kolesa&hlokalita=&humkreis=&cen'.format(page)
soup = BeautifulSoup(requests.get(current_page).content, 'html.parser')
title_data = soup.select('.nadpis')
for each_title in title_data:
title.append(each_title.text)
print(current_page)
print(title)
Output:
['ELEKTRONY SKODA OCTAVIA SCOUT DISKY “PROTEUS” R17', 'Fiat Sedici 1.6, 4x4, r.v 04/2009, 79 kw, slovenské ŠPZ', 'Bmw e46 328ci', '255/50 R19', 'Honda Jazz 1.3', 'Predám 4 ks kolesá', 'Audi A5 3.2 FSI quattro tiptronic S LINE R20 TOP STAV', 'Peugeot 407 combi 1,6 hdi', 'Škoda Superb 2.0TDI 4x4 od 260€ mesačne, bez akontácia', 'Predam elektrony Audi 5x112 R17 a letne pneu', 'ROZPREDÁM MAZDA 3 2.0i 110kW NA NÁHRADNÉ DIELY', 'Predám Astra j Turbo Noblesse bronz', 'ŠKODA KAROQ 1.6 TDI - full výbava', 'VW CHICAGO 5x112 + letné pneu 215/40 R18', 'Fiat 500 SPORT 1.3 multijet 70kw', 'Volvo FL280 - TROJSTRANNÝ SKLÁPAČ + HYDRAULICKÁ RUKA', 'ŠKODA SUPERB COMBI 2.0 TDI 190K 4X4 L&K DSG', 'FORD FOCUS 2.0 TDCI TITANIUM', 'FORD EDGE 2.0 TDCi - 154 kW VIGNALE : 27.000 km', 'R18 5x112 originalne Vw Seat Audi Skoda']

Related

How to scrape all links of products using selenium python?

There is a webpage and 42 products. I would like to get all links of 42 products to scrape individually them. But When I try to get them, I am getting only 16-20 of them.
I used two approaches:
I got page source using Selenium then scraped with BeautifulSoup
I only used selenium driver(css_selector, class_name) to get links.
The link need to scrape: https://thrivecausemetics.com/collections/all?page=4&sort=ss_days_since_published%253Dasc
my 1st approach code:
driver = webdriver.Chrome()
webpage = "https://thrivecausemetics.com/collections/all?page=4&sort=ss_days_since_published%253Dasc"
driver.get(webpage)
time.sleep(15)
page_source = driver.page_source
soup = BeautifulSoup(page_source, 'lxml')
links = [link['href'] for link in soup.find("ul", class_="grid-list").find_all('a', class_='tile-images')]
print(links)
print(len(links))
my 2nd approach
driver = webdriver.Chrome()
webpage = "https://thrivecausemetics.com/collections/all?page=4&sort=ss_days_since_published%253Dasc"
driver.get(webpage)
time.sleep(15)
ul_tag = driver.find_element(By.CSS_SELECTOR, "ul.grid-list")
print(ul_tag)
li_tags = ul_tag.find_elements(By.CSS_SELECTOR, "li.grid-item.is-visible")
# print(li_tags)
print(len(li_tags))
All two approaches are helping to get all links. Using above codes, it is taking only 16 product links.
Any help is appreciate

That data is being pulled from an API endpoint by javascript, once the page loads, so requests cannot see it. The way forward is to scrape the actual API endpoint (you can find it in Dev tools - Network tab). Here is one way to obtain that data:
import requests
import pandas as pd
url = 'https://b7i79y.a.searchspring.io/api/search/search.json?resultsFormat=native&page=1&resultsPerPage=500&sort.ss_days_since_published=asc&siteId=b7i79y'
r = requests.get(url)
df = pd.json_normalize(r.json()['results'])
print(df)
This will display in terminal:
brand collection_id handle id imageUrl intellisuggestData intellisuggestSignature msrp name popularity price product_type_unigram rating ratingCount reviews_total_reviews sku ss_available ss_image_alt ss_inventory_count ss_name_type tags thumbnailImageUrl uid url variant_id variant_mfield_filter_color
0 Bigger Than Beauty Skincare [159254708314, 174020034650, 262184763482, 263320010842] pumpkin-spice-latte-liquid-balm-treatment bed045c1cec90548f830bfa4bc3e2e56 https://cdn.shopify.com/s/files/1/0582/2885/products/PSL_Component_1_medium.jpg?v=1662478574 eJxKMs80t6xkYGAICXM3NDZhMGQwZDBgMLdgSC_KTAEEAAD__1t7Bhw 5a3173ae3360eadabcc446e464c51a6269f0e28ab8d79b2be8b1da2b0f0201da 0 Pumpkin Spice Latte Liquid Balm Lip Treatment™ 10669 26 treatment 4.45424 295 295 TVG134 1 https://cdn.shopify.com/s/files/1/0582/2885/products/PSL_Swatch_New_medium.jpg?v=1662478574 20060 lip treatment [2261, 4522, 50, 800, Benefits:Hydrating, Benefits:Plumping, collection-badge::BACK IN STOCK!, collection::hide-variants, Face, Fill Size:< 1 fl oz, linked::liquid-balm-set, lip plumper, lip plumping, plump, plumper, plumping, recommendation::all-skincare, Skin Concern:Dull and Dry Skin, swatches::show, travel size, Vegan] https://cdn.shopify.com/s/files/1/0582/2885/products/PSL_Component_1_medium.jpg?v=1662478574 4742230212698 https://thrive-causemetics.myshopify.com/products/pumpkin-spice-latte-liquid-balm-treatment [32526428766298] NaN
1 Thrive Causemetics NaN dream-lash-duo 26b794e35fad33ba5496223db9f1bed4 https://cdn.shopify.com/s/files/1/0582/2885/products/Mascara_LashSerum_PDPSets_medium.jpg?v=1659461093 eJxKMs80t6xkYGAICXM3NDZhMGQwYjBgMLdgSC_KTAEEAAD__1uGBh0 12ef5b3a76c62cc8e9d2b0f6f2b2341a3903bbc584f3c347b96f6b9d67f38c05 0 Dream Lash Duo NaN 71 duo NaN NaN NaN NaN 1 https://cdn.shopify.com/s/files/1/0582/2885/products/Mascara_LashSerum_PDPSets_nocopy_medium.jpg?v=1659491015 274 dream lash duo [collection::hide-variants, YBlacklist] https://cdn.shopify.com/s/files/1/0582/2885/products/Mascara_LashSerum_PDPSets_medium.jpg?v=1659461093 6766529675354 https://thrive-causemetics.myshopify.com/products/dream-lash-duo [40035119235162, 40035119267930, 40035119300698] NaN
2 Thrive Causemetics NaN liquid-lash-extensions-lash-serum 096bf1756363b494a31863ae20803818 https://cdn.shopify.com/s/files/1/0582/2885/products/LashSerum_Component_medium.jpg?v=1659566057 eJxKMs80t6xkYGAICXM3MrNgMGQwZjBgMLdgSC_KTAEEAAD__1wOBiY 5fa69eead6ac5da701c5be908298ba006e9490183a18de4e398e7599d0a01eb6 0 Liquid Lash Extensions™ Lash Serum 21949 56 serum 4.075 40 40 TVG268 1 NaN 75132 lash serum [collection-badge::New!] https://cdn.shopify.com/s/files/1/0582/2885/products/LashSerum_Component_medium.jpg?v=1659566057 6729553772634 https://thrive-causemetics.myshopify.com/products/liquid-lash-extensions-lash-serum [39909600854106] NaN
3 Thrive Causemetics [267668095066] brilliant-face-highlighter-skin-perfecting-powder 9ff61df38853620f61d4c39e7363f5a2 https://cdn.shopify.com/s/files/1/0582/2885/products/Brilliant-Face-Highlighter_Component_ToQuyen_medium.jpg?v=1657292791 eJxKMs80t6xkYGAICXM3MjJnMGQwYTBgMLdgSC_KTAEEAAD__1vKBiI 7d651c91af12c272ce4478e268a2764530f5df50b7a4a78817eaeda1251cd85b 0 Brilliant Face Highlighter™ Skin Perfecting Powder 12525 34 highlighter 4.18182 66 66 TVG227 1 https://cdn.shopify.com/s/files/1/0582/2885/products/Brilliant-Face-Highlighter_Component_Shael_medium.jpg?v=1657292793 44920 highlighter [collection-badge::trending, Highlight, Highlighter, Highlighting] https://cdn.shopify.com/s/files/1/0582/2885/products/Brilliant-Face-Highlighter_Component_ToQuyen_medium.jpg?v=1657292791 6729555247194 https://thrive-causemetics.myshopify.com/products/brilliant-face-highlighter-skin-perfecting-powder [39909605703770, 39909605736538, 39909605769306] [gold]
4 Thrive Causemetics NaN brilliant-face-set dab6ca20bb4cf41740cacbbc37fb4f20 https://cdn.shopify.com/s/files/1/0582/2885/products/Highlighter_BEB_Primer_Set_PDP_medium.jpg?v=1657585503 eJxKMs80t6xkYGAICXM3MjJnMGQwZTBgMLdgSC_KTAEEAAD__1vVBiM c0e8eb4abe31f0bd31324450909de60b570dafb73a7e5f6f4c3cb49e93b7a9e4 0 Brilliant Face Set NaN 84 sets NaN NaN NaN NaN 1 https://cdn.shopify.com/s/files/1/0582/2885/products/Highlighter_BEB_Primer_Set_V2_medium.jpg?v=1657585503 3889 brilliant face sets [collection-badge::New!, collection::hide-variants, ST-unpublished] https://cdn.shopify.com/s/files/1/0582/2885/products/Highlighter_BEB_Primer_Set_PDP_medium.jpg?v=1657585503 6765261324378 https://thrive-causemetics.myshopify.com/products/brilliant-face-set [40031327682650, 40031327715418, 40031327748186, 40031327780954, 40031327813722, 40031327846490, 40031327879258, 40031327912026, 40031327944794, 40031327977562, 40031328010330, 40031328043098, 40031328075866, 40031328108634, 40031328141402, 40031328174170, 40031328206938, 40031328239706, 40031328272474, 40031328305242, 40031328338010, 40031328370778, 40031328403546, 40031328436314, 40031328469082, 40031328501850, 40031328534618, 40031328567386, 40031328600154, 40031328632922, 40031328665690, 40031328698458, 40031328731226, 40031328763994, 40031328796762, 40031328829530, 40031328862298, 40031328895066, 40031328927834] NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
73 Thrive Causemetics [27186779, 209907910, 237619142, 333714566, 399044812, 2738815001, 4013293593, 5575639065, 5575770137, 5575802905, 6686801945, 56789368922, 56833736794, 57405112410, 81244520538, 82036260954, 82769608794, 83615547482, 84961689690, 85599420506, 86598778970, 87577591898, 88078483546, 89189417050, 89505661018, 89797230682, 91221393498, 91755741274, 93260021850, 96613498970, 149781872730, 153671794778, 157076848730, 159474384986, 166734495834, 173874675802, 262216908890, 262273564762, 263767261274, 263970160730, 264544747610, 264579055706, 264579121242, 265485779034, 266068197466, 266346889306, 266381164634, 266457251930, 266889232474, 267015094362, 267954094170] triple-threat-color-stick 908b72d51839441d48d27fc251340e88 https://cdn.shopify.com/s/files/1/0582/2885/products/TCCS_Triple_Threat_Color_Stick_Isabella_V2_2db06b39-24da-4e68-8029-dab1f68a985e_medium.jpg?v=1601483873 eJxKMs80t6xkYGAICXM3NDVhMGQwN2EwYDC3YEgvykwBBAAA__9h-AZY 67aceb71cc8a5bfa95da136235f5c18504688f10daed921caa8efe7e75dcfa8d 0 Triple Threat™ Color Stick 35732 36 threat 4.42416 3171 3171 TVG154 1 https://cdn.shopify.com/s/files/1/0582/2885/products/TCCS_Triple_Threat_Color_Stick_Mieko_V2_e92bcbce-3708-4941-a23b-5122e9881820_medium.jpg?v=1601483873 164632 triple threat [Benefits:Hydrating, Benefits:Waterproof, Best Sellers, blush, body, collection-badge::Multi-Use!, Coverage:Buildable, Finish:Dewy, Finish:Shimmer, Formulation:Cream, intl::ca, Lips, Lipstick, recommendation::face, shade-finder::thumbnails, Triple Threat Color Stick, TVG285, TVG286, TVG287, Vegan, YCRF_cheeks] https://cdn.shopify.com/s/files/1/0582/2885/products/TCCS_Triple_Threat_Color_Stick_Isabella_V2_2db06b39-24da-4e68-8029-dab1f68a985e_medium.jpg?v=1601483873 5892103302 https://thrive-causemetics.myshopify.com/products/triple-threat-color-stick [32456620376154, 18635615622, 18635615558, 32456620310618, 32456620408922, 18635615430, 18635615686, 40078997586010, 40078998175834, 40078999191642] [pink, gold, purple, red, peach]
74 Thrive Causemetics [27186779, 209907910, 237619142, 343406086, 383763660, 2738815001, 4013293593, 6686801945, 6845464601, 57475530842, 81244487770, 81951588442, 82036260954, 82769608794, 83476283482, 83810091098, 86001451098, 86594289754, 86765207642, 88078483546, 91221393498, 93260021850, 93929963610, 94846025818, 149781872730, 151323705434, 157076848730, 159474384986, 162671919194, 166112591962, 263766736986, 264805384282, 266185965658, 267195973722] infinity-waterproof-brow-liner 05a0b2ec067e0d40becc91a6d7ff10a9 https://cdn.shopify.com/s/files/1/0582/2885/products/BrowLiner_Component_Christina_medium.jpg?v=1637091941 eJxKMs80t6xkYGAICXM3MLRgMGQwN2UwYDC3YEgvykwBBAAA__9h7QZY c25ffd3a46285a1f90da35783af2fb62d7900453d9fa7fbd7e288f8fd13b9f1d 0 Infinity Waterproof Eyebrow Liner™ 39291 23 liner 4.49396 2235 2235 TVG018 1 https://cdn.shopify.com/s/files/1/0582/2885/products/BrowLiner_Component_Audrey_medium.jpg?v=1637091946 209279 brow liner [Benefits:Waterproof, Coverage:Buildable, default_variant::2, Infinity Waterproof Brow Liner, Ingredients:Shea Butter, intl::ca, recommendation::eyes, shade-finder::thumbnails, Vegan, YCRF_eyes] https://cdn.shopify.com/s/files/1/0582/2885/products/BrowLiner_Component_Christina_medium.jpg?v=1637091941 781737155 https://thrive-causemetics.myshopify.com/products/infinity-waterproof-brow-liner [2199676227, 39591112081498, 2199676163, 35014122444, 39591112343642] [beige, red, brown, black, grey]
75 Thrive Causemetics [27186779, 91101891, 5576228889, 81244487770, 174020034650] gift-card 15c74aab8aa83d300f8c66cdca7c1cb1 https://cdn.shopify.com/s/files/1/0582/2885/products/egift-card_1__2_medium.png?v=1659654650 eJxKMs80t6xkYGAICXM3MLRgMGQwN2MwYDC3YEgvykwBBAAA__9h-AZZ 7ba842c650d93f11c8936dfaf70818667c3b3024b44fc6688ee25874d0ccf019 0 eGift Card NaN 25 card 5 11 11 NaN 1 https://cdn.shopify.com/s/files/1/0582/2885/products/Thrive_PDP_GiftCard_medium.jpg?v=1659654650 -16399 gift card [::hide-dropdown-swatch, collection::hide-variants, Gift Cards, image::no-swap, intl::ca, swag, YBlacklist] https://cdn.shopify.com/s/files/1/0582/2885/products/egift-card_1__2_medium.png?v=1659654650 337553443 https://thrive-causemetics.myshopify.com/products/gift-card [12622098246, 782092871, 12622102150, 782092875] NaN
76 Thrive Causemetics [27186779, 237619142, 343406086, 389141580, 81244520538, 91755741274, 153671794778, 157076848730, 159474384986] jackie 4f5bf8da32c6904051e29c44b49a4516 https://cdn.shopify.com/s/files/1/0582/2885/products/Jackie_Faux_Lashes_1_medium.jpg?v=1582596256 eJxKMs80t6xkYGAICXM3NDdiMGQwN2cwYDC3YEgvykwBBAAA__9iGwZb 3425b4b7cf441485cfbc7cc37da68ae40efbe65e842df16229f0c1c29b172b7d 0 Jackie Faux Lashes™ 150 26 lashes 4.85714 14 14 TVG172 1 https://cdn.shopify.com/s/files/1/0582/2885/products/PDP_lashes_jackie_1024x1024_1_medium.jpg?v=1582596246 827 faux lashes [Faux Lashes, recommendation::eyes, swatches::hide, Vegan, YCRF_eyes] https://cdn.shopify.com/s/files/1/0582/2885/products/Jackie_Faux_Lashes_1_medium.jpg?v=1582596256 334825111 https://thrive-causemetics.myshopify.com/products/jackie [775766255] NaN
77 Thrive Causemetics [27186779, 237619142, 343406086, 389141580, 81244520538, 91755741274, 157076848730, 159474384986] robin cfe488b97e5e61b13c3060260a920885 https://cdn.shopify.com/s/files/1/0582/2885/products/Robin_Faux_Lashes_medium.jpg?v=1582233291 eJxKMs80t6xkYGAICXM3NDdmMGQwt2AwABHpRZkpgAAAAP__YjYGXQ 3b387d1b5edb7855f9d83403ddac5c5559ddac6a8ea440cde30447897896cfa6 0 Robin Faux Lashes™ 130 26 lashes 4.9 10 10 TVG173 1 https://cdn.shopify.com/s/files/1/0582/2885/products/PDP_lashes_robin_1024x1024_7a28a8f4-b602-4049-9480-6eddb8e94944_medium.jpg?v=1582233282 2152 faux lashes [Faux Lashes, recommendation::eyes, swatches::hide, Vegan, YCRF_eyes] https://cdn.shopify.com/s/files/1/0582/2885/products/Robin_Faux_Lashes_medium.jpg?v=1582233291 334825555 https://thrive-causemetics.myshopify.com/products/robin [775768027] NaN
78 rows × 26 columns
The actual XHR request is asking only for 12 products (and then continues to ask for more products, as you scroll the page). I went ahead and asked for 500 products (see url), to make sure I get them all.
Requests documentation:https://requests.readthedocs.io/en/latest/
Also, pandas relevant documentation:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.json_normalize.html
EDIT: And here is a solution based off Selenium/chromedriver. Setup is for linux/chrome/chromedriver, you can adapt to your own setup - just observe the imports, and the code after defining the browser/driver:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.action_chains import ActionChains
import time as t
chrome_options = Options()
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument('disable-notifications')
chrome_options.add_argument("window-size=1280,720")
webdriver_service = Service("chromedriver/chromedriver") ## path to where you saved chromedriver binary
browser = webdriver.Chrome(service=webdriver_service, options=chrome_options)
actions = ActionChains(browser)
wait = WebDriverWait(browser, 20)
url = 'https://thrivecausemetics.com/collections/all?page=4&sort=ss_days_since_published%253Dasc'
browser.get(url)
try:
WebDriverWait(browser, 10).until(EC.element_to_be_clickable((By.ID, "onetrust-reject-all-handler"))).click()
print('declined cookies')
except Exception as e:
print('no cookie button!')
t.sleep(2)
try:
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, 'div[class="dialog dialog-email"]'))).find_element(By.CSS_SELECTOR, 'div[class="icon close"]').click()
print('dismissed 10% offer')
except Exception as e:
print('no 10% offer, damn')
try:
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, 'div[class="dialog dialog-country"]'))).find_element(By.CSS_SELECTOR, 'div[class="icon close"]').click()
print('dismissed country popup')
except Exception as e:
print('no country popup')
products = [x.find_element(By.TAG_NAME, 'a') for x in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, 'li[class="grid-item is-visible"]'))) if len(x.text) > 3]
print('Total items:', len(products))
for p in products:
print(p.get_attribute('href'))
print('______________')
Result printed in terminal:
declined cookies
dismissed 10% offer
dismissed country popup
Total items: 42
https://thrivecausemetics.com/products/brilliant-eye-brightener
______________
https://thrivecausemetics.com/products/liquid-lash-extensions-mascara
______________
https://thrivecausemetics.com/products/waterproof-eyeliner
______________
https://thrivecausemetics.com/products/sheer-strength-hydrating-lip-tint
______________
https://thrivecausemetics.com/products/infinity-waterproof-eyeshadow-stick
______________
https://thrivecausemetics.com/products/triple-threat-color-stick
______________
https://thrivecausemetics.com/products/infinity-waterproof-brow-liner
[...]
For Selenium documentation, please visit https://www.selenium.dev/documentation/

Try this code:
ul_tag = driver.find_elements(By.CSS_SELECTOR, ".grid-list.text-.align- .grid-item.is-visible .tile-heading-lockup a")
print("Total products: ", len(ul_tag))
for product_link in ul_tag:
print("Product link: ", product_link.get_attribute("href"))
Output:
Total products: 42
Product link: https://thrivecausemetics.com/products/brilliant-eye-brightener
Product link: https://thrivecausemetics.com/products/liquid-lash-extensions-mascara
Product link: https://thrivecausemetics.com/products/waterproof-eyeliner
Product link: https://thrivecausemetics.com/products/sheer-strength-hydrating-lip-tint
Product link: https://thrivecausemetics.com/products/infinity-waterproof-eyeshadow-stick
Product link: https://thrivecausemetics.com/products/triple-threat-color-stick
Product link: https://thrivecausemetics.com/products/infinity-waterproof-brow-liner
Product link: https://thrivecausemetics.com/products/instant-brow-fix-semi-permanent-eyebrow-gel
Product link: https://thrivecausemetics.com/products/liquid-lash-extensions-lash-serum
Product link: https://thrivecausemetics.com/products/buildable-blur-cc-cream-with-spf-35
and so on...

Beautiful Soup Craigslist Scraping Pricing is the same

I am trying to scrape Craigslist using BeautifulSoup4. All data shows properly EXCEPT price. I can't seem to find the right tagging to loop through pricing instead of showing the same price for each post.
import requests
from bs4 import BeautifulSoup
source = requests.get('https://washingtondc.craigslist.org/search/nva/sss?query=5%20hp%20boat%20motor&sort=rel').text
soup = BeautifulSoup(source, 'lxml')
for summary in soup.find_all('p', class_='result-info'):
pricing = soup.find('span', class_='result-price')
price = pricing
title = summary.a.text
url = summary.a['href']
print(title + '\n' + price.text + '\n' + url + '\n')
Left: HTML code from Craigslist, commented out is irrelevant (in my opinion) code. I want pricing to not loop the same number. Right: Sublime SS of code.
Snippet of code running through terminal. Pricing is the same for each post.
Thank you

Your script is almost correct. You need to change the soup object for the price to summary
import requests
from bs4 import BeautifulSoup
source = requests.get('https://washingtondc.craigslist.org/search/nva/sss?query=5%20hp%20boat%20motor&sort=rel').text
soup = BeautifulSoup(source, 'lxml')
for summary in soup.find_all('p', class_='result-info'):
price = summary.find('span', class_='result-price')
title = summary.a.text
url = summary.a['href']
print(title + '\n' + price.text + '\n' + url + '\n')
Output:
Boat Water Tender - 10 Tri-Hull with Electric Trolling Motor
$629
https://washingtondc.craigslist.org/nva/boa/d/haymarket-boat-water-tender-10-tri-hull/7160572264.html
1987 Boston Whaler Montauk 17
$25450
https://washingtondc.craigslist.org/nva/boa/d/alexandria-1987-boston-whaler-montauk-17/7163033134.html
1971 Westerly Warwick Sailboat
$3900
https://washingtondc.craigslist.org/mld/boa/d/upper-marlboro-1971-westerly-warwick/7170495800.html
Buy or Rent. DC Party Pontoon for Dock Parties or Cruises
$15000
https://washingtondc.craigslist.org/doc/boa/d/washington-buy-or-rent-dc-party-pontoon/7157810378.html
West Marine Zodiac Inflatable Boat SB285 With 5HP Gamefisher (Merc)
$850
https://annapolis.craigslist.org/boa/d/annapolis-west-marine-zodiac-inflatable/7166031908.html
2012 AB aluminum/hypalon inflatable dinghy/2012 Yamaha 6hp four stroke
$3400
https://annapolis.craigslist.org/bpo/d/annapolis-2012-ab-aluminum-hypalon/7157768911.html
RHODES-18’ CENTERBOARD DAYSAILER
$6500
https://annapolis.craigslist.org/boa/d/ocean-view-rhodes-18-centerboard/7148322078.html
Mercury Outboard 7.5 HP
$250
https://baltimore.craigslist.org/bpo/d/middle-river-mercury-outboard-75-hp/7167399866.html
8 hp yamaha 2 stroke
$0
https://baltimore.craigslist.org/bpo/d/8-hp-yamaha-2-stroke/7154103281.html
TRADE 38' BENETEAU IDYLLE 1150
$35000
https://baltimore.craigslist.org/boa/d/middle-river-trade-38-beneteau-idylle/7163761741.html
5-hp Top Tank Mercury
$0
https://baltimore.craigslist.org/bpo/d/5-hp-top-tank-mercury/7154102434.html
5-hp Top Tank Mercury
$0
https://baltimore.craigslist.org/bpo/d/5-hp-top-tank-mercury/7154102744.html
Wanted ur unwanted outboards
$0
https://baltimore.craigslist.org/bpo/d/randallstown-wanted-ur-unwanted/7141349142.html
Grumman Sport Boat
$2250
https://baltimore.craigslist.org/boa/d/baldwin-grumman-sport-boat/7157186381.html
1996 Carver 355 Aft Cabin Motor Yacht
$47000
https://baltimore.craigslist.org/boa/d/middle-river-1996-carver-355-aft-cabin/7156830617.html
Lower unit, long shaft
$50
https://baltimore.craigslist.org/bpo/d/catonsville-lower-unit-long-shaft/7155566763.html
Lower unit, long shaft
$50
https://baltimore.craigslist.org/bpo/d/catonsville-lower-unit-long-shaft/7155565771.html
Lower unit, long shaft
$50
https://baltimore.craigslist.org/bpo/d/catonsville-lower-unit-long-shaft/7155566035.html
Lower unit, long shaft
$50
https://baltimore.craigslist.org/bpo/d/catonsville-lower-unit-long-shaft/7155565301.html
Cape Dory 25 Sailboat for sale or trade
$6500
https://baltimore.craigslist.org/boa/d/reedville-cape-dory-25-sailboat-for/7149227778.html
West Marine HP-V 350
$1200
https://baltimore.craigslist.org/boa/d/pasadena-west-marine-hp-350/7147285666.html

Is there any way to get the cookies and cache of a visited website from chrome to beautifulsoup in Python?

I want to scrape a certain website weather data but the default page layout gives max of 40 results but when layout changed to simple list gives 100 results and the layout is set to default which is difficult to achieve with selenium. Is there any way to get the cookies saved in chrome to be used with beautiful soup.
import requests
from bs4 import BeautifulSoup
import browser_cookie3
cj = browser_cookie3.load()
s = requests.Session()
url = "https:/something.org/titles/2"
i=1
print(cj)
for c in cj:
if 'mangadex' in str(c):
s.cookies.set_cookie(c)
r = s.get(url)
soup = BeautifulSoup(r.content, 'lxml')
for anime in soup.find_all('div', {'class': 'manga-entry col-lg-6 border-bottom pl-0 my-1'}):
det = anime.find('a', {"class": "ml-1 manga_title text-truncate"})
anime_name = det.text
anime_link = det['href']
stars = anime.select("span")[3].text
print(anime_name, anime_link, stars,i)
i=i+1

Try:
import browser_cookie3
import requests
cj = browser_cookie3.load()
s = requests.Session()
for c in cj:
if 'sitename' in str(c):
s.cookies.set_cookie(c)
r = s.get(the_site)
This code use the browsers cookies in the requests module in as Session. Simply change sitename to the site you want cookies from.
Your new code:
import requests
from bs4 import BeautifulSoup
import browser_cookie3
cj = browser_cookie3.load()
s = requests.Session()
url = "https://something.org/titles/2"
i = 1
print(cj)
for c in cj:
if 'mangadex' in str(c):
s.cookies.set_cookie(c)
r = s.get(url)
soup = BeautifulSoup(r.content, 'lxml')
for anime in soup.find_all('div', {'class': 'manga-entry row m-0 border-bottom'}):
det = anime.find('a', {"class": "ml-1 manga_title text-truncate"})
anime_name = det.text
anime_link = det['href']
stars = anime.select("span")[3].text
print(anime_name, anime_link, stars, i)
i = i + 1
prints:
-Hitogatana- /title/540/hitogatana 4 1
-PIQUANT- /title/44134/piquant 5 2
-Rain- /title/37103/rain 4 3
-SINS- /title/1098/sins 4
:radical /title/46819/radical 1 5
:REverSAL /title/3877/reversal 3 6
... /title/52206/ 7
...Curtain. ~Sensei to Kiyoraka ni Dousei~ /title/7829/curtain-sensei-to-kiyoraka-ni-dousei 8
...Junai no Seinen /title/28947/junai-no-seinen 9
...no Onna /title/10162/no-onna 2 10
...Seishunchuu! /title/19186/seishunchuu 11
...Virgin Love /title/28945/virgin-love 12
.flow - Untitled (Doujinshi) /title/27292/flow-untitled-doujinshi 2 13
.gohan /title/50410/gohan 14
.hack//4koma + Gag Senshuken /title/7750/hack-4koma-gag-senshuken 24 15
.hack//Alcor - Hagun no Jokyoku /title/24375/hack-alcor-hagun-no-jokyoku 16
.hack//G.U.+ /title/7757/hack-g-u 1 17
.hack//GnU /title/7758/hack-gnu 18
.hack//Link - Tasogare no Kishidan /title/24374/hack-link-tasogare-no-kishidan 1 19
.hack//Tasogare no Udewa Densetsu /title/5817/hack-tasogare-no-udewa-densetsu 20
.hack//XXXX /title/7759/hack-xxxx 21
.traeH /title/9789/traeh 22
(G) Edition /title/886/g-edition 1 23
(Not) a Househusband /title/22832/not-a-househusband 6 24
(R)estauraNTR /title/37551/r-estaurantr 14 25
[ rain ] 1st Story /title/25587/rain-1st-story 3 26
[another] Xak /title/24881/another-xak 27
[es] ~Eternal Sisters~ /title/4879/es-eternal-sisters 1 28
and so on to 100...

Beautifulsoup + Python HTML UL targeting, creating a list and appending to variables

I'm trying to scrape Autotrader's website to get an excel of the stats and names.
I'm stuck at trying to loop through an html 'ul' element without any classes or IDs and organize that info in python list to then append the individual li elements to different fields in my table.
As you can see I'm able to target the title and price elements, but the 'ul' is really tricky... Well... for someone at my skill level.
The specific code I'm struggling with:
for i in range(1, 2):
response = get('https://www.autotrader.co.uk/car-search?sort=sponsored&seller-type=private&page=' + str(i))
html_soup = BeautifulSoup(response.text, 'html.parser')
ad_containers = html_soup.find_all('h2', class_ = 'listing-title title-wrap')
price_containers = html_soup.find_all('section', class_ = 'price-column')
for container in ad_containers:
name = container.find('a', class_ ="js-click-handler listing-fpa-link").text
names.append(name)
# Trying to loop through the key specs list and assigned each 'li' to a different field in the table
lis = []
list_container = container.find('ul', class_='listing-key-specs')
for li in list_container.find('li'):
lis.append(li)
year.append(lis[0])
body_type.append(lis[1])
milage.append(lis[2])
engine.append(lis[3])
hp.append(lis[4])
transmission.append(lis[5])
petrol_type.append(lis[6])
lis = [] # Clearing dictionary to get ready for next set of data
And the error message I get is the following:
Full code here:
from requests import get
from bs4 import BeautifulSoup
import pandas
# from time import sleep, time
# import random
# Create table fields
names = []
prices = []
year = []
body_type = []
milage = []
engine = []
hp = []
transmission = []
petrol_type = []
for i in range(1, 2):
# Make a get request
response = get('https://www.autotrader.co.uk/car-search?sort=sponsored&seller-type=private&page=' + str(i))
# Pause the loop
# sleep(random.randint(4, 7))
# Create containers
html_soup = BeautifulSoup(response.text, 'html.parser')
ad_containers = html_soup.find_all('h2', class_ = 'listing-title title-wrap')
price_containers = html_soup.find_all('section', class_ = 'price-column')
for container in ad_containers:
name = container.find('a', class_ ="js-click-handler listing-fpa-link").text
names.append(name)
# Trying to loop through the key specs list and assigned each 'li' to a different field in the table
lis = []
list_container = container.find('ul', class_='listing-key-specs')
for li in list_container.find('li'):
lis.append(li)
year.append(lis[0])
body_type.append(lis[1])
milage.append(lis[2])
engine.append(lis[3])
hp.append(lis[4])
transmission.append(lis[5])
petrol_type.append(lis[6])
lis = [] # Clearing dictionary to get ready for next set of data
for pricteainers in price_containers:
price = pricteainers.find('div', class_ ='vehicle-price').text
prices.append(price)
test_df = pandas.DataFrame({'Title': names, 'Price': prices, 'Year': year, 'Body Type': body_type, 'Mileage': milage, 'Engine Size': engine, 'HP': hp, 'Transmission': transmission, 'Petrol Type': petrol_type})
print(test_df.info())
# test_df.to_csv('Autotrader_test.csv')

I followed the advice from David in the other answer's comment area.
Code:
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
pd.set_option('display.width', 1000)
pd.set_option('display.height', 1000)
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
names = []
prices = []
year = []
body_type = []
milage = []
engine = []
hp = []
transmission = []
petrol_type = []
for i in range(1, 2):
response = get('https://www.autotrader.co.uk/car-search?sort=sponsored&seller-type=private&page=' + str(i))
html_soup = BeautifulSoup(response.text, 'html.parser')
outer = html_soup.find_all('article', class_='search-listing')
for inner in outer:
lis = []
names.append(inner.find_all('a', class_ ="js-click-handler listing-fpa-link")[1].text)
prices.append(inner.find('div', class_='vehicle-price').text)
for li in inner.find_all('ul', class_='listing-key-specs'):
for i in li.find_all('li')[-7:]:
lis.append(i.text)
year.append(lis[0])
body_type.append(lis[1])
milage.append(lis[2])
engine.append(lis[3])
hp.append(lis[4])
transmission.append(lis[5])
petrol_type.append(lis[6])
test_df = pd.DataFrame.from_dict({'Title': names, 'Price': prices, 'Year': year, 'Body Type': body_type, 'Mileage': milage, 'Engine Size': engine, 'HP': hp, 'Transmission': transmission, 'Petrol Type': petrol_type}, orient='index')
print(test_df.transpose())
Output:
Title Price Year Body Type Mileage Engine Size HP Transmission Petrol Type
0 Citroen C3 1.4 HDi Exclusive 5dr £500 2002 (52 reg) Hatchback 123,065 miles 1.4L 70bhp Manual Diesel
1 Volvo V40 1.6 XS 5dr £585 1999 (V reg) Estate 125,000 miles 1.6L 109bhp Manual Petrol
2 Toyota Yaris 1.3 VVT-i 16v GLS 3dr £700 2000 (W reg) Hatchback 94,000 miles 1.3L 85bhp Automatic Petrol
3 MG Zt-T 2.5 190 + 5dr £750 2002 (52 reg) Estate 95,000 miles 2.5L 188bhp Manual Petrol
4 Volkswagen Golf 1.9 SDI E 5dr £795 2001 (51 reg) Hatchback 153,000 miles 1.9L 68bhp Manual Diesel
5 Volkswagen Polo 1.9 SDI Twist 5dr £820 2005 (05 reg) Hatchback 106,116 miles 1.9L 64bhp Manual Diesel
6 Volkswagen Polo 1.4 S 3dr (a/c) £850 2002 (02 reg) Hatchback 125,640 miles 1.4L 75bhp Manual Petrol
7 KIA Picanto 1.1 LX 5dr £990 2005 (05 reg) Hatchback 109,000 miles 1.1L 64bhp Manual Petrol
8 Vauxhall Corsa 1.2 i 16v SXi 3dr £995 2004 (54 reg) Hatchback 81,114 miles 1.2L 74bhp Manual Petrol
9 Volkswagen Beetle 1.6 3dr £995 2003 (53 reg) Hatchback 128,000 miles 1.6L 102bhp Manual Petrol

The ul is not a child of the h2 . It's a sibling.
So you will need to make a separate selection because it's not part of the ad_containers.

BeautifulSoup - how to arrange data and write to txt?

New to Python, have a simple problem. I am pulling some data from Yahoo Fantasy Baseball to text file, but my code didn't work properly:
from bs4 import BeautifulSoup
import urllib2
teams = ("http://baseball.fantasysports.yahoo.com/b1/2282/players?status=A&pos=B&cut_type=33&stat1=S_S_2015&myteam=0&sort=AR&sdir=1")
page = urllib2.urlopen(teams)
soup = BeautifulSoup(page, "html.parser")
players = soup.findAll('div', {'class':'ysf-player-name Nowrap Grid-u Relative Lh-xs Ta-start'})
playersLines = [span.get_text('\t',strip=True) for span in players]
with open('output.txt', 'w') as f:
for line in playersLines:
line = playersLines[0]
output = line.encode('utf-8')
f.write(output)
In output file is only one player for 25 times. Any ideas to get result like this?
Pedro Álvarez Pit - 1B,3B
Kevin Pillar Tor - OF
Melky Cabrera CWS - OF
etc

Try removing:
line = playersLines[0]
Also, append a newline character to the end of your output to get them to write to separate lines in the output.txt file:
from bs4 import BeautifulSoup
import urllib2
teams = ("http://baseball.fantasysports.yahoo.com/b1/2282/players?status=A&pos=B&cut_type=33&stat1=S_S_2015&myteam=0&sort=AR&sdir=1")
page = urllib2.urlopen(teams)
soup = BeautifulSoup(page, "html.parser")
players = soup.findAll('div', {'class':'ysf-player-name Nowrap Grid-u Relative Lh-xs Ta-start'})
playersLines = [span.get_text('\t',strip=True) for span in players]
with open('output.txt', 'w') as f:
for line in playersLines:
output = line.encode('utf-8')
f.write(output+'\n')
Results:
Pedro Álvarez Pit - 1B,3B
Kevin Pillar Tor - OF
Melky Cabrera CWS - OF
Ryan Howard Phi - 1B
Michael A. Taylor Was - OF
Joe Mauer Min - 1B
Maikel Franco Phi - 3B
Joc Pederson LAD - OF
Yangervis Solarte SD - 1B,2B,3B
César Hernández Phi - 2B,3B,SS
Eddie Rosario Min - 2B,OF
Austin Jackson Sea - OF
Danny Espinosa Was - 1B,2B,3B,SS
Danny Valencia Oak - 1B,3B,OF
Freddy Galvis Phi - 3B,SS
Jimmy Paredes Bal - 2B,3B
Colby Rasmus Hou - OF
Luis Valbuena Hou - 1B,2B,3B
Chris Young NYY - OF
Kevin Kiermaier TB - OF
Steven Souza TB - OF
Jace Peterson Atl - 2B,3B
Juan Lagares NYM - OF
A.J. Pierzynski Atl - C
Khris Davis Mil - OF

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Web scraping through multiple pages doesnt save each result -beautifulsoup - python

Related

How to scrape all links of products using selenium python?

Beautiful Soup Craigslist Scraping Pricing is the same

Is there any way to get the cookies and cache of a visited website from chrome to beautifulsoup in Python?

Beautifulsoup + Python HTML UL targeting, creating a list and appending to variables

BeautifulSoup - how to arrange data and write to txt?

Categories

Resources