selenium.common.exceptions.TimeoutException: Message - python

Getting below exeception for the line pst_hldr.
Also find the error below:
File "/home/PycharmProjects/reditt/redit1.py", line 44, in get_links
pst_hldr = wait.until(cond.visibility_of_element_located((By.XPATH, ".//*[#class='QBfRw7Rj8UkxybFpX-USO']")))
File "/usr/local/lib/python3.9/dist-packages/selenium/webdriver/support/wait.py", line 80, in until
raise TimeoutException(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message:
code:
def get_links(*keywords):
urllist = []
keystring = ''
for kw in keywords:
keystring += kw + "%20"
keystring = keystring.strip()
driver = webdriver.Chrome("chromedriver")
driver.get('https://www.reddit.com/search/?q=' + keystring)
driver.maximize_window()
wait = WebDriverWait(driver, 10)
tbody = wait.until(cond.presence_of_element_located(
(By.XPATH, "//*[#class='_3ozFtOe6WpJEMUtxDOIvtU']//*[#class='q4a8asWOWdfdniAbgNhMh']")))
tb_bar = tbody.find_element_by_xpath("//*[#class='_3ozFtOe6WpJEMUtxDOIvtU']//*["
"#class='q4a8asWOWdfdniAbgNhMh']//*[#class='M7VDHU4AdgCc6tHaZ-UUy']")
driver.execute_script("arguments[0].click();", tb_bar)
print("end of bar")
k = 0
for i in range(200):
newht = i * 500
driver.execute_script("window.scrollTo(0, " + str(newht) + ");")
time.sleep(0.1)
wait = WebDriverWait(driver, 10)
pst_hldr = wait.until(cond.visibility_of_element_located((By.XPATH, ".//*[#class='QBfRw7Rj8UkxybFpX-USO']")))
pst_tiles = pst_hldr.find_elements_by_xpath(
".//*[#class='_1poyrkZ7g36PawDueRza-J']//*[#class='_2XDITKxlj4y3M99thqyCsO']//*["
"#class='_1Y6dfr4zLlrygH-FLmr8x-']")
for tl in pst_tiles:
ttl = tl.find_element_by_xpath(".//*[#class='y8HYJ-y_lTUHkQIc1mdCq _2INHSNB8V5eaWp4P0rY_mE']//a")
href = ttl.get_attribute('href')
print(href)
driver.close()
get_links('america', 'coronavirus', 'cases')

those classes looks dynamic in nature, try with below code :
you can use the below css_selector :
div[data-testid='search-results-subnav']+div
In code,
pst_hldr = wait.until(cond.visibility_of_element_located((By.CSS_SELECTOR, "div[data-testid='search-results-subnav']+div")))
However, this exception
raise TimeoutException(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message:
is common, when you use Explicit wait which is WebDriverWait.
so try to use
driver.find_element_by_css("div[data-testid='search-results-subnav']+div")
and see what is the exact error you are getting.
Update 1 :
div[class*='ListingLayout-outerContainer'] div:nth-of-type(2) div:nth-of-type(3)
in code :-
pst_hldr = wait.until(cond.visibility_of_element_located((By.CSS_SELECTOR, "div[class*='ListingLayout-outerContainer'] div:nth-of-type(2) div:nth-of-type(3)")))
pst_tiles = wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "a[data-click-id='body'] span")))
for title in pst_tiles:
print(title.text)
Update 2 :
driver.implicitly_wait(30)
driver.maximize_window()
driver.get("https://www.reddit.com/search/?q=america%20coronavirus%20cases%20")
wait = WebDriverWait(driver, 20)
pst_hldr = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div[class*='ListingLayout-outerContainer'] div:nth-of-type(2) div:nth-of-type(3)")))
pst_tiles = wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "a[data-click-id='body']")))
for title in pst_tiles:
print(title.get_attribute('href'))
Output :
Timeline suggests a link between Alzheimers, Prion Disease & Covid-19
Vaccines
+2,089 New Cases = 1,243,932 Total Cases in PA; +16 New Deaths = 27,941 Total Deaths in PA Take me out of Latam 🗺✈ is that a good bet?
r/CoronavirusDownunder random daily discussion thread - 13 August,
2021 The New York parade reappears Asian-Americans gather downstairs
to protest Yan Limeng spread rumors about the origin of the virus
Florida becomes epicentre of America’s pandemic as coronavirus cases
surge 50 per cent. 1 in 5 coronavirus cases nationally is found in
Florida Pre-market brief
+1,811 New Cases = 1,241,843 Total Cases in PA; +11 New Deaths = 27,925 Total Deaths in PA Florida becomes epicentre of America’s
pandemic as coronavirus cases surge 50 per cent What A Day:
Reconcilable BIFerences by Sarah Lazarus & Crooked Media (08/11/21)
r/CoronavirusDownunder random daily discussion thread - 12 August,
2021 Hungarian nationalism is not the answer - Slow Boring More
evidence suggests COVID-19 was in US by Christmas 2019 3 Major Reasons
Why I am All in GME Pre-market brief Know anyone who is PRO MASK / VAX
/ LOCKDOWN?
+2,076 New Cases = 1,240,032 Total Cases in PA; +11 New Deaths = 27,914 Total Deaths in PA What A Day: Cuom Over by Sarah Lazarus &
Crooked Media (08/10/21) r/CoronavirusDownunder random daily
discussion thread - 11 August, 2021 Democrats and their accomplices in
the media are working overtime in Texas and Florida to create panic in
an effort to pressure governors to give them back the power to
reinstate mask mandates and lockdowns in their communities. More
evidence suggests COVID-19 was in US by Christmas 2019
+1,280 New Cases = 1,237,956 Total Cases in PA; +1 New Deaths = 27,903 Total Deaths in PA Even if we don't compare the value of animal lives
to human lives, the Holocaust comparison is still valid. AMC Talks
Bitcoin, GameStop With Its Reddit Followers -- 2nd Update
(https://markets.qtrade.ca/news/story?t=iKXJizM-dSY,RB0cWxp8F57oYAx8Odv4T-UoxUxTcyQA&article=651d9d865e21f8f0#651d9d865e21f8f0)
Process finished with exit code 0

This is expected behavior. You are using WebDriverWait, which waits for a fixed time (which you give) for an element to be visible. If the element was not found within that time, this exception is thrown. It is a way to tell you: "Hey, the element did not appear in the time you set".
Read more about this exception here: https://www.educative.io/edpresso/timeoutexception-in-selenium

Related

Loop scrapes the same page 20 times instead of iterating through range

I'm trying to scrape IMDB for a list of the top 1000 movies and get some details about them. However, when I run it, instead of getting the first 50 movies and going to the next page for the next 50, it repeats the loop and makes the same 50 entries 20 times in my database.
# Dataframe template
data = pd.DataFrame(columns=['ID','Title','Genre','Summary'])
#Get page data function
def getPageContent(start=1):
start = 1
url = 'https://www.imdb.com/search/title/?title_type=feature&year=1950-01-01,2019-12-31&sort=num_votes,desc&start='+str(start)
r = requests.get(url)
bs = bsp(r.text, "lxml")
return bs
#Run for top 1000
for start in range(1,1001,50):
getPageContent(start)
movies = bs.findAll("div", "lister-item-content")
for movie in movies:
id = movie.find("span", "lister-item-index").contents[0]
title = movie.find('a').contents[0]
genres = movie.find('span', 'genre').contents[0]
genres = [g.strip() for g in genres.split(',')]
summary = movie.find("p", "text-muted").find_next_sibling("p").contents
i = data.shape[0]
data.loc[i] = [id,title,genres,summary]
#Clean data
# data.ID = [float(re.sub('.','',str(i))) for i in data.ID] #remove . from ID
data.head(51)
0 1. The Shawshank Redemption [Drama] [\nTwo imprisoned men bond over a number of ye...
1 2. The Dark Knight [Action, Crime, Drama] [\nWhen the menace known as the Joker wreaks h...
2 3. Inception [Action, Adventure, Sci-Fi] [\nA thief who steals corporate secrets throug...
3 4. Fight Club [Drama] [\nAn insomniac office worker and a devil-may-...
...
46 47. The Usual Suspects [Crime, Drama, Mystery] [\nA sole survivor tells of the twisty events ...
47 48. The Truman Show [Comedy, Drama] [\nAn insurance salesman discovers his whole l...
48 49. Avengers: Infinity War [Action, Adventure, Sci-Fi] [\nThe Avengers and their allies must be willi...
49 50. Iron Man [Action, Adventure, Sci-Fi] [\nAfter being held captive in an Afghan cave,...
50 1. The Shawshank Redemption [Drama] [\nTwo imprisoned men bond over a number of ye...
Delete 'start' variable inside 'getPageContent' function. It assigns 'start=1' every time.
#Get page data function
def getPageContent(start=1):
url = 'https://www.imdb.com/search/title/?title_type=feature&year=1950-01-01,2019-12-31&sort=num_votes,desc&start='+str(start)
r = requests.get(url)
bs = bsp(r.text, "lxml")
return bs
I was not able to test this code. See inline comments for what I see as the main issue.
# Dataframe template
data = pd.DataFrame(columns=['ID', 'Title', 'Genre', 'Summary'])
# Get page data function
def getPageContent(start=1):
start = 1
url = 'https://www.imdb.com/search/title/?title_type=feature&year=1950-01-01,2019-12-31&sort=num_votes,desc&start=' + str(
start)
r = requests.get(url)
bs = bsp(r.text, "lxml")
return bs
# Run for top 1000
# for start in range(1, 1001, 50): # 50 is a
# step value so this gets every 50th movie
# Try 2 loops
start = 0
for group in range(0, 1001, 50):
for item in range(group, group + 50):
getPageContent(item)
movies = bs.findAll("div", "lister-item-content")
for movie in movies:
id = movie.find("span", "lister-item-index").contents[0]
title = movie.find('a').contents[0]
genres = movie.find('span', 'genre').contents[0]
genres = [g.strip() for g in genres.split(',')]
summary = movie.find("p", "text-muted").find_next_sibling("p").contents
i = data.shape[0]
data.loc[i] = [id, title, genres, summary]
# Clean data
# data.ID = [float(re.sub('.','',str(i))) for i in data.ID] #remove . from ID
data.head(51)

How to scrape all links of products using selenium python?

There is a webpage and 42 products. I would like to get all links of 42 products to scrape individually them. But When I try to get them, I am getting only 16-20 of them.
I used two approaches:
I got page source using Selenium then scraped with BeautifulSoup
I only used selenium driver(css_selector, class_name) to get links.
The link need to scrape: https://thrivecausemetics.com/collections/all?page=4&sort=ss_days_since_published%253Dasc
my 1st approach code:
driver = webdriver.Chrome()
webpage = "https://thrivecausemetics.com/collections/all?page=4&sort=ss_days_since_published%253Dasc"
driver.get(webpage)
time.sleep(15)
page_source = driver.page_source
soup = BeautifulSoup(page_source, 'lxml')
links = [link['href'] for link in soup.find("ul", class_="grid-list").find_all('a', class_='tile-images')]
print(links)
print(len(links))
my 2nd approach
driver = webdriver.Chrome()
webpage = "https://thrivecausemetics.com/collections/all?page=4&sort=ss_days_since_published%253Dasc"
driver.get(webpage)
time.sleep(15)
ul_tag = driver.find_element(By.CSS_SELECTOR, "ul.grid-list")
print(ul_tag)
li_tags = ul_tag.find_elements(By.CSS_SELECTOR, "li.grid-item.is-visible")
# print(li_tags)
print(len(li_tags))
All two approaches are helping to get all links. Using above codes, it is taking only 16 product links.
Any help is appreciate
That data is being pulled from an API endpoint by javascript, once the page loads, so requests cannot see it. The way forward is to scrape the actual API endpoint (you can find it in Dev tools - Network tab). Here is one way to obtain that data:
import requests
import pandas as pd
url = 'https://b7i79y.a.searchspring.io/api/search/search.json?resultsFormat=native&page=1&resultsPerPage=500&sort.ss_days_since_published=asc&siteId=b7i79y'
r = requests.get(url)
df = pd.json_normalize(r.json()['results'])
print(df)
This will display in terminal:
brand collection_id handle id imageUrl intellisuggestData intellisuggestSignature msrp name popularity price product_type_unigram rating ratingCount reviews_total_reviews sku ss_available ss_image_alt ss_inventory_count ss_name_type tags thumbnailImageUrl uid url variant_id variant_mfield_filter_color
0 Bigger Than Beauty Skincare [159254708314, 174020034650, 262184763482, 263320010842] pumpkin-spice-latte-liquid-balm-treatment bed045c1cec90548f830bfa4bc3e2e56 https://cdn.shopify.com/s/files/1/0582/2885/products/PSL_Component_1_medium.jpg?v=1662478574 eJxKMs80t6xkYGAICXM3NDZhMGQwZDBgMLdgSC_KTAEEAAD__1t7Bhw 5a3173ae3360eadabcc446e464c51a6269f0e28ab8d79b2be8b1da2b0f0201da 0 Pumpkin Spice Latte Liquid Balm Lip Treatmentâ„¢ 10669 26 treatment 4.45424 295 295 TVG134 1 https://cdn.shopify.com/s/files/1/0582/2885/products/PSL_Swatch_New_medium.jpg?v=1662478574 20060 lip treatment [2261, 4522, 50, 800, Benefits:Hydrating, Benefits:Plumping, collection-badge::BACK IN STOCK!, collection::hide-variants, Face, Fill Size:< 1 fl oz, linked::liquid-balm-set, lip plumper, lip plumping, plump, plumper, plumping, recommendation::all-skincare, Skin Concern:Dull and Dry Skin, swatches::show, travel size, Vegan] https://cdn.shopify.com/s/files/1/0582/2885/products/PSL_Component_1_medium.jpg?v=1662478574 4742230212698 https://thrive-causemetics.myshopify.com/products/pumpkin-spice-latte-liquid-balm-treatment [32526428766298] NaN
1 Thrive Causemetics NaN dream-lash-duo 26b794e35fad33ba5496223db9f1bed4 https://cdn.shopify.com/s/files/1/0582/2885/products/Mascara_LashSerum_PDPSets_medium.jpg?v=1659461093 eJxKMs80t6xkYGAICXM3NDZhMGQwYjBgMLdgSC_KTAEEAAD__1uGBh0 12ef5b3a76c62cc8e9d2b0f6f2b2341a3903bbc584f3c347b96f6b9d67f38c05 0 Dream Lash Duo NaN 71 duo NaN NaN NaN NaN 1 https://cdn.shopify.com/s/files/1/0582/2885/products/Mascara_LashSerum_PDPSets_nocopy_medium.jpg?v=1659491015 274 dream lash duo [collection::hide-variants, YBlacklist] https://cdn.shopify.com/s/files/1/0582/2885/products/Mascara_LashSerum_PDPSets_medium.jpg?v=1659461093 6766529675354 https://thrive-causemetics.myshopify.com/products/dream-lash-duo [40035119235162, 40035119267930, 40035119300698] NaN
2 Thrive Causemetics NaN liquid-lash-extensions-lash-serum 096bf1756363b494a31863ae20803818 https://cdn.shopify.com/s/files/1/0582/2885/products/LashSerum_Component_medium.jpg?v=1659566057 eJxKMs80t6xkYGAICXM3MrNgMGQwZjBgMLdgSC_KTAEEAAD__1wOBiY 5fa69eead6ac5da701c5be908298ba006e9490183a18de4e398e7599d0a01eb6 0 Liquid Lash Extensionsâ„¢ Lash Serum 21949 56 serum 4.075 40 40 TVG268 1 NaN 75132 lash serum [collection-badge::New!] https://cdn.shopify.com/s/files/1/0582/2885/products/LashSerum_Component_medium.jpg?v=1659566057 6729553772634 https://thrive-causemetics.myshopify.com/products/liquid-lash-extensions-lash-serum [39909600854106] NaN
3 Thrive Causemetics [267668095066] brilliant-face-highlighter-skin-perfecting-powder 9ff61df38853620f61d4c39e7363f5a2 https://cdn.shopify.com/s/files/1/0582/2885/products/Brilliant-Face-Highlighter_Component_ToQuyen_medium.jpg?v=1657292791 eJxKMs80t6xkYGAICXM3MjJnMGQwYTBgMLdgSC_KTAEEAAD__1vKBiI 7d651c91af12c272ce4478e268a2764530f5df50b7a4a78817eaeda1251cd85b 0 Brilliant Face Highlighterâ„¢ Skin Perfecting Powder 12525 34 highlighter 4.18182 66 66 TVG227 1 https://cdn.shopify.com/s/files/1/0582/2885/products/Brilliant-Face-Highlighter_Component_Shael_medium.jpg?v=1657292793 44920 highlighter [collection-badge::trending, Highlight, Highlighter, Highlighting] https://cdn.shopify.com/s/files/1/0582/2885/products/Brilliant-Face-Highlighter_Component_ToQuyen_medium.jpg?v=1657292791 6729555247194 https://thrive-causemetics.myshopify.com/products/brilliant-face-highlighter-skin-perfecting-powder [39909605703770, 39909605736538, 39909605769306] [gold]
4 Thrive Causemetics NaN brilliant-face-set dab6ca20bb4cf41740cacbbc37fb4f20 https://cdn.shopify.com/s/files/1/0582/2885/products/Highlighter_BEB_Primer_Set_PDP_medium.jpg?v=1657585503 eJxKMs80t6xkYGAICXM3MjJnMGQwZTBgMLdgSC_KTAEEAAD__1vVBiM c0e8eb4abe31f0bd31324450909de60b570dafb73a7e5f6f4c3cb49e93b7a9e4 0 Brilliant Face Set NaN 84 sets NaN NaN NaN NaN 1 https://cdn.shopify.com/s/files/1/0582/2885/products/Highlighter_BEB_Primer_Set_V2_medium.jpg?v=1657585503 3889 brilliant face sets [collection-badge::New!, collection::hide-variants, ST-unpublished] https://cdn.shopify.com/s/files/1/0582/2885/products/Highlighter_BEB_Primer_Set_PDP_medium.jpg?v=1657585503 6765261324378 https://thrive-causemetics.myshopify.com/products/brilliant-face-set [40031327682650, 40031327715418, 40031327748186, 40031327780954, 40031327813722, 40031327846490, 40031327879258, 40031327912026, 40031327944794, 40031327977562, 40031328010330, 40031328043098, 40031328075866, 40031328108634, 40031328141402, 40031328174170, 40031328206938, 40031328239706, 40031328272474, 40031328305242, 40031328338010, 40031328370778, 40031328403546, 40031328436314, 40031328469082, 40031328501850, 40031328534618, 40031328567386, 40031328600154, 40031328632922, 40031328665690, 40031328698458, 40031328731226, 40031328763994, 40031328796762, 40031328829530, 40031328862298, 40031328895066, 40031328927834] NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
73 Thrive Causemetics [27186779, 209907910, 237619142, 333714566, 399044812, 2738815001, 4013293593, 5575639065, 5575770137, 5575802905, 6686801945, 56789368922, 56833736794, 57405112410, 81244520538, 82036260954, 82769608794, 83615547482, 84961689690, 85599420506, 86598778970, 87577591898, 88078483546, 89189417050, 89505661018, 89797230682, 91221393498, 91755741274, 93260021850, 96613498970, 149781872730, 153671794778, 157076848730, 159474384986, 166734495834, 173874675802, 262216908890, 262273564762, 263767261274, 263970160730, 264544747610, 264579055706, 264579121242, 265485779034, 266068197466, 266346889306, 266381164634, 266457251930, 266889232474, 267015094362, 267954094170] triple-threat-color-stick 908b72d51839441d48d27fc251340e88 https://cdn.shopify.com/s/files/1/0582/2885/products/TCCS_Triple_Threat_Color_Stick_Isabella_V2_2db06b39-24da-4e68-8029-dab1f68a985e_medium.jpg?v=1601483873 eJxKMs80t6xkYGAICXM3NDVhMGQwN2EwYDC3YEgvykwBBAAA__9h-AZY 67aceb71cc8a5bfa95da136235f5c18504688f10daed921caa8efe7e75dcfa8d 0 Triple Threatâ„¢ Color Stick 35732 36 threat 4.42416 3171 3171 TVG154 1 https://cdn.shopify.com/s/files/1/0582/2885/products/TCCS_Triple_Threat_Color_Stick_Mieko_V2_e92bcbce-3708-4941-a23b-5122e9881820_medium.jpg?v=1601483873 164632 triple threat [Benefits:Hydrating, Benefits:Waterproof, Best Sellers, blush, body, collection-badge::Multi-Use!, Coverage:Buildable, Finish:Dewy, Finish:Shimmer, Formulation:Cream, intl::ca, Lips, Lipstick, recommendation::face, shade-finder::thumbnails, Triple Threat Color Stick, TVG285, TVG286, TVG287, Vegan, YCRF_cheeks] https://cdn.shopify.com/s/files/1/0582/2885/products/TCCS_Triple_Threat_Color_Stick_Isabella_V2_2db06b39-24da-4e68-8029-dab1f68a985e_medium.jpg?v=1601483873 5892103302 https://thrive-causemetics.myshopify.com/products/triple-threat-color-stick [32456620376154, 18635615622, 18635615558, 32456620310618, 32456620408922, 18635615430, 18635615686, 40078997586010, 40078998175834, 40078999191642] [pink, gold, purple, red, peach]
74 Thrive Causemetics [27186779, 209907910, 237619142, 343406086, 383763660, 2738815001, 4013293593, 6686801945, 6845464601, 57475530842, 81244487770, 81951588442, 82036260954, 82769608794, 83476283482, 83810091098, 86001451098, 86594289754, 86765207642, 88078483546, 91221393498, 93260021850, 93929963610, 94846025818, 149781872730, 151323705434, 157076848730, 159474384986, 162671919194, 166112591962, 263766736986, 264805384282, 266185965658, 267195973722] infinity-waterproof-brow-liner 05a0b2ec067e0d40becc91a6d7ff10a9 https://cdn.shopify.com/s/files/1/0582/2885/products/BrowLiner_Component_Christina_medium.jpg?v=1637091941 eJxKMs80t6xkYGAICXM3MLRgMGQwN2UwYDC3YEgvykwBBAAA__9h7QZY c25ffd3a46285a1f90da35783af2fb62d7900453d9fa7fbd7e288f8fd13b9f1d 0 Infinity Waterproof Eyebrow Linerâ„¢ 39291 23 liner 4.49396 2235 2235 TVG018 1 https://cdn.shopify.com/s/files/1/0582/2885/products/BrowLiner_Component_Audrey_medium.jpg?v=1637091946 209279 brow liner [Benefits:Waterproof, Coverage:Buildable, default_variant::2, Infinity Waterproof Brow Liner, Ingredients:Shea Butter, intl::ca, recommendation::eyes, shade-finder::thumbnails, Vegan, YCRF_eyes] https://cdn.shopify.com/s/files/1/0582/2885/products/BrowLiner_Component_Christina_medium.jpg?v=1637091941 781737155 https://thrive-causemetics.myshopify.com/products/infinity-waterproof-brow-liner [2199676227, 39591112081498, 2199676163, 35014122444, 39591112343642] [beige, red, brown, black, grey]
75 Thrive Causemetics [27186779, 91101891, 5576228889, 81244487770, 174020034650] gift-card 15c74aab8aa83d300f8c66cdca7c1cb1 https://cdn.shopify.com/s/files/1/0582/2885/products/egift-card_1__2_medium.png?v=1659654650 eJxKMs80t6xkYGAICXM3MLRgMGQwN2MwYDC3YEgvykwBBAAA__9h-AZZ 7ba842c650d93f11c8936dfaf70818667c3b3024b44fc6688ee25874d0ccf019 0 eGift Card NaN 25 card 5 11 11 NaN 1 https://cdn.shopify.com/s/files/1/0582/2885/products/Thrive_PDP_GiftCard_medium.jpg?v=1659654650 -16399 gift card [::hide-dropdown-swatch, collection::hide-variants, Gift Cards, image::no-swap, intl::ca, swag, YBlacklist] https://cdn.shopify.com/s/files/1/0582/2885/products/egift-card_1__2_medium.png?v=1659654650 337553443 https://thrive-causemetics.myshopify.com/products/gift-card [12622098246, 782092871, 12622102150, 782092875] NaN
76 Thrive Causemetics [27186779, 237619142, 343406086, 389141580, 81244520538, 91755741274, 153671794778, 157076848730, 159474384986] jackie 4f5bf8da32c6904051e29c44b49a4516 https://cdn.shopify.com/s/files/1/0582/2885/products/Jackie_Faux_Lashes_1_medium.jpg?v=1582596256 eJxKMs80t6xkYGAICXM3NDdiMGQwN2cwYDC3YEgvykwBBAAA__9iGwZb 3425b4b7cf441485cfbc7cc37da68ae40efbe65e842df16229f0c1c29b172b7d 0 Jackie Faux Lashesâ„¢ 150 26 lashes 4.85714 14 14 TVG172 1 https://cdn.shopify.com/s/files/1/0582/2885/products/PDP_lashes_jackie_1024x1024_1_medium.jpg?v=1582596246 827 faux lashes [Faux Lashes, recommendation::eyes, swatches::hide, Vegan, YCRF_eyes] https://cdn.shopify.com/s/files/1/0582/2885/products/Jackie_Faux_Lashes_1_medium.jpg?v=1582596256 334825111 https://thrive-causemetics.myshopify.com/products/jackie [775766255] NaN
77 Thrive Causemetics [27186779, 237619142, 343406086, 389141580, 81244520538, 91755741274, 157076848730, 159474384986] robin cfe488b97e5e61b13c3060260a920885 https://cdn.shopify.com/s/files/1/0582/2885/products/Robin_Faux_Lashes_medium.jpg?v=1582233291 eJxKMs80t6xkYGAICXM3NDdmMGQwt2AwABHpRZkpgAAAAP__YjYGXQ 3b387d1b5edb7855f9d83403ddac5c5559ddac6a8ea440cde30447897896cfa6 0 Robin Faux Lashesâ„¢ 130 26 lashes 4.9 10 10 TVG173 1 https://cdn.shopify.com/s/files/1/0582/2885/products/PDP_lashes_robin_1024x1024_7a28a8f4-b602-4049-9480-6eddb8e94944_medium.jpg?v=1582233282 2152 faux lashes [Faux Lashes, recommendation::eyes, swatches::hide, Vegan, YCRF_eyes] https://cdn.shopify.com/s/files/1/0582/2885/products/Robin_Faux_Lashes_medium.jpg?v=1582233291 334825555 https://thrive-causemetics.myshopify.com/products/robin [775768027] NaN
78 rows × 26 columns
The actual XHR request is asking only for 12 products (and then continues to ask for more products, as you scroll the page). I went ahead and asked for 500 products (see url), to make sure I get them all.
Requests documentation:https://requests.readthedocs.io/en/latest/
Also, pandas relevant documentation:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.json_normalize.html
EDIT: And here is a solution based off Selenium/chromedriver. Setup is for linux/chrome/chromedriver, you can adapt to your own setup - just observe the imports, and the code after defining the browser/driver:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.action_chains import ActionChains
import time as t
chrome_options = Options()
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument('disable-notifications')
chrome_options.add_argument("window-size=1280,720")
webdriver_service = Service("chromedriver/chromedriver") ## path to where you saved chromedriver binary
browser = webdriver.Chrome(service=webdriver_service, options=chrome_options)
actions = ActionChains(browser)
wait = WebDriverWait(browser, 20)
url = 'https://thrivecausemetics.com/collections/all?page=4&sort=ss_days_since_published%253Dasc'
browser.get(url)
try:
WebDriverWait(browser, 10).until(EC.element_to_be_clickable((By.ID, "onetrust-reject-all-handler"))).click()
print('declined cookies')
except Exception as e:
print('no cookie button!')
t.sleep(2)
try:
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, 'div[class="dialog dialog-email"]'))).find_element(By.CSS_SELECTOR, 'div[class="icon close"]').click()
print('dismissed 10% offer')
except Exception as e:
print('no 10% offer, damn')
try:
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, 'div[class="dialog dialog-country"]'))).find_element(By.CSS_SELECTOR, 'div[class="icon close"]').click()
print('dismissed country popup')
except Exception as e:
print('no country popup')
products = [x.find_element(By.TAG_NAME, 'a') for x in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, 'li[class="grid-item is-visible"]'))) if len(x.text) > 3]
print('Total items:', len(products))
for p in products:
print(p.get_attribute('href'))
print('______________')
Result printed in terminal:
declined cookies
dismissed 10% offer
dismissed country popup
Total items: 42
https://thrivecausemetics.com/products/brilliant-eye-brightener
______________
https://thrivecausemetics.com/products/liquid-lash-extensions-mascara
______________
https://thrivecausemetics.com/products/waterproof-eyeliner
______________
https://thrivecausemetics.com/products/sheer-strength-hydrating-lip-tint
______________
https://thrivecausemetics.com/products/infinity-waterproof-eyeshadow-stick
______________
https://thrivecausemetics.com/products/triple-threat-color-stick
______________
https://thrivecausemetics.com/products/infinity-waterproof-brow-liner
[...]
For Selenium documentation, please visit https://www.selenium.dev/documentation/
Try this code:
ul_tag = driver.find_elements(By.CSS_SELECTOR, ".grid-list.text-.align- .grid-item.is-visible .tile-heading-lockup a")
print("Total products: ", len(ul_tag))
for product_link in ul_tag:
print("Product link: ", product_link.get_attribute("href"))
Output:
Total products: 42
Product link: https://thrivecausemetics.com/products/brilliant-eye-brightener
Product link: https://thrivecausemetics.com/products/liquid-lash-extensions-mascara
Product link: https://thrivecausemetics.com/products/waterproof-eyeliner
Product link: https://thrivecausemetics.com/products/sheer-strength-hydrating-lip-tint
Product link: https://thrivecausemetics.com/products/infinity-waterproof-eyeshadow-stick
Product link: https://thrivecausemetics.com/products/triple-threat-color-stick
Product link: https://thrivecausemetics.com/products/infinity-waterproof-brow-liner
Product link: https://thrivecausemetics.com/products/instant-brow-fix-semi-permanent-eyebrow-gel
Product link: https://thrivecausemetics.com/products/liquid-lash-extensions-lash-serum
Product link: https://thrivecausemetics.com/products/buildable-blur-cc-cream-with-spf-35
and so on...

BeautifulSoup trying to get text from wrapped divs but empty or "none" is being returned

Here is a picture (sorry) of the HTML that I am trying to parse:
I am using this line:
home_stats = soup.select_one('div', class_='statText:nth-child(1)').text
Thinking that I'd get the 1st child of the class statText and the outcome would be 53%.
But it's not. I get "Loading..." and none of the data that I was trying to use and display.
The full code I have so far:
soup = BeautifulSoup(source, 'lxml')
home_team = soup.find('div', class_='tname-home').a.text
away_team = soup.find('div', class_='tname-away').a.text
home_score = soup.select_one('.current-result .scoreboard:nth-child(1)').text
away_score = soup.select_one('.current-result .scoreboard:nth-child(2)').text
print("The home team is " + home_team, "and they scored " + home_score)
print()
print("The away team is " + away_team, "and they scored " + away_score)
home_stats = soup.select_one('div', class_='statText:nth-child(1)').text
print(home_stats)
Which currently does print the hone and away team and the number of goals they scored. But I can't seem to get any of the statistical content from this site.
My output plan is to have:
[home_team] had 53% ball possession and [away_team] had 47% ball possession
However, I would like to remove the "%" symbols from the parse (but that's not essential). My plan is to use these numbers for more stats later on, so the % symbol gets in the way.
Apologies for the noob question - this is the absolute beginning of my Pythonic journey. I have scoured the internet and StackOverflow and just can not find this situation - I also possibly don't know exactly what I am looking for either.
Thanks kindly for your help! May your answer be the one I pick as "correct" ;)
Assuming that this is the website that u r tryna scrape, here is the complete code to scrape all the stats:
from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd
driver = webdriver.Chrome('chromedriver.exe')
driver.get('https://www.scoreboard.com/en/match/SO3Fg7NR/#match-statistics;0')
pg = driver.page_source #Gets the source code of the page
driver.close()
soup = BeautifulSoup(pg,'html.parser') #Creates a soup object
statrows = soup.find_all('div',class_ = "statTextGroup") #Finds all the div tags with class statTextGroup -- these div tags contain the stats
#Scrapes the team names
teams = soup.find_all('a',class_ = "participant-imglink")
teamslst = []
for x in teams:
team = x.text.strip()
if team != "":
teamslst.append(team)
stats_dict = {}
count = 0
for x in statrows:
txt = x.text
final_txt = ""
stat = ""
alphabet = False
percentage = False
#Extracts the numbers from the text
for c in txt:
if c in '0123456789':
final_txt+=c
else:
if alphabet == False:
final_txt+= "-"
alphabet = True
if c != "%":
stat += c
else:
percentage = True
values = final_txt.split('-')
#Appends the values to the dictionary
for x in values:
if stat in stats_dict.keys():
if percentage == True:
stats_dict[stat].append(x + "%")
else:
stats_dict[stat].append(int(x))
else:
if percentage == True:
stats_dict[stat] = [x + "%"]
else:
stats_dict[stat] = [int(x)]
count += 1
if count == 15:
break
index = [teamslst[0],teamslst[1]]
#Creates a pandas DataFrame out of the dictionary
df = pd.DataFrame(stats_dict,index = index).T
print(df)
Output:
Burnley Southampton
Ball Possession 53% 47%
Goal Attempts 10 5
Shots on Goal 2 1
Shots off Goal 4 2
Blocked Shots 4 2
Free Kicks 11 10
Corner Kicks 8 2
Offsides 2 1
Goalkeeper Saves 0 2
Fouls 8 10
Yellow Cards 1 0
Total Passes 522 480
Tackles 15 12
Attacks 142 105
Dangerous Attacks 44 29
Hope that this helps!
P.S: I actually wrote this code for a different question, but I didn't post it as an answer was already posted! But I didn't know that it would come in handy now! Anyways, I hope that my answer does what u need.

webscraping with selenium to click a button and grab everything

I have been working on this scraper for a while and I think it could be improved but I'm not sure where to go from here.
The initial scraper looks like this and I believe it does everything I need it to do:
url = "https://matrix.heartlandmls.com/Matrix/Public/Portal.aspx?L=1&k=990316X949Z&p=DE-74613894-421"
h_table = []
driver = webdriver.Firefox()
driver.get(url)
driver.find_element_by_xpath("/html/body/form/div[3]/div/div/div[5]/div[3]/span[2]/div/div/div[2]/div[1]/div/div/div[2]/div[2]/div[1]/span/a").click()
time.sleep(10)
i = 200
while i > 0:
h_table.append(driver.find_element_by_id("wrapperTable").text)
driver.find_element_by_xpath("/html/body/form/div[3]/div/div/div[5]/div[2]/div/div[1]/div/div/span/ul/li[2]/a").click()
time.sleep(10)
i -= 1
this outputs everything into a table which i can clean up
['210 Sitter Street\nPleasant Hill, MO 64080\nMLS#:2178982\nMap\n$15,000\nSold\n4Bedrms\n2Full Bath(s)\n0Half Bath(s)\n1,848Sqft\nBuilt in1950\n0.27Acres\nSingle Family\n1 / 10\nThis Home sits on a level, treed, and nice .279 acre sizeable double lot. The property per taxes, is identified as a Single Family Home however it has 2 separate utility meters and 2 living spaces, each with 2 bedrooms and 1 full bath and laundry areas, and was utilized as a Duplex for Rental income for 2 units. This property is a CASH ONLY sale and is being sold "In It\'s Present Condition". Home and detached garage are in need of repair OR would be a candidate for a tear down and complete rebuild on the lot.\nAbout 210 Sitter Street, Pleasant Hill, MO 64080\nDirections:I-70 to 7 Hwy, to Broadway, to Sitter St, to property.\nGeneral Description\nMLS Number\n2178982\nCounty\nCass\nCity\nPleasant Hill\nSub Div\nWalkers & Sitlers\nType\nSingle Family\nFloor Plan Description\nRanch\nBdrms\n4\nBaths Full\n2\nBaths Half\n0\nAge Description\n51-75 Years\nYear Built\n1950\nSqft Main\n1848\nSQFT MAIN SOURCE\nPublic Record\nBelow Grade Finished Sq Ft\n0\nBelow Grade Finished Sq Ft Source\nPublic Record\nSqft\n1848\nLot Size\n12,155\nAcres\n0.27\nSchools E\nPleasant Hill Prim\nSchools M\nPleasant Hill\nSchools H\nPleasant Hill\nSchool District\nPleasant Hill\nLegal Description\nWALKER & SITLERS LOT 47 & 48 BLK 5\nS Terms\nCash\nInterior Features\nFireplace?\nY\nFireplace Description\nLiving Room, Wood Burning\nBasement\nN\nBasement Description\nBlock, Crawl Space\nDining Area Description\nEat-In Kitchen\nUtility Room\nMultiple, Main Level\nInterior Features\nFixer Up\nRooms\nBathroom Full\nLevel 1\n2nd Full Bath\nLevel 1\nMaster Bedroom\nLevel 1\nSecond Bedroom\nLevel 1\nMaster BR- 2nd\nLevel 1\nFourth Bedroom\nLevel 1\nKitchen\nLevel 1\nKitchen- 2nd\nLevel 1\nLiving Room\nLevel 1\nFamily Rm- 2nd\nLevel 1\nExterior / Construction\nGarage/Parking?\nY\nGarage/Parking #\n2\nGarage Description\nDetached, Front Entry\nConstruction\nFrame\nArchitecture\nTraditional\nRoof\nComposition\nLot Description\nCity Limits, City Lot, Level, Treed\nIn Floodplain\nNo\nInside City Limits\nYes\nStreet Maintenance\nPub Maint, Paved\nExterior Features\nFixer Up\nUtility Information\nCentral Air\nY\nHeat\nForced Air Gas\nCool\nCentral Electric, Window Unit(s)\nWater\nCity/Public\nSewer\nCity/Public\nFinancial Information\nS Terms\nCash\nHoa Amount\n$0\nTax\n$1,066\nSpecial Tax\n$0\nTotal Tax\n$1,066\nExclusions\nEntire Property\nType Of Ownership\nPrivate\nWill Sell\nCash\nAssessment & Tax\nAssessment Year\n2019\n2018\n2017\nAssessed Value - Total\n$17,240\n$15,380\n$15,380\nAssessed Value - Land\n$2,400\n$1,920\n$1,920\nAssessed Value - Improved\n$14,840\n$13,460\n$13,460\nYOY Change ($)\n$1,860\n$\nYOY Change (%)\n12%\n0%\nTax Year\n2019\n2018\n2017\nTotal Tax\n$1,178.32\n$1,065.64\n$1,064.30\nYOY Change ($)\n$113\n$1\nYOY Change (%)\n11%\n0%\nNotes for you and your agent\nAdd Note\nMap data ©2020\nTerms of Use\nReport a map error\nMap\n200 ft \nParcel Disclaimer'
however, I had seen some other examples with WebDriverWait, but so far I have been unsuccessful, I think it would greatly speed up the scraper, here's the code I wrote
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
url = "https://matrix.heartlandmls.com/Matrix/Public/Portal.aspx?L=1&k=990316X949Z&p=DE-74613894-421"
h_table = []
xpath = '/html/body/form/div[3]/div/div/div[5]/div[2]/div/div[1]/div/div/span/ul/li[2]/a'
driver = webdriver.Firefox()
driver.get(url)
driver.find_element_by_xpath("/html/body/form/div[3]/div/div/div[5]/div[3]/span[2]/div/div/div[2]/div[1]/div/div/div[2]/div[2]/div[1]/span/a").click()
time.sleep(10)
while True:
button = driver.find_elements_by_xpath("/html/body/form/div[3]/div/div/div[5]/div[2]/div/div[1]/div/div/span/ul/li[2]/a")
if len(button) < 1:
print('done')
break
else:
h_table.append(driver.find_element_by_id("wrapperTable").text)
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, 'xpath'))).click()
this seems to give all the results, but it gives duplicates and I couldn't stop it without a keyboard interrupt
calling len(h_table) = 258, where it should be 200
if the length of your list is the problem, why not using :
if len(h_table) >= 200:
print("done")
break

Beautiful Soup Craigslist Scraping Pricing is the same

I am trying to scrape Craigslist using BeautifulSoup4. All data shows properly EXCEPT price. I can't seem to find the right tagging to loop through pricing instead of showing the same price for each post.
import requests
from bs4 import BeautifulSoup
source = requests.get('https://washingtondc.craigslist.org/search/nva/sss?query=5%20hp%20boat%20motor&sort=rel').text
soup = BeautifulSoup(source, 'lxml')
for summary in soup.find_all('p', class_='result-info'):
pricing = soup.find('span', class_='result-price')
price = pricing
title = summary.a.text
url = summary.a['href']
print(title + '\n' + price.text + '\n' + url + '\n')
Left: HTML code from Craigslist, commented out is irrelevant (in my opinion) code. I want pricing to not loop the same number. Right: Sublime SS of code.
Snippet of code running through terminal. Pricing is the same for each post.
Thank you
Your script is almost correct. You need to change the soup object for the price to summary
import requests
from bs4 import BeautifulSoup
source = requests.get('https://washingtondc.craigslist.org/search/nva/sss?query=5%20hp%20boat%20motor&sort=rel').text
soup = BeautifulSoup(source, 'lxml')
for summary in soup.find_all('p', class_='result-info'):
price = summary.find('span', class_='result-price')
title = summary.a.text
url = summary.a['href']
print(title + '\n' + price.text + '\n' + url + '\n')
Output:
Boat Water Tender - 10 Tri-Hull with Electric Trolling Motor
$629
https://washingtondc.craigslist.org/nva/boa/d/haymarket-boat-water-tender-10-tri-hull/7160572264.html
1987 Boston Whaler Montauk 17
$25450
https://washingtondc.craigslist.org/nva/boa/d/alexandria-1987-boston-whaler-montauk-17/7163033134.html
1971 Westerly Warwick Sailboat
$3900
https://washingtondc.craigslist.org/mld/boa/d/upper-marlboro-1971-westerly-warwick/7170495800.html
Buy or Rent. DC Party Pontoon for Dock Parties or Cruises
$15000
https://washingtondc.craigslist.org/doc/boa/d/washington-buy-or-rent-dc-party-pontoon/7157810378.html
West Marine Zodiac Inflatable Boat SB285 With 5HP Gamefisher (Merc)
$850
https://annapolis.craigslist.org/boa/d/annapolis-west-marine-zodiac-inflatable/7166031908.html
2012 AB aluminum/hypalon inflatable dinghy/2012 Yamaha 6hp four stroke
$3400
https://annapolis.craigslist.org/bpo/d/annapolis-2012-ab-aluminum-hypalon/7157768911.html
RHODES-18’ CENTERBOARD DAYSAILER
$6500
https://annapolis.craigslist.org/boa/d/ocean-view-rhodes-18-centerboard/7148322078.html
Mercury Outboard 7.5 HP
$250
https://baltimore.craigslist.org/bpo/d/middle-river-mercury-outboard-75-hp/7167399866.html
8 hp yamaha 2 stroke
$0
https://baltimore.craigslist.org/bpo/d/8-hp-yamaha-2-stroke/7154103281.html
TRADE 38' BENETEAU IDYLLE 1150
$35000
https://baltimore.craigslist.org/boa/d/middle-river-trade-38-beneteau-idylle/7163761741.html
5-hp Top Tank Mercury
$0
https://baltimore.craigslist.org/bpo/d/5-hp-top-tank-mercury/7154102434.html
5-hp Top Tank Mercury
$0
https://baltimore.craigslist.org/bpo/d/5-hp-top-tank-mercury/7154102744.html
Wanted ur unwanted outboards
$0
https://baltimore.craigslist.org/bpo/d/randallstown-wanted-ur-unwanted/7141349142.html
Grumman Sport Boat
$2250
https://baltimore.craigslist.org/boa/d/baldwin-grumman-sport-boat/7157186381.html
1996 Carver 355 Aft Cabin Motor Yacht
$47000
https://baltimore.craigslist.org/boa/d/middle-river-1996-carver-355-aft-cabin/7156830617.html
Lower unit, long shaft
$50
https://baltimore.craigslist.org/bpo/d/catonsville-lower-unit-long-shaft/7155566763.html
Lower unit, long shaft
$50
https://baltimore.craigslist.org/bpo/d/catonsville-lower-unit-long-shaft/7155565771.html
Lower unit, long shaft
$50
https://baltimore.craigslist.org/bpo/d/catonsville-lower-unit-long-shaft/7155566035.html
Lower unit, long shaft
$50
https://baltimore.craigslist.org/bpo/d/catonsville-lower-unit-long-shaft/7155565301.html
Cape Dory 25 Sailboat for sale or trade
$6500
https://baltimore.craigslist.org/boa/d/reedville-cape-dory-25-sailboat-for/7149227778.html
West Marine HP-V 350
$1200
https://baltimore.craigslist.org/boa/d/pasadena-west-marine-hp-350/7147285666.html

Categories

Resources