I'm trying to scrape a webpage using beautifulsoup, but findAll() returns an empty list. This is my code:
URL = "https://elcinema.com/en/index/work/country/eg?page=1"
r = requests.get(URL)
bsObj = BeautifulSoup(r.content, 'html5lib')
recordList = bsObj.findAll('a', attrs = {'class':"lazy-loaded "})
print(recordList)
What am I doing wrong?
you need to find img tags with a class lazyloaded
import requests
from bs4 import BeautifulSoup
URL = "https://elcinema.com/en/index/work/country/eg?page=1"
r = requests.get(URL)
bsObj = BeautifulSoup(r.content, 'html')
recordList = bsObj.findAll('img',class_="lazy-loaded")
recordList =[i['data-src'] for i in recordList ]
print(recordList)
Output:
['https://media.elcinema.com/blank_photos/75x75.jpg', 'https://media.elcinema.com/uploads/_75x75_2fe90cb32f2759181f71eb2a9b29f0735f87ac88150a6a8fd3734300f8714369.jpg', 'https://media.elcinema.com/uploads/_75x75_3d90d1ee22c5f455bc4556073eab69cd218446d6134dc0f2694782ee39ccb5bf.jpg', 'https://media.elcinema.com/uploads/_75x75_81f30061ed82645e9ee688642275d76a23ee329344c5ac25c42f22afa35432ff.jpg', 'https://media.elcinema.com/blank_photos/75x75.jpg',.......]
According to the question, it looks like you need to find all a records who have img tag in it with a specific class lazy-loaded Follow the below code to get those:
Code:
import requests
from bs4 import BeautifulSoup
URL = "https://elcinema.com/en/index/work/country/eg?page=1"
r = requests.get(URL)
bsObj = BeautifulSoup(r.content, 'html.parser')
outputdata=[]
recordList = bsObj.findAll('a')
for record in recordList:
if record.find("img",{"class":"lazy-loaded"}):
outputdata.append(record)
print(len(outputdata))
print(outputdata)
Output:
Let me know if you have any questions :)
Hey guess so I got as far as being able to add the a class to a list. The problem is I just want the href link to be added to the links_with_text list and not the entire a class. What am I doing wrong?
from bs4 import BeautifulSoup
from requests import get
import requests
URL = "https://news.ycombinator.com"
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find(id = 'hnmain')
articles = results.find_all(class_="title")
links_with_text = []
for article in articles:
link = article.find('a', href=True)
links_with_text.append(link)
print('\n'.join(map(str, links_with_text)))
This prints exactly how I want the list to print but I just want the href from every a class not the entire a class. Thank you
To get all links from the https://news.ycombinator.com, you can use CSS selector 'a.storylink'.
For example:
from bs4 import BeautifulSoup
from requests import get
import requests
URL = "https://news.ycombinator.com"
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
links_with_text = []
for a in soup.select('a.storylink'): # <-- find all <a> with class="storylink"
links_with_text.append(a['href']) # <-- note the ['href']
print(*links_with_text, sep='\n')
Prints:
https://blog.mozilla.org/futurereleases/2020/06/18/introducing-firefox-private-network-vpns-official-product-the-mozilla-vpn/
https://mxb.dev/blog/the-return-of-the-90s-web/
https://github.blog/2020-06-18-introducing-github-super-linter-one-linter-to-rule-them-all/
https://www.sciencemag.org/news/2018/11/why-536-was-worst-year-be-alive
https://www.strongtowns.org/journal/2020/6/16/do-the-math-small-projects
https://devblogs.nvidia.com/announcing-cuda-on-windows-subsystem-for-linux-2/
https://lwn.net/SubscriberLink/822568/61d29096a4012e06/
https://imil.net/blog/posts/2020/fakecracker-netbsd-as-a-function-based-microvm/
https://jepsen.io/consistency
https://tumblr.beesbuzz.biz/post/621010836277837824/advice-to-young-web-developers
https://archive.org/search.php?query=subject%3A%22The+Navy+Electricity+and+Electronics+Training+Series%22&sort=publicdate
https://googleprojectzero.blogspot.com/2020/06/ff-sandbox-escape-cve-2020-12388.html?m=1
https://apnews.com/1da061ce00eb531291b143ace0eed1c9
https://support.apple.com/library/content/dam/edam/applecare/images/en_US/appleid/android-apple-music-account-payment-none.jpg
https://standpointmag.co.uk/issues/may-june-2020/the-healing-power-of-birdsong/
https://steveblank.com/2020/06/18/the-coming-chip-wars-of-the-21st-century/
https://www.videolan.org/security/sb-vlc3011.html
https://onesignal.com/careers/2023b71d-2f44-4934-a33c-647855816903
https://www.bbc.com/news/world-europe-53006790
https://github.com/efficient/HOPE
https://everytwoyears.org/
https://www.historytoday.com/archive/natural-histories/intelligence-earthworms
https://cr.yp.to/2005-590/powerpc-cwg.pdf
https://quantum.country/
http://www.crystallography.net/cod/
https://parkinsonsnewstoday.com/2020/06/17/tiny-magnetically-powered-implant-may-be-future-of-deep-brain-stimulation/
https://spark.apache.org/releases/spark-release-3-0-0.html
https://arxiv.org/abs/1712.09624
https://www.washingtonpost.com/technology/2020/06/18/data-privacy-law-sherrod-brown/
https://blog.chromium.org/2020/06/improving-chromiums-browser.html
I am currently trying to scrape aviation data from craigslist. I have no problem getting all the info I want except the first image for each post. Here is my link:
https://spokane.craigslist.org/search/avo?hasPic=1
I have been able to get all images thanks to a different post on this site but I am having trouble figuring out how to get just the first image.
I am using bs4 and requests for this script. Here is what I have so far which gets every image:
from bs4 import BeautifulSoup as bs
import requests
image_url = 'https://images.craigslist.org/{}_300x300.jpg'
r = requests.get('https://spokane.craigslist.org/search/avo?hasPic=1')
soup = bs(r.content, 'lxml')
ids = [item['data-ids'].replace('1:','') for item in soup.select('.result-image[data-ids]', limit = 10)]
images = [image_url.format(j) for i in ids for j in i.split(',')]
print(images)
Any help is greatly appreciated.
Thanks in advance,
inzel
from bs4 import BeautifulSoup
import requests
r = requests.get("https://spokane.craigslist.org/search/avo?hasPic=1")
soup = BeautifulSoup(r.text, 'html.parser')
img = "https://images.craigslist.org/"
imgs = [f"{img}{item.get('data-ids').split(':')[1].split(',')[0]}_300x300.jpg"
for item in soup.findAll("a", class_="result-image gallery")]
print(imgs)
output:
['https://images.craigslist.org/00N0N_ci3cbcv5T58_300x300.jpg', 'https://images.craigslist.org/00101_5dLpBXXdDWJ_300x300.jpg', 'https://images.craigslist.org/00n0n_8zVXHONPkTH_300x300.jpg', 'https://images.craigslist.org/00l0l_jiNMe38avtl_300x300.jpg', 'https://images.craigslist.org/00q0q_l4hts9RPOuk_300x300.jpg', 'https://images.craigslist.org/00D0D_ibbWWn7uFCu_300x300.jpg', 'https://images.craigslist.org/00z0z_2ylVbmdVnPr_300x300.jpg', 'https://images.craigslist.org/00Q0Q_ha0o2IJwj4Q_300x300.jpg', 'https://images.craigslist.org/01212_5LoZU43xA7r_300x300.jpg', 'https://images.craigslist.org/00U0U_7CMAu8vAhDi_300x300.jpg', 'https://images.craigslist.org/00m0m_8c7azYhDR1Z_300x300.jpg', 'https://images.craigslist.org/00E0E_7k7cPL7zNnP_300x300.jpg', 'https://images.craigslist.org/00I0I_97AZy8UMt5V_300x300.jpg', 'https://images.craigslist.org/00G0G_iWw8AI8N8Kf_300x300.jpg', 'https://images.craigslist.org/00m0m_9BEEcvD0681_300x300.jpg', 'https://images.craigslist.org/01717_4Ut5FSIdoi3_300x300.jpg', 'https://images.craigslist.org/00h0h_jeAhtDXW2ST_300x300.jpg', 'https://images.craigslist.org/00T0T_hTogH4m9zTH_300x300.jpg', 'https://images.craigslist.org/01212_9x1EFI1CYHE_300x300.jpg', 'https://images.craigslist.org/00H0H_kiXLOtVgReA_300x300.jpg', 'https://images.craigslist.org/00P0P_ad77Eqvf1ul_300x300.jpg', 'https://images.craigslist.org/00909_jyBoTCNGmAJ_300x300.jpg', 'https://images.craigslist.org/00g0g_gFtJlANhi51_300x300.jpg', 'https://images.craigslist.org/00202_3LV7YERBssE_300x300.jpg', 'https://images.craigslist.org/00j0j_3zxT682nE2i_300x300.jpg', 'https://images.craigslist.org/00Y0Y_b6AXcApcSfl_300x300.jpg', 'https://images.craigslist.org/00M0M_6eTHo5E3Ee5_300x300.jpg', 'https://images.craigslist.org/00g0g_hvyvJKUejXY_300x300.jpg', 'https://images.craigslist.org/00I0I_d2WOWXtgQ8s_300x300.jpg', 'https://images.craigslist.org/00s0s_dAwJG0D6uce_300x300.jpg', 'https://images.craigslist.org/00g0g_TC2qvnD3AN_300x300.jpg', 'https://images.craigslist.org/00M0M_Dba39RfEkr_300x300.jpg', 'https://images.craigslist.org/00M0M_31drxF6c9vO_300x300.jpg', 'https://images.craigslist.org/00505_jOjMq3B8y0M_300x300.jpg', 'https://images.craigslist.org/00e0e_ixfV647qwLh_300x300.jpg', 'https://images.craigslist.org/00p0p_i2noTC4cADw_300x300.jpg', 'https://images.craigslist.org/00a0a_kywatxfm6Ud_300x300.jpg', 'https://images.craigslist.org/00808_1ZjIIX8PdaP_300x300.jpg', 'https://images.craigslist.org/01515_blEEDKbbyKD_300x300.jpg', 'https://images.craigslist.org/00b0b_brUn6sUxBzF_300x300.jpg', 'https://images.craigslist.org/00U0U_2ukBvcgvU99_300x300.jpg', 'https://images.craigslist.org/01212_dPTe5ZHM26A_300x300.jpg', 'https://images.craigslist.org/00B0B_1GsE81zVsr0_300x300.jpg', 'https://images.craigslist.org/00N0N_l8SXlBaI8lq_300x300.jpg', 'https://images.craigslist.org/00f0f_82qAzPq7cXd_300x300.jpg', 'https://images.craigslist.org/00w0w_lUrgFG9YOY0_300x300.jpg', 'https://images.craigslist.org/00C0C_kiZpgrFEnO8_300x300.jpg', 'https://images.craigslist.org/00T0T_g7IHvHMx14L_300x300.jpg', 'https://images.craigslist.org/00E0E_bzm9jRXpWVd_300x300.jpg', 'https://images.craigslist.org/00k0k_lOCRF1fgWCF_300x300.jpg', 'https://images.craigslist.org/00y0y_exwReppAi3L_300x300.jpg', 'https://images.craigslist.org/01515_7xyZ605hYcc_300x300.jpg', 'https://images.craigslist.org/00J0J_hqLMLvTCfXk_300x300.jpg', 'https://images.craigslist.org/00505_3P0xQrbeFY4_300x300.jpg', 'https://images.craigslist.org/00r0r_gj6dO6ZHO8L_300x300.jpg', 'https://images.craigslist.org/01717_cIVmzgKCWtP_300x300.jpg', 'https://images.craigslist.org/00w0w_6O59k6qlZQz_300x300.jpg', 'https://images.craigslist.org/00808_jd43ZthN1uB_300x300.jpg', 'https://images.craigslist.org/00m0m_1GJ41cKvv4Y_300x300.jpg']
That list is containing the first image for each post.
You need to find all class with the gallery of images then get the data-ids.
Then split them into a list and get the first element [0].
from bs4 import BeautifulSoup as bs
import requests
image_url = 'https://images.craigslist.org/{}_300x300.jpg'
r = requests.get('https://spokane.craigslist.org/search/avo?hasPic=1')
soup = bs(r.content, 'lxml')
ids = [item.get('data-ids').replace('1:','') for item in soup.findAll("a", {"class": "result-image gallery"}, limit=10)]
images = [image_url.format(i.split(',')[0]) for i in ids]
print(images)
Result:
['https://images.craigslist.org/00N0N_ci3cbcv5T58_300x300.jpg', 'https://images.craigslist.org/00101_5dLpBXXdDWJ_300x300.jpg', 'https://images.craigslist.org/00n0n_8zVXHONPkTH_300x300.jpg', 'https://images.craigslist.org/00l0l_jiNMe38avtl_300x300.jpg', 'https://images.craigslist.org/01212_fULyvfO9Rqz_300x300.jpg', 'https://images.craigslist.org/00D0D_ibbWWn7uFCu_300x300.jpg', 'https://images.craigslist.org/00z0z_2ylVbmdVnPr_300x300.jpg', 'https://images.craigslist.org/00Q0Q_ha0o2IJwj4Q_300x300.jpg', 'https://images.craigslist.org/01212_5LoZU43xA7r_300x300.jpg', 'https://images.craigslist.org/00U0U_7CMAu8vAhDi_300x300.jpg']
Here is a clean and straightforward solution:
from pprint import pprint
import requests
from bs4 import BeautifulSoup
base_image_url = 'https://images.craigslist.org/{}_300x300.jpg'
r = requests.get('https://spokane.craigslist.org/search/avo?hasPic=1')
soup = BeautifulSoup(r.content, 'lxml')
results = []
for elem in soup.find_all("a", attrs={"class": "result-image gallery"})[:2]:
listing_url = elem.get("href")
image_urls = []
image_ids = elem.get("data-ids")
if image_ids:
image_urls = [base_image_url.format(curr_id[2:]) for curr_id in image_ids.split(",")]
results.append((listing_url, image_urls))
pprint(results)
Output:
[('https://spokane.craigslist.org/avo/d/spokane-lightspeed-sierra-headset/7090771925.html',
['https://images.craigslist.org/00N0N_ci3cbcv5T58_300x300.jpg',
'https://images.craigslist.org/00q0q_5ax4n1nCwmI_300x300.jpg',
'https://images.craigslist.org/00202_pAcLlJsaR3_300x300.jpg',
'https://images.craigslist.org/00z0z_kCZGUL6WZZw_300x300.jpg',
'https://images.craigslist.org/00G0G_8A2Xg7Wbe7B_300x300.jpg',
'https://images.craigslist.org/00f0f_cNpP8ZfUXdU_300x300.jpg']),
('https://spokane.craigslist.org/avo/d/spokane-window-mounted-air-conditioner/7090361383.html',
['https://images.craigslist.org/00101_5dLpBXXdDWJ_300x300.jpg',
'https://images.craigslist.org/00I0I_lxNKJsQAT7X_300x300.jpg',
'https://images.craigslist.org/00t0t_3BeBsNO6xH6_300x300.jpg',
'https://images.craigslist.org/00L0L_aPnbejSiXQp_300x300.jpg'])]
Let me know if you have any questions :)
I'm scraping from two URLs that have the same DOM structure, and so I'm trying to find a way to scrape both of them at the same time.
The only caveat is that the data scraped from both these pages need to end up on distinctly named lists.
To explain with example, here is what I've tried:
import os
import requests
from bs4 import BeautifulSoup as bs
urls = ['https://www.basketball-reference.com/leaders/ws_career.html',
'https://www.basketball-reference.com/leaders/ws_per_48_career.html',]
ws_list = []
ws48_list = []
categories = [ws_list, ws48_list]
for url in urls:
response = requests.get(url, headers=headers)
soup = bs(response.content, 'html.parser')
section = soup.find('table', class_='stats_table')
for a in section.find_all('a'):
player_name = a.text
for cat_list in categories:
cat_list.append(player_name)
print(ws48_list)
print(ws_list)
This ends up printing two identical lists when I was shooting for 2 lists unique to its page.
How do I accomplish this? Would it be better practice to code it another way?
Instead of trying to append to already existing lists. Just create new ones. Make a function to do the scrape and pass each url in turn to it.
import os
import requests
from bs4 import BeautifulSoup as bs
urls = ['https://www.basketball-reference.com/leaders/ws_career.html',
'https://www.basketball-reference.com/leaders/ws_per_48_career.html',]
def parse_page(url, headers={}):
response = requests.get(url, headers=headers)
soup = bs(response.content, 'html.parser')
section = soup.find('table', class_='stats_table')
return [a.text for a in section.find_all('a')]
ws_list, ws48_list = [parse_page(url) for url in urls]
print('ws_list = %r' % ws_list)
print('ws8_list = %r' % ws48_list)
Just add them to the appropriate list and the problem is solved?
for i, url in enumerate(urls):
response = requests.get(url)
soup = bs(response.content, 'html.parser')
section = soup.find('table', class_='stats_table')
for a in section.find_all('a'):
player_name = a.text
categories[i].append(player_name)
print(ws48_list)
print(ws_list)
You can use a function to define your scraping logic, then just call it for your urls.
import os
import requests
from bs4 import BeautifulSoup as bs
def scrape(url):
response = requests.get(url)
soup = bs(response.content, 'html.parser')
section = soup.find('table', class_='stats_table')
names = []
for a in section.find_all('a'):
player_name = a.text
names.append(player_name)
return names
ws_list = scrape('https://www.basketball-reference.com/leaders/ws_career.html')
ws48_list = scrape('https://www.basketball-reference.com/leaders/ws_per_48_career.html')
print(ws_list)
print(ws48_list)
Here is the URL that I'am using:
http://www.protect-stream.com/PS_DL_xODN4o5HjLuqzEX5fRNuhtobXnvL9SeiyYcPLcqaqqXayD8YaIvg9Qo80hvgj4vCQkY95XB7iqcL4aF1YC8HRg_i_i
In fact on this page, the link that I am looking for appears may be 5 second after loading the page.
I see after 5 second a post request to :
http://www.protect-stream.com/secur.php
with data like so :
k=2AE_a,LHmb6kSC_c,sZNk4eNixIiPo_c,_c,Gw4ERVdriKuHJlciB1uuy_c,Sr7mOTQVUhVEcMlZeINICKegtzYsseabOlrDb_a,LmiP80NGUvAbK1xhbZGC6OWMtIaNF12f0mYA4O0WxBkmAtz75kpYcrHzxtYt32hCYSp0WjqOQR9bY_a,ofQtw_b,
I didn't get from where the 'k' value come from ?
Is their an idea on how we could get the 'k' value using python ?
This is not going to be trivial. The k parameter value is "hidden" deep inside a script element inside nested iframes. Here is a requests + BeautifulSoup way to get to the k value:
import re
from urlparse import urljoin
# Python 3: from urllib.parse import urljoin
import requests
from bs4 import BeautifulSoup
base_url = "http://www.protect-stream.com"
with requests.Session() as session:
response = session.get("http://www.protect-stream.com/PS_DL_xODN4o5HjLuqzEX5fRNuhtobXnvL9SeiyYcPLcqaqqXayD8YaIvg9Qo80hvgj4vCQkY95XB7iqcL4aF1YC8HRg_i_i")
# get the top frame url
soup = BeautifulSoup(response.content, "html.parser")
src = soup.select_one('iframe[src^="frame.php"]')["src"]
frame_url = urljoin(base_url, src)
# get the nested frame url
response = session.get(frame_url)
soup = BeautifulSoup(response.content, "html.parser")
src = soup.select_one('iframe[src^="w.php"]')["src"]
frame_url = urljoin(base_url, src)
# get the frame HTML source and extract the "k" value
response = session.get(frame_url)
soup = BeautifulSoup(response.content, "html.parser")
script = soup.find("script", text=lambda text: text and "k=" in text).get_text(strip=True)
k_value = re.search(r'var k="(.*?)";', script).group(1)
print(k_value)
Prints:
YjfH9430zztSYgf7ItQJ4grv2cvH3mT7xGwv32rTy2HiB1uuy_c,Sr7mOTQVUhVEcMlZeINICKegtzYsseabOlrDb_a,LmiP80NGUvAbK1xhbZGC6OWMtIaNF12f0mYA4O0WXhmwUC0ipkPRkLQepYHLyF1U0xvsrzHMcK2XBCeY3_a,O_b,