Scraping href which includes ads information

Scraping href which includes ads information - python

I would like to count how many ads there are in this website: https://www.lastampa.it/?refresh_ce
I am using BeautifulSoup to do this. I would need to extra info within the following:
<a id="aw0" target="_blank" href="https://googleads.g.doubleclick.net/pcs/click?xai=AKAOjssYz5VxTdwhxCBCrbtSi0dfGqGd25s7Ub6CCjsHLqd__OqfDKLyOWi6bKE3CL4XIJ0xDHy3ey-PGjm3_yVqTe0_IZ1g9AsvZmO1u8gciKpEKYMj1TIvl6KPivBuwgpfUDf8g2EvMyCD5r6tQ8Mx6Oa4G4yZoPYxFRN7ieFo7UbMr8FF2k6FL6R2qegawVLKVB5WHVAbwNQu4rVx4GE8KuxowGjcfecOnagp9uAHY2qiDE55lhdGqmXmuIEAK8UdaIKeRr6aBBVCR40LzY4&sig=Cg0ArKJSzEIRw7NDzCe7&adurl=https://track.adform.net/C/%3Fbn%3D38337867&nm=3&nx=357&ny=-4&mb=2" onfocus="ss('aw0')" onmousedown="st('aw0')" onmouseover="ss('aw0')" onclick="ha('aw0')"><img src="https://tpc.googlesyndication.com/simgad/5262715044200667305" border="0" width="990" height="30" alt="" class="img_ad"></a>
i.e. parts containing ads information.
The code that I am using is the following:
from bs4 import BeautifulSoup
import requests
from lxml import html
r = requests.get("https://www.lastampa.it/?refresh_ce")
soup = BeautifulSoup(r.content, "html.parser")
ads_div = soup.find('div')
if ads_div:
for link in ads_div.find_all('a'):
print (link['href'])
It does not scrape any information because I am considering the wrong tag/href. How could I get ads information in order to count how many ads there are in that webpage?

How about use a regular expression to match "googleads" and count how many you get.
Recursively searching from the body gives you all the links in the whole page. If you want to search in a specific div you can supply parameters such as the class or id that you want to match as a dictionary.
You can filter the links once you obtain them.
body = soup.find('body')
if body:
for link in body.find_all('a'):
if "ad" in link['href']:
print (link['href'])
When looking at the response that I get, I notice there are no ads at all. This could be because the ads are loaded via some script, which means the ads won't be rendered and requests won't download it. To get around this you can use a webdriver with selenium. That should do the trick.

Related

Web scraping SEC filings

I am working on web scraping 10Q documents from SEC edgar.
This is the url link: https://www.sec.gov/Archives/edgar/data/1652044/000165204419000032/goog10-qq32019.htm
If we inspect it you can find
I need to extract 1600 Amphitheatre Parkway without using id. Below is a code snippet to extract text using id tag. However I need to se name tag.
from requests_html import HTMLSession
from bs4 import BeautifulSoup
session = HTMLSession()
page = session.get('https://www.sec.gov/Archives/edgar/data/1652044/000165204419000032/goog10-qq32019.htm')
soup = BeautifulSoup(page.content, 'html.parser')
content = soup.find(id="d92517213e644-wk-Fact-0B11263160365DBABCF89969352EE602")
print(content.text)
Instead of id tag, I would like to use name tag. However I am not able to extract information sing name tag. Please help.
see the html information :
How to use name tag instead of id tag to extract the contents.
Thanks

You can find elements based on attribute values like this
soup.find('html_tag',{"attribute":"value"})
So in your case, name attribute exists on ix:nonnumeric tag
content = soup.find('ix:nonnumeric',{"name":"dei:EntityAddressAddressLine1"})

Extracting product URLs from a search query on a website

If I were for example looking to track the price changes of MIDI keyboards on https://www.gear4music.com/Studio-MIDI-Controllers. I would need to extract all the URLs of the products pictured from the search and then loop through the URLs of the products and extract price info for each product. I can obtain the price data of an individual product by hard coding the URL but I cannot find a way to automate getting the URLs of multiple products.
So far I have tried this,
from bs4 import BeautifulSoup
import requests
url = "https://www.gear4music.com/Studio-MIDI- Controllers"
response = requests.get(url)
data = response.text
soup = BeautifulSoup(data, 'lxml')
tags = soup.find_all('a')
for tag in tags:
print(tag.get('href'))
This does produce a list of URLs but I cannot make out which ones relate specifically to the MIDI keyboards in that search query that I want to obtain the price product info of. Is there a better more specific way to obtain the URLs of the products only and not everything within the HTML file?

There are many ways how to obtain product links. One way could be select all <a> tags which have data-g4m-inv= attribute:
import requests
from bs4 import BeautifulSoup
url = "https://www.gear4music.com/Studio-MIDI-Controllers"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
for a in soup.select("a[data-g4m-inv]"):
print("https://www.gear4music.com" + a["href"])
Prints:
https://www.gear4music.com/Recording-and-Computers/SubZero-MiniPad-MIDI-Controller/P6E
https://www.gear4music.com/Recording-and-Computers/SubZero-MiniControl-MIDI-Controller/P6D
https://www.gear4music.com/Keyboards-and-Pianos/SubZero-MiniKey-25-Key-MIDI-Controller/JMR
https://www.gear4music.com/Keyboards-and-Pianos/Nektar-SE25/2XWA
https://www.gear4music.com/Keyboards-and-Pianos/Korg-nanoKONTROL2-USB-MIDI-Controller-Black/G8L
https://www.gear4music.com/Recording-and-Computers/SubZero-ControlKey25-MIDI-Keyboard/221Y
https://www.gear4music.com/Keyboards-and-Pianos/SubZero-CommandKey25-Universal-MIDI-Controller/221X
...

Open the chrome developer console and look at the div that corresponds to a product, from there, set a variable(lets say "product") equal to soup.find_all(that aforementioned div) and loop through these results to find tags that are children of that element (or alternatively identify the title class and search that way).

Finding video id on html website using python

I am scraping an html file, each page has a video on it, and in the html there is the video id. I want to print out the video id.
I know that if i want to print a headline from a div class i would do this
with open('yeehaw.html') as html_file:
soup = BeautifulSoup(html_file, 'lxml')
article = soup.find('div', class_='article')
headline = article.h2.a.text
print headline
However the id for the video is found inside a data-id='qe67234'
I dont know how to access this 'qe67234' and print it out.
please help thank you!

Assuming that the tag for data-id begins with div:
from bs4 import BeautifulSoup
import re
soup = BeautifulSoup('<div class="_article" data-id="qe67234"></div>')
results = soup.findAll("div", {"data-id" : re.compile(r".*")})
print('output: ', results[0]['data-id'])
# output: qe67234

Assuming that the data-id is in div
BeautifulSoup.find returns you the found html element as a dictionary. You can therefore navigate it using standard means to get access to the text (as you did in your question) as well as html tags (as shown in the code below)
soup = BeautifulSoup('<div class="_article" data-id="qe67234">')
soup.find("div", {"class":"_article"})['data-id']
Note that, oftentimes, video elements require JS for playback, and you might not be able to find the necessary element if it was scraped with a non-javascript client (i.e. python requests).
If this happens, you have to use tools like phantomjs + selenium browser to get the website combined with the javascript to perform your scraping.
EDIT
If the data-id tag itself is not constant, you should look into lxml library to replace BeautifulSoup and use xpath values to find the element that you need

Trying to parse a webpage for latest high-ranking vulnerabilities using Python and BeautifulSoup

I was trying to apply what others have suggested from here:
Beautiful Soup: Accessing <li> elements from <ul> with no id
But I can't get it to work. It seems the person from that question had a
'parent' h2 header, but the one I am trying to parse does not.
Here is the webpage I am scraping:
https://nvd.nist.gov/
(I think) I located the element I need to manipulate, it's <ul id="latestVulns"> and its following li sections.
I basically want to scrape for the section that says "Last 20 Scored Vulnerability IDs & Summaries" and based off of what the vulnerabilities are, send an email to the appropriate department of my work place.
Here is my code so far:
from bs4 import BeautifulSoup
import requests
source = requests.get('https://nvd.nist.gov/')
soup = BeautifulSoup(source.content, 'lxml')
section = soup.find('latestVulns')
print(section)
this code returns None
I'm at a loss

The first argument of find expects the name of the element and you are passing in the id.
You can use this to find the tag correctly
section = soup.find('ul', {'id': 'latestVulns'})

How to use CSS selectors to retrieve specific links lying in some class using BeautifulSoup?

I am new to Python and I am learning it for scraping purposes I am using BeautifulSoup to collect links (i.e href of 'a' tag). I am trying to collect the links under the "UPCOMING EVENTS" tab of site http://allevents.in/lahore/. I am using Firebug to inspect the element and to get the CSS path but this code returns me nothing. I am looking for the fix and also some suggestions for how I can choose proper CSS selectors to retrieve desired links from any site. I wrote this piece of code:
from bs4 import BeautifulSoup
import requests
url = "http://allevents.in/lahore/"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data)
for link in soup.select( 'html body div.non-overlay.gray-trans-back div.container div.row div.span8 div#eh-1748056798.events-horizontal div.eh-container.row ul.eh-slider li.h-item div.h-meta div.title a[href]'):
print link.get('href')

The page is not the most friendly in the use of classes and markup, but even so your CSS selector is too specific to be useful here.
If you want Upcoming Events, you want just the first <div class="events-horizontal">, then just grab the <div class="title"><a href="..."></div> tags, so the links on titles:
upcoming_events_div = soup.select_one('div.events-horizontal')
for link in upcoming_events_div.select('div.title a[href]'):
print(link['href'])
Note that you should not use r.text; use r.content and leave decoding to Unicode to BeautifulSoup. See Encoding issue of a character in utf-8

import bs4 , requests
res = requests.get("http://allevents.in/lahore/")
soup = bs4.BeautifulSoup(res.text)
for link in soup.select('a[property="schema:url"]'):
print link.get('href')
This code will work fine!!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scraping href which includes ads information - python

Related

Web scraping SEC filings

Extracting product URLs from a search query on a website

Finding video id on html website using python

Trying to parse a webpage for latest high-ranking vulnerabilities using Python and BeautifulSoup

How to use CSS selectors to retrieve specific links lying in some class using BeautifulSoup?

Categories

Resources