How to get google search page html code using python? - python

I try to extract the google search page HTML code in python. I use requests module in python.
from bs4 import BeautifulSoup
url = "https://www.google.com/search?q=how+to+get+google+search+page+source+code+by+python"
resp = requests.get(url)
soup = BeautifulSoup(resp.text, 'html.parser')
print(soup)
search = soup.find_all('div',class_="yuRUbf")
print(search)
But I can't find any of this class_="yuRUbf" in the code. I think it do not give me the source code. Now how can I do this work.
I also used resp.content but it didn't work.
I also selenium but it didn't work.

Related

Can't get for loop to work while parsing HTML using Beautiful Soup 4

I'm using the Beautiful Soup documentation to help me understand how to implement it. I'm not too familiar with Python as a whole, so maybe I'm making a syntax error, but I don't believe so. The code below should print out any links from the main Etsy page, but it's not doing that. The documentation states something similar to this, but maybe I'm missing something. Here's my code:
#!/usr/bin/python3
# import library
from bs4 import BeautifulSoup
import requests
import os.path
from os import path
# Request to website and download HTML contents
url='https://www.etsy.com/?utm_source=google&utm_medium=cpc&utm_term=etsy_e&utm_campaign=Search_US_Brand_GGL_ENG_General-Brand_Core_All_Exact&utm_ag=A1&utm_custom1=_k_Cj0KCQiAi8KfBhCuARIsADp-A54MzODz8nRIxO2LnGcB8Ezc3_q40IQk9HygcSzz9fPmPWnrITz8InQaAt5oEALw_wcB_k_&utm_content=go_227553629_16342445429_536666953103_kwd-1818581752_c_&utm_custom2=227553629&gclid=Cj0KCQiAi8KfBhCuARIsADp-A54MzODz8nRIxO2LnGcB8Ezc3_q40IQk9HygcSzz9fPmPWnrITz8InQaAt5oEALw_wcB'
req=requests.get(url)
content=req.text
soup=BeautifulSoup(content, 'html.parser')
for x in soup.head.find_all('a'):
print(x.get('href'))
The HTML prints if I set it up that way, but I can't get the for loop to work.
If you're trying to get all tags from the specified URL then:
url = 'https://www.etsy.com/?utm_source=google&utm_medium=cpc&utm_term=etsy_e&utm_campaign=Search_US_Brand_GGL_ENG_General-Brand_Core_All_Exact&utm_ag=A1&utm_custom1=_k_Cj0KCQiAi8KfBhCuARIsADp-A54MzODz8nRIxO2LnGcB8Ezc3_q40IQk9HygcSzz9fPmPWnrITz8InQaAt5oEALw_wcB_k_&utm_content=go_227553629_16342445429_536666953103_kwd-1818581752_c_&utm_custom2=227553629&gclid=Cj0KCQiAi8KfBhCuARIsADp-A54MzODz8nRIxO2LnGcB8Ezc3_q40IQk9HygcSzz9fPmPWnrITz8InQaAt5oEALw_wcB'
with requests.get(url) as r:
r.raise_for_status()
soup = BeautifulSoup(r.text, 'lxml')
if (body := soup.body):
for a in body.find_all('a', href=True):
print(a['href'])

Cannot get the data via my scripts but the data is available when I "inspect"

When I inspect "https://dse.bigexam.hk/en/ssp?p=1&band=1&order=name&asc=1" I can find the data I want. For examples the total pages "Showing schools 1 to 10 of 143." can be found. However, I got no data from my scripts. Anyone can help? Thanks.
from bs4 import BeautifulSoup
import requests
def makeSoup(url):
response = requests.get(url)
return BeautifulSoup(response.text, 'lxml')
url = "https://dse.bigexam.hk/en/ssp?p=1&band=1&order=name&asc=1"
soup = makeSoup(url)
pages = soup.find('div', attrs={'class': 'col-sm'})
print(pages)
That's because it's loaded using Ajax/javascript. Requests library doesn't handle that, you'll need to use something that can execute these scripts and get the dom.
you can try selenium

How do I log data from a live website using beautiful soup

Hello I am trying to use beautiful soup and requests to log the data coming from an anemometer which updates live every second. The link to this website here:
http://88.97.23.70:81/
The piece of data I want to scrape is highlighted in purple in the image :
from inspection of the html in my browser.
I have written the code bellow in to try to print out the data however when I run the code it prints: None. I think this means that the soup object doesnt infact contain the whole html page? Upon printing soup.prettify() I cannot find the same id=js-2-text I find when inspecting the html in my browser. If anyone has any ideas why this might be or how to fix it I would be most grateful.
from bs4 import BeautifulSoup
import requests
wind_url='http://88.97.23.70:81/'
r = requests.get(wind_url)
data = r.text
soup = BeautifulSoup(data, 'lxml')
print(soup.find(id='js-2-text'))
All the best,
Brendan
The data is loaded from external URL, so beautifulsoup doesn't need it. You can try to use API URL the page is connecting to:
import requests
from bs4 import BeautifulSoup
api_url = "http://88.97.23.70:81/cgi-bin/CGI_GetMeasurement.cgi"
data = {"input_id": "1"}
soup = BeautifulSoup(requests.post(api_url, data=data).content, "html.parser")
_, direction, metres_per_second, *_ = soup.csv.text.split(",")
knots = float(metres_per_second) * 1.9438445
print(direction, metres_per_second, knots)
Prints:
210 006.58 12.79049681

requests.get does not return all the information on the web page

I'm trying to use webscraping (python) to download a video from a website.
As you can see on the following capture, I can see the div I want to select.
But when I try to extract this content with BeautifulSoup, I get an empty result.
BeautifulSoup is not the problem, the problem is that response.text (see the following code) does not contain this section (however the divs surrounding this div are present in the response).
import requests
from bs4 import BeautifulSoup
url = "https://www.jw.org/nsi/library/bible/nwt/books/genesis/1"
response = requests.get(url)
if response.status_code != 404 :
soup = BeautifulSoup(response.text, "html.parser")
print(soup.select("div[class^='video-js']"))

BeautifulSoup url scraping

Trying out BeautifulSoup for the first time.
I have this link http://www.mediafire.com/download/alv8dq6k35n4m2k/For+You.zip
I want to catch the direct download url from the download button which is
http://download2110.mediafire.com/niz8p9iu6r9g/alv8dq6k35n4m2k/For+You.zip
What I have tried so far.
r = requests.get(url)
soup = BeautifulSoup(r.content)
links = soup.findAll('a')
I think the last function findAll('a')would find all the links from that page, but I could not find the direct download url in my linkslist.
Am I doing something wrong here? If so, how can I grab that link with beautifulsoup. I inspect the element in Chrome Developer Console and I see that the link is there.
You can try this to extract the url from the javascript:
from bs4 import BeautifulSoup
import requests
r = requests.get("http://www.mediafire.com/download/alv8dq6k35n4m2k/For+You.zip")
soup = BeautifulSoup(r.content)
link = soup.find("div",{"class":"download_link"})
import re
url = re.findall("http.*.zip?",link.text)[0]

Categories

Resources