Python get all the contents from a website to html file - python

someone please help, i want to transfer all to contents from url to a html file can someone help me please? I have to use user-agent too!

Welcome to SO, when you ask a question you need to submit the code that you have tried, here's where you can learn to ask a question properly.
Regarding your question, when you say "I want to transfer all to contents from url to a html file" I am assuming you just want to read the page source and save it in a file.
import requests as r
from bs4 import BeautifulSoup
data = r.get("http://example.com", headers={'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0'})
soup = BeautifulSoup(data.text)
file = open('myfile.html', 'w')
file.writelines(soup)
file.close()
if you get an error called TypeError: write() argument must be str, not Tag, just typecast soup to string.
file.writelines(str(soup))

because I don't know what site you need scrape so I say a few wasy
if site contains JS frontend and for laoding needed waiting then I recommend you use requests_html module which has method for rendering content
from requests_html import HTMLSession
url = "https://some-url.org"
with HTMLSession() as session:
response = session.get(url)
response.html.render() # rendering JS code
content = response.html.html # full content
if site doesn't use JS for frontent then requests module is really good choice for you
import requests
url = "https://some-url.org"
response = requests.get(url)
content = response.content # html content in bytes()
else you can use selenium webdriver but it works few slowly for python

Related

Why do I always get empty list thing to select a tag or css property by using Python?

I just started studying Python, requests and BeautifulSoup.
I'm using VSCode and Python version is 3.10.8
I want to get HTML code using a 'taw' tag in google. but I can't get it. the result keeps getting an empty list.
import requests
from bs4 import BeautifulSoup
url = 'https://www.google.com/search?client=safari&rls=en&q=프로그래밍+공부&ie=UTF-8&oe=UTF-8'
response = requests.get(url)
html = response.text
soup = BeautifulSoup(html, 'html.parser')
find = soup.select('#taw')
print(find)
and here's HTML code that I tried to get 'taw' tag
sorry for using image instead of codes.
Taw tag contains Google's ad site and I want to scrap this tag. I tried other CSS properties and tags, but the empty list keeps showing up as a result. I tried soup.find, but I got 'None'.
For various possible reasons, you don't always get the exact same html via python's requests.get as what you see in your browser. Sometimes it's because of blockers or JavaScript loading, but for this specific page and element, it's just that google will format the response a bot differently based on the source of the request. Try adding some headers
import requests
from bs4 import BeautifulSoup
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}
url = 'https://www.google.com/search?client=safari&rls=en&q=프로그래밍+공부&ie=UTF-8&oe=UTF-8'
response = requests.get(url, headers=headers)
reqErr = response.raise_for_status() # just a good habit to check
if reqErr: print(f'!"{reqErr}" - while getting ', url)
soup = BeautifulSoup(response.content, 'html.parser')
find = soup.select('#taw')
if not find: ## save html to check [IN AN EDITOR, NOT a browser] if expected elements are missing
hfn = 'x.html'
with open(hfn, 'wb') as f: f.write(response.content)
print(f'saved html to "{hfn}"')
print(find)
The reqErr and if not find.... parts are just to help understand why in case you don't get the expected results. They're helpful for debugging in general for requests+bs4 scraping attempts.
The printed output I got with the code above was:
[<div id="taw"><div data-ved="2ahUKEwjvjrj04ev7AhV3LDQIHaXeDCkQL3oECAcQAg" id="oFNiHe"></div><div id="tvcap"></div></div>]

Why is BeautifulSoup leaving out parts of a website?

I'm completely new to python and wanted to dip my toes into web scraping. So I tried to scrape the rankings of players in https://www.fencingtimelive.com/events/competitors/F87F9E882BD6467FB9461F68E484B8B3#
But when I try to access the rankings and ratings of each player, it gives none as a return. This is all inside the so I assume beautifulsoup isn't able to access it because it's javascript, but I'm not sure. please help ._.
Input:
from bs4 import BeautifulSoup
import requests
URL_USAFencingOctoberNac_2022 = "https://www.fencingtimelive.com/events/competitors/F87F9E882BD6467FB9461F68E484B8B3"
October_Nac_2022 = requests.get(URL_USAFencingOctoberNac_2022)
October_Nac_2022 = BeautifulSoup(October_Nac_2022.text, "html.parser")
tbody = October_Nac_2022.tbody
print(tbody)
Output:
None
In this case the problem is not with BS4 but with your analysis before starting the scraping. The data which you are looking for is not available directly from the request you have made.
To get the data you have to make request to a different back end URL https://www.fencingtimelive.com/events/competitors/data/F87F9E882BD6467FB9461F68E484B8B3?sort=name, which will give you a JSON response.
The code will look something like this
from requests import get
url = 'https://www.fencingtimelive.com/events/competitors/data/F87F9E882BD6467FB9461F68E484B8B3?sort=
name'
response = get(url, headers = {'User-Agent':'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:107.0) Gecko/20100101 Firefox/107.0 X-Requested-With XMLHttpRequest'})
print(response.json())
If you want to test performance of BS4 consider the below example for fetching the blog post links from the link
from requests import get
from bs4 import BeautifulSoup as bs
url = "https://www.zyte.com/blog/"
response = get(url, headers = {'User-Agent':'Mozilla/5.0 (X11; Ubuntu; Linux soup = bs(response.content)
posts = soup.find_all('div', {"class":"oxy-posts"})
print(len(posts))
Note:
Before writing code for scraping analyse the website thoroughly. It will give the idea about the data sources of the website

Get Bing search results in Python

I am trying to make a chatbot that can get Bing search results using Python. I've tried many websites, but they all use old Python 2 code or Google. I am currently in China and cannot access YouTube, Google, or anything else related to Google (Can't use Azure and Microsoft Docs either). I want the results to be like this:
This is the title
https://this-is-the-link.com
This is the second title
https://this-is-the-second-link.com
Code
import requests
import bs4
import re
import urllib.request
from bs4 import BeautifulSoup
page = urllib.request.urlopen("https://www.bing.com/search?q=programming")
soup = BeautifulSoup(page.read())
links = soup.findAll("a")
for link in links:
print(link["href"])
And it gives me
/?FORM=Z9FD1
javascript:void(0);
javascript:void(0);
/rewards/dashboard
/rewards/dashboard
javascript:void(0);
/?scope=web&FORM=HDRSC1
/images/search?q=programming&FORM=HDRSC2
/videos/search?q=programming&FORM=HDRSC3
/maps?q=programming&FORM=HDRSC4
/news/search?q=programming&FORM=HDRSC6
/shop?q=programming&FORM=SHOPTB
http://go.microsoft.com/fwlink/?LinkId=521839
http://go.microsoft.com/fwlink/?LinkID=246338
https://go.microsoft.com/fwlink/?linkid=868922
http://go.microsoft.com/fwlink/?LinkID=286759
https://go.microsoft.com/fwlink/?LinkID=617297
Any help would be greatly appreciated (I'm using Python 3.6.9 on Ubuntu)
Actually, code you've written working properly, problem is in HTTP request headers. By default urllib use Python-urllib/{version} as User-Agent header value, which makes easy for website to recognize your request as automatically generated. To avoid this, you should use custom value which can be achieved passing Request object as first parameter of urlopen():
from urllib.parse import urlencode, urlunparse
from urllib.request import urlopen, Request
from bs4 import BeautifulSoup
query = "programming"
url = urlunparse(("https", "www.bing.com", "/search", "", urlencode({"q": query}), ""))
custom_user_agent = "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0"
req = Request(url, headers={"User-Agent": custom_user_agent})
page = urlopen(req)
# Further code I've left unmodified
soup = BeautifulSoup(page.read())
links = soup.findAll("a")
for link in links:
print(link["href"])
P.S. Take a look on comment left by #edd under your question.

Python Request not getting all data

I'm trying to scrape data from Google translate for educational purpose.
Here is the code
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
#https://translate.google.com/#view=home&op=translate&sl=en&tl=en&text=hello
#tlid-transliteration-content transliteration-content full
class Phonetizer:
def __init__(self,sentence : str,language_ : str = 'en'):
self.words=sentence.split()
self.language=language_
def get_phoname(self):
for word in self.words:
print(word)
url="https://translate.google.com/#view=home&op=translate&sl="+self.language+"&tl="+self.language+"&text="+word
print(url)
req = Request(url, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:71.0) Gecko/20100101 Firefox/71.0'})
webpage = urlopen(req).read()
f= open("debug.html","w+")
f.write(webpage.decode("utf-8"))
f.close()
#print(webpage)
bsoup = BeautifulSoup(webpage,'html.parser')
phonems = bsoup.findAll("div", {"class": "tlid-transliteration-content transliteration-content full"})
print(phonems)
#break
The problem is when gives me the html, there is no tlid-transliteration-content transliteration-content full class, of css.
But using inspect, I have found that, phoneme are inside this css class, here take a snap :
I have saved the html, and here it is, take a look, no tlid-transliteration-content transliteration-content full is present and it not like other google translate page, it is not complete. I have heard google blocks crawler, bot, spyder. And it can be easily detected by their system, so I added the additional header, but still I can't access the whole page.
How can I do so ? Access the whole page and read all data from google translate page?
Want to contribute on this project?
I have tried this code below :
from requests_html import AsyncHTMLSession
asession = AsyncHTMLSession()
lang = "en"
word = "hello"
url="https://translate.google.com/#view=home&op=translate&sl="+lang+"&tl="+lang+"&text="+word
async def get_url():
r = await asession.get(url)
print(r)
return r
results = asession.run(get_url)
for result in results:
print(result.html.url)
print(result.html.find('#tlid-transliteration-content'))
print(result.html.find('#tlid-transliteration-content transliteration-content full'))
It gives me nothing, till now.
Yes, this happens because some javascript generated content are rendered by the browser on page load, but what you see is the final DOM, after all kinds of manipulation happened by javascript (adding content). To solve this you would need to use selenium but it has multiple downsides like speed and memory issues. A more modern and better way, in my opinion, is to use requests-html where it will replace both bs4 and urllib and it has a render method as mentioned in the documentation.
Here is a sample code using requests_html, just keep in mind what you trying to print is not utf8 so you might run into some issues printing it on some editors like sublime, it ran fine using cmd.
from requests_html import HTMLSession
session = HTMLSession()
r = session.get("https://translate.google.com/#view=home&op=translate&sl=en&tl=en&text=hello")
r.html.render()
css = ".source-input .tlid-transliteration-content"
print(r.html.find(css, first=True).text)
# output: heˈlō,həˈlō
First of all, I would suggest you to use the Google Translate API instead of scraping google page. The API is a hundred times easier, hassle-free and a legal and conventional way of doing this.
However, if you want to fix this, here is the solution.
You are not dealing with Bot detection here. Google's bot detection is so strong it would just open the google re-captcha page and not even show your desired web-page.
The problem here is that the results of translation are not returned using the URL you have used. This URL just displays the basic translator page, the results are fetched later by javascript and are shown on the page after the page has been loaded. The javascript is not processed by python-requests and this is why the class doesn't even exist in the web-page you are accessing.
The solution is to trace the packets and detect which URL is being used by javascript to fetch results. Fortunately, I have found the found the desired URL for this purpose.
If you request https://translate.google.com/translate_a/single?client=webapp&sl=en&tl=fr&hl=en&dt=at&dt=bd&dt=ex&dt=ld&dt=md&dt=qca&dt=rw&dt=rm&dt=ss&dt=t&dt=gt&source=bh&ssel=0&tsel=0&kc=1&tk=327718.241137&q=goodmorning, you will get the response of translator as JSON. You can parse the JSON to get the desired results.
Here, you can face Bot detection here which can straight away throw an HTTP 403 error.
You can also use selenium to process javascript and give you results. Following changes inyour code can fix it using selenium
from selenium import webdriver
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
#https://translate.google.com/#view=home&op=translate&sl=en&tl=en&text=hello
#tlid-transliteration-content transliteration-content full
class Phonetizer:
def __init__(self,sentence : str,language_ : str = 'en'):
self.words=sentence.split()
self.language=language_
def get_phoname(self):
for word in self.words:
print(word)
url="https://translate.google.com/#view=home&op=translate&sl="+self.language+"&tl="+self.language+"&text="+word
print(url)
#req = Request(url, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:71.0) Gecko/20100101 Firefox/71.0'})
#webpage = urlopen(req).read()
driver = webdriver.Chrome()
driver.get(url)
webpage = driver.page_source
driver.close()
f= open("debug.html","w+")
f.write(webpage.decode("utf-8"))
f.close()
#print(webpage)
bsoup = BeautifulSoup(webpage,'html.parser')
phonems = bsoup.findAll("div", {"class": "tlid-transliteration-content transliteration-content full"})
print(phonems)
#break
You should scrape this page with Javascript support, since the content you're looking for "hiding" inside <script> tag, which urllib does not render.
I would suggest to use Selenium or other equivalent framework.
Take a look here: Web-scraping JavaScript page with Python

How to get the right source code with Python from the URLs using my web crawler?

I'm trying to use python to write a web crawler. I'm using re and requests module. I want to get urls from the first page (it's a forum) and get information from every url.
My problem now is, I already store the URLs in a List. But I can't get further to get the RIGHT source code of these URLs.
Here is my code:
import re
import requests
url = 'http://bbs.skykiwi.com/forum.php?mod=forumdisplay&fid=55&typeid=470&sortid=231&filter=typeid&pageNum=1&page=1'
sourceCode = getsourse(url) # source code of the url page
allLinksinPage = getallLinksinPage(sourceCode) #a List of the urls in current page
for eachLink in allLinksinPage:
url = 'http://bbs.skykiwi.com/' + eachLink.encode('utf-8')
html = getsourse(url) #THIS IS WHERE I CAN'T GET THE RIGHT SOURCE CODE
#To get the source code of current url
def getsourse(url):
header = {'User-Agent': 'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 10.0; WOW64; Trident/8.0; Touch)'}
html = requests.get(url, headers=header)
return html.text
#To get all the links in current page
def getallLinksinPage(sourceCode):
bigClasses = re.findall('<th class="new">(.*?)</th>', sourceCode, re.S)
allLinks = []
for each in bigClasses:
everylink = re.findall('</em><a href="(.*?)" onclick', each, re.S)[0]
allLinks.append(everylink)
return allLinks
You define your functions after you use them so your code will error. You should also not be using re to parse html, use a parser like beautifulsoup as below. Also use urlparse.urljoin to join the base url to the the links, what you actually want is the hrefs in the anchor tags inside the the div with the id threadlist:
import requests
from bs4 import BeautifulSoup
from urlparse import urljoin
url = 'http://bbs.skykiwi.com/forum.php?mod=forumdisplay&fid=55&typeid=470&sortid=231&filter=typeid&pageNum=1&page=1'
def getsourse(url):
header = {'User-Agent': 'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 10.0; WOW64; Trident/8.0; Touch)'}
html = requests.get(url, headers=header)
return html.content
#To get all the links in current page
def getallLinksinPage(sourceCode):
soup = BeautifulSoup(sourceCode)
return [a["href"] for a in soup.select("#threadlist a.xst")]
sourceCode = getsourse(url) # source code of the url page
allLinksinPage = getallLinksinPage(sourceCode) #a List of the urls in current page
for eachLink in allLinksinPage:
url = 'http://bbs.skykiwi.com/'
html = getsourse(urljoin(url, eachLink))
print(html)
If you print urljoin(url, eachLink) in the loop you see you get all the correct links for the table and the correct source code returned, below is a snippet of the links returned:
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3177846&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3197510&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3201399&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3170748&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3152747&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3168498&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3176639&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3203657&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3190138&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3140191&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3199154&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3156814&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3203435&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3089967&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3199384&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3173489&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3204107&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231
If you visit the links above in your browser you will see it get the correct page, using http://bbs.skykiwi.com/forum.php?mod=viewthread&tid=3187289&extra=page%3D1%26filter%3Dtypeid%26typeid%3D470%26sortid%3D231%26typeid%3D470%26sortid%3D231 from your results you will see :
Sorry, specified thread does not exist or has been deleted or is being reviewed
[New Zealand day-dimensional network Community Home]
You can see clearly the difference in the url's. If you wanted yours to work you would need to do a replace in your regex:
everylink = re.findall('</em><a href="(.*?)" onclick', each.replace("&","%26"), re.S)[0]
But really don't parse html will a regex.

Categories

Resources