I am trying to extract reader comments from the following page with the code shown below. But the output html test.html does not contain any comments from the page. How do I get this information with Python?
http://www.theglobeandmail.com/opinion/it-doesnt-matter-who-won-the-debate-america-has-already-lost/article32314064/comments/
from bs4 import BeautifulSoup
import urllib
import urllib.request
import urllib.parse
req =urllib.request.Request('http://www.theglobeandmail.com/opinion/it-doesnt-matter-who-won-the-debate-america-has-already-lost/article32314064/comments/')
response = urllib.request.urlopen(req)
the_page = response.read()
soup = BeautifulSoup(the_page, 'html.parser')
f = open('test.html', 'w')
f.write(soup.prettify())
f.close()
Thanks!
The comments are retrieved using an ajax requests which you can mimic:
You can see there are numerous parameters but what is below is enough to get a result, I will leave it to you to figure out how you can influence the results:
from json import loads
from urllib.request import urlopen
from urllib.parse import urlencode
data = {"categoryID":"Production",
"streamID":"32314064",
"APIKey":"2_oNjjtSC8Qc250slf83cZSd4sbCzOF4cCiqGIBF8__5dWzOJY_MLAoZvds76cHeQD",
"callback" :"foo",}
r = urlopen("http://comments.us1.gigya.com/comments.getComments", data=urlencode(data).encode("utf-8"))
json_dcts = loads(r.read().decode("utf-8"))["comments"]
print(json_dcts)
That gives you a list of dicts that hold all the comments, upvotes, negvotes etc.. If you wanted to parse the key it is in the url of inside one of th scripts src='https://cdns.gigya.com/js/socialize.js?apiKey=2_oNjjtSC8Qc250slf83cZSd4sbCzOF4cCiqGIBF8__5dWzOJY_MLAoZvds76cHeQD', the streamID is in your original url.
Related
This question already has answers here:
Web-scraping JavaScript page with Python
(18 answers)
Closed 1 year ago.
import requests
from bs4 import BeautifulSoup as bs
import csv
r = requests.get('https://portal.karandaaz.com.pk/dataset/total-population/1000')
soup = bs(r.text)
table = soup.find_all(class_='ag-header-cell-text')
this give me None value any idea how to scrape data from this site would appreciate.
BeautifulSoup can only see what's directly baked into the HTML of a resource at the time it is initially requested. The content you're trying to scrape isn't baked into the page, because normally, when you view this particular page in a browser, the DOM is populated asynchronously using JavaScript. Fortunately, logging your browser's network traffic reveals requests to a REST API, which serves the contents of the table as JSON. The following script makes an HTTP GET request to that API, given a desired "dataset_id" (you can change the key-value pair in the params dict as desired). The response is then dumped into a CSV file:
def main():
import requests
import csv
url = "https://portal.karandaaz.com.pk/api/table"
params = {
"dataset_id": "1000"
}
response = requests.get(url, params=params)
response.raise_for_status()
content = response.json()
filename = "dataset_{}.csv".format(params["dataset_id"])
with open(filename, "w", newline="") as file:
fieldnames = content["data"]["columns"]
writer = csv.DictWriter(file, fieldnames=fieldnames)
writer.writeheader()
for row in content["data"]["rows"]:
writer.writerow(dict(zip(fieldnames, row)))
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
The tag you're searching for isn't in the source code, which is why you're returning no data. Is there some reason you expect this to be there? You may be seeing different source code in a browser than you do when pulling it with the requests library.
You can view the code being pulled via:
import requests
from bs4 import BeautifulSoup as bs
import csv
r = requests.get('https://portal.karandaaz.com.pk/dataset/total-population/1000')
soup = bs(r.text, "lxml")
print( soup )
This is my first time posting so apologies if there is any errors. I currently have a file with a list of URLs, and I am trying to create a python program which will go to the URLs and grab the text from the HTML page and save it in a .txt file. I am currently using beautifulsoup to scrape these sites and many of them are throwing errors which I am unsure how to solve. I am looking for a better way to this: I have posted by code below.
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
from urllib.request import Request
import datefinder
from dateutil.parser import parse
import json
import re
import random
import time
import scrapy
import requests
import urllib
import os.path
from os import path
#extracts page contents using beautifulSoup
def page_extract(url):
req = Request(url,
headers={'User-Agent': 'Mozilla/5.0'})
webpage = uReq(req, timeout=5).read()
page_soup = soup(webpage, "lxml")
return page_soup
#opens file that contains the links
file1 = open('links.txt', 'r')
lines = file1.readlines()
#for loop that iterates through the list of urls I have
for i in range(0, len(lines)):
fileName = str(i)+".txt"
url = str(lines[i])
print(i)
try:
#if the scraping is successful i would like it to save the text contents in a text file with the text file name
# being the index
soup2 = page_extract(url)
text = soup2.text
f = open("Politifact Files/"+fileName,"x")
f.write(str(text))
f.close()
print(url)
except:
#otherwise save it to another folder which contains all the sites that threw an error
f = open("Politifact Files Not Completed/"+fileName,"x")
f.close()
print("NOT DONE: "+url)
Thanks #Thierry Lathuille and #Dr Pi for your response. I was able to find a solution to this problem by looking into python libraries that are able to webscrape the important text off of a webpage. I came across one called 'Trafilatura' which is able to accomplish this task. The documentation for this library is here at: https://pypi.org/project/trafilatura/.
I'm trying to get other subset URLs from a main URL. However,as I print to see if I get the content, I noticed that I am only getting the HTML, not the URLs within it.
import urllib
file = 'http://example.com'
with urllib.request.urlopen(file) as url:
collection = url.read().decode('UTF-8')
I think this is what you are looking for.
You can use beautiful soup library of python and this code should work with python3
import urllib
from urllib.request import urlopen
from bs4 import BeautifulSoup
def get_all_urls(url):
open = urlopen(url)
url_html = BeautifulSoup(open, 'html.parser')
for link in url_html.find_all('a'):
links = str(link.get('href'))
if links.startswith('http'):
print(links)
else:
print(url + str(links))
get_all_urls('url.com')
import urllib.request as urllib2 #To query website
from bs4 import BeautifulSoup #To parse website
import pandas as pd
#specify the url and open
url3 = 'http://www.thatscricket.com/ipl/2014/results/index.html'
req = urllib2.urlopen(url3)
soup = BeautifulSoup(req,"html5lib")
all_tables=soup.find_all('table')
print(all_tables)
If you see the content of your requested data
content = req.readall()
as you examine the content:
print(content)
and surprisingly there is not table!!!
But if you check the page source you can see tables in it.
As I examined there should be some problem with urllib.request and there is some escape sequence on the page which causes that urllib get only part of that page.
So I could be able to fix the problem by using requests instead of urllib
first
pip install requests
Then change your code to this:
import requests
from bs4 import BeautifulSoup
url3 = 'http://www.thatscricket.com/ipl/2014/results/index.html'
req = requests.get(url3)
soup = BeautifulSoup(req.content,"html5lib")
all_tables=soup.find_all('table')
print(all_tables)
I'm trying to scrape a webpage using BeautifulSoup using the code below:
import urllib.request
from bs4 import BeautifulSoup
with urllib.request.urlopen("http://en.wikipedia.org//wiki//Markov_chain.htm") as url:
s = url.read()
soup = BeautifulSoup(s)
with open("scraped.txt", "w", encoding="utf-8") as f:
f.write(soup.get_text())
f.close()
The problem is that it saves the Wikipedia's main page instead of that specific article. Why the address doesn't work and how should I change it?
The correct url for the page is http://en.wikipedia.org/wiki/Markov_chain:
>>> import urllib.request
>>> from bs4 import BeautifulSoup
>>> url = "http://en.wikipedia.org/wiki/Markov_chain"
>>> soup = BeautifulSoup(urllib.request.urlopen(url))
>>> soup.title
<title>Markov chain - Wikipedia, the free encyclopedia</title>
#alecxe's answer will generate:
**GuessedAtParserWarning**:
No parser was explicitly specified, so I'm using the best
available HTML parser for this system ("html.parser"). This usually isn't a problem,
but if you run this code on another system, or in a different virtual environment, it
may use a different parser and behave differently. The code that caused this warning
is on line 25 of the file crawl.py.
To get rid of this warning, pass the additional argument 'features="html.parser"' to
the BeautifulSoup constructor.
Here is a solution without GuessedAtParserWarning using requests:
# crawl.py
import requests
url = 'https://www.sap.com/belgique/index.html'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
file = path.join(path.dirname(__file__), 'downl.txt')
# Either print the title/text or save it to a file
print(soup.title)
# download the text
with open(file, 'w') as f:
f.write(soup.text)