Scraping in python shows None value [duplicate] - python

This question already has answers here:
Web-scraping JavaScript page with Python
(18 answers)
Closed 1 year ago.
import requests
from bs4 import BeautifulSoup as bs
import csv
r = requests.get('https://portal.karandaaz.com.pk/dataset/total-population/1000')
soup = bs(r.text)
table = soup.find_all(class_='ag-header-cell-text')
this give me None value any idea how to scrape data from this site would appreciate.

BeautifulSoup can only see what's directly baked into the HTML of a resource at the time it is initially requested. The content you're trying to scrape isn't baked into the page, because normally, when you view this particular page in a browser, the DOM is populated asynchronously using JavaScript. Fortunately, logging your browser's network traffic reveals requests to a REST API, which serves the contents of the table as JSON. The following script makes an HTTP GET request to that API, given a desired "dataset_id" (you can change the key-value pair in the params dict as desired). The response is then dumped into a CSV file:
def main():
import requests
import csv
url = "https://portal.karandaaz.com.pk/api/table"
params = {
"dataset_id": "1000"
}
response = requests.get(url, params=params)
response.raise_for_status()
content = response.json()
filename = "dataset_{}.csv".format(params["dataset_id"])
with open(filename, "w", newline="") as file:
fieldnames = content["data"]["columns"]
writer = csv.DictWriter(file, fieldnames=fieldnames)
writer.writeheader()
for row in content["data"]["rows"]:
writer.writerow(dict(zip(fieldnames, row)))
return 0
if __name__ == "__main__":
import sys
sys.exit(main())

The tag you're searching for isn't in the source code, which is why you're returning no data. Is there some reason you expect this to be there? You may be seeing different source code in a browser than you do when pulling it with the requests library.
You can view the code being pulled via:
import requests
from bs4 import BeautifulSoup as bs
import csv
r = requests.get('https://portal.karandaaz.com.pk/dataset/total-population/1000')
soup = bs(r.text, "lxml")
print( soup )

Related

How to create list of urls from csv file to iterate?

I am working on a webscrape code, he work fine, now I want replace the url, with a CSV file who containt thousand of url, it's like this :
url1
url2
url3
.
.
.urlX
my first line web scrape code is a basic :
from bs4 import BeautifulSoup
import requests
from csv import writer
url= "HERE THE URL FROM EACH LINE OF THE CSV FILE"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
how can i do for tell to python, to use the urls from the CSV, i think to do a dico, but i dont very know how i can do that, anyone have a solution please ? i know it's seams very simple for you, but it will be very usefull for me.
If this is just a list of urls, you don't really need the csv module. But here is a solution assuming the url is in column 0 of the file. You want a csv reader, not writer, and then its a simple case of iterating the rows and taking action.
from bs4 import BeautifulSoup
import requests
import csv
with open("url-collection.csv", newline="") as fileobj:
for row in csv.reader(fileobj):
# TODO: add try/except to handle errors
url = row[0]
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')

How to webscrape old school website that uses frames

I am trying to webscrape a government site that uses frameset.
Here is the URL - https://lakecounty.in.gov/departments/voters/election-results-c/2022GeneralElectionResults/index.htm
I've tried using splinter/selenium
url = "https://lakecounty.in.gov/departments/voters/election-results-c/2022GeneralElectionResults/index.htm"
browser.visit(url)
time.sleep(10)
full_xpath_frame = '/html/frameset/frameset/frame[2]'
tree = browser.find_by_xpath(full_xpath_frame)
for i in tree:
print(i.text)
It just returns an empty string.
I've tried using the requests library.
import requests
from lxml import HTML
url = "https://lakecounty.in.gov/departments/voters/election-results-c/2022GeneralElectionResults/index.htm"
# get response object
response = requests.get(url)
# get byte string
data = response.content
print(data)
And it returns this
b"<html>\r\n<head>\r\n<meta http-equiv='Content-Type'\r\ncontent='text/html; charset=iso-
8859-1'>\r\n<title>Lake_ County Election Results</title>\r\n</head>\r\n<FRAMESET rows='20%,
*'>\r\n<FRAME src='titlebar.htm' scrolling='no'>\r\n<FRAMESET cols='20%, *'>\r\n<FRAME
src='menu.htm'>\r\n<FRAME src='Lake_ElecSumm_all.htm' name='reports'>\r\n</FRAMESET>
\r\n</FRAMESET>\r\n<body>\r\n</body>\r\n</html>\r\n"
I've also tried using beautiful soup and it gave me the same thing. Is there another python library I can use in order to get the data that's inside the second table?
Thank you for any feedback.
As mentioned you could go for the frames and its src:
BeautifulSoup(r.text).select('frame')[1].get('src')
or directly to the menu.htm:
import requests
from bs4 import BeautifulSoup
r = requests.get('https://lakecounty.in.gov/departments/voters/election-results-c/2022GeneralElectionResults/menu.htm')
link_list = ['https://lakecounty.in.gov/departments/voters/election-results-c/2022GeneralElectionResults'+a.get('href') for a in BeautifulSoup(r.text).select('a')]
for link in link_list[:1]:
r = requests.get(link)
soup = BeautifulSoup(r.text)
###...scrape what is needed

Web Scraping - Python - need assistance

This is my first time posting so apologies if there is any errors. I currently have a file with a list of URLs, and I am trying to create a python program which will go to the URLs and grab the text from the HTML page and save it in a .txt file. I am currently using beautifulsoup to scrape these sites and many of them are throwing errors which I am unsure how to solve. I am looking for a better way to this: I have posted by code below.
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
from urllib.request import Request
import datefinder
from dateutil.parser import parse
import json
import re
import random
import time
import scrapy
import requests
import urllib
import os.path
from os import path
#extracts page contents using beautifulSoup
def page_extract(url):
req = Request(url,
headers={'User-Agent': 'Mozilla/5.0'})
webpage = uReq(req, timeout=5).read()
page_soup = soup(webpage, "lxml")
return page_soup
#opens file that contains the links
file1 = open('links.txt', 'r')
lines = file1.readlines()
#for loop that iterates through the list of urls I have
for i in range(0, len(lines)):
fileName = str(i)+".txt"
url = str(lines[i])
print(i)
try:
#if the scraping is successful i would like it to save the text contents in a text file with the text file name
# being the index
soup2 = page_extract(url)
text = soup2.text
f = open("Politifact Files/"+fileName,"x")
f.write(str(text))
f.close()
print(url)
except:
#otherwise save it to another folder which contains all the sites that threw an error
f = open("Politifact Files Not Completed/"+fileName,"x")
f.close()
print("NOT DONE: "+url)
Thanks #Thierry Lathuille and #Dr Pi for your response. I was able to find a solution to this problem by looking into python libraries that are able to webscrape the important text off of a webpage. I came across one called 'Trafilatura' which is able to accomplish this task. The documentation for this library is here at: https://pypi.org/project/trafilatura/.

Python Extract Table from URL to csv

Extracting the "2016-Annual" table in http://www.americashealthrankings.org/api/v1/downloads/131 to a csv. The table has 3 fields- STATE, RANK, VALUE. Getting error with the following:
import urllib2
from bs4 import BeautifulSoup
import csv
url = 'http://www.americashealthrankings.org/api/v1/downloads/131'
header = {'User-Agent': 'Mozilla/5.0'}
req = urllib2.Request(url,headers=header)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)
table = soup.find('2016-Annual', {'class': 'STATE-RANK-VALUE'})
f = open('output.csv', 'w')
for row in table.findAll('tr'):
cells = row.findAll('td')
if len(cells) == 3:
STATE = cells[0].find(text=True)
RANK = cells[1].find(text=True)
VALUE = cells[2].find(text=True)
print write_to_file
f.write(write_to_file)
f.close()
What am I missing here? Using python 2.7
you code is wrong
this 'http://www.americashealthrankings.org/api/v1/downloads/131' download
csv file.
download csv file to local computer, you can use this file.
#!/usr/bin/env python
# coding:utf-8
'''黄哥Python'''
import urllib2
url = 'http://www.americashealthrankings.org/api/v1/downloads/131'
html = urllib2.urlopen(url).read()
with open('output.csv', 'w') as output:
output.write(html)
According to the Beautifulsoup docs, you need to pass a string to be parsed on initialization. However, page = urllib2.urlopen(req) returns a pointer to a page.
Try using soup = BeautifulSoup(page.read(), 'html.parser') instead.
Also, the variable write_to_file doesn't exist.
If this doesn't solve it, please also post which error you get.
The reason its not working is because your pointing to a file that is already a csv - you can literally load that URL in your browser and it will download in CSV file format ---- the table your expecting though, is not at that endpoint - it is at this URL:
http://www.americashealthrankings.org/explore/2016-annual-report
Also - I dont see a class called STATE-RANK-VALUE I only see th headers called state,rank, and ,value

Python urllib is not extracting reader comments from a website

I am trying to extract reader comments from the following page with the code shown below. But the output html test.html does not contain any comments from the page. How do I get this information with Python?
http://www.theglobeandmail.com/opinion/it-doesnt-matter-who-won-the-debate-america-has-already-lost/article32314064/comments/
from bs4 import BeautifulSoup
import urllib
import urllib.request
import urllib.parse
req =urllib.request.Request('http://www.theglobeandmail.com/opinion/it-doesnt-matter-who-won-the-debate-america-has-already-lost/article32314064/comments/')
response = urllib.request.urlopen(req)
the_page = response.read()
soup = BeautifulSoup(the_page, 'html.parser')
f = open('test.html', 'w')
f.write(soup.prettify())
f.close()
Thanks!
The comments are retrieved using an ajax requests which you can mimic:
You can see there are numerous parameters but what is below is enough to get a result, I will leave it to you to figure out how you can influence the results:
from json import loads
from urllib.request import urlopen
from urllib.parse import urlencode
data = {"categoryID":"Production",
"streamID":"32314064",
"APIKey":"2_oNjjtSC8Qc250slf83cZSd4sbCzOF4cCiqGIBF8__5dWzOJY_MLAoZvds76cHeQD",
"callback" :"foo",}
r = urlopen("http://comments.us1.gigya.com/comments.getComments", data=urlencode(data).encode("utf-8"))
json_dcts = loads(r.read().decode("utf-8"))["comments"]
print(json_dcts)
That gives you a list of dicts that hold all the comments, upvotes, negvotes etc.. If you wanted to parse the key it is in the url of inside one of th scripts src='https://cdns.gigya.com/js/socialize.js?apiKey=2_oNjjtSC8Qc250slf83cZSd4sbCzOF4cCiqGIBF8__5dWzOJY_MLAoZvds76cHeQD', the streamID is in your original url.

Categories

Resources