Web-scraping URL Construction

Web-scraping URL Construction - python

Consider the URL :
https://en.wikipedia.org/wiki/NGC_2808
When I use this directly as my url in temp = requests.get(url).text everything works alright.
Now, consider the string name = NGC2808. Now, when I do s = name[:3] + '_' + name[3:] and then do url = 'https://en.wikipedia.org/wiki/' + s
,the program doesn't work anymore.
This is code snippet :
s = name[:3] + '_' + name[3:]
url0 = 'https://en.wikipedia.org/wiki/' + s
url = requests.get(url0).text
soup = BeautifulSoup(url,"lxml")
soup.prettify()
table = soup.find('table',{'class':'infobox'})
tags = table.find_all('tr')
Here is the error:
AttributeError: 'NoneType' object has no attribute 'find_all'
Edit :
The name isn't really explicitly defined as "NGC2808" but rather comes from scanning a .txt file. But print(name) results in NGC2808. Now when I provide the name directly, without scanning the file, I get no error. Why is this happening?
Why does this happen?

Providing a minimal reproducible example and a copy of the error message would have helped greatly here and may have allowed for greater insight on your issue.
Nevertheless, the following works for me:
name = "NGC2808"
s = name[:3] + '_' + name[3:]
url = 'https://en.wikipedia.org/wiki/' + s
temp = requests.get(url).text
print(temp)
Edited due to question changes:
The error you have provided suggests that beautiful soup has been unable to find any tables in the document returned by your get request. Have you checked the url you are passing to that request and also the content returned?
As it stands I am able to get a list of tags (such as you seem to want) with the following:
import requests
from bs4 import BeautifulSoup
import lxml
name = "NGC2808"
s = name[:3] + '_' + name[3:]
url = 'https://en.wikipedia.org/wiki/' + s
temp = requests.get(url).text
soup = BeautifulSoup(temp,"lxml")
soup.prettify()
table = soup.find('table',{'class':'infobox'})
tags = table.find_all('tr')
print(tags)
The way that the line s = name[:3] + '_' + name[3:] is indented is curious and suggests that there is detail missing from the top of your example. It may be useful to have this context, as it could be that whatever logic is involved there is resulting in your passing a malformed url to your get request.

If it only happens when reading from a file source then there must be some special(Unicode) or whitespace characters in your name string, if you're using PyCharm then do some debugging or you can simply print the name string(just after reading it from the file) using the pprint() or repr() method to see that problem causing character, let's take an example code where the normal print function won't show the special character but pprint does...
from bs4 import BeautifulSoup
from pprint import pprint
import requests
# Suppose this is a article id fetched from the file
article_id = "NGC2808 "
# Print will not show any special character
print(article_id)
# Even you can print this special character using repr() method
print(repr(article_id))
# Pprint shows a the character code in place of special character
pprint(article_id)
# Now this code will produce an error
article_id_mod = article_id[:3] + '_' + article_id[3:]
url = 'https://en.wikipedia.org/wiki/' + article_id_mod
response = requests.get(url)
soup = BeautifulSoup(response.text,"lxml")
table = soup.find('table',{'class':'infobox'})
if table:
tags = table.find_all('tr')
print(tags)
Now to resolve the same you can do:
In case of extra whitespaces at the beginning/ending of the string: Use strip() method
article_id = article_id.strip()
If there are a special character(s): Use appropriate regex expression or simply open the file using editors like vscode/sublime/notepad++ and utilze the find/replace option.

Related

I get InvalidURL: URL can't contain control characters when I try to send a request using urllib

I am trying to get a JSON response from the link used as a parameter to the urllib request. but it gives me an error that it can't contain control characters.
how can I solve the issue?
start_url = "https://devbusiness.un.org/solr-sitesearch-output/10//0/ds_field_last_updated/desc?bundle_fq =procurement_notice&sm_vid_Institutions_fq=&sm_vid_Procurement_Type_fq=&sm_vid_Countries_fq=&sm_vid_Sectors_fq= &sm_vid_Languages_fq=English&sm_vid_Notice_Type_fq=&deadline_multifield_fq=&ts_field_project_name_fq=&label_fq=&sm_field_db_ref_no__fq=&sm_field_loan_no__fq=&dm_field_deadlineFrom_fq=&dm_field_deadlineTo_fq =&ds_field_future_posting_dateFrom_fq=&ds_field_future_posting_dateTo_fq=&bm_field_individual_consulting_fq="
source = urllib.request.urlopen(start_url).read()
the error I get is :
URL can't contain control characters. '/solr-sitesearch-output/10//0/ds_field_last_updated/desc?bundle_fq =procurement_notice&sm_vid_Institutions_fq=&sm_vid_Procurement_Type_fq=&sm_vid_Countries_fq=&sm_vid_Sectors_fq= &sm_vid_Languages_fq=English&sm_vid_Notice_Type_fq=&deadline_multifield_fq=&ts_field_project_name_fq=&label_fq=&sm_field_db_ref_no__fq=&sm_field_loan_no__fq=&dm_field_deadlineFrom_fq=&dm_field_deadlineTo_fq =&ds_field_future_posting_dateFrom_fq=&ds_field_future_posting_dateTo_fq=&bm_field_individual_consulting_fq=' (found at least ' ')

Replacing whitespace with:
url = url.replace(" ", "%20")
if the problem is with the whitespace.

Spaces are not allowed in URL, I removed them and it seems to be working now:
import urllib.request
start_url = "https://devbusiness.un.org/solr-sitesearch-output/10//0/ds_field_last_updated/desc?bundle_fq =procurement_notice&sm_vid_Institutions_fq=&sm_vid_Procurement_Type_fq=&sm_vid_Countries_fq=&sm_vid_Sectors_fq= &sm_vid_Languages_fq=English&sm_vid_Notice_Type_fq=&deadline_multifield_fq=&ts_field_project_name_fq=&label_fq=&sm_field_db_ref_no__fq=&sm_field_loan_no__fq=&dm_field_deadlineFrom_fq=&dm_field_deadlineTo_fq =&ds_field_future_posting_dateFrom_fq=&ds_field_future_posting_dateTo_fq=&bm_field_individual_consulting_fq="
url = start_url.replace(" ","")
source = urllib.request.urlopen(url).read()

Solr search strings can get pretty weird. Better use the 'quote' method to encode characters before making the request. See example below:
from urllib.parse import quote
start_url = "https://devbusiness.un.org/solr-sitesearch-output/10//0/ds_field_last_updated/desc?bundle_fq =procurement_notice&sm_vid_Institutions_fq=&sm_vid_Procurement_Type_fq=&sm_vid_Countries_fq=&sm_vid_Sectors_fq= &sm_vid_Languages_fq=English&sm_vid_Notice_Type_fq=&deadline_multifield_fq=&ts_field_project_name_fq=&label_fq=&sm_field_db_ref_no__fq=&sm_field_loan_no__fq=&dm_field_deadlineFrom_fq=&dm_field_deadlineTo_fq =&ds_field_future_posting_dateFrom_fq=&ds_field_future_posting_dateTo_fq=&bm_field_individual_consulting_fq="
source = urllib.request.urlopen(quote(start_url)).read()
Better later than never...

You probably already found out by now but let's get it written here.
There can't be any space character in the URL, and there are 2, after bundle_fq e dm_field_deadlineTo_fq
Remove those and you're good to go

Like the error message says, there are some control characters in your url, which doesn't seem to be a valid one by the way.

You need to encode the control characters inside the URL. Especially spaces need to be encoded to %20.

Parsing the url first and then encoding the url elements would work.
import urllib.request
from urllib.parse import urlparse, quote
def make_safe_url(url: str) -> str:
"""
Returns a parsed and quoted url
"""
_url = urlparse(url)
url = _url.scheme + "://" + _url.netloc + quote(_url.path) + "?" + quote(_url.query)
return url
start_url = "https://devbusiness.un.org/solr-sitesearch-output/10//0/ds_field_last_updated/desc?bundle_fq =procurement_notice&sm_vid_Institutions_fq=&sm_vid_Procurement_Type_fq=&sm_vid_Countries_fq=&sm_vid_Sectors_fq= &sm_vid_Languages_fq=English&sm_vid_Notice_Type_fq=&deadline_multifield_fq=&ts_field_project_name_fq=&label_fq=&sm_field_db_ref_no__fq=&sm_field_loan_no__fq=&dm_field_deadlineFrom_fq=&dm_field_deadlineTo_fq =&ds_field_future_posting_dateFrom_fq=&ds_field_future_posting_dateTo_fq=&bm_field_individual_consulting_fq="
start_url = make_safe_url(start_url)
source = urllib.request.urlopen(start_url).read()
The code returns the JSON-document despite the double forward-slash and the whitespace in the url.

Use item name stored in old for loop, inside a new for loop

This program I'm working on is going to search through multiple paths (located in a JSON list) of a URL and find one that's not being used (404 page).
The problem = I want to print what the path is when I come across a 404 (when I can find an error div). But I can't figure out a way to do so, since the item name seems unreachable.
### Libraries ###
from bs4 import BeautifulSoup
import grequests
import requests
import json
import time
### User inputs ###
namelist = input('Your namelist: ')
print('---------------------------------------')
result = input('Output file: ')
print('---------------------------------------')
### Scrape ###
names = json.loads(open(namelist + '.json').read())
reqs = (grequests.get('https://steamcommunity.com/id/' + name) for name in names)
resp=grequests.imap(reqs, grequests.Pool(10))
for r in resp:
soup = BeautifulSoup(r.text, 'lxml')
findelement = soup.find_all('div', attrs={'class':"error_ctn"})
if (findelement):
print(name)
else:
print('trying')

I think you can do this by modifying where your for loop is located. I'm not familiar with the libraries you're using so I've left a comment where you might need to modify the code further, but something along these lines should work:
names = json.loads(open(namelist + '.json').read())
for name in names:
req = grequests.get('https://steamcommunity.com/id/' + name)
# May need to modify this line since only passing one req, so are assured of only one response
resp=grequests.imap(req, grequests.Pool(10))
# There should only be one response now.
soup = BeautifulSoup(resp.text, 'lxml')
findelement = soup.find_all('div', attrs={'class':"error_ctn"})
if (findelement):
print(name)
else:
print('trying')

Extracting follower count from Instagram

I am trying to pull the the number of followers from a list of Instagram accounts. I have tried using the "find" method within Requests, however, the string that I am looking for when I inspect the actual Instagram no longer appears when I print "r" from the code below.
Was able to get this code to run successfully find the past, however, will no longer run.
Webscraping Instagram follower count BeautifulSoup
import requests
user = "espn"
url = 'https://www.instagram.com/' + user
r = requests.get(url).text
start = '"edge_followed_by":{"count":'
end = '},"followed_by_viewer"'
print(r[r.find(start)+len(start):r.rfind(end)])
I receive a "-1" error, which means the substring from the find method was not found within the variable "r".

I think it's because of the last ' in start and first ' in end...this will work:
import requests
import re
user = "espn"
url = 'https://www.instagram.com/' + user
r = requests.get(url).text
followers = re.search('"edge_followed_by":{"count":([0-9]+)}',r).group(1)
print(followers)
'14061730'

I want to suggest an updated solution to this question, as the answer of Derek Eden above from 2019 does not work anymore, as stated in its comments.
The solution was to add the r' before the regular expression in the re.search like so:
follower_count = re.search(r'"edge_followed_by\\":{\\"count\\":([0-9]+)}', response).group(1)
This r'' is really important as without it, Python seems to treat the expression as regular string which leads to the query not giving any results.
Also the instagram page seems to have backslashes in the object we look for at least in my tests, so the code example i use is the following in Python 3.10 and working as of July 2022:
# get follower count of instagram profile
import os.path
import requests
import re
import urllib3
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
# get instagram follower count
def get_instagram_follower_count(instagram_username):
url = "https://www.instagram.com/" + instagram_username
filename = "instagram.html"
try:
if not os.path.isfile(filename):
r = requests.get(url, verify=False)
print(r.status_code)
print(r.text)
response = r.text
if not r.status_code == 200:
raise Exception("Error: " + str(r.status_code))
with open(filename, "w") as f:
f.write(response)
else:
with open(filename, "r") as f:
response = f.read()
# print(response)
follower_count = re.search(r'"edge_followed_by\\":{\\"count\\":([0-9]+)}', response).group(1)
return follower_count
except Exception as e:
print(e)
return 0
print(get_instagram_follower_count('your.instagram.profile'))
The method returns the follower count as expected. Please note that i added a few lines to not hammer Instagrams webserver and get blocked while testing by just saving the response in a file.
This is a slice of the original html content that contains the part we are looking for:
... mRL&s=1\",\"edge_followed_by\":{\"count\":110070},\"fbid\":\"1784 ...
I debugged the regex in regexr, it seems to work just fine at this point in time.
There are many posts about the regex r prefix like this one
Also the documentation of the re package shows clearly that this is the issue with the code above.

TypeError Using regex and beautifulsoup

Im working on some code that is reading a html, being parsed by beautifulsoup, and then want to use regex to find some numbers (part of an assignment).
Now an earlier assignment I used socket instead of urllib and I know that the error is from data types (expecting string or bytes) but down the line Im missing what I need to encode/decode to process the data. The error occurs at my re.findall
Besides a fix, what is causing the issue, and I guess more importantly what are the data type differences because I seem to be missing something... that should feel inherent.
Thanks ahead.
#Py3 urllib is utllib.request
import urllib.request
#BeautifulSoup stuff bs4 in Py3
from bs4 import *
#Raw Input now input in Py3
#url = 'http://' + input('Enter - ')
url = urllib.request.urlopen('http://python-data.dr-chuck.net/comments_42.html')
html = url.read()
#html.parser is the parser that defaults. Usefull most of the time (according to the web)
soup = BeautifulSoup(html, 'html.parser')
# Retrieve all of the tags specified
tags = soup('span')
for tag in tags:
print(re.findall('[0-9]+', tag))

So, I've been caught off guard with this before: BeautifulSoup returns objects, which just appear to be strings when you call print.
Just as a sanity check, try this:
import urllib.request
from bs4 import *
url = urllib.request.urlopen('http://python-data.dr-chuck.net/comments_42.html')
soup = BeautifulSoup(url.read(), 'html.parser')
single_tag = soup('span')[0]
print("Type is: \"%s\"; prints as \"%s\"" % (type(single_tag), single_tag))
print("As a string: \"%s\"; prints as \"%s\"" % (type(str(single_tag)), str(single_tag)))
The following should be output:
Type is: "< class 'bs4.element.Tag' >"; prints as "< span
class="comments" >97< /span >"
As a string: "< class 'str' >"; prints as "< span class="comments" >97< /span >"
So, if you encapsulate "tag" in a str() call before sending it to the regex, that problem should be taken care of
I've always found that adding sanity print(type(var)) checks when things start to complain about unexpected variable types to be a useful debugging technique!

Remove newline in python with urllib

I am using Python 3.x. While using urllib.request to download the webpage, i am getting a lot of \n in between. I am trying to remove it using the methods given in the other threads of the forum, but i am not able to do so. I have used strip() function and the replace() function...but no luck! I am running this code on eclipse. Here is my code:
import urllib.request
#Downloading entire Web Document
def download_page(a):
opener = urllib.request.FancyURLopener({})
try:
open_url = opener.open(a)
page = str(open_url.read())
return page
except:
return""
raw_html = download_page("http://www.zseries.in")
print("Raw HTML = " + raw_html)
#Remove line breaks
raw_html2 = raw_html.replace('\n', '')
print("Raw HTML2 = " + raw_html2)
I am not able to spot out the reason of getting a lot of \n in the raw_html variable.

Your download_page() function corrupts the html (str() call) that is why you see \n (two characters \ and n) in the output. Don't use .replace() or other similar solution, fix download_page() function instead:
from urllib.request import urlopen
with urlopen("http://www.zseries.in") as response:
html_content = response.read()
At this point html_content contains a bytes object. To get it as text, you need to know its character encoding e.g., to get it from Content-Type http header:
encoding = response.headers.get_content_charset('utf-8')
html_text = html_content.decode(encoding)
See A good way to get the charset/encoding of an HTTP response in Python.
if the server doesn't pass charset in Content-Type header then there are complex rules to figure out the character encoding in html5 document e.g., it may be specified inside html document: <meta charset="utf-8"> (you would need an html parser to get it).
If you read the html correctly then you shouldn't see literal characters \n in the page.

If you look at the source you've downloaded, the \n escape sequences you're trying to replace() are actually escaped themselves: \\n. Try this instead:
import urllib.request
def download_page(a):
opener = urllib.request.FancyURLopener({})
open_url = opener.open(a)
page = str(open_url.read()).replace('\\n', '')
return page
I removed the try/except clause because generic except statements without targeting a specific exception (or class of exceptions) are generally bad. If it fails, you have no idea why.

Seems like they are literal \n characters , so i suggest you to do like this.
raw_html2 = raw_html.replace('\\n', '')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Web-scraping URL Construction - python

Related

I get InvalidURL: URL can't contain control characters when I try to send a request using urllib

Use item name stored in old for loop, inside a new for loop

Extracting follower count from Instagram

TypeError Using regex and beautifulsoup

Remove newline in python with urllib

Categories

Resources