Response 404, but URL is reachable from python - python

I have a webcrawler, but currently the 404 error occurs when calling requests.get(url) from the requests module. But the URL is reachable.
base_url = "https://www.blogger.com/profile/"
site = base_url + blogs_to_visit.pop().rsplit('/', 1)[-1]
r = requests.get(site)
soup = BeautifulSoup(r.content, "html.parser")
# Printing some values for debugging
>>> print site
https://www.blogger.com/profile/01785989747304686024
>>> print r
<Response [404]>
However, if I hardcore the string site for the requests module as the exact same string. The response is 202.
site = "https://www.blogger.com/profile/01785989747304686024"
# Printing some values for debugging
>>> print site
https://www.blogger.com/profile/01785989747304686024
>>> print r
<Response [202]>
What just striked me is that it looks like a hidden newline after printing site the first time, might that be what's causing the problem?
The URL's to visit is earlier stored in a file with;
for link in soup.select("h2 a[href]"):
blogs.write(link.get("href") + "\n")
and fetched with
with open("foo") as p:
return p.readlines()
The question is then, what would be a better way of writing them to the file? If I dont seperate them with "\n" for eg, all the URL's are glued together as one.

In reference to Getting rid of \n when using .readlines(), perhaps use:
with open("foo") as p:
return p.read().splitlines()

you can use:
r = requests.get(site.strip('\n'))
instead of:
r = requests.get(site)

Related

Web-scraping URL Construction

Consider the URL :
https://en.wikipedia.org/wiki/NGC_2808
When I use this directly as my url in temp = requests.get(url).text everything works alright.
Now, consider the string name = NGC2808. Now, when I do s = name[:3] + '_' + name[3:] and then do url = 'https://en.wikipedia.org/wiki/' + s
,the program doesn't work anymore.
This is code snippet :
s = name[:3] + '_' + name[3:]
url0 = 'https://en.wikipedia.org/wiki/' + s
url = requests.get(url0).text
soup = BeautifulSoup(url,"lxml")
soup.prettify()
table = soup.find('table',{'class':'infobox'})
tags = table.find_all('tr')
Here is the error:
AttributeError: 'NoneType' object has no attribute 'find_all'
Edit :
The name isn't really explicitly defined as "NGC2808" but rather comes from scanning a .txt file. But print(name) results in NGC2808. Now when I provide the name directly, without scanning the file, I get no error. Why is this happening?
Why does this happen?
Providing a minimal reproducible example and a copy of the error message would have helped greatly here and may have allowed for greater insight on your issue.
Nevertheless, the following works for me:
name = "NGC2808"
s = name[:3] + '_' + name[3:]
url = 'https://en.wikipedia.org/wiki/' + s
temp = requests.get(url).text
print(temp)
Edited due to question changes:
The error you have provided suggests that beautiful soup has been unable to find any tables in the document returned by your get request. Have you checked the url you are passing to that request and also the content returned?
As it stands I am able to get a list of tags (such as you seem to want) with the following:
import requests
from bs4 import BeautifulSoup
import lxml
name = "NGC2808"
s = name[:3] + '_' + name[3:]
url = 'https://en.wikipedia.org/wiki/' + s
temp = requests.get(url).text
soup = BeautifulSoup(temp,"lxml")
soup.prettify()
table = soup.find('table',{'class':'infobox'})
tags = table.find_all('tr')
print(tags)
The way that the line s = name[:3] + '_' + name[3:] is indented is curious and suggests that there is detail missing from the top of your example. It may be useful to have this context, as it could be that whatever logic is involved there is resulting in your passing a malformed url to your get request.
If it only happens when reading from a file source then there must be some special(Unicode) or whitespace characters in your name string, if you're using PyCharm then do some debugging or you can simply print the name string(just after reading it from the file) using the pprint() or repr() method to see that problem causing character, let's take an example code where the normal print function won't show the special character but pprint does...
from bs4 import BeautifulSoup
from pprint import pprint
import requests
# Suppose this is a article id fetched from the file
article_id = "NGC2808 "
# Print will not show any special character
print(article_id)
# Even you can print this special character using repr() method
print(repr(article_id))
# Pprint shows a the character code in place of special character
pprint(article_id)
# Now this code will produce an error
article_id_mod = article_id[:3] + '_' + article_id[3:]
url = 'https://en.wikipedia.org/wiki/' + article_id_mod
response = requests.get(url)
soup = BeautifulSoup(response.text,"lxml")
table = soup.find('table',{'class':'infobox'})
if table:
tags = table.find_all('tr')
print(tags)
Now to resolve the same you can do:
In case of extra whitespaces at the beginning/ending of the string: Use strip() method
article_id = article_id.strip()
If there are a special character(s): Use appropriate regex expression or simply open the file using editors like vscode/sublime/notepad++ and utilze the find/replace option.

Extracting follower count from Instagram

I am trying to pull the the number of followers from a list of Instagram accounts. I have tried using the "find" method within Requests, however, the string that I am looking for when I inspect the actual Instagram no longer appears when I print "r" from the code below.
Was able to get this code to run successfully find the past, however, will no longer run.
Webscraping Instagram follower count BeautifulSoup
import requests
user = "espn"
url = 'https://www.instagram.com/' + user
r = requests.get(url).text
start = '"edge_followed_by":{"count":'
end = '},"followed_by_viewer"'
print(r[r.find(start)+len(start):r.rfind(end)])
I receive a "-1" error, which means the substring from the find method was not found within the variable "r".
I think it's because of the last ' in start and first ' in end...this will work:
import requests
import re
user = "espn"
url = 'https://www.instagram.com/' + user
r = requests.get(url).text
followers = re.search('"edge_followed_by":{"count":([0-9]+)}',r).group(1)
print(followers)
'14061730'
I want to suggest an updated solution to this question, as the answer of Derek Eden above from 2019 does not work anymore, as stated in its comments.
The solution was to add the r' before the regular expression in the re.search like so:
follower_count = re.search(r'"edge_followed_by\\":{\\"count\\":([0-9]+)}', response).group(1)
This r'' is really important as without it, Python seems to treat the expression as regular string which leads to the query not giving any results.
Also the instagram page seems to have backslashes in the object we look for at least in my tests, so the code example i use is the following in Python 3.10 and working as of July 2022:
# get follower count of instagram profile
import os.path
import requests
import re
import urllib3
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
# get instagram follower count
def get_instagram_follower_count(instagram_username):
url = "https://www.instagram.com/" + instagram_username
filename = "instagram.html"
try:
if not os.path.isfile(filename):
r = requests.get(url, verify=False)
print(r.status_code)
print(r.text)
response = r.text
if not r.status_code == 200:
raise Exception("Error: " + str(r.status_code))
with open(filename, "w") as f:
f.write(response)
else:
with open(filename, "r") as f:
response = f.read()
# print(response)
follower_count = re.search(r'"edge_followed_by\\":{\\"count\\":([0-9]+)}', response).group(1)
return follower_count
except Exception as e:
print(e)
return 0
print(get_instagram_follower_count('your.instagram.profile'))
The method returns the follower count as expected. Please note that i added a few lines to not hammer Instagrams webserver and get blocked while testing by just saving the response in a file.
This is a slice of the original html content that contains the part we are looking for:
... mRL&s=1\",\"edge_followed_by\":{\"count\":110070},\"fbid\":\"1784 ...
I debugged the regex in regexr, it seems to work just fine at this point in time.
There are many posts about the regex r prefix like this one
Also the documentation of the re package shows clearly that this is the issue with the code above.

How do I print JSON results from async.get() on urls in python 2.7?

In python, using async from requests and get, I'd like to print the responses from async.map()
from requests import async
url_list = ['abc.com','xyz.com']
rs = [async.get(u) for u in url_list]
a = async.map(rs)
print a
This gives me the result
[<Response[200]>,<Response[200]>]
I'd like to print the JSON responses obtained from the async.map from the url's Thanks
The way to do it would be to add a .text
a = async.map(rs)
print a.text
will give you the JSON within the response.

blocking api redirect with requests in python 3.4

I am creating a python program that uses an online thesaurus and returns synonyms. Unfortunately sometimes it will take a word that is spelled wrong, and redirect to a page for a word that is close to it, which is sometimes problematic. How can I stop it from redirecting? I would appreciate any advice. This is the code that applies:
def get_synonym(the_word):
#return a dictionary of the thesaurus results of the word
theurl = (the api key for the thesaurus)
new_word = the_word + "/json"
theurl = theurl + new_word
r = requests.get(theurl)
thewords = r.text #all the text for the results
from json import loads
thewords = json.loads(thewords) #make a dictionary of terms
return thewords #return dictionary of synonyms for the_word
Use the allow_redirects=False keyword argument:
r = requests.get(url, allow_redirects=False)
By default, requests follows redirects on all methods except HEAD.

Unwrapping t.co link that links to bit.ly

I'm trying to get the URL that has been already shortened by bit.ly and again by twitter. I have already tried:
import urllib.request
r = urllib.request.urlopen(url)
r.url
Also by using libraries such as requests and httplib2.
All these solutions would work if i wanted the final destination for the t.co link, however, I do need the intermediate shortener, which I now I can get via a HEAD request but I can't get Python 3 http.client working in order to get the location. Any ideas?
>>> c = http.client.HTTPConnection('t.co')
>>> c.request('GET', '/7fGoazTYpc') # or HEAD, but body is empty anyway
>>> r = c.getresponse()
>>> r.getheader('Location')
'http://bit.ly/900913'
requests automatically follows redirects, but it lets you access all URLs via the history attribute.
>>> r = requests.get('http://bit.ly/UG4ECS')
>>> r.url
u'http://www.fontsquirrel.com/fonts/exo'
>>> r.history
(<Response [301]>,)
>>> r.history[0].url
u'http://bit.ly/UG4ECS'
>>>

Categories

Resources