I wrote a program with urllib that gets all article titles from a webpage (in this case nytimes.com). There is only one problem. Some titles have a semicolon, which results in an ugly "There\xe2\x80\x99s" if printed. So I tried to replace the \xe2\x80\x99 with a ' but it does not seem to work. I think there is a problem with Tuples. Unfortunately I can't create a tuple, that results in the same problem.
import urllib.request
import urllib.parse
import re
url = 'https://www.nytimes.com/'
headers = {}
headers['User-Agent'] = 'Mozilla/5.0 (X11; Linux i686)'
req = urllib.request.Request(url, headers = headers)
resp = urllib.request.urlopen(req)
resp_data = resp.read()
par = re.findall(r'story-heading">(.*?)',str(resp_data))
for n in par:
print(n[1])
print(n[1].replace("\xe2\x80\x99","'"))
I tried to create string variables from the tuple but nothing is working. I know there is another solution to this with BeautifulSoup but I thought I'd try to find my own way.
You have to change this one line:
resp_data = resp.read()
to:
resp_data = resp.read().decode("utf8")
And the work will be done.
Explication:
The website is using ut8 encoding, as i'm guessing, so you have to decode the returned bytes into an utf8 string which can be better represented like what you intended to have.
PS: You can use resp.read().decode() without an argument in decode() method and you let Python guessing the encoding type.
You are seeing the repr() of the string, hence the funny characters. If you want, coerce this to a string. See my results:
>>> print repr(n[1])
'There\xe2\x80\x99s'
>>> print str(n[1])
There’s
In Summary: wrap your n[1] in str()
Related
I have been trying to figure out how to use python-requests to send a request that the url looks like:
http://example.com/api/add.json?name='hello'&data[]='hello'&data[]='world'
Normally I can build a dictionary and do:
data = {'name': 'hello', 'data': 'world'}
response = requests.get('http://example.com/api/add.json', params=data)
That works fine for most everything that I do. However, I have hit the url structure from above, and I am not sure how to do that in python without manually building strings. I can do that, but would rather not.
Is there something in the requests library I am missing or some python feature I am unaware of?
Also what do you even call that type of parameter so I can better google it?
All you need to do is putting it on a list and making the key as list like string:
data = {'name': 'hello', 'data[]': ['hello', 'world']}
response = requests.get('http://example.com/api/add.json', params=data)
What u are doing is correct only. The resultant url is same what u are expecting.
>>> payload = {'name': 'hello', 'data': 'hello'}
>>> r = requests.get("http://example.com/api/params", params=payload)
u can see the resultant url:
>>> print(r.url)
http://example.com/api/params?name=hello&data=hello
According to url format:
In particular, encoding the query string uses the following rules:
Letters (A–Z and a–z), numbers (0–9) and the characters .,-,~ and _ are left as-is
SPACE is encoded as + or %20
All other characters are encoded as %HH hex representation with any non-ASCII characters first encoded as UTF-8 (or other specified encoding)
So array[] will not be as expected and will be automatically replaced according to the rules:
If you build a url like :
`Build URL: http://example.com/api/add.json?name='hello'&data[]='hello'&data[]='world'`
OutPut will be:
>>> payload = {'name': 'hello', "data[]": 'hello','data[]':'world'}
>>> r = requests.get("http://example.com/api/params", params=payload)
>>> r.url
u'http://example.com/api/params?data%5B%5D=world&name=hello'
This is because Duplication will be replaced by the last value of the key in url and data[] will be replaced by data%5B%5D.
If data%5B%5D is not the problem(If server is able to parse it correctly),then u can go ahead with it.
Source Link
One solution if using the requests module is not compulsory, is using the urllib/urllib2 combination:
payload = [('name', 'hello'), ('data[]', ('hello', 'world'))]
params = urllib.urlencode(payload, doseq=True)
sampleRequest = urllib2.Request('http://example.com/api/add.json?' + params)
response = urllib2.urlopen(sampleRequest)
Its a little more verbose and uses the doseq(uence) trick to encode the url parameters but I had used it when I did not know about the requests module.
For the requests module the answer provided by #Tomer should work.
Some api-servers expect json-array as value in the url query string. The requests params doesn't create json array as value for parameters.
The way I fixed this on a similar problem was to use urllib.parse.urlencode to encode the query string, add it to the url and pass it to requests
e.g.
from urllib.parse import urlencode
query_str = urlencode(params)
url = "?" + query_str
response = requests.get(url, params={}, headers=headers)
The solution is simply using the famous function: urlencode
>>> import urllib.parse
>>> params = {'q': 'Python URL encoding', 'as_sitesearch': 'www.urlencoder.io'}
>>> urllib.parse.urlencode(params)
'q=Python+URL+encoding&as_sitesearch=www.urlencoder.io'
I use urllib.request and regex for html parse but when I write in json file there is double backslash in text. How can I replace one backslash?
I have looked at many solutions but none of them have worked.
headers = {}
headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0'
req = Request('https://www.manga-tr.com/manga-list.html', headers=headers)
response = urlopen(req).read()
a = re.findall(r'<b><a[^>]* href="([^"]*)"',str(response))
sub_req = Request('https://www.manga-tr.com/'+a[3], headers=headers)
sub_response = urlopen(sub_req).read()
manga = {}
manga['manga'] = []
manga_subject = re.findall(r'<h3>Tan.xc4.xb1t.xc4.xb1m<.h3>[^<]*.n.t([^<]*).t',str(sub_response))
manga['manga'].append({'msubject': manga_subject })
with io.open('allmanga.json', 'w', encoding='utf-8-sig') as outfile:
outfile.write(json.dumps(manga, indent=4))
this is my json file
{
"manga": [
{
"msubject": [
" Minami Ria 16 ya\\xc5\\x9f\\xc4\\xb1ndad\\xc4\\xb1r. \\xc4\\xb0lk erkek arkada\\xc5\\x9f\\xc4\\xb1 sakatani jirou(16) ile yakla\\xc5\\x9f\\xc4\\xb1k 6 ayd\\xc4\\xb1r beraberdir. Herkes taraf\\xc4\\xb1ndan \\xc3\\xa7ifte kumru olarak g\\xc3\\xb6r\\xc3\\xbclmelerine ra\\xc4\\x9fmen ili\\xc5\\x9fkilerinde %1\\'lik bir eksiklik vard\\xc4\\xb1r. Bu eksikli\\xc4\\x9fi tamamlayabilecekler mi?"
}
]
}
Why Is This Happening?
The error is when str is used to convert a bytes object to a str. This does not do the conversion in the desired way.
a = re.findall(r'<b><a[^>]* href="([^"]*)"',str(response))
# ^^^
For example, if the response is the word "Tanıtım", you it would be expressed in UTF-8 as b'Tan\xc4\xb1t\xc4\xb1m'. If you then use str on that, you get:
In [1]: response = b'Tan\xc4\xb1t\xc4\xb1m'
In [2]: str(response)
Out[2]: "b'Tan\\xc4\\xb1t\\xc4\\xb1m'"
If you convert this to JSON, you'll see double backslashes (which are really just ordinary backslashes, encoded as JSON).
In [3]: import json
In [4]: print(json.dumps(str(response)))
"b'Tan\\xc4\\xb1t\\xc4\\xb1m'"
The correct way to convert a bytes object back to a str is by using the decode method, with the appropriate encoding:
In [5]: response.decode('UTF-8')
Out[5]: 'Tanıtım'
Note that the response is not valid UTF-8, unfortunately. The website operators appear to be serving corrupted data.
Quick Fix
Replace every call to str(response) with response.decode('UTF-8', 'replace') and update the regular expressions to match.
a = re.findall(
# "r" prefix to string is unnecessary
'<b><a[^>]* href="([^"]*)"',
response.decode('UTF-8', 'replace'))
sub_req = Request('https://www.manga-tr.com/'+a[3],
headers=headers)
sub_response = urlopen(sub_req).read()
manga = {}
manga['manga'] = []
manga_subject = re.findall(
# "r" prefix to string is unnecessary
'<h3>Tanıtım</h3>([^<]*)',
sub_response.decode('UTF-8', 'replace'))
manga['manga'].append({'msubject': manga_subject })
# io.open is the same as open
with open('allmanga.json', 'w', encoding='utf-8-sig') as fp:
# json.dumps is unnecessary
json.dump(manga, fp, indent=4)
Better Fix
Use "Requests"
The Requests library is much easier than using urlopen. You will have to install it (with pip, apt, dnf, etc, whatever you use), it does not come with Python. It will look like this:
response = requests.get(
'https://www.manga-tr.com/manga-list.html')
And then response.text contains the decoded string, you don't need to decode it yourself. Easier!
Use BeautifulSoup
The Beautiful Soup library can search through HTML documents, and it is more reliable and easier to use than regular expressions. It also needs to be installed. You might use it like this, for example, to find all the summary from a manga page:
soup = BeautifulSoup(response.text, 'html.parser')
subject = soup.find('h3', text='Tanıtım').next_sibling.string
Summary
Here is a Gist containing a more complete example of what the scraper might look like.
Keep in mind that scraping a website can be a bit difficult, just because you might scrape 100 pages and then suddenly discover that something is wrong with your scraper, or you are hitting the website too hard, or something crashes and fails and you need to start over. So scraping well often involves rate-limiting, saving progress and caching responses, and (ideally) parsing robots.txt.
But Requests + BeautifulSoup will at least get you started. Again, see the Gist.
I use Python's request library to access (public) ads.txt files:
import requests
r = requests.get('https://www.sicurauto.it/ads.txt')
print(r.text)
This works fine in most cases, but the text from the URL above begins with some strange symbols:
> google.com, [...]
If I open the URL in my browser, I do not see these three symbols; the text begins with google.com, [...] I am a beginner when it comes to encodings and web protocols ... where might these odd symbols come from?
You need to specify your encoding (in r.encoding) before calling r.text:
import requests
r = requests.get('https://www.sicurauto.it/ads.txt')
r.encoding = 'utf-8-sig' # specify UTF-8-sig encoding
print(r.text)
I'm having problems getting data from an HTTP response. The format unfortunately comes back with '\n' attached to all the key/value pairs. JSON says it must be a str and not "bytes".
I have tried a number of fixes so my list of includes might look weird/redundant. Any suggestions would be appreciated.
#!/usr/bin/env python3
import urllib.request
from urllib.request import urlopen
import json
import requests
url = "http://finance.google.com/finance/info?client=ig&q=NASDAQ,AAPL"
response = urlopen(url)
content = response.read()
print(content)
data = json.loads(content)
info = data[0]
print(info)
#got this far - planning to extract "id:" "22144"
When it comes to making requests in Python, I personally like to use the requests library. I find it easier to use.
import json
import requests
r = requests.get('http://finance.google.com/finance/info?client=ig&q=NASDAQ,AAPL')
json_obj = json.loads(r.text[4:])
print(json_obj[0].get('id'))
The above solution prints: 22144
The response data had a couple unnecessary characters at the head, which is why I am only loading the relevant (json) portion of the response: r.text[4:]. This is the reason why you couldn't load it as json initially.
Bytes object has method decode() which converts bytes to string. Checking the response in the browser, seems there are some extra characters at the beginning of the string that needs to be removed (a line feed character, followed by two slashes: '\n//'). To skip the first three characters from the string returned by the decode() method we add [3:] after the method call.
data = json.loads(content.decode()[3:])
print(data[0]['id'])
The output is exactly what you expect:
22144
JSON says it must be a str and not "bytes".
Your content is "bytes", and you can do this as below.
data = json.loads(content.decode())
I am using Python 3.x. While using urllib.request to download the webpage, i am getting a lot of \n in between. I am trying to remove it using the methods given in the other threads of the forum, but i am not able to do so. I have used strip() function and the replace() function...but no luck! I am running this code on eclipse. Here is my code:
import urllib.request
#Downloading entire Web Document
def download_page(a):
opener = urllib.request.FancyURLopener({})
try:
open_url = opener.open(a)
page = str(open_url.read())
return page
except:
return""
raw_html = download_page("http://www.zseries.in")
print("Raw HTML = " + raw_html)
#Remove line breaks
raw_html2 = raw_html.replace('\n', '')
print("Raw HTML2 = " + raw_html2)
I am not able to spot out the reason of getting a lot of \n in the raw_html variable.
Your download_page() function corrupts the html (str() call) that is why you see \n (two characters \ and n) in the output. Don't use .replace() or other similar solution, fix download_page() function instead:
from urllib.request import urlopen
with urlopen("http://www.zseries.in") as response:
html_content = response.read()
At this point html_content contains a bytes object. To get it as text, you need to know its character encoding e.g., to get it from Content-Type http header:
encoding = response.headers.get_content_charset('utf-8')
html_text = html_content.decode(encoding)
See A good way to get the charset/encoding of an HTTP response in Python.
if the server doesn't pass charset in Content-Type header then there are complex rules to figure out the character encoding in html5 document e.g., it may be specified inside html document: <meta charset="utf-8"> (you would need an html parser to get it).
If you read the html correctly then you shouldn't see literal characters \n in the page.
If you look at the source you've downloaded, the \n escape sequences you're trying to replace() are actually escaped themselves: \\n. Try this instead:
import urllib.request
def download_page(a):
opener = urllib.request.FancyURLopener({})
open_url = opener.open(a)
page = str(open_url.read()).replace('\\n', '')
return page
I removed the try/except clause because generic except statements without targeting a specific exception (or class of exceptions) are generally bad. If it fails, you have no idea why.
Seems like they are literal \n characters , so i suggest you to do like this.
raw_html2 = raw_html.replace('\\n', '')