Avoid white space when reading a file - python

I am trying to read a list of coordinates from a text file and insert it into a url
Here is my code:
with open("coords.txt", "r") as txtFile:
for line in txtFile:
coords = line
url = 'https://maps.googleapis.com/maps/api/place/nearbysearch/json?location=' + coords + '&radius=1&key=' + key
json_obj = urllib2.urlopen(url)
data = json.load(json_obj)
print data['results']
When I run it, I get this error:
Traceback (most recent call last):
File "C:\Users\Vel0city\Desktop\Coding\Python\placeid.py", line 8, in <module>
json_obj = urllib2.urlopen(url)
File "C:\Python27\lib\urllib2.py", line 154, in urlopen
return opener.open(url, data, timeout)
File "C:\Python27\lib\urllib2.py", line 429, in open
req = meth(req)
File "C:\Python27\lib\urllib2.py", line 1125, in do_request_
raise URLError('no host given')
URLError: <urlopen error no host given>
I am pretty sure this is due to the fact that python inserts a line break for every line in a text file so when I print out the final url with the coords concatenated to it, i get this:
https://maps.googleapis.com/maps/api/place/nearbysearch/json?location=37.773972,-122.431297
&radius=1&key=
https://maps.googleapis.com/maps/api/place/nearbysearch/json?location=37.773972,-122.431297
&radius=1&key=
https://maps.googleapis.com/maps/api/place/nearbysearch/json?location=37.773972,-122.431297
&radius=1&key=
https://maps.googleapis.com/maps/api/place/nearbysearch/json?location=37.773972,-122.431297&radius=1&key=
how would i remove this linebreak so it doesnt screw up the url?

You can use the strip method
coords = line.strip()
strip does nothing but, removes the the whitespace in your string.
You can use rstrip and lstrip if you would like to strip only one side of the line
EDIT:
As TemporalWolf mentioned in the comments, the strip method can be used to strip other things beside whitespace (which is the default).
For example, line.strip('0') would removes all '0' occurrences.

Related

"json.decoder.JSONDecodeError: Extra data: line 1 column 5287" but there is nothing there

I have a series of .json files. Each file contains tweets based on a different keyword. Each line in every file is a json object. I read the files using the following code:
# Get tweets out of JSON file
tweetsFromJSON = []
with open(json_file) as f:
for line in f:
json_object = json.loads(line)
tweet_text = json_object["text"]
tweetsFromJSON.append(tweet_text)
For every JSON file I have this works flawlessly. But this particular file gives me the following error:
Traceback (most recent call last):
File "C:/Users/alexandros/Dropbox/Development/Sentiment Analysis/lda_analysis.py", line 119, in <module>
lda_analysis('precision_medicine.json', 'precision medicine')
File "C:/Users/alexandros/Dropbox/Development/Sentiment Analysis/lda_analysis.py", line 46, in lda_analysis
json_object = json.loads(line)
File "C:\Users\alexandros\AppData\Local\Programs\Python\Python35-32\lib\json\__init__.py", line 319, in loads
return _default_decoder.decode(s)
File "C:\Users\alexandros\AppData\Local\Programs\Python\Python35-32\lib\json\decoder.py", line 342, in decode
raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 1 column 5287 (char 5286)
So tried removing the first line to see what happens. The error persists and again it's in the exact same position (line 1 column 5287 (char 5286)). I removed another line and it's the same. I'm breaking my head trying to figure out what's wrong. What am I missing?

Python script to unzip and print one line of a file

I am trying a simple example of retrieving data from a file and printing only one line of the output. I get semicolon error around encoded and 'r'.
import gzip
data = gzip.open('pagecounts-20130601-000000.gz', 'r')
encoded=data.read()
print encoded[2]
It gives this error:
Traceback (most recent call last):
File "filter_articles.scpt", line 4, in <module> encoded=data.read()
File "/usr/lib/python2.7/gzip.py", line 249, in read self._read(readsize)
File "/usr/lib/python2.7/gzip.py", line 308, in _read self._add_read_data( uncompress )
File "/usr/lib/python2.7/gzip.py", line 326, in _add_read_data self.extrabuf = self.extrabuf[offset:] + data MemoryError
I guess this is because the file is huge and was not able to read the content? What could be better way to print few lines of the file?
I am assuming that:
You meant to have quotes around the file name in your script.
You actually want the third line (as your post suggests) and not the third character (as your script suggests)
In this case the following should work:
import gzip
data = gzip.open('pagecounts-20130601-000000.gz', 'r')
data.readline()
data.readline()
print data.readline()

ValueError: unknown url type

The title pretty much says it all. Here's my code:
from urllib2 import urlopen as getpage
print = getpage("www.radioreference.com/apps/audio/?ctid=5586")
and here's the traceback error I get:
Traceback (most recent call last):
File "C:/Users/**/Dropbox/Dev/ComServ/citetest.py", line 2, in <module>
contents = getpage("www.radioreference.com/apps/audio/?ctid=5586")
File "C:\Python25\lib\urllib2.py", line 121, in urlopen
return _opener.open(url, data)
File "C:\Python25\lib\urllib2.py", line 366, in open
protocol = req.get_type()
File "C:\Python25\lib\urllib2.py", line 241, in get_type
raise ValueError, "unknown url type: %s" % self.__original
ValueError: unknown url type: www.radioreference.com/apps/audio/?ctid=5586
My best guess is that urllib can't retrieve data from untidy php URLs. if this is the case, is there a work around? If not, what am I doing wrong?
You should first try to add 'http://' in front of the url. Also, do not store the results in print, as it is binding the reference to another (non callable) object.
So this line should be:
page_contents = getpage("http://www.radioreference.com/apps/audio/?ctid=5586")
This returns a file like object. To read its contents you need to use different file manipulation methods, like this:
for line in page_contents.readlines():
print line
You need to pass a full URL: ie it must begin with http://.
Simply use http://www.radioreference.com/apps/audio/?ctid=5586 and it'll work fine.
In [24]: from urllib2 import urlopen as getpage
In [26]: print getpage("http://www.radioreference.com/apps/audio/?ctid=5586")
<addinfourl at 173987116 whose fp = <socket._fileobject object at 0xa5eb6ac>>

Why is urllib.urlopen() only working once? - Python

I'm writing a crawler to download the static html pages using urllib.
The get_page function works for 1 cycle but when i try to loop it, it doesn't open the content to the next url i've fed in.
How do i make urllib.urlopen continuously download HTML pages?
If it is not possible, is there any other suggestion to download
webpages within my python code?
my code below only returns the html for the 1st website in the seed list:
import urllib
def get_page(url):
return urllib.urlopen(url).read().decode('utf8')
seed = ['http://www.pmo.gov.sg/content/pmosite/home.html',
'http://www.pmo.gov.sg/content/pmosite/aboutpmo.html']
for j in seed:
print "here"
print get_page(j)
The same crawl "once-only" problem also occurs with urllib2:
import urllib2
def get_page(url):
req = urllib2.Request(url)
response = urllib2.urlopen(req)
return response.read().decode('utf8')
seed = ['http://www.pmo.gov.sg/content/pmosite/home.html',
'http://www.pmo.gov.sg/content/pmosite/aboutpmo.html']
for j in seed:
print "here"
print get_page(j)
Without the exception, i'm getting an IOError with urllib:
Traceback (most recent call last):
File "/home/alvas/workspace/SingCorp/sgcrawl.py", line 91, in <module>
print get_page(j)
File "/home/alvas/workspace/SingCorp/sgcrawl.py", line 4, in get_page
return urllib.urlopen(url).read().decode('utf8')
File "/usr/lib/python2.7/urllib.py", line 86, in urlopen
return opener.open(url)
File "/usr/lib/python2.7/urllib.py", line 207, in open
return getattr(self, name)(url)
File "/usr/lib/python2.7/urllib.py", line 462, in open_file
return self.open_local_file(url)
File "/usr/lib/python2.7/urllib.py", line 476, in open_local_file
raise IOError(e.errno, e.strerror, e.filename)
IOError: [Errno 2] No such file or directory: 'http://www.pmo.gov.sg/content/pmosite/aboutpmo.html'
Without the exception, i'm getting a ValueError with urllib2:
Traceback (most recent call last):
File "/home/alvas/workspace/SingCorp/sgcrawl.py", line 95, in <module>
print get_page(j)
File "/home/alvas/workspace/SingCorp/sgcrawl.py", line 7, in get_page
response = urllib2.urlopen(req)
File "/usr/lib/python2.7/urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 392, in open
protocol = req.get_type()
File "/usr/lib/python2.7/urllib2.py", line 254, in get_type
raise ValueError, "unknown url type: %s" % self.__original
ValueError: unknown url type: http://www.pmo.gov.sg/content/pmosite/aboutpmo.html
ANSWERED:
The IOError and ValueError occurred because there was some sort of Unicode byte order mark (BOM). A non-break space was found in the second URL. Thanks for all your help and suggestion in solving the problem!!
your code is choking on .read().decode('utf8').
but you wouldn't see that since you are just swallowing exceptions. urllib works fine "more than once".
import urllib
def get_page(url):
return urllib.urlopen(url).read()
seeds = ['http://www.pmo.gov.sg/content/pmosite/home.html',
'http://www.pmo.gov.sg/content/pmosite/aboutpmo.html']
for seed in seeds:
print 'here'
print get_page(seed)
Both of your examples work fine for me. The only explanation I can think of for your exact errors is that the second URL string contains some sort of non-printable character (a Unicode BOM, perhaps) that got filtered out when pasting the code here. Try copying the code back from this site into your file, or retyping the entire second string from scratch.

Error With AlchemyAPI Python SDK

I am trying to use AlchemyAPI Python 0.7 SDK. However when ever I run a method within it e.g. URLGetText(url);
I get this error:
nodes = etree.fromstring(result).xpath(xpathQuery)
File "lxml.etree.pyx", line 2743, in lxml.etree.fromstring (src/lxml/lxml.etree.c:52665)
File "parser.pxi", line 1573, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:79932)
File "parser.pxi", line 1452, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:78774)
File "parser.pxi", line 960, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:75389)
File "parser.pxi", line 564, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:71739)
File "parser.pxi", line 645, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:72614)
File "parser.pxi", line 585, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:71955)
lxml.etree.XMLSyntaxError: AttValue: " or ' expected, line 19, column 11
This come from this area of code:
def GetRequest(self, apiCall, apiPrefix, paramObject):
endpoint = 'http://' + self._hostPrefix + '.alchemyapi.com/calls/' + apiPrefix + '/' + apiCall
endpoint += '?apikey=' + self._apiKey + paramObject.getParameterString()
handle = urllib.urlopen(endpoint)
result = handle.read()
handle.close()
xpathQuery = '/results/status'
nodes = etree.fromstring(result).xpath(xpathQuery)
if nodes[0].text != "OK":
raise Exception, 'Error making API call.'
return result
Any one have any ideas about what is going wrong?
Thank You
Daniel Kershaw
I looked at the Python urllib docs, and found this page:
http://docs.python.org/library/urllib.html#high-level-interface
which contains this warning about the filehandle object returned by urllib.urlopen():
One caveat: the read() method, if the size argument is omitted or negative, may not read until the end of the data stream; there is no good way to determine that the entire stream from a socket has been read in the general case.
I think maybe you should ensure that you obtain the entire contents of the file as a Python string before parsing it with the etree.fromstring() API. Something like:
result = ''
while (1):
next = handle.read()
if not next:
break
result += next

Categories

Resources