Read file object as string in python - python

I'm using urllib2 to read in a page. I need to do a quick regex on the source and pull out a few variables but urllib2 presents as a file object rather than a string.
I'm new to python so I'm struggling to see how I use a file object to do this. Is there a quick way to convert this into a string?

You can use Python in interactive mode to search for solutions.
if f is your object, you can enter dir(f) to see all methods and attributes. There's one called read. Enter help(f.read) and it tells you that f.read() is the way to retrieve a string from an file object.

From the doc file.read() (my emphasis):
file.read([size])
Read at most size bytes from the file (less if the read hits EOF before obtaining size bytes). If the size argument is negative or omitted, read all data until EOF is reached. The bytes are returned as a string object. An empty string is returned when EOF is encountered immediately. (For certain files, like ttys, it makes sense to continue reading after an EOF is hit.) Note that this method may call the underlying C function fread more than once in an effort to acquire as close to size bytes as possible. Also note that when in non-blocking mode, less data than was requested may be returned, even if no size parameter was given.
Be aware that a regexp search on a large string object may not be efficient, and consider doing the search line-by-line, using file.next() (a file object is its own iterator).

Michael Foord, aka Voidspace has an excellent tutorial on urllib2 which you can find here:
urllib2 - The Missing Manual
What you are doing should be pretty straightforward, observe this sample code:
import urllib2
import re
response = urllib2.urlopen("http://www.voidspace.org.uk/python/articles/urllib2.shtml")
html = response.read()
pattern = '(V.+space)'
wordPattern = re.compile(pattern, re.IGNORECASE)
results = wordPattern.search(html)
print results.groups()

Related

Python ijson - parse error: trailing garbage // bz2.decompress()

I have come across an error while parsing json with ijson.
Background:
I have a series(approx - 1000) of large files of twitter data that are compressed in a '.bz2' format. I need to get elements from the file into a pd.DataFrame for further analysis. I have identified the keys I need to get. I am cautious putting twitter data up.
Attempt:
I have managed to decompress the files using bz2.decompress with the following code:
## Code in loop specific for decompressing and parsing -
with open(file, 'rb') as source:
# Decompress the file
json_r = bz2.decompress(source.read())
json_decom = json_r.decode('utf-8') # decompresses one file at a time rather than a stream
# Parse the JSON with ijson
parser = ijson.parse(json_decom)
for prefix, event, value in parser:
# Print selected items as part of testing
if prefix=="created_at":
print(value)
if prefix=="text":
print(value)
if prefix=="user.id_str":
print(value)
This gives the following error:
IncompleteJSONError: parse error: trailing garbage
estamp_ms":"1609466366680"} {"created_at":"Fri Jan 01 01:59
(right here) ------^
Two things:
Is my decompression method correct and giving the right type of file for ijson to parse (ijson takes both bytes and str)?
Is is a JSON error? // If it is a JSON error is it possible to develop some kind of error handler to move to the next file - if so any suggestion would be appreciated?
Any assistance would be greatly appreciated.
Thank you, James
To directly answer your two questions:
The decompression method is correct in the sense that it yields JSON data that you then feed to ijson. As you point out, ijson works both with str and bytes inputs (although the latter is preferred); if you were giving ijson some non-JSON input you wouldn't see an error showing JSON data in it.
This is a very common error that is described in ijson's FAQ. It basically means your JSON document has more than one top-level value, which is not standard JSON, but is supported by ijson by using the multiple_values option (see docs for details).
About the code as a whole: while it's working correctly, it could be improved on: the whole point of using ijson is that you can avoid loading the full JSON contents in memory. The code you posted doesn't use this to its advantage though: it first opens the bz-compressed file, reads it as a whole, decompresses that as a whole, (unnecessarily) decodes that as a whole, and then gives the decoded data as input to ijson. If your input file is small, and the decompressed data is also small you won't see any impact, but if your files are big then you'll definitely start noticing it.
A better approach is to stream the data through all the operations so that everything happens incrementally: decompression, no decoding and JSON parsing. Something along the lines of:
with bz2.BZ2File(filename, mode='r') as f:
for prefix, event, value in ijson.parse(f):
# ...
As the cherry on the cake, if you want to build a DataFrame from that you can use DataFrame's data argument to build the DataFrame directly with the results from the above. data can be an iterable, so you can, for example, make the code above a generator and use it as data. Again, something along the lines of:
def json_input():
with bz2.BZ2File(filename, mode='r') as f:
for prefix, event, value in ijson.parse(f):
# yield your results
df = pandas.DataFrame(data=json_input())

Convert bytes to a file object in python

I have a small application that reads local files using:
open(diefile_path, 'r') as csv_file
open(diefile_path, 'r') as file
and also uses linecache module
I need to expand the use to files that send from a remote server.
The content that is received by the server type is bytes.
I couldn't find a lot of information about handling IOBytes type and I was wondering if there is a way that I can convert the bytes chunk to a file-like object.
My goal is to use the API is specified above (open,linecache)
I was able to convert the bytes into a string using data.decode("utf-8"),
but I can't use the methods above (open and linecache)
a small example to illustrate
data = 'b'First line\nSecond line\nThird line\n'
with open(data) as file:
line = file.readline()
print(line)
output:
First line
Second line
Third line
can it be done?
open is used to open actual files, returning a file-like object. Here, you already have the data in memory, not in a file, so you can instantiate the file-like object directly.
import io
data = b'First line\nSecond line\nThird line\n'
file = io.StringIO(data.decode())
for line in file:
print(line.strip())
However, if what you are getting is really just a newline-separated string, you can simply split it into a list directly.
lines = data.decode().strip().split('\n')
The main difference is that the StringIO version is slightly lazier; it has a smaller memory foot print compared to the list, as it splits strings off as requested by the iterator.
The answer above that using StringIO would need to specify an encoding, which may cause wrong conversion.
from Python Documentation using BytesIO:
from io import BytesIO
f = BytesIO(b"some initial binary data: \x00\x01")

Python - read huge online csv through proxy

I have a huuuuuge csv online and I wan't to read it line by line whitout download it. But this file is behind a proxy.
I wrote this code :
import requests
import pandas as pd
import io
cafile = 'mycert.crt'
proxies = {"http":"http://ipproxy:port", "https":"http://ipproxy:port"}
auth = HttpNtlmAuth('Username','Password')
url = 'http://myurl/ressources.csv'
content = requests.get(url, proxies=proxies, auth=auth, verify=cafile).content
csv_read = pd.read_csv(io.StringIO(content.decode('utf-8')))
pattern = 'mypattern'
for row in csv_read:
if row[0] == pattern:
print(row)
break
This code above works but the line 'content = requests.get(...' takes soooo much time ! Because of the size of the csv file.
So my question is :
Is it possible to read an online csv line by line through proxy ?
In the best way, I wish to read the first row, check if it equals to my pattern, if yes = break, if not = read the second line ans so on.
Thank's for your help
You can pass stream=True to requests.get to avoid fetching the entire result immediately. In that case you can access a pseudo-file object through response.raw, you can build your CSV reader based on that (alternatively, the response object has iter_content and iter_lines methods but I don't know how easy it is to feed that to a CSV parser).
However while the stdlib's csv module simply yields a sequence of lists or dicts and can therefore easily be lazy, pandas returns a dataframe which are not lazy, so you need to specify some special parameters then you get a dataframe per chunk or something it looks like.
The requests.get call will get you the whole file anyway. You'd need to implement your own HTTP code, down to the socket level, to be able to process the content as it gets in, in a plain HTTP Get method.
The only way of getting partial results and slice the download is to add HTTP "range" request headers, if the server providing the file support then. (requests can let you set these headers).
enter requests advanced usage:
The good news is that requests can do that for you under the hood -
you can set stream=True parameter when calling requests, and it even will let you iterate the contents line by line. Check the documentation on that part.
Here is more or less what requests does under the hood so that you can get your contents line by line:
It will get reasobale sized chunks of your data, - but certainly not equest one line at a time (think ~80 bytes versus 100.000 bytes), because otherwise it'd need a new HTTP request for each line,and the overhead for each request is not trivial, even if made over the same TCP connection.
Anyway, as CSV being a text format, neither requests nor any other software could know the size of the lines, and even less the exact size of the "next" line to be read - before setting the range headers accordingly.
So, for this to work, ther have to have to be Python code to:
accept a request for a "new line" of the CSV if there are buffered
text lines, yield the next line,
otherwise make an HTTP request for
the next 100KB or so
Concatenate the downloaded data to the
remainder of the last downloaded line
split the downloaded data at
the last line-feed in the binary data,
save the remainder of the
last line
convert your binary buffer to text, (you'd have to take
care of multi-byte character boundaries in a multi-byte encoding
(like utf-8) - but cutting at newlines may save you that)
yield the
next text line
According to Masklinn's answer, my code looks like this now :
import requests
cafile = 'mycert.crt'
proxies = {"http":"http://ipproxy:port", "https":"http://ipproxy:port"}
auth = HttpNtlmAuth('Username','Password')
url = 'http://myurl/ressources.csv'
pattern = 'mypattern'
r = requests.get(url, stream=True, proxies=proxies, verify=cafile)
if r.encoding is None:
r.encoding = 'ISO-8859-1'
for line in r.iter_lines(decode_unicode=True):
if line.split(';')[0] == pattern:
print(line)
break

ValueError Reading large data set with pd.read_json

I am working a set of code exercises that use a Yelp reviews dataset. At this point in the exercises I am supposed to read in review.json which has one JSON record per line. I have made a smaller version of the JSON file, with only 100 records, for testing.
I can read the entire test file into a pandas dataframe and examine it.
The complete dataset file, however, has about 6 million lines. The recommendation is to use chunksize and build a json reader. I'm hitting errors, even with my test input.
My code currently looks like this
path = 'file://localhost/Users/.../DSC_Intro/'
filename = path + 'yelp_dataset/review_100.json'
# create a reader to read in chunks
review_reader =
pd.read_json(StringIO(filename), lines=True, chunksize=10)
type(review_reader)
The type call returns
pandas.io.json.json.JsonReader
which looks good.
Then I try
for chunk in review_reader:
print(chunk)
as referenced in pandas user guide
and I get an error:
ValueError: Unexpected character found when decoding 'false'
Update - it has been suggested that the issue is caused by embedded (quoted) "\n" characters in the data file; that pandas is seeing the JSON records as, not one per line, but multiple lines.
The error message is VERY opaque, if that's the case. Also, with 6 million lines, how should I tell pd.read_json to ignore "\n" and only look at actual newlines in the data?
Update
It's been suggested that if I fix my typo (it was a typo in this post, not a typo in my code) and use a Unix file path instead of a URL (JSON doesn't care: see docs).
When I do this but keep StringIO(), I get a different ValueError.
When I do this but remove StringIO(), the code works.
This seems to be very fragile. :-(
Note The tutorial has an answer key. I've tried that code. The answer key uses
review_reader =
pd.read_json(filename, lines=True, chunksize=10)
which throws the TypeError
sequence item 0: expected str instance, bytes found
Adding StringIO() seems to have solved that.
Input Sample JSON record, one per line of the input file.
{"review_id":"Amo5gZBvCuPc_tZNpHwtsA","user_id":"DzZ7piLBF-WsJxqosfJgtA","business_id":"qx6WhZ42eDKmBchZDax4dQ","stars":5.0,"useful":1,"funny":0,"cool":0,"text":"Our family LOVES the food here. Quick, friendly, delicious, and a great restaurant to take kids to. 5 stars!","date":"2017-03-27 01:14:37"}
Firstly, your
path 'file://localhost/Users/.../DSC_Intro/'
is not valid python code. If you try to execute that as such, you will get an invalid syntax error. I assume, then, that this is just showing the value of the path variable. I don't know whether those ellipses are literal or the result of your environment truncating the display of path. I'll assume here that your path is a valid file URL for your system as it doesn't seem germane here to consider an incorrect path.
Either way, yes, read_json can read json from a file URL as you're specifying there (I learned something there) if you read it in one go:
pd.read_json(fname, lines=True)
But if you try to create a reader from this, by specifying
pd.read_json(fname, lines=True, chunksize=...)
then you get
TypeError: sequence item 0: expected str instance, bytes found
Secondly, yes, wrapping your file-like argument with StringIO makes this error go away, but it isn't helping for any reason you might think and its use is based on a misreading of the pandas docs you point to.
I'll quote a couple of bits from the read_json doc here:
Signature: pd.read_json(
path_or_buf=None, ...
path_or_buf : a valid JSON string or file-like, default: None
The string could be a URL. Valid URL schemes include http, ftp, s3,
gcs, and file. For file URLs, a host is expected. For instance, a local
file could be file://localhost/path/to/table.json
So with read_json, you can either give it an actual string that is valid JSON, or you can give it a file-like object that points to a file that contains JSON.
Notice in the pandas docs that you cite:
In [258]: jsonl = '''
.....: {"a": 1, "b": 2}
.....: {"a": 3, "b": 4}
.....: '''
.....:
is JSON, not a path. When their example then does:
df = pd.read_json(jsonl, lines=True)
it is merely parsing the JSON in the string - no files are involved here.
When it then wants to demonstrate reading from a file in chunks, it does
# reader is an iterator that returns `chunksize` lines each iteration
In [262]: reader = pd.read_json(StringIO(jsonl), lines=True, chunksize=1)
In other words, they are wrapping a JSON string, not a path, by StringIO(). This is just for the purposes of the documented example, so you can see that if you treated the JSON string as if it were being read from a file you can read it in chunks. That's what StringIO() does. So when you wrap the string that describes your file URL in StringIO(), I expect that read_json is then trying to interpret that string as JSON that's being read from a file and parse it. It understandably falls over because it isn't JSON.
This brings us back to why read_json cannot read your file URL in chunks. I don't have an immediate good answer to that. I suspect it lies in the internals of how read_json opens file URLs, or what function underlies this. If you were intent upon, or forced to, do this chunking from a file URL then I suspect you'd be looking at controlling the mode in which the file is opened, or perhaps somehow providing explicit guidance to read_json how to interpret the bytestream it gets. Libraries such as urllib2 may be useful here, I'm not sure.
But let's cut to the best fix here. Why are we trying to specify the path as a file URL? Simply specify your path as an OS path, e.g.
path = '/path/to/my/data/'
and then
filename = path + 'yelp_dataset/review_100.json'
# create a reader to read in chunks
review_reader = pd.read_json(filename, lines=True, chunksize=10)
And I betcha it works as intended! (It does for me, as it always has).
Caveat: windows doesn't use forward-slash path delimiters, and constructing paths by concatenating strings in the above fashion can be fragile, but usually if you use 'proper' forward-slash delimiters (smile), decent languages internally understand that. It's constructing paths using backslashes that is guaranteed to cause you pain. But just keep an eye on that.

Open a file for RawIOBase python

I need to read in the the binary of a file for a function, and from this link https://docs.python.org/2/library/io.html, it looks like I should be using a RawIOBase object to read it in. But I can't find any where on how to open a file to use with RawIOBase. Right now I have tried this to read the binary into a string
with (open(documentFileName+".bin", "rb")) as binFile:
document = binFile.RawIOBase.read()
print document
but that throws the error AttributeError: 'file' object has no attribute 'RawIOBase'
So with no open attribute in RawIOBase, how do I open the file for it to read from?
Don't delve into the implementation details of the io thicket unless you need to code your own peculiar file-oid-like types! In your case,
with open(documentFileName+".bin", "rb") as binFile:
document = binFile.read()
will be perfectly fine!
Note in passing that I've killed the superfluous parentheses you were using -- "no unneeded pixels!!!" -- but, while important!, that's a side issue to your goal here.
Now, assuming Python 2, document is a str -- an immutable array of bytes. It may be confusing that displaying document shows it as a string of characters, but that's just Py2's confusion between text and byte strings (in Py3, the returned type would be bytes).
If you prefer to work with (e.g) a mutable array of ints, use e.g
theints = map(ord, document)
or, for an immutable array of bytes that displays numerically,
import array
thearray = array.array('b', document)

Categories

Resources