FTP downloaded data contains literal unicode such as \x00 and \n

FTP downloaded data contains literal unicode such as \x00 and \n - python

I'm currently pulling from an FTP server to run a parser on a group of files when parsing the file from downloading in-browser and feeding the file into the parser, I get no errors since it's a plain TXT file.
But, when I pull the data from the server through Python's ftplib, I get differently formatted data. An example is the extra \x00 and 'b' characters inside my data. It's important that the data has integrity since I'm working with a COBOL database structure which can't be manipulated in the slightest else it will ruin the entire store.
I am successfully downloading using the following functions:
data = []
def handle_binary(more_data):
data.append(str(more_data))
def get_file(filename):
resp = ftp.retrbinary("RETR {0}".format(filename), callback=handle_binary)
file = "".join(data)
# returning the data as well as saving it to inspect
save = open("save.txt", "w+")
save.write(file)
save.close()
return file
I tried to change my store from:
data.append(str(more_data))
To
data.append(more_data)
As well as changing my join function to be b"" to indicate a byte join, but I got errors following that.
An example of a long string of data that wasn't in the original download:
\x00\x00\x00'b'\x00\n
Edit: Upon comparison, it seems that the FTP-downloaded data has no newlines (which makes sense since there's a missed newline in the above) as the downloaded copy does.
Thanks for any help regarding this question.

Related

Python ijson - parse error: trailing garbage // bz2.decompress()

I have come across an error while parsing json with ijson.
Background:
I have a series(approx - 1000) of large files of twitter data that are compressed in a '.bz2' format. I need to get elements from the file into a pd.DataFrame for further analysis. I have identified the keys I need to get. I am cautious putting twitter data up.
Attempt:
I have managed to decompress the files using bz2.decompress with the following code:
## Code in loop specific for decompressing and parsing -
with open(file, 'rb') as source:
# Decompress the file
json_r = bz2.decompress(source.read())
json_decom = json_r.decode('utf-8') # decompresses one file at a time rather than a stream
# Parse the JSON with ijson
parser = ijson.parse(json_decom)
for prefix, event, value in parser:
# Print selected items as part of testing
if prefix=="created_at":
print(value)
if prefix=="text":
print(value)
if prefix=="user.id_str":
print(value)
This gives the following error:
IncompleteJSONError: parse error: trailing garbage
estamp_ms":"1609466366680"} {"created_at":"Fri Jan 01 01:59
(right here) ------^
Two things:
Is my decompression method correct and giving the right type of file for ijson to parse (ijson takes both bytes and str)?
Is is a JSON error? // If it is a JSON error is it possible to develop some kind of error handler to move to the next file - if so any suggestion would be appreciated?
Any assistance would be greatly appreciated.
Thank you, James

To directly answer your two questions:
The decompression method is correct in the sense that it yields JSON data that you then feed to ijson. As you point out, ijson works both with str and bytes inputs (although the latter is preferred); if you were giving ijson some non-JSON input you wouldn't see an error showing JSON data in it.
This is a very common error that is described in ijson's FAQ. It basically means your JSON document has more than one top-level value, which is not standard JSON, but is supported by ijson by using the multiple_values option (see docs for details).
About the code as a whole: while it's working correctly, it could be improved on: the whole point of using ijson is that you can avoid loading the full JSON contents in memory. The code you posted doesn't use this to its advantage though: it first opens the bz-compressed file, reads it as a whole, decompresses that as a whole, (unnecessarily) decodes that as a whole, and then gives the decoded data as input to ijson. If your input file is small, and the decompressed data is also small you won't see any impact, but if your files are big then you'll definitely start noticing it.
A better approach is to stream the data through all the operations so that everything happens incrementally: decompression, no decoding and JSON parsing. Something along the lines of:
with bz2.BZ2File(filename, mode='r') as f:
for prefix, event, value in ijson.parse(f):
# ...
As the cherry on the cake, if you want to build a DataFrame from that you can use DataFrame's data argument to build the DataFrame directly with the results from the above. data can be an iterable, so you can, for example, make the code above a generator and use it as data. Again, something along the lines of:
def json_input():
with bz2.BZ2File(filename, mode='r') as f:
for prefix, event, value in ijson.parse(f):
# yield your results
df = pandas.DataFrame(data=json_input())

ValueError Reading large data set with pd.read_json

I am working a set of code exercises that use a Yelp reviews dataset. At this point in the exercises I am supposed to read in review.json which has one JSON record per line. I have made a smaller version of the JSON file, with only 100 records, for testing.
I can read the entire test file into a pandas dataframe and examine it.
The complete dataset file, however, has about 6 million lines. The recommendation is to use chunksize and build a json reader. I'm hitting errors, even with my test input.
My code currently looks like this
path = 'file://localhost/Users/.../DSC_Intro/'
filename = path + 'yelp_dataset/review_100.json'
# create a reader to read in chunks
review_reader =
pd.read_json(StringIO(filename), lines=True, chunksize=10)
type(review_reader)
The type call returns
pandas.io.json.json.JsonReader
which looks good.
Then I try
for chunk in review_reader:
print(chunk)
as referenced in pandas user guide
and I get an error:
ValueError: Unexpected character found when decoding 'false'
Update - it has been suggested that the issue is caused by embedded (quoted) "\n" characters in the data file; that pandas is seeing the JSON records as, not one per line, but multiple lines.
The error message is VERY opaque, if that's the case. Also, with 6 million lines, how should I tell pd.read_json to ignore "\n" and only look at actual newlines in the data?
Update
It's been suggested that if I fix my typo (it was a typo in this post, not a typo in my code) and use a Unix file path instead of a URL (JSON doesn't care: see docs).
When I do this but keep StringIO(), I get a different ValueError.
When I do this but remove StringIO(), the code works.
This seems to be very fragile. :-(
Note The tutorial has an answer key. I've tried that code. The answer key uses
review_reader =
pd.read_json(filename, lines=True, chunksize=10)
which throws the TypeError
sequence item 0: expected str instance, bytes found
Adding StringIO() seems to have solved that.
Input Sample JSON record, one per line of the input file.
{"review_id":"Amo5gZBvCuPc_tZNpHwtsA","user_id":"DzZ7piLBF-WsJxqosfJgtA","business_id":"qx6WhZ42eDKmBchZDax4dQ","stars":5.0,"useful":1,"funny":0,"cool":0,"text":"Our family LOVES the food here. Quick, friendly, delicious, and a great restaurant to take kids to. 5 stars!","date":"2017-03-27 01:14:37"}

Firstly, your
path 'file://localhost/Users/.../DSC_Intro/'
is not valid python code. If you try to execute that as such, you will get an invalid syntax error. I assume, then, that this is just showing the value of the path variable. I don't know whether those ellipses are literal or the result of your environment truncating the display of path. I'll assume here that your path is a valid file URL for your system as it doesn't seem germane here to consider an incorrect path.
Either way, yes, read_json can read json from a file URL as you're specifying there (I learned something there) if you read it in one go:
pd.read_json(fname, lines=True)
But if you try to create a reader from this, by specifying
pd.read_json(fname, lines=True, chunksize=...)
then you get
TypeError: sequence item 0: expected str instance, bytes found
Secondly, yes, wrapping your file-like argument with StringIO makes this error go away, but it isn't helping for any reason you might think and its use is based on a misreading of the pandas docs you point to.
I'll quote a couple of bits from the read_json doc here:
Signature: pd.read_json(
path_or_buf=None, ...
path_or_buf : a valid JSON string or file-like, default: None
The string could be a URL. Valid URL schemes include http, ftp, s3,
gcs, and file. For file URLs, a host is expected. For instance, a local
file could be file://localhost/path/to/table.json
So with read_json, you can either give it an actual string that is valid JSON, or you can give it a file-like object that points to a file that contains JSON.
Notice in the pandas docs that you cite:
In [258]: jsonl = '''
.....: {"a": 1, "b": 2}
.....: {"a": 3, "b": 4}
.....: '''
.....:
is JSON, not a path. When their example then does:
df = pd.read_json(jsonl, lines=True)
it is merely parsing the JSON in the string - no files are involved here.
When it then wants to demonstrate reading from a file in chunks, it does
# reader is an iterator that returns `chunksize` lines each iteration
In [262]: reader = pd.read_json(StringIO(jsonl), lines=True, chunksize=1)
In other words, they are wrapping a JSON string, not a path, by StringIO(). This is just for the purposes of the documented example, so you can see that if you treated the JSON string as if it were being read from a file you can read it in chunks. That's what StringIO() does. So when you wrap the string that describes your file URL in StringIO(), I expect that read_json is then trying to interpret that string as JSON that's being read from a file and parse it. It understandably falls over because it isn't JSON.
This brings us back to why read_json cannot read your file URL in chunks. I don't have an immediate good answer to that. I suspect it lies in the internals of how read_json opens file URLs, or what function underlies this. If you were intent upon, or forced to, do this chunking from a file URL then I suspect you'd be looking at controlling the mode in which the file is opened, or perhaps somehow providing explicit guidance to read_json how to interpret the bytestream it gets. Libraries such as urllib2 may be useful here, I'm not sure.
But let's cut to the best fix here. Why are we trying to specify the path as a file URL? Simply specify your path as an OS path, e.g.
path = '/path/to/my/data/'
and then
filename = path + 'yelp_dataset/review_100.json'
# create a reader to read in chunks
review_reader = pd.read_json(filename, lines=True, chunksize=10)
And I betcha it works as intended! (It does for me, as it always has).
Caveat: windows doesn't use forward-slash path delimiters, and constructing paths by concatenating strings in the above fashion can be fragile, but usually if you use 'proper' forward-slash delimiters (smile), decent languages internally understand that. It's constructing paths using backslashes that is guaranteed to cause you pain. But just keep an eye on that.

How can I take the html file of a website

I am trying to take the html of my website and see if it is the same as what I have on an offline version.
I have been researching this, and all I can find is either parsing or something that deals with only http://
So far I have this:
import urllib
url = "https://www.mywebsite.com/"
onlinepage = urllib.urlopen(url)
print(onlinepage.read())
offlinepage = open("offline.txt", "w+")
print(offlinepage.read())
if onlinepage.read() == offlinepage.read():
print("same") # for debugging
else:
print("different")
This always says that they are the same, even when I put in a different website entirely.

When you first print your online and offline pages with these lines:
print(onlinepage.read())
print(offlinepage.read())
...you have now consumed all of the text in each file object. Subsequent reads on either object will return an empty string. Two empty strings are equal, therefore your if condition will always evaluate to True.
If you were purely working with files, you could seek to the beginning of both files and read again. Since there is no seek method on the file object from urlopen, you'll need to either re-fetch the page with a new urlopen command or, better, save the original text in a variable and use that for your subsequent comparisons:
online = onlinepage.read()
print(online)
offline = offlinepage.read()
print(offline)
...
if online == offline:
...

As others have noted, you can't read the request object twice (and can't read the file twice without seeking); once read, the data you got back is no longer available, so you need to store it.
But they missed another problem: You opened the file with mode w+. w+ allows both reading and writing, but, just like mode w, it truncates the file on open. So your local file is always empty when you read it, which means you're both corrupting the local file and never getting a match (unless the online file is empty too).
You need to use mode r+ or a+ to get a read/write handle that doesn't truncate the existing file (r+ requires that the file already exist, a+ does not, but puts the write position at end of file, and on some systems, all writes are put at the end of the file).
So fixing both bugs, you get:
import urllib
url = "https://www.mywebsite.com/"
# Using with statements properly for safe resource cleanup
with urllib.urlopen(url) as onlinepage:
onlinedata = onlinepage.read()
print(onlinedata)
with open("offline.txt", "r+") as offlinepage: # DOES NOT TRUNCATE EXISTING FILE!
offlinedata = offlinepage.read()
print(offlinedata)
if onlinedata == offlinedata:
print("same") # for debugging
else:
print("different")
# I assume you want to rewrite the local page, or you wouldn't open with +
# so this is what you'd do to ensure you replace the existing data correctly
offlinepage.seek(0) # Ensure you're seeked to beginning of file for write
offlinepage.write(onlinedata)
offlinepage.truncate() # If online data smaller, don't keep offline extra data

You use .read() twice on each file.
>>> f.read()
'This is the entire file.\n'
>>> f.read()
''
"If the end of the file has been reached, f.read() will return an empty string ("")." (7.2.1 Docs).
Therefore, when two results are compared, they are equal because each is an empty string.

Searching for and manipulating the content of a keyword in a huge file

I have a huge HTML file that I have converted to text file. (The file is Facebook home page's source). Assume the text file has a specific keyword in some places of it. For example: "some_keyword: [bla bla]". How would I print all the different bla blas that are followed by some_keyword?
{id:"1126830890",name:"Hillary Clinton",firstName:"Hillary"}
Imagine there are 50 different names with this format in the page. How would I print all the names followed by "name:", considering the text is very large and crashes when you read() it or try to search through its lines.
Sample File:
shortProfiles:{"100000094503825":{id:"100000094503825",name:"Bla blah",firstName:"Blah",vanity:"blah",thumbSrc:"https://scontent-lax3-1.xx.fbcdn.net/v/t1.0-1/c19.0.64.64/p64x64/10354686_10150004552801856_220367501106153455_n.jpg?oh=3b26bb13129d4f9a482d9c4115b9eeb2&oe=5883062B",uri:"https://www.facebook.com/blah",gender:2,i18nGender:16777216,type:"friend",is_friend:true,mThumbSrcSmall:null,mThumbSrcLarge:null,dir:null,searchTokens:["Bla"],alternateName:"",is_nonfriend_messenger_contact:false},"1347968857":

Based on your comment, since you are the person responsible for writting the data to the file. Write the data in JSON format and read it from file using json.loads() as:
import json
json_file = open('/path/to/your_file')
json_str = json_file.read()
json_data = json.loads(json_str)
for item in json_data:
print item['name']
Explanation:
Lets say data is the variable storing
{id:"1126830890",name:"Hillary Clinton",firstName:"Hillary"}
which will be dynamically changing within your code where you are performing write operation in the file. Instead append it to the list as:
a = []
for item in page_content:
# data = some xy logic on HTML file
a.append(data)
Now write this list to the file using: json.dump()

I just wanted to throw this out there even though I agree with all the comments about just dealing with the html directly or using Facebook's API (probably the safest way), but open file objects in Python can be used as a generator yielding lines without reading the entire file into memory and the re module can be used to extract information from text.
This can be done like so:
import re
regex = re.compile(r"(?:some_keyword:\s\[)(.*?)\]")
with open("filename.txt", "r") as fp:
for line in fp:
for match in regex.findall(line):
print(match)
Of course this only works if the file is in a "line-based" format, but the end effect is that only the line you are on is loaded into memory at any one time.
here is the Python 2 docs for the re module
here is the Python 3 docs for the re module
I cannot find documentation which details the generator capabilities of file objects in Python, it seems to be one of those well-known secrets...Please feel free to edit and remove this paragraph if you know where in the Python docs this is detailed.

Write gzipped-already data into a file

I have a database with some of the data is binary (blob datatype in MySQL), which was actually webpages scrapped and gzipped. Now I want to extract them and write each record into a gzip file, which I'd assume to be doable - after all they are gzipped-data right?
The question is, however, how would I do that? By searching I could find a million of examples on how to write gzip file from original data, not gzipped one. Writing the gzipped string directly into a file doesn't result in a gzip file, not to mention I got a load of "ordinal not in range" exceptions.
Could you guys help? Thanks in advance. I'm a newbie to Python...
Edit: Here is the method I used:
def store_cache(self, content, news_id):
if not content:
return
# some of the records may contain normal data (not gzipp-ed), hence this try block
try:
content = self.gunzip(content)
except:
return
import gzip
with gzip.open('static/cache/%s' % (self.base36encode(news_id), ), 'wb') as f:
f.write(content)
f.close()
This causes an exception:
<type 'exceptions.UnicodeEncodeError'> at /migrate
'ascii' codec can't encode character u'\u1edb' in position 186: ordinal not in range(128)
And this is the innermost traceback:
E:\Python27\lib\gzip.py in write
self.crc = zlib.crc32(data, self.crc) & 0xffffffffL

You said it yourself: extract them and then write them into a gzip file. There is nothing special about writing "from gzipped data": you un-gzip the data to get the original data, and then write the original data as if it were original data (because it is). The documentation shows you how to do these things.
However, gzip is just a compression format, not an archive format. It is not built to handle multiple files, so you must use something else to create a single file from the multiple inputs. Typically this is done by making a tar archive which is then gzipped. You can do this in Python using the tarfile module. Since your data will come from gzip-decompression streams, you will want to use the TarFile.addfile(tarinfo, fileobj) method to add them to the archive. You should be able to use the gzip.GzipFile instance as the fileobj to add this way.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.