Can't convert base64 images from csv to file in Python - python

I know it looks like it has been answered before, but I can't seem to find a solution for this issue. I have a CSV file that contains very long strings of Base64 encoded images (~5mb each). I enabled the CSV field size limit to max. There are several of the images decoded separated by columns then a few values that are only a couple of words long. I can read these through print(row[7]) for example no problem. The images won't print the base64 strings, and i'm trying to decode them and save them to the filesystem but they end up being empty. Any thoughts?
fh = open("~path~/image.png", "wb")
x = base64.b64decode(row[1])
fh.write(x)
fh.close()
Thanks for any help!
EDIT: Works now. CSV split on python seems to act a little different than in Java. The empty values came up due to the csv being saved differently than the exporting tool I used indicated, so it was left with values ("8",,"data:image/png;base64,IR0BRR....",...). I didn't catch the empty space before, which is why it was showing blank, and then I also attempted to append the data:image/png part to the beginning of itself since I believed that python string split would split the comma after base64 like Java would. After adjusting for this, the image correctly saves in my filesystem.

Related

Edit a few lines of uncompressed PDF in Python

I want to edit a few lines in an uncompressed pdf.
I found a similar problem but since I need to scan the file a few times to get the exact line positions I want to change this doesn't really suit (and the pure number of RegEx matches are more than desired).
The pdf contains utf-8 encodable lines (a few of them I want to edit, bookmark target ids in particular)
and a lot of blobs (guess images and so on).
When I edit the file with notepad it's working fine, but when I do it programatically (reading in, changing a few lines, writing back)
images and some formatting is missing. (Sine they are not read in at the firstplace, ignore-option)
with codecs.open("merged-uncompressed.pdf", "r", encoding='ascii', errors='ignore') as f:
I can read the file in with errors="surrogateescape" and wanted to map the lines from above import but don't know if this approach can work.
Does anyone know a way how to deal with this?
Best, Lukas
I was able to solve this:
read the file as binary
marked the lines which couldn't be encoded utf-8
copied the list line by line to a temporary list ( not encodable lines were copied with a placholder 'None\n')
Then I went back to do the searching part on the copied list so I got my lines I wanted to replace
replaced the lines in the original binary list (same indices!)
wrote it back to file
the resulting pdf was a bit corupted because of whitespace before the target ids of the bookmarks but by recompressing qpdf fixed it:)
The code is very messy at the moment and so I don't want to publish it right now.
But I want to add it at github within the next few weeks.
If anyone needs it: just comment and it will have more priority.
Thanks to anyone who wanted to help:)
Lukas

Using RegEx and Reading files inside of a .EGG File?

newemail = 'test#gmail.com'
import zipfile
import re
egg = zipfile.ZipFile('C:\\Users\\myname\\Desktop\\TEST\\Tool\\scraper-1.11-py3.6.egg')
file = egg.open('scraping_tool/settings.py')
text = file.read().decode('utf8')
emailregex = re.compile(r'[A-Za-z0-9-.]+#[A-Za-z0-9-.]+')
newtext = emailregex.sub(newemail,text)
newtext = newtext.encode('utf8')
file.close()
egg.close()
egg = zipfile.ZipFile('C:\\Users\\myname\\Desktop\\TEST\\Tool\\scraper-1.11-py3.6.egg', 'w')
file = egg.open('scaping_tool/settings.py', 'w')
file.write(newtext)
file.close()
egg.close()
I'm a week into programming so let me know if anything I say doesn't make sense. The objective I'm trying to achieve right now is getting the email out of a .py file in a egg file.
In the interactive shell I was able to successfully to retrieve the txt = file.read() but once I start getting the match objects and regEX involved I start getting errors like "can not get strings from byte objects"
Tried reading stackoverflow questions about the errors but still too new to decipher what they are talking about and might need it dumbed down a bit more. I understand the zip file is messing up how regex works with strings but not sure how to fix it.
EDIT: Bonus Question about encoding
When you .read() an entry from a zipfile, you will get bytes. There is no automatism that detects whether a zip entry is a binary file or a text file, you have to make that decision yourself.
In order to convert bytes into string, you must decode them, which requires knowledge of the text encoding these files have been saved in. If you don't know for sure, UTF-8 (utf8) and Windows-1252 (cp1252) are common encodings to try. You'll know that you've picked the right encoding when special/accented characters look right in the result:
txt = file.read().decode('utf8')
print(txt)
Once you're working with strings, the "can not get strings from byte objects" error won't occur anymore.

Python csv package - issue with DictReader module

I'm having a curious issue with the csv package in Python 3.7.
I'm importing a csv file and able to access all the file as expected, with one exception - the header row, as stored in the "fieldnames" object, appears have the first column header (first item in fieldnames) malformed.
This first field always has the format: 'xxx"header"'
where:
xxx are garbage characters that always seem to be the same
header is the correct header text
See the following screenshot of my table <csv.DictReader> object from my debug window:
My code to open the file, follows. I added the headers[0] = table.fieldnames[0].split('"')[1] in order to extract the correct header and place it back into fieldnames`.
import csv
with self.inputfile.open() as self.inputfid:
table = csv.DictReader(self.inputfid, delimiter=',')
headers = table.fieldnames
headers[0] = table.fieldnames[0].split('"')[1]
(Note: self.inputfile is a pathlib.Path object)
I didn't notice this for a long time because I wasn't using the first column (with the # header) - I've been happily parsing with the rest of the columns for a while on multiple files.
If I look directly at the csv, there doesn't appear to be any issue:
Questions:
Does anyone know what the issue is? Is there anything I can try to correct the import issue?
If there isn't a fix, is there a better way to parse the garbage? I realize this could clear up in the future, but I think the split will still work even with just bare double quotes (the header should still be the 2nd item in the split, right?). Is there a better solution?
It looks like your csv file is encoded as utf-8-sig - a version of utf-8 used by some Windows applications, but it's being decoded as cp1252 - another encoding in common use on Windows.
>>> print('"#"'.encode('utf-8-sig').decode('cp1252'))
"#"
The "garbage" characters preceding the header are the byte-order-mark that utf-8-sig uses to tell Windows applications that a file is encoded as utf-8 rather than one of the historically more common 8-bit encodings.
To avoid the "garbage", specify utf-8-sig as the encoding when opening your file.
The code in the question could be modified to work like this:
import csv
encoding = 'utf-8-sig'
with self.inputfile.open(encoding=encoding, newline='') as self.inputfid:
table = csv.DictReader(self.inputfid, delimiter=',')
headers = table.fieldnames
...
If - as seems likely - the encoding of input files may vary, the value of encoding (or a best guess) must be determined by using a tool like chardet, as used in the comments.

ValueError Reading large data set with pd.read_json

I am working a set of code exercises that use a Yelp reviews dataset. At this point in the exercises I am supposed to read in review.json which has one JSON record per line. I have made a smaller version of the JSON file, with only 100 records, for testing.
I can read the entire test file into a pandas dataframe and examine it.
The complete dataset file, however, has about 6 million lines. The recommendation is to use chunksize and build a json reader. I'm hitting errors, even with my test input.
My code currently looks like this
path = 'file://localhost/Users/.../DSC_Intro/'
filename = path + 'yelp_dataset/review_100.json'
# create a reader to read in chunks
review_reader =
pd.read_json(StringIO(filename), lines=True, chunksize=10)
type(review_reader)
The type call returns
pandas.io.json.json.JsonReader
which looks good.
Then I try
for chunk in review_reader:
print(chunk)
as referenced in pandas user guide
and I get an error:
ValueError: Unexpected character found when decoding 'false'
Update - it has been suggested that the issue is caused by embedded (quoted) "\n" characters in the data file; that pandas is seeing the JSON records as, not one per line, but multiple lines.
The error message is VERY opaque, if that's the case. Also, with 6 million lines, how should I tell pd.read_json to ignore "\n" and only look at actual newlines in the data?
Update
It's been suggested that if I fix my typo (it was a typo in this post, not a typo in my code) and use a Unix file path instead of a URL (JSON doesn't care: see docs).
When I do this but keep StringIO(), I get a different ValueError.
When I do this but remove StringIO(), the code works.
This seems to be very fragile. :-(
Note The tutorial has an answer key. I've tried that code. The answer key uses
review_reader =
pd.read_json(filename, lines=True, chunksize=10)
which throws the TypeError
sequence item 0: expected str instance, bytes found
Adding StringIO() seems to have solved that.
Input Sample JSON record, one per line of the input file.
{"review_id":"Amo5gZBvCuPc_tZNpHwtsA","user_id":"DzZ7piLBF-WsJxqosfJgtA","business_id":"qx6WhZ42eDKmBchZDax4dQ","stars":5.0,"useful":1,"funny":0,"cool":0,"text":"Our family LOVES the food here. Quick, friendly, delicious, and a great restaurant to take kids to. 5 stars!","date":"2017-03-27 01:14:37"}
Firstly, your
path 'file://localhost/Users/.../DSC_Intro/'
is not valid python code. If you try to execute that as such, you will get an invalid syntax error. I assume, then, that this is just showing the value of the path variable. I don't know whether those ellipses are literal or the result of your environment truncating the display of path. I'll assume here that your path is a valid file URL for your system as it doesn't seem germane here to consider an incorrect path.
Either way, yes, read_json can read json from a file URL as you're specifying there (I learned something there) if you read it in one go:
pd.read_json(fname, lines=True)
But if you try to create a reader from this, by specifying
pd.read_json(fname, lines=True, chunksize=...)
then you get
TypeError: sequence item 0: expected str instance, bytes found
Secondly, yes, wrapping your file-like argument with StringIO makes this error go away, but it isn't helping for any reason you might think and its use is based on a misreading of the pandas docs you point to.
I'll quote a couple of bits from the read_json doc here:
Signature: pd.read_json(
path_or_buf=None, ...
path_or_buf : a valid JSON string or file-like, default: None
The string could be a URL. Valid URL schemes include http, ftp, s3,
gcs, and file. For file URLs, a host is expected. For instance, a local
file could be file://localhost/path/to/table.json
So with read_json, you can either give it an actual string that is valid JSON, or you can give it a file-like object that points to a file that contains JSON.
Notice in the pandas docs that you cite:
In [258]: jsonl = '''
.....: {"a": 1, "b": 2}
.....: {"a": 3, "b": 4}
.....: '''
.....:
is JSON, not a path. When their example then does:
df = pd.read_json(jsonl, lines=True)
it is merely parsing the JSON in the string - no files are involved here.
When it then wants to demonstrate reading from a file in chunks, it does
# reader is an iterator that returns `chunksize` lines each iteration
In [262]: reader = pd.read_json(StringIO(jsonl), lines=True, chunksize=1)
In other words, they are wrapping a JSON string, not a path, by StringIO(). This is just for the purposes of the documented example, so you can see that if you treated the JSON string as if it were being read from a file you can read it in chunks. That's what StringIO() does. So when you wrap the string that describes your file URL in StringIO(), I expect that read_json is then trying to interpret that string as JSON that's being read from a file and parse it. It understandably falls over because it isn't JSON.
This brings us back to why read_json cannot read your file URL in chunks. I don't have an immediate good answer to that. I suspect it lies in the internals of how read_json opens file URLs, or what function underlies this. If you were intent upon, or forced to, do this chunking from a file URL then I suspect you'd be looking at controlling the mode in which the file is opened, or perhaps somehow providing explicit guidance to read_json how to interpret the bytestream it gets. Libraries such as urllib2 may be useful here, I'm not sure.
But let's cut to the best fix here. Why are we trying to specify the path as a file URL? Simply specify your path as an OS path, e.g.
path = '/path/to/my/data/'
and then
filename = path + 'yelp_dataset/review_100.json'
# create a reader to read in chunks
review_reader = pd.read_json(filename, lines=True, chunksize=10)
And I betcha it works as intended! (It does for me, as it always has).
Caveat: windows doesn't use forward-slash path delimiters, and constructing paths by concatenating strings in the above fashion can be fragile, but usually if you use 'proper' forward-slash delimiters (smile), decent languages internally understand that. It's constructing paths using backslashes that is guaranteed to cause you pain. But just keep an eye on that.

Reading bytes from wave file python

I'm working on a little audio project and part of it requires using wave files and flac files. Im trying to figure out how to read the metadata in each and how to add tags manually. I'm having trouble figuring out how to read the bytes as they are.
I have been referencing this page and a couple others to see the full format of a Wave file however for some wave files I get some discrepancies. I want to be able to see the hexadecimal bytes in order to see what differences are occurring.
Using simply open('fname', 'rb') and read, only returns the bytes as strings. Using struct.unpack has worked for some wave files however it is limited to printing as strings, ints, or shorts and I can't see exactly what is going wrong when I use it. Is there any other way I can read this file in hex?
Thanks
I assume that you just want to display the content of a binary file in hexadecimal. First, you do not need to use Python for that, as some editors to it natively, for example vim.
Now assuming you have a string that you got by reading a file, you can easily change it to a list of hexadecimal values:
with open('fname', 'rb') as fd: # open the file
data = rd.read(16) # read 16 bytes from it
h = [ hex(ord(b)) for b in data] # convert the bytes to their hex value
print (h) # prints a list of hexadecimal codes of the read bytes

Categories

Resources