Python load unknown characters loaded in sklearn

Python load unknown characters loaded in sklearn - python

I am trying to load email messages that I copied into rtf files (as my training data)
I load the directory containing the files using sklearn module and command:
sklearn.datasets.load_files
corpus = sklearn.datasets.load_files(<path>,shuffle = False)
When I attempt to print corpus.data, the first 6000 characters or so are \x00\x00\x00\x01Bud1\x00\x00\x10\x00\x00\x00\x08. Then the actual message text is displayed but intertwined are characters such as: \cf0 \expnd0\expndtw0\kerning0\nHey,\\ in the middle of the text.
I do want to mention that some of the text has German characters as well as English.
What could be the problem here?
Best
Ok

In the documentation for this function it says
If you leave encoding equal to None, then the content will be made of bytes instead of Unicode, and you will not be able to use most functions in sklearn.feature_extraction.text.
Without knowing the encoding of your files you might want to try
sklearn.databases.load_files(<path>,shuffle = False, encoding='utf-8')

Related

Unicode issues with tarfile.extractall() (Python 2.7)

I'm using python 2.7.6 on Windows and I'm using the tarfile module to extract a file a gzip file. The mode option of tarfile.open() is set to "r:gz". After the open call, if I were to print the contents of the archive via tarfile.list(), I see the following directory in the list:
./Θ¥ÖµÇüσêåµ₧É Part 1.v1/
However, after I call tarfile.extractall(), I don't see the above directory in the extracted list of files, instead I see this:
é™æ€åˆ†æž Part 1.v1/
If I were to extract the archive via 7zip, I see a directory with the same name as the first item above. So, clearly, the extractall() method is screwing up, but I don't know how to fix this.

I learned that tar doesn't retain the encoding information as part of the archive and treats filenames as raw byte sequences. So, the output I saw from tarfile.extractall() was simply raw the character sequence that comprised the file's name prior to compression. In order to get the extractall() method to recreate the original filenames, I discovered that you have to manually convert the members of the TarFile object to the appropriate encoding before calling extractall(). In my case, the following did the trick:
modeltar = tarfile.open(zippath, mode="r:gz")
updatedMembers = []
for m in modeltar.getmembers():
m.name = unicode(m.name, 'utf-8')
updatedMembers.append(m)
modeltar.extractall(members=updatedMembers, path=dbpath)
The above code is based on this superuser answer: https://superuser.com/a/190786/354642

Python3 Mutagen not outputing unicode tags

I'm attempting to automate some ID3 tagging with Mutagen, but whenever I attempt to insert unicode characters I have them replaced by question marks.
Smallest test code that results in this error is as follows
from mutagen.id3 import ID3, TALB
audio = ID3()
audio['TALB'] = TALB(encoding=3, text=u'test祥さtest')
audio.save('test.mp3', v1=2)
When run, test.mp3's album tag shows up as test??test in both my file manager and music player. If I manually enter unicode tags via the file manager the unicode characters display normally without issue.
Things I have already tried in order to fix this problem:
Trying both with and without the u string prefix
Using the alternate Mutagen tagging syntax (audio.add(TALB(encoding=3, text=u'test祥さtest')))
I'm using the v1=2 argument for the save function as leaving it out results in around half the files not having their tags written (and unicode still being outputted as question marks), and other values refuse to write ID3 tags for any files.
I'm using Windows 10 64bit. My Python environments are Anaconda3 (Python3.4) and Python2.7, both result in the same problem with same code.

So I think your main problem is that your way of testing if the tags are correct has some problems. Let me explain.
For me, this code works:
from mutagen.id3 import ID3, TALB
audio = ID3()
audio['TALB'] = TALB(encoding=3, text=u'test祥さtest')
audio.save("test.mp3",v1=0)
Checking the file in a text editor shows the tags correctly written in unicode.
So why can't you see the tags? Likely because mutagen defaults to writing ID3v2.4 tags which neither Windows File Explorer nor any of the standard Windows media players will read. However, when you have added the v1=2 argument you have forced mutagen to also write ID3v1 tags. These are readable by File Explorer but unfortunately do not support Unicode. That is why you are seeing the question marks instead. So it us useful, when you want to use Unicode, to add v1=0 (as I have done) to prevent any ID3v1 tags being written and distracting from the main issue of getting the ID3v2 tags working.
So now move to ID3v2.3 instead of ID3v2.4 and see if that helps:
from mutagen.id3 import ID3, TALB
audio = ID3()
audio.update_to_v23()
audio['TALB'] = TALB(encoding=3, text=u'test祥さtest')
audio.save("test.mp3",v1=0,v2_version=3)
Finally, the best way to see what tags are really in the file is to use a dedicated tag editor which comprehensively follows the spec, like Mp3tag. This helps to find out if the problem is how you are writing the tags, or how your player is reading them.

Searching text files' contents with various encodings with Python?

I am having trouble with variable text encoding when opening text files to find a match in the files' contents.
I am writing a script to scan the file system for log files with specific contents in order to copy them to an archive. The names are often changed, so the contents are the only way to identify them. I need to identify *.txt files and find within their contents a string that is unique to these particular log files.
I have the code below that mostly works. The problem is the logs may have their encoding changed if they are opened and edited. In this case, Python won't match the search term to the contents because the contents are garbled when Python uses the wrong encoding to open the file.
import os
import codecs
#Filepaths to search
FILEPATH = "SomeDrive:\\SomeDirs\\"
#Text to match in file names
MATCH_CONDITION = ".txt"
#Text to match in file contents
MATCH_CONTENT = "--------Base Data Details:--------------------"
for root, dirs, files in os.walk(FILEPATH):
for f in files:
if MATCH_CONDITION in f:
print "Searching: " + os.path.join(root,f)
#ATTEMPT A -
#matches only text file re-encoded as ANSI,
#UTF-8, UTF-8 no BOM
#search_file = open(os.path.join(root,f), 'r')
#ATTEMPT B -
#matches text files ouput from Trimble software
#"UCS-2 LE w/o BOM", also "UCS-2 Little Endian" -
#(same file resaved using Windows Notepad),
search_file = codecs.open(os.path.join(root,f), 'r', 'utf_16_le')
file_data = search_file.read()
if MATCH_CONTENT in file_data:
print "CONTENTS MATCHED: " + f
search_file.close()
I can open the files in Notepad ++ which detects the encoding. Using the regular file.open() Python command does not automatically detect the encoding. I can use codecs.open and specify the encoding to catch a single encoding, but then have to write excess code to catch the rest. I've read the Python codecs module documentation and it seems to be devoid of any automatic detection.
What options do I have to concisely and robustly search any text file with any encoding?
I've read about the chardet module, which seems good but I really need to avoid installing modules. Anyway, there must be a simpler way to interact with the ancient and venerable text file. Surely as a newb I am making this too complicated, right?
Python 2.7.2, Windows 7 64-bit. Probably not necessary, but here is a sample log file.
EDIT:
As far as I know the files will almost surely be in one of the encodings in the code comments: ANSI, UTF-8, UTF_16_LE (as UCS-2 LE w/o BOM; UCS-2 Little Endian). There is always the potential for someone to find a way around my expectations...
EDIT:
While using an external library is certainly the sound approach, I've taken a chance at writing some amateurish code to guess the encoding and solicited feedback in another question -> Pitfalls in my code for detecting text file encoding with Python?

The chardet package exists for a reason (and was ported from some older Netscape code, for a similar reason) : detecting the encoding of an arbitrary text file is tricky.
There are two basic alternatives :
Use some hard-coded rules to determine whether a file has a certain encoding. For example, you could look for the UTF byte-order marker at the beginning of the file. This breaks for encodings that overlap significantly in their use of different bytes, or for files that don't happen to use the "marker" bytes that your detection rules use.
Take a database of files in known encodings and count up the distributions of different bytes (and byte pairs, triplets, etc.) in each of the encodings. Then, when you have a file of unknown encoding, take a sample of its bytes and see which pattern of byte usage is the best match. This breaks when you have short test files (which makes the frequency estimates inaccurate), or when the usage of the bytes in your test file doesn't match the usage in the file database you used to build up your frequency data.
The reason notepad++ can do character detection (as well as web browsers, word processors, etc.) is that these programs all have one or both of these methods built in to the program. Python doesn't build this into its interpreter -- it's a general-purpose programming language, not a text editor -- but that's just what the chardet package does.
I would say that because you know some things about the text files that you're handling, you might be able to take a few shortcuts. For example, are your log files all in one of either encoding A or encoding B ? If so, then your decision is much simpler, and probably either the frequency-based or the rule-based approach above would be pretty straightforward to implement on your own. But if you need to detect arbitrary character sets, I'd highly recommend building on the shoulders of giants.

Putting gzipped data into a script as a string

I snagged a Lorem Ipsupm generator last week, and I admit, it's pretty cool.
My question: can someone show me a tutorial on how the author of the above script was able to post the contents of a gzipped file into their code as a string? I keep getting examples of gzipping a regular file, and I'm feeling kind of lost here.
For what it's worth, I have another module that is quite similar (it generates random names, companies, etc), and right now it reads from a couple different text files. I like this approach better; it requires one less sub-directory in my project to place data into, and it also presents a new way of doing things for me.
I'm quite new to streams, IO types, and the like. Feel free to dump the links on my lap. Snipptes are always appreciated too.

Assuming you are in a *nix environment, you just need gzip and a base64 encoder to generate the string. Lets assume your content is in file.txt, for the purpose of this example I created the file with random bytes with that specific name.
So you need to compress it first:
$ gzip file.txt
That will generate a file.txt.gz file that you now need to embed into your code. To do that, you need to encode it. A common way to do so is to use Base64 encoding, which can be done with the base64 program:
$ base64 file.txt.gz
H4sICGmHsE8AA2ZpbGUudHh0AAGoAFf/jIMKME+MgnEhgS4vd6SN0zIuVRhsj5fac3Q1EV1EvFJK
fBsw+Ln3ZSX7d5zjBXJR1BUn+b2/S3jHXO9h6KEDx37U7iOvmSf6BMo1gOJEgIsf57yHwUKl7f9+
Beh4kwF+VljN4xjBfdCiXKk0Oc9g/5U/AKR02fRwI+zYlp1ELBVDzFHNsxpjhIT43sBPklXW8L5P
d8Ao3i2tQQPf2JAHRQZYYn3vt0tKg7drVKgAAAA=
Now you have all what you need to use the contents of that file in your python script:
from cStringIO import StringIO
from base64 import b64decode
from gzip import GzipFile
# this is the variable with your file's contents
gzipped_data = """
H4sICGmHsE8AA2ZpbGUudHh0AAGoAFf/jIMKME+MgnEhgS4vd6SN0zIuVRhsj5fac3Q1EV1EvFJK
fBsw+Ln3ZSX7d5zjBXJR1BUn+b2/S3jHXO9h6KEDx37U7iOvmSf6BMo1gOJEgIsf57yHwUKl7f9+
Beh4kwF+VljN4xjBfdCiXKk0Oc9g/5U/AKR02fRwI+zYlp1ELBVDzFHNsxpjhIT43sBPklXW8L5P
d8Ao3i2tQQPf2JAHRQZYYn3vt0tKg7drVKgAAAA=
"""
# we now decode the file's content from the string and unzip it
orig_file_desc = GzipFile(mode='r',
fileobj=StringIO(b64decode(gzipped_data)))
# get the original's file content to a variable
orig_file_cont = orig_file_desc.read()
# and close the file descriptor
orig_file_desc.close()
Obviously, your program will depend on the base64, gzip and cStringIO python modules.

I'm not sure exactly what you're asking, but here's a stab...
The author of lipsum.py has included the compressed data inline in their code as chunks of Base64 encoded text. Base64 is an encoding mechanism for representing binary data using printable ASCII characters. It can be used for including binary data in your Python code. It is more commonly used to include binary data in email attachments...the next time someone sends you a picture or PDF document, take a look at the raw message and you'll see very much the same thing.
Python's base64 module provides routines for converting between base64 and binary representations of data...and once you have the binary representation of the data, it doesn't really matter how you got, whether it was by reading it from a file or decoding a string embedded in your code.
Python's gzip module can be used to decompress data. It expects a file-like object...and Python provides the StringIO module to wrap strings in the right set of methods to make them act like files. You can see that in lipsum.py in the following code:
sample_text_file = gzip.GzipFile(mode='rb',
fileobj=StringIO(base64.b64decode(DEFAULT_SAMPLE_COMPRESSED)))
This is creating a StringIO object containing the binary representation of the base64 encoded value stored in DEFAULT_SAMPLE_COMPRESSED.
All the modules mentioned here are described in the documentation for the Python standard library.
I wouldn't recommend including data in your code inline like this as a good idea in general, unless your data is small and relatively static. Otherwise, package it up into your Python package which makes it easier to edit and track changes.
Have I answered the right question?

How about this: Zips and encodes a string, prints it out encoded, then decodes and unzips it again.
from StringIO import StringIO
import base64
import gzip
contents = 'The quick brown fox jumps over the lazy dog'
zip_text_file = StringIO()
zipper = gzip.GzipFile(mode='wb', fileobj=zip_text_file)
zipper.write(contents)
zipper.close()
enc_text = base64.b64encode(zip_text_file.getvalue())
print enc_text
sample_text_file = gzip.GzipFile(mode='rb',
fileobj=StringIO(base64.b64decode(enc_text)))
DEFAULT_SAMPLE = sample_text_file.read()
sample_text_file.close()
print DEFAULT_SAMPLE

Old question but I had to do this recent for AWS logs. In Python3 use BytesIO instead of StringIO:
import base64
from io import BytesIO
DEFAULT_SAMPLE_COMPRESSED = "Some base 64 encoded and gzip compressed string"
sample_text_file = gzip.GzipFile(
mode='rb',
fileobj=BytesIO(base64.b64decode(DEFAULT_SAMPLE_COMPRESSED))
)
binary_text = sample_text_file.read() # This will be the final string as bianry
text = binary_text .decode() # This will make the binary text a string.

How to separate content from a file that is a container for binary and other forms of content

I am trying to parse some .txt files. These files serve as containers for a variable number of 'children' files that are set off or identified within the container with SGML tags. With python I can easily separate the children files. However I am having trouble writing the binary content back out as a binary file (say a gif or jpg). In the simplest case the container might have an embedded html file followed by a graphic that is called by the html. I am assuming that my problem is because I am reading the original .txt file using open(filename,'r'). But that seems the only option to find the sgml tags to split the file.
I would appreciate any help to identify some relevant reading material.
I appreciate the suggestions but I am still struggling with the most basic questions. For example when I open the file with wordpad and scroll down to the section tagged as a gif I see this:
<FILENAME>h65803h6580301.gif
<DESCRIPTION>GRAPHIC
<TEXT>
begin 644 h65803h6580301.gif
M1TE&.#EA(P)I`=4#`("`#,#`P$!`0+^_OW]_?_#P\*"#H.##X-#0T&!#8!`0
M$+"PL"`#('!P<)"0D#`P,%!04#\_/^_O[Y^?GZ^OK]_?WX^/C\_/SV]O;U]?
I can handle finding the section easily enough but where does the gif file begin. Does the header start with 644, the blanks after the word begin or the line beginning with MITE?
Next, when the file is read into python does it do anything to the binary code that has to be undone when it is read back out?
I can find the lines where the graphics begin:
filerefbin=file('myfile.txt','rb')
wholeFile=filerefbin.read()
import re
graphicReg=re.compile('<DESCRIPTION>GRAPHIC')
locationGraphics=graphicReg.finditer(wholeFile)
graphicsTags=[]
for match in locationGraphics:
graphicsTags.append(match.span())
I can easily use the same process to get to the word begin, or to identify the filename and get to the end of the filename in the 'first' line. I have also successefully gotten to the end of the embedded gif file. But I can't seem to write out the correct combination of things so when I double click on h65803h6580301.gif when it has been isolated and saved I get to see the graphic.
Interestingly, when I open the file in rb, the line endings appear to still be present even though they don't seem to have any effect in notebpad. So that is clearly one of my problems I might need to readlines and join the lines together after stripping out the \n
I love this site and I love PYTHON
This was too easy once I read bendin's post. I just had to snip the section that began with the word begin and save that in a txt file and then run the following command:
import uu
uu.decode(r'c:\test2.txt',r'c:\test.gif')
I have to work with some other stuff for the rest of the day but I will post more here as I look at this more closely. The first thing I need to discover is how to use something other than a file, that is since I read the whole .txt file into memory and clipped out the section that has the image I need to work with the clipped section instead of writing it out to test2.txt. I am sure that can be done its just figuring out how to do it.

What you're looking at isn't "binary", it's uuencoded. Python's standard library includes the module uu, to handle uuencoded data.
The module uu requires the use of temporary files for encoding and decoding. You can accomplish this without resorting to temporary files by using Python's codecs module like this:
import codecs
data = "Let's just pretend that this is binary data, ok?"
uuencode = codecs.getencoder("uu")
data_uu, n = uuencode(data)
uudecode = codecs.getdecoder("uu")
decoded, m = uudecode(data_uu)
print """* The initial input:
%(data)s
* Encoding these %(n)d bytes produces:
%(data_uu)s
* When we decode these %(m)d bytes, we get the original data back:
%(decoded)s""" % globals()

You definitely need to be reading in binary mode if the content includes JPEG images.
As well, Python includes an SGML parser, http://docs.python.org/library/sgmllib.html .
There is no example there, but all you need to do is setup do_ methods to handle the sgml tags you wish.

You need to open(filename,'rb') to open the file in binary mode. Be aware that this will cause python to give You confusing, two-byte line endings on some operating systems.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.