Searching text files' contents with various encodings with Python?

Searching text files' contents with various encodings with Python? - python

I am having trouble with variable text encoding when opening text files to find a match in the files' contents.
I am writing a script to scan the file system for log files with specific contents in order to copy them to an archive. The names are often changed, so the contents are the only way to identify them. I need to identify *.txt files and find within their contents a string that is unique to these particular log files.
I have the code below that mostly works. The problem is the logs may have their encoding changed if they are opened and edited. In this case, Python won't match the search term to the contents because the contents are garbled when Python uses the wrong encoding to open the file.
import os
import codecs
#Filepaths to search
FILEPATH = "SomeDrive:\\SomeDirs\\"
#Text to match in file names
MATCH_CONDITION = ".txt"
#Text to match in file contents
MATCH_CONTENT = "--------Base Data Details:--------------------"
for root, dirs, files in os.walk(FILEPATH):
for f in files:
if MATCH_CONDITION in f:
print "Searching: " + os.path.join(root,f)
#ATTEMPT A -
#matches only text file re-encoded as ANSI,
#UTF-8, UTF-8 no BOM
#search_file = open(os.path.join(root,f), 'r')
#ATTEMPT B -
#matches text files ouput from Trimble software
#"UCS-2 LE w/o BOM", also "UCS-2 Little Endian" -
#(same file resaved using Windows Notepad),
search_file = codecs.open(os.path.join(root,f), 'r', 'utf_16_le')
file_data = search_file.read()
if MATCH_CONTENT in file_data:
print "CONTENTS MATCHED: " + f
search_file.close()
I can open the files in Notepad ++ which detects the encoding. Using the regular file.open() Python command does not automatically detect the encoding. I can use codecs.open and specify the encoding to catch a single encoding, but then have to write excess code to catch the rest. I've read the Python codecs module documentation and it seems to be devoid of any automatic detection.
What options do I have to concisely and robustly search any text file with any encoding?
I've read about the chardet module, which seems good but I really need to avoid installing modules. Anyway, there must be a simpler way to interact with the ancient and venerable text file. Surely as a newb I am making this too complicated, right?
Python 2.7.2, Windows 7 64-bit. Probably not necessary, but here is a sample log file.
EDIT:
As far as I know the files will almost surely be in one of the encodings in the code comments: ANSI, UTF-8, UTF_16_LE (as UCS-2 LE w/o BOM; UCS-2 Little Endian). There is always the potential for someone to find a way around my expectations...
EDIT:
While using an external library is certainly the sound approach, I've taken a chance at writing some amateurish code to guess the encoding and solicited feedback in another question -> Pitfalls in my code for detecting text file encoding with Python?

The chardet package exists for a reason (and was ported from some older Netscape code, for a similar reason) : detecting the encoding of an arbitrary text file is tricky.
There are two basic alternatives :
Use some hard-coded rules to determine whether a file has a certain encoding. For example, you could look for the UTF byte-order marker at the beginning of the file. This breaks for encodings that overlap significantly in their use of different bytes, or for files that don't happen to use the "marker" bytes that your detection rules use.
Take a database of files in known encodings and count up the distributions of different bytes (and byte pairs, triplets, etc.) in each of the encodings. Then, when you have a file of unknown encoding, take a sample of its bytes and see which pattern of byte usage is the best match. This breaks when you have short test files (which makes the frequency estimates inaccurate), or when the usage of the bytes in your test file doesn't match the usage in the file database you used to build up your frequency data.
The reason notepad++ can do character detection (as well as web browsers, word processors, etc.) is that these programs all have one or both of these methods built in to the program. Python doesn't build this into its interpreter -- it's a general-purpose programming language, not a text editor -- but that's just what the chardet package does.
I would say that because you know some things about the text files that you're handling, you might be able to take a few shortcuts. For example, are your log files all in one of either encoding A or encoding B ? If so, then your decision is much simpler, and probably either the frequency-based or the rule-based approach above would be pretty straightforward to implement on your own. But if you need to detect arbitrary character sets, I'd highly recommend building on the shoulders of giants.

Related

Is there a way to resolve these escape sequences produced by invalid characters in windows filenames?

I am trying to write a small script in python 3 to sanitise filenames before they are uploaded to a cloud solution. This needs to run the same on unix and windows systems (including macs). Linux and mac allow characters in file and directory names that windows does not, and for this reason files with these characters simply cannot be uploaded, which is why the script is required.
I am utilising os.walk() to scan through the files and directories, but while the regex for my first check ('[\\\\/":<>|*?]') runs without issues on my linux test, it does not work when actually run from windows.
Given for example a file named hello?\This is a file, python will read it as 'hello\uf03f\uf05cThis is a file' and the regex will of course not match. I have tried converting it to bytes then decoding it, encoding and decoding it and using a byte string as the path and decoding all as suggested by various semi-related SO posts, but nothing seems to give me the original characters.
Is anyone able to suggest anything I can do besides adding the sequences to the regex, which would be my last resort if I can't find the real solution?
Example of what I am testing with (invalid files created by mounting drive to linux):
C:\Users\username\Desktop:
shortcut.lnk
text file.txt
|\invalid??.txt
for dirpath, dirnames, filenames in os.walk("C:\\Users\\username\\Desktop"):
for file in filenames:
print(file)
Outputs:
'shortcut.lnk'
'text file.txt'
'\uf07c\uf05cinvalid\uf03f\uf03f.txt'

Read any file (image ,mp3 ,or video ,or text ) in binary format in python

This is my attempt to convert the string in to binary format:
std ="this is the code"
res=''.join(format(ord(i),'b')for i in test_str)
print(" this is the conversion "+str(res))
output:
this is conversion
11101001101000110100111100111000001101001111001110000011101001101000110010110000011000111101111110010
01100101
How to do with type of file(ex.text,video,mp3 etc) in binary format?

I believe your question is not strictly related to Python, but for any language, as the problem relates to: What is a binary file. A binary file is just 0 and 1 (many times is read directly in hexa) for which you have to know how the structure that is saved (usually a struct) is being serialized.
For that reason, you have to know what kind of file you're reading, and have a Parser that will know the binary structure for that specific file. That's why you have libraries. Each library knows one or multiple types of files to read/write. In Linux (for example), the extension is completely irrelevant, the content of the file is the one that's important.
To be more exact for your request, some links to help you drive through How to read/process <your-file-format> in python (know that extension is not important, the file format is important!)
to read .mp3: Read MP3 in Python 3
to read .wmv: Is there a library like pymedia, but more updated?
to read .jpeg: https://www.geeksforgeeks.org/reading-images-in-python/

what does ^# sign in .txt file suggest

I was concurrently manipulating a txt file (some r/w operation)with multiple processes. and I saw traces of special signs as ^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^# spreading across some lines now and then. What does this suggest? And under what circumstances will these symbols appear. Does it mean some binary contents were written in to, by mistake, where it should be text?
UPDATE
I read through the documentation. Some suggest it's due to newline issue on linux/windows platform, while others suggest it's because of big endian/small endian in a networked environment. The fact is I was running multiple processes in a networked filesystem and manipulate one common txt file. So I guess the encoding format might be the major reason. Anyone who can suggest how to avoid this issue? I don't want to edit files(like manually doing text substitution). A clean way of producing the right file without any null characters are preferred.
UPDATE2
This is the python pseudo code that implements my project. the fcntl.lockf thing is to lock the common manipulated file across multiple machines that run multiple process on it.
while(manipulatedfile size is not 0):
open(manipulatedfile, 'r+') as fh:
fcntl.lockf(fh, fcntl.LOCK_EX)
all_lines = fh.readlines()
listing=all_lines[0:50] #get the first 50 lines
rest_lines = all_lines[50:] # get remaining lines
fh.seek(0)
fh.truncate()
fh.writelines(rest_lines) # write remaining lines back to file
fcntl.lockf(fh, fcntl.LOCK_UN)
listing = map(lambda s:s.strip(), listing)
do_sth(listing)
Thanks

In ASCII, ^# is a binary zero (NUL) character.
Data containing ^# between each ASCII character can sometimes be incorrectly translated from Unicode (4 bytes to a character) to ASCII (1 bytes to a character), or vice versa.
To remove the ^# characters, run vi file.txt, and enter :%s/ Ctrl+V Ctrl+# //g and hit ↵ Return.
See this detailed article for more information.

These are "file holes" and contain null characters. The null character (or NUL char) has an ASCII code of 0 and appears as ^# when viewed in vi or less.
I usually see these when I am nearly out of disk space and processes are trying to write to log files.

Processing a Django UploadedFile as UTF-8 with universal newlines

In my django application, I provide a form that allows users to upload a file. The file can be in a variety of formats (Excel, CSV), come from a variety of platforms (Mac, Linux, Windows), and be encoded in a variety of encodings (ASCII, UTF-8).
For the purpose of this question, let's assume that I have a view which is receiving request.FILES['file'], which is an instance of InMemoryUploadedFile, called file. My problem is that InMemoryUploadedFile objects (like file):
Do not support UTF-8 encoding (I see a \xef\xbb\xbf at the beginning of the file, which as I understand is a flag meaning 'this file is UTF-8').
Do not support universal newlines (which probably the majority of the files uploaded to this system will need).
Complicating the issue is that I wish to pass the file in to the python csv module, which does not natively support Unicode. I will happily accept answers that avoid this issue - once I get django playing nice with UTF-8 I'm sure I can bludgeon csv into doing the same. (Similarly, please ignore the requirement to support Excel - I am waiting until CSV works before I tackle parsing Excel files.)
I have tried using StringIO,mmap,codec, and any of a wide variety of ways of accessing the data in an InMemoryUploadedFile object. Each approach has yielded differing errors, none so far have been perfect. This shows some of the code that I feel came the closest:
import csv
import codecs
class CSVParser:
def __init__(self,file):
# 'file' is assumed to be an InMemoryUploadedFile object.
dialect = csv.Sniffer().sniff(codecs.EncodedFile(file,"utf-8").read(1024))
file.open() # seek to 0
self.reader = csv.reader(codecs.EncodedFile(file,"utf-8"),
dialect=dialect)
try:
self.field_names = self.reader.next()
except StopIteration:
# The file was empty - this is not allowed.
raise ValueError('Unrecognized format (empty file)')
if len(self.field_names) <= 1:
# This probably isn't a CSV file at all.
# Note that the csv module will (incorrectly) parse ALL files, even
# binary data. This will catch most such files.
raise ValueError('Unrecognized format (too few columns)')
# Additional methods snipped, unrelated to issue
Please note that I haven't spent too much time on the actual parsing algorithm so it may be wildly inefficient, right now I'm more concerned with getting encoding to work as expected.
The problem is that the results are also not encoded, despite being wrapped in the Unicode codecs.EncodedFile file wrapper.
EDIT: It turns out, the above code does in fact work. codecs.EncodedFile(file,"utf-8") is the ticket. It turns out the reason I thought it didn't work was that the terminal I was using does not support UTF-8. Live and learn!

As mentioned above, the code snippet I provided was in fact working as intended - the problem was with my terminal, and not with python encoding.
If your view needs to access a UTF-8 UploadedFile, you can just use utf8_file = codecs.EncodedFile(request.FILES['file_field'],"utf-8") to open a file object in the correct encoding.
I also noticed that, at least for InMemoryUploadedFiles, opening the file through the codecs.EncodedFile wrapper does NOT reset the seek() position of the file descriptor. To return to the beginning of the file (again, this may be InMemoryUploadedFile specific) I just used request.FILES['file_field'].open() to send the seek() position back to 0.

I use the csv.DictReader and it appears to be working well. I attached my code snippet, but it is basically the same as another answer here.
import csv as csv_mod
import codecs
file = request.FILES['file']
dialect = csv_mod.Sniffer().sniff(codecs.EncodedFile(file,"utf-8").read(1024))
file.open()
csv = csv_mod.DictReader( codecs.EncodedFile(file,"utf-8"), dialect=dialect )

For CSV and Excel upload to django, this site may help.

How to separate content from a file that is a container for binary and other forms of content

I am trying to parse some .txt files. These files serve as containers for a variable number of 'children' files that are set off or identified within the container with SGML tags. With python I can easily separate the children files. However I am having trouble writing the binary content back out as a binary file (say a gif or jpg). In the simplest case the container might have an embedded html file followed by a graphic that is called by the html. I am assuming that my problem is because I am reading the original .txt file using open(filename,'r'). But that seems the only option to find the sgml tags to split the file.
I would appreciate any help to identify some relevant reading material.
I appreciate the suggestions but I am still struggling with the most basic questions. For example when I open the file with wordpad and scroll down to the section tagged as a gif I see this:
<FILENAME>h65803h6580301.gif
<DESCRIPTION>GRAPHIC
<TEXT>
begin 644 h65803h6580301.gif
M1TE&.#EA(P)I`=4#`("`#,#`P$!`0+^_OW]_?_#P\*"#H.##X-#0T&!#8!`0
M$+"PL"`#('!P<)"0D#`P,%!04#\_/^_O[Y^?GZ^OK]_?WX^/C\_/SV]O;U]?
I can handle finding the section easily enough but where does the gif file begin. Does the header start with 644, the blanks after the word begin or the line beginning with MITE?
Next, when the file is read into python does it do anything to the binary code that has to be undone when it is read back out?
I can find the lines where the graphics begin:
filerefbin=file('myfile.txt','rb')
wholeFile=filerefbin.read()
import re
graphicReg=re.compile('<DESCRIPTION>GRAPHIC')
locationGraphics=graphicReg.finditer(wholeFile)
graphicsTags=[]
for match in locationGraphics:
graphicsTags.append(match.span())
I can easily use the same process to get to the word begin, or to identify the filename and get to the end of the filename in the 'first' line. I have also successefully gotten to the end of the embedded gif file. But I can't seem to write out the correct combination of things so when I double click on h65803h6580301.gif when it has been isolated and saved I get to see the graphic.
Interestingly, when I open the file in rb, the line endings appear to still be present even though they don't seem to have any effect in notebpad. So that is clearly one of my problems I might need to readlines and join the lines together after stripping out the \n
I love this site and I love PYTHON
This was too easy once I read bendin's post. I just had to snip the section that began with the word begin and save that in a txt file and then run the following command:
import uu
uu.decode(r'c:\test2.txt',r'c:\test.gif')
I have to work with some other stuff for the rest of the day but I will post more here as I look at this more closely. The first thing I need to discover is how to use something other than a file, that is since I read the whole .txt file into memory and clipped out the section that has the image I need to work with the clipped section instead of writing it out to test2.txt. I am sure that can be done its just figuring out how to do it.

What you're looking at isn't "binary", it's uuencoded. Python's standard library includes the module uu, to handle uuencoded data.
The module uu requires the use of temporary files for encoding and decoding. You can accomplish this without resorting to temporary files by using Python's codecs module like this:
import codecs
data = "Let's just pretend that this is binary data, ok?"
uuencode = codecs.getencoder("uu")
data_uu, n = uuencode(data)
uudecode = codecs.getdecoder("uu")
decoded, m = uudecode(data_uu)
print """* The initial input:
%(data)s
* Encoding these %(n)d bytes produces:
%(data_uu)s
* When we decode these %(m)d bytes, we get the original data back:
%(decoded)s""" % globals()

You definitely need to be reading in binary mode if the content includes JPEG images.
As well, Python includes an SGML parser, http://docs.python.org/library/sgmllib.html .
There is no example there, but all you need to do is setup do_ methods to handle the sgml tags you wish.

You need to open(filename,'rb') to open the file in binary mode. Be aware that this will cause python to give You confusing, two-byte line endings on some operating systems.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.