strip out binary data from text file in python

strip out binary data from text file in python - python

I have a text file that contains some binary data. When I read the file, using Python 3, in text mode I get an UniCodeDecodeError (codec can't decode byte...) with the following lines of code:
fo = open('myfile.txt, 'r')
for line in inFile:
How can I remove the binary data from my file. I have a header that is printed just before each binary data (in this case it is shown as Data Block). For example, my file looks like such where I want to remove the çºí?¼È×“ñdí:
myfile.txt:
ABCDEFGH
123456
Data Block 11
çºí?¼È×“ñdí
XYZ123
The result I want is for myfile.txt to look like this:
ABCDEFGH
123456
Data Block 11
XYZ123

This is difficult, because "binary" blobs may contain valid characters or character sequences. And if you're using a file that has "text" using multi-byte encoding, forget about it.
If you know the "text" in your file only contains single-byte characters, one approach would be to read the file in as bytes, then use something like
encode('ascii', error='ignore')
This effectively strips non-ascii characters out of the output, but if you were to do this on your file, you'd get:
ABCDEFGH
123456
Data Block
?d
XYZ123
Note the second to last line -- valid ascii characters were found in the blob and treated as "text".
You may start with a solution like that, and fine-tune it (if possible) to meet your needs. Maybe the blobs occur by themselves on lines so that if a line has any non-ascii characters, throw out the entire line completely. Maybe you can look at the blobs and try to grok some structure it has. Maybe you just settle for having random lines of partial characters in there and handle them somehow later. It's kind of application-specific at that point.
Here's the code I used to produce that output from your sample input:
def strip_nonascii(b):
return b.decode('ascii', errors='ignore')
with open('garbled.txt', 'rb') as f:
for line in f:
print(strip_nonascii(line), end='')

If you also have footer after binary data (like you are having header), try to replace everything between header/footer with nothing with regexp?

Related

How to remove erronous tabs/new line from .vcf file?

I am working with a vcf file. I try to extract information from this file, but the file has errors in the format.
In this file there is a column that contains long character strings. The error is, that a number of tabs and a new line character are erronously placed within some rows of this column. So when I try to read in this tab delimited file, all columns are messed up.
I have an idea how to solve this, but don't know how to execute it in code. The string is DNA, so always has ATCG. Basically, if one could look for a number of tabs and a newline within characters ATCG and remove them, then the file is fixed:
ACTGCTGA\t\t\t\t\nCTGATCGA would become:
ACTGCTGACTGATCGA
So one would need to look into this file, look for [ACTG] followed by tabs or newlines, followed by more [ACTG], and then replace this with nothing. Any idea how to do this?
with open(file.vcf, 'r') as f:
lines = [l for l in f if not l.startswith('##')]

Here's one way with regex:
First read the file in:
import re
with open('file.vcf', 'r') as file:
dnafile = file.read()
Then write a new file with the changes:
with open('fileNew.vcf', 'w') as file:
file.write(re.sub("(?<=[ACTG]{2})((\\t)*(\\n)*)(?=[ACTG]{2})", "", dnafile))

How do I "unfold" my text file?

I had a data table I converted to a text file with a tilde ~ at the end of each line. This is how I ended each line, so it is not a delimiter.
I used Linux to fold the data into an 80 byte length wrapped text file and added a line feed at the end of each line.
Example (if I did this at 10 bytes per line):
Original file or table:
abcdefghigklmnop~
1234567890~
New file:
abcdefghig
klmnop~123
4567890~
Linux/Unix, Perl, or even Python responses would help and be appreciated.
I need the new file to look exactly like the original. Sometimes line lengths will be over 80 characters in length which is ok.

If your original data was delimited by ~\n (tildes at the end of a line), and the new format removed the newlines and inserted new ones every 80 bytes, then the reverse is to remove newlines and replacing ~ with ~\n again:
with open(inputfile, 'r') as inp, open(outputfile, 'w') as outp:
data = inp.read().replace('\n', '')
outp.write(data.replace('~', '~\n'))

Python - failing to read correctly the first line of a text file to a list

I'm having a problem understanding why my python program does what it does when reading (first) lines from files and adding the lines into a list. For some reason the first line needs to be empty or it'll not read the first line correctly. If the first line is empty, it's not empty (at least not according to python).
The thing is, I have two types of files:
First file is in the form:
text:more text
another text:and more
and the second file in the form:
text_file.txt
anothertext_file.txt
Both files are UTF-8 encoded text files. The first line of both files that gets added to a list in my program, is "text" and "text_file.txt" but any code that for example tries to say
if something == "text":
...
will not get executed even if the "something" is the same as the "text".
So I'm assuming that my problem is that somewhere in the machine code (or something), my computer writes some invisible code in the beginning of the text file and that makes the first line not what it is. Maybe? I have actually found a solution for the problem simply by adding an empty line and an if clause when reading the file line by line:
if not "." in line:
...
and in the other filetype:
if not ":" in line:
...
Those if clauses work and my program does what it's supposed to (as long as I always add an empty line to the beginning of the file), but I haven't been able to find a real reason for why my program is behaving as it is. Also, I would like to not have to do this kind of a workaround if there's an easier solution that doesn't involve me editing all my files and adding an if clauses to my code.
Would appreciate any help understanding what's happening here!
Edit: as you people have been asking for my code, here it is:
filelist = []
with open("filename.txt", "r", encoding="UTF-8") as f:
for line in f:
filelist.append(line.rstrip("\n"))
This does not work properly. Also I tried it like mxds said,
filelist = []
with open("filename.txt", "r", encoding="UTF-8") as f:
lines = f.readlines()
for line in lines:
filelist.append(line.rstrip("\n"))
and this does not work either. It is only a problem in the files in the first character of the first line.
Edit2:
It seems the problem is having a Byte order mark in the beginning of my text files. After a quick googling I didn't find a solution as to how I could remove it. I'm creating my files with just windows notepad.
Final edit:
Apparently notepad is not a real text editor. I guess I'll just swap over from notepad to notepad++ to avoid this problem. However, just in case I'll have to handle my files in notepad: If I open a textfile in notepad and add some text in it, will it add a BOM or should it do that only in the creating of the file?

Looks like you've already done the legwork on this, but according to How to make Notepad to save text in UTF-8 without BOM?, the best answer is not to use Notepad (but Notepad++ is ok). :)
Alternatively, you can strip the BOM in Python with:
line = line.decode("utf-8-sig").encode("utf-8")
See https://docs.python.org/3/library/codecs.html:
To increase the reliability with which a UTF-8 encoding can be
detected, Microsoft invented a variant of UTF-8 (that Python 2.5 calls
"utf-8-sig") for its Notepad program: Before any of the Unicode
characters is written to the file, a UTF-8 encoded BOM (which looks
like this as a byte sequence: 0xef, 0xbb, 0xbf) is written.
...
On decoding utf-8-sig will skip those three bytes if they appear as the first three bytes in the file. In UTF-8, the use of the BOM is discouraged and should generally be avoided.

A classic approach to reading text files in Python is:
with open(fname, 'r') as f:
lines = f.readlines()
After which you can process the lines like this:
for line in lines:
# do something with line...
As other comments have hinted, you may want to make sure this works first. It would help if you post your current code for review.

I just had similar issue: python readlines() reports invalid chars heading the first line, something like ï»¿. I have tried all suggestions i can google, with no luck.
I came up with a simple trick: skip the line with
add a blank line as the first line in the text file
if len(line[i]) > len(line[0]):
do things
else:
skipping
in my case, the len(line[0] = 4, all other lines are longer than 4

Iterate through file but ignore certain line break characters?

I know that I can read the entire file into memory and simply replace the offending character in memory then iterate through the stored file, but I don't want to do that because these are MASSIVE text files (often exceeding 4GB).
With that said, I want to iterate line by line through a file (which has been properly encoded as utf-8 using codecs) but I don't want line breaks to occur on the \x0b (\v) character. Unfortunately, there is some binary data that shows up in my file that has the \x0b character. Naturally, this causes a line break which ends up splitting up some lines that I need to keep intact. I'd like to ignore this character when determining where line breaks should occur while iterating through the file.
Is there a parameter or approach that will enable me to do this? I'm ok with writing my own generator to iterate line by line through the file by specifying my own valid line break characters, but I'm not sure if there isn't a simpler approach, and I'm not sure how to do this since I'm using the codecs library to handle encoding.
Here are some (sanitized) sample data:
Record#|EventID|Date| Time-UTC|Level|computer name|param_01|param_02|param_03|param_04|param_05|param_06|source name|event log
84491|682|03/19/2015| 21:59:16.000|a-pass|WKS-WINXP32BIT|sample_user|SampleGroup|(0x0,0xF38058)|RDP-Tcp#12|RogueApp|10.3.98.6|Security|C:\Users\sampleuser\EventLogs\problem-child\SecEvent.Evt
90582|682|04/03/2015| 14:42:14.000|a-pass|WKS-WINXP32BIT|sample_user|SampleGroup|(0x0,0x35BDF)|RDP-Tcp#5|RogueApp|10.3.98.14|Security|C:\Users\sampleuser\EventLogs\problem-child\SecEvent.Evt
90613|682|04/03/2015| 16:26:03.000|a-pass|WKS-WINXP32BIT|sample_user|SampleGroup|(0x0,0x35BDF)|RDP-Tcp#9|Anonymous's Mac|192.168.1.14ì°†éªá…°ê±ºà¨€Aà¨€Aì°†éªá… ê±ºà¬€Aà¬€Aé¶é«á… Ö Î„|Security|C:\Users\sampleuser\EventLogs\problem-child\SecEvent.Evt
90626|682|04/03/2015| 16:57:35.000|a-pass|WKS-WINXP32BIT|sample_user|SampleGroup|(0x0,0x35BDF)|RDP-Tcp#11|RogueApp|10.3.98.14|Security|C:\Users\sampleuser\EventLogs\problem-child\SecEvent.Evt
91018|682|04/04/2015| 13:56:13.000|a-pass|WKS-WINXP32BIT|sample_user|SampleGroup|(0x0,0x100513C)|RDP-Tcp#33|Anonymous|10.3.58.13|Security|C:\Users\sampleuser\EventLogs\problem-child\SecEvent.Evt
91038|682|04/04/2015| 14:09:19.000|a-pass|WKS-WINXP32BIT|sample_user|SampleGroup|(0x0,0x100513C)|RDP-Tcp#39|Anonymous's Mac|192.168.1.18ì°†éªá…°ê±ºæ¸€xæ¸€xì°†éªá… ê±ºæ¬€xæ¬€xé¶é«á… Ð€Ì€|Security|C:\Users\sampleuser\EventLogs\problem-child\SecEvent.Evt
91064|682|04/04/2015| 15:25:33.000|a-pass|WKS-WINXP32BIT|ACN-Helpdesk|WKS-WINXP32BIT|(0x0,0x11FA916)|RDP-Tcp#43|CONTROLLER|10.3.58.4|Security|C:\Users\sampleuser\EventLogs\problem-child\SecEvent.Evt
91163|682|04/04/2015| 16:40:19.000|a-pass|WKS-WINXP32BIT|Anonymous|SampleGroup|(0x0,0x37D49)|RDP-Tcp#2|Anonymous's Mac|192.168.1.18ì°†éªá…°ê±ºá´€æ®–á´€æ®–ì°†éªá… ê±ºã¬€æ®–ã¬€æ®–é¶é«á… Ð€Ì€|Security|C:\Users\sampleuser\EventLogs\problem-child\SecEvent.Evt
91204|682|04/04/2015| 18:10:55.000|a-pass|WKS-WINXP32BIT|Anonymous|SampleGroup|(0x0,0x37D49)|RDP-Tcp#5|Anonymous's Mac|192.168.1.18ì°†éªá…°ê±ºæ˜€æ˜€ì°†éªá… ê±ºæ„€æ„€é¶é«á… Ð€Ì€|Security|C:\Users\sampleuser\EventLogs\problem-child\SecEvent.Evt
91545|682|04/05/2015| 13:41:58.000|a-pass|WKS-WINXP32BIT|Anonymous|SampleGroup|(0x0,0x37D49)|RDP-Tcp#7|Anonymous's Mac|192.168.1.14ì°†éªá…°ê±ºìˆ€ìˆ€ì°†éªá… ê±ºëŒ€ëŒ€é¶é«á… Ð€Ì€|Security|C:\Users\sampleuser\EventLogs\problem-child\SecEvent.Evt
91567|682|04/05/2015| 14:42:21.000|a-pass|WKS-WINXP32BIT|Anonymous|SampleGroup|(0x0,0x37D49)|RDP-Tcp#9|Anonymous's Mac|192.168.1.14ì°†éªá…°ê±ºæ €æ €ì°†éªá… ê±ºæ„€æ„€é¶é«á… Ð€Ì€|Security|C:\Users\sampleuser\EventLogs\problem-child\SecEvent.Evt
92120|682|04/06/2015| 19:06:43.000|a-pass|WKS-WINXP32BIT|ACN-Helpdesk|WKS-WINXP32BIT|(0x0,0x3D6DB)|RDP-Tcp#2|Anonymous's Mac|192.168.1.14ì°†éªá…°ê±ºç„€ç„€ì°†éªá… ê±ºçœ€çœ€é¶é«á… Ð€Ì€|Security|C:\Users\sampleuser\EventLogs\problem-child\SecEvent.Evt
It parses everything fine except for the very last row. Yes I know there shouldn't be binary data in a CSV file, but there is. And I have no choice in that matter.

>>> with open("out.test","wb") as f:
... f.write("a\va\nb\rq")
...
>>> for line in open("out.test","rb"):
... print line.decode("utf8")
...
a♂a
q
seems fine in python 2.7 ... what kind of encoding is this file that this wont work?

How to make the file.write() method in Python explicitly write the newline characters?

I am trying to write text to an output file that explicitly shows all of the newline characters (\n, \r, \r\n,). I am using Python 3 and Windows 7. My thought was to do this by converting the strings that I am writing into bytes.
My code looks like this:
file_object = open(r'C:\Users\me\output.txt', 'wb')`
for line in lines:
line = bytes(line, 'UTF-8')
print('Line: ', line) #for debugging
file_object.write(line)
file_object.close()
The print() statement to standard output (my Windows terminal) is as I want it to be. For example, one line looks like so, with the \n character visible.
Line: b'<p class="byline">Foo C. Bar</p>\n'
However, the write() method does not explicitly print any of the newline characters in my output.txt file. Why does write() not explicitly show the newline characters in my output text file, even though I'm writing in bytes mode, but print does explicitly show the newline characters in the windows terminal?

What Python does when writing strings or bytes to text or binary files:
Strings to a text file. Directly written.
Bytes to a text file. Writes the repr.
Strings to a binary file. Throws an exception.
Bytes to a binary file. Directly written.
You say that you get what you’re looking for when you write a bytes to standard out (a text file). That, with the pseudo-table above, suggests you might look into using repr. Specifically, if you’re looking for the output b'<p class="byline">Foo C. Bar</p>\n', you’re looking for the repr of a bytes object. If line was a str to start with and you don’t actually need that b at the beginning, you might instead be looking for the repr of the string, '<p class="byline">Foo C. Bar</p>\n'. If so, you could write it like this:
with open(r'C:\Users\me\output.txt', 'w') as file_object:
for line in lines:
file_object.write(repr(line) + '\n')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.