I am processing many text files which (some of them) contain uuencoding which can be .jpg or .pdf or .zip of .xlsx etc. I don't care about the embedded UUencoded data, so I would just like to discard these passages and keep the rest of the text. I'm struggling with how to come up with a method to skip only just enough, but not too much.
To summarize http://en.wikipedia.org/wiki/Uuencoding each blob begins with
begin 644 filename.extension
every line after the begin 644 seems to start by the letter
M
so this might also help. Any idea how to have a function that deletes all these lines for all .txt files in a folder (directory)?
For example, the following is a .jpg uuencoding
GRAPHIC
18
g438975g32h99a01.jpg
begin 644 g438975g32h99a01.jpg
M_]C_X``02D9)1#`!`#$`8`!#``#_[0G64&AO;=&]S:&]P(#,N,``X0DE-`^T`
M`````!``8`````$``0!#`````0`!.$))300-```````$````'CA"24T$&0``
M````!````!XX0DE-`_,```````D```````````$`.$))300*```````!```X
M0DE-)Q````````H``0`````````".$))30/U``````!(`"]F9#`!`&QF9;#`&
M```````!`"]F9#`!`*&9F#`&```````!`#(````!`%H````&```````!`#4`
M```!`"T````&```````!.$))30/X``````!P``#_____________________
M________`^#`````_____________________________P/H`````/______
M______________________\#Z`````#_____________________________
M`^#``#A"24T$"```````$`````$```)````"0``````X0DE-!!X```````0`
M````.$))300:``````!M````!#``````````````)P```+`````&`&<`,P`R
M`&#`.0`Y`````0`````````````````````````!``````````````"P````
M)P`````````````````````````````````````````````X0DE-!!$`````
M``$!`#A"24T$%```````!`````(X0DE-!`P`````!SH````!````<````!D`
M``%0```#T```!QX`&``!_]C_X``02D9)1#`!`#$`2`!(``#_[#`.061O8F4`
M9(`````!_]L`A``,"`#("0#,"0D,$0L*"Q$5#PP,#Q48$Q,5$Q,8$0P,#`P,
M#!$,#`P,#`P,#`P,#`P,#`P,#`P,#`P,#`P,#`P,`0T+"PT.#1`.#A`4##X.
M%!0.##X.%!$,#`P,#!$1#`P,#`P,$0P,#`P,#`P,#`P,#`P,#`P,#`P,#`P,
M#`P,#`S_P``1"``9`'`#`2(``A$!`Q$!_]T`!``'_\0!/P```04!`0$!`0$`
M`````````P`!`#0%!#<("0H+`0`!!0$!`0$!`0`````````!``(#!`4&!P#)
M"#L0``$$`0,"!`(%!P8(!0,,,P$``A$#!"$2,05!46$3(G&!,#84D:&Q0B;,D
M%5+!8C,T<H+10P)E\K.$P]-U
MX_-&)Y2DA;25Q-3D]*6UQ=7E]59F=H:6IK;&UN;V-T=79W>'EZ>WQ]?G]Q$`
M`#(!`#0$`P0%!#<'!#4U`0`"$0,A,1($05%A<2(3!3*!D12AL4(CP5+1\#,D
M8N%R#I)#4Q5C<S3Q)086HK*#!R8UPM)$DU2C%V1%539T9>+RLX3#TW7C\T:4
MI(6TE<34Y/2EM<75Y?569G:&EJ;:VQM;F]B
I would like to be left with just
GRAPHIC
18
g438975g32h99a01.jpg
For background, see also my earlier question How to remove weird encoding from txt file
EDIT : Here is a try
start_marker = 'begin 644'
with open('fileWithBegin644.txt') as inf:
ignoreLines = False
for line in inf:
if start_marker in line:
print line,
ignoreLines = True
if not ignoreLines:
with open("strip_" + inf, "w") as f:
f.write(line.get_text().encode('utf-8'))
But I am getting the following error
File "removeUuencodingFromAll.py", line 10
with open("strip_" + inf, "w") as f:
^
IndentationError: expected an indented block
I coded up what was supposed to be a rather simple generator. Because the spec is slightly tedious (why two separate end markers on different lines?) it is rather bulky, but here goes. It should work as a validator for uuencode at the same time, but I have only tested it in very limited settings.
import re
def unuuencode (iterator, collector=None, ignore_length_errors=False):
"""
Yield lines from iterator except when they are in an uuencode blob.
If collector is not None, append to it the uuencoded blobs as a list
of a list of lines, one for each uuencoded blob.
"""
state = None # one of { None, 'in_blob', 'closing', 'closed' }
collectitem = None
regex = re.compile(r'^begin\s+[0-7]{3,6}\s+.*?(?:\r?\n)?$')
for line in iterator:
if state == None:
if regex.match(line):
if collector != None:
collectitem = [line]
state = 'in_blob'
continue
else:
yield line
else:
stripped = line.rstrip('\r\n')
if state == 'in_blob' and line.startswith('`'):
state = 'closing'
if state == 'closing':
if stripped != '`':
raise ValueError('Expected "`" but got "%s"' % line)
state = 'closed'
elif state == 'closed':
if stripped != 'end':
raise ValueError('Expected "end" but got "%s"' % line)
state = None
else:
expect = ord(line[0:1])-32
actual = len(stripped)
seen = (len(stripped)-1)*6/8
if seen != expect:
if not ignore_length_errors:
raise ValueError('Wrong prefix on line: %s '
'(indicated %i, 6/8 %i, actual length %i)' % (
line, expect, seen, actual))
if line[0:1] != 'M':
state = 'closing'
if collectitem:
collectitem.append(line)
if state is None:
if collectitem:
collector.append(collectitem)
collectitem = None
continue
Use it like this:
with open(file, 'r') as f:
lines = [x for x in unuuencode(f)]
or like this:
with open(file, 'r') as f:
blobs = []
lines = [x for x in unuuencode(f, collector=blobs)]
or like this:
with open(file, 'r') as f:
lines = f.read().split('\n')
# ... or whichever way you obtained your content as an array of lines
lines = [x for x in unuuencode(lines)]
or in the case of the code you seem to be using:
for fi in sys.argv[1:]:
with open(fi) as markup:
soup = BeautifulSoup(''.join(unuuencode(markup, ignore_length_errors=True)))
with open("strip_" + fi, "w") as f:
f.write(soup.get_text().encode('utf-8'))
The sample you linked to had an invalid length indicator in the second uuencoded blob, so I added an option to ignore that.
Related
I have a .txt file of amino acids separated by ">node" like this:
Filename.txt :
>NODE_1
MSETLVLTRPDDWHVHLRDGAALQSVVPYTARQFARAIAMPNLKPPITTAEQAQAYRERI
KFFLGTDSAPHASVMKENSVCGAGCFTALSALELYAEAFEAAGALDKLEAFASFHGADFY
GLPRNTTQVTLRKTEWTLPESVPFGEAAQLKPLRGGEALRWKLD*
>NODE_2
MSTWHKVQGRPKAQARRPGRKSKDDFVTRVEHDAKNDALLQLVRAEWAMLRSDIATFRGD
MVERFGKVEGEITGIKGQIDGLKGEMQGVKGEVEGLRGSLTTTQWVVGTAMALLAVVTQV
PSIISAYRFPPAGSSAFPAPGSLPTVPGSPASAASAP*
I want to separate this file into two (or as many as there are nodes) files;
Filename1.txt :
>NODE
MSETLVLTRPDDWHVHLRDGAALQSVVPYTARQFARAIAMPNLKPPITTAEQAQAYRERI
KFFLGTDSAPHASVMKENSVCGAGCFTALSALELYAEAFEAAGALDKLEAFASFHGADFY
GLPRNTTQVTLRKTEWTLPESVPFGEAAQLKPLRGGEALRWKLD*
Filename2.txt :
>NODE
MSTWHKVQGRPKAQARRPGRKSKDDFVTRVEHDAKNDALLQLVRAEWAMLRSDIATFRGD
MVERFGKVEGEITGIKGQIDGLKGEMQGVKGEVEGLRGSLTTTQWVVGTAMALLAVVTQV
PSIISAYRFPPAGSSAFPAPGSLPTVPGSPASAASAP*
With a number after the filename
This code works, however it deletes the ">NODE" line and does not create a file for the last node (the one without a '>' afterwards).
with open('FilePathway') as fo:
op = ''
start = 0
cntr = 1
for x in fo.read().split("\n"):
if x.startswith('>'):
if start == 1:
with open (str(cntr) + '.fasta','w') as opf:
opf.write(op)
opf.close()
op = ''
cntr += 1
else:
start = 1
else:
if op == '':
op = x
else:
op = op + '\n' + x
fo.close()
I canĀ“t seem to find the mistake. Would be thankful if you could point it out to me.
Thank you for your help!
Hi again! Thank you for all the comments. With your help, I managed to get it to work perfectly. For anyone with similar problems, this is my final code:
import os
import glob
folder_path = 'FilePathway'
for filename in glob.glob(os.path.join(folder_path, '*.fasta')):
with open(filename) as fo:
for line in fo.readlines():
if line.startswith('>'):
original = line
content = [original]
fileno = 1
filename = filename
y = filename.replace(".fasta","_")
def writefasta():
global content, fileno
if len(content) > 1:
with open(f'{y}{fileno}.fasta', 'w') as fout:
fout.write(''.join(content))
content = [line]
fileno += 1
with open('FilePathway') as fin:
for line in fin:
if line.startswith('>NODE'):
writefasta()
else:
content.append(line)
writefasta()
You could do it like this:
def writefasta(d):
if len(d['content']) > 1:
with open(f'Filename{d["fileno"]}.fasta', 'w') as fout:
fout.write(''.join(d['content']))
d['content'] = ['>NODE\n']
d['fileno'] += 1
with open('test.fasta') as fin:
D = {'content': ['>NODE\n'], 'fileno': 1}
for line in fin:
if line.startswith('>NODE'):
writefasta(D)
else:
D['content'].append(line)
writefasta(D)
This would be better way. It is going to write only on odd iterations. So that, ">NODE" will be skipped and files will be created only for the real content.
with open('filename.txt') as fo:
cntr=1
for i,content in enumerate(fo.read().split("\n")):
if i%2 == 1:
with open (str(cntr) + '.txt','w') as opf:
opf.write(content)
cntr += 1
By the way, since you are using context manager, you dont need to close the file.
Context managers allow you to allocate and release resources precisely
when you want to. It opens the file, writes some data to it and then
closes it.
Please check: https://book.pythontips.com/en/latest/context_managers.html
with open('FileName') as fo:
cntr = 1
for line in fo.readlines():
with open (f'{str(cntr)}.fasta','w') as opf:
opf.write(line)
opf.close()
op = ''
cntr += 1
fo.close()
I am trying to extract IPv4 addresses from a text file and save them as a list to a new file, however, I can not use regex to parse the file, Instead, I have check the characters individually. Not really sure where to start with that, everything I find seems to have import re as the first line.
So far this is what I have,
#Opens and prints wireShark txt file
fileObject = open("wireShark.txt", "r")
data = fileObject.read()
print(data)
#Save IP adresses to new file
with open('wireShark.txt') as fin, open('IPAdressess.txt', 'wt') as fout:
list(fout.write(line) for line in fin if line.rstrip())
#Opens and prints IPAdressess txt file
fileObject = open("IPAdressess.txt", "r")
data = fileObject.read()
print(data)
#Close Files
fin.close()
fout.close()
So I open the file, and I have created the file that I will put the extracted IP's in, I just don't know ow to pull them without using REGEX.
Thanks for the help.
Here is a possible solution.
The function find_first_digit, position the index at the next digit in the text if any and return True. Else return False
The functions get_dot and get_num read a number/dot and, lets the index at the position just after the number/dot and return the number/dot as str. If one of those functions fails to get the number/dot raise an MissMatch exception.
In the main loop, find a digit, save the index and then try to get an ip.
If sucess, write it to output file.
If any of the called functions raises a MissMatch exception, set the current index to the saved index plus one and start over.
class MissMatch(Exception):pass
INPUT_FILE_NAME = 'text'
OUTPUT_FILE_NAME = 'ip_list'
def find_first_digit():
while True:
c = input_file.read(1)
if not c: # EOF found!
return False
elif c.isdigit():
input_file.seek(input_file.tell() - 1)
return True
def get_num():
num = input_file.read(1) # 1st digit
if not num.isdigit():
raise MissMatch
if num != '0':
for i in range(2): # 2nd 3th digits
c = input_file.read(1)
if c.isdigit():
num += c
else:
input_file.seek(input_file.tell() - 1)
break
return num
def get_dot():
if input_file.read(1) == '.':
return '.'
else:
raise MissMatch
with open(INPUT_FILE_NAME) as input_file, open(OUTPUT_FILE_NAME, 'w') as output_file:
while True:
ip = ''
if not find_first_digit():
break
saved_position = input_file.tell()
try:
ip = get_num() + get_dot() \
+ get_num() + get_dot() \
+ get_num() + get_dot() \
+ get_num()
except MissMatch:
input_file.seek(saved_position + 1)
else:
output_file.write(ip + '\n')
Assume:
self.base_version = 1000
self.target_version = 2000
I have a file as follows:
some text...
<tsr_args> \"upgrade_test test_mode=upgrade base_sw=1000 target_sw=2000 system_profile=eth\"</tsr_args>
some text...
<tsr_args> \"upgrade_test test_mode=rollback base_sw=2000 target_sw=1000 system_profile=eth manufacture_type=no-manufacture\"</tsr_args>
some text...
<tsr_args> \"upgrade_test test_mode=downgrade base_sw=2000 target_sw=1000 system_profile=eth no_boot_next_enable_flag=True\"</tsr_args>
I need the base and target version values to be placed as specified above (Note that on the 2nd and 3rd entry, the base and target are opposite).
I tried to do it as follows, but it does not work:
base_regex = re.compile('.*test_mode.*base_sw=(.*)')
target_regex = re.compile('.*test_mode.*target_sw=(.*)')
o = open(file,'a')
for line in open(file):
if 'test_mode' in line:
if 'upgrade' in line:
new_line = (re.sub(base_regex, self.base_version, line))
new_line = (re.sub(target_regex, self.target_version, line))
o.write(new_line)
elif 'rollback' in line or 'downgrade' in line):
new_line = (re.sub(base_regex, self.target_version, line))
new_line = (re.sub(target_regex, self.base_version, line))
o.write(new_line)
o.close()
Assume the above code runs properly without any syntax errors.
The file is not modified at all.
The complete line is modified instead of just the captured group. How can I make re.sub to substitute only the captured group?
You are opening file with a -> append. So, your changes should be at the end of file. You should create a new file and replace old_one at the end of your script.
There is only one way I know if you want replace several matching groups: first of all you find word using regexp and replace it like a string without regexp.
Thanks Jimilan for your remarks. I fixed my code, and now it`s working:
base_regex = re.compile(.*test_mode.*base_sw=(\S*))
target_regex = re.compile(.*test_mode.*target_sw=(\S*))
for file in self.upgrade_cases_files_list:
file_handle = open(file, 'r')
file_string = file_handle.read()
file_handle.close()
base_version_result = base_regex.search(file_string)
target_version_result = target_regex.search(file_string)
if base_version_result is not None:
current_base_version = base_version_result.group(1)
else:
raise Exception("Could not detect base version in the following file: -> %s \n" % (file))
if target_version_result is not None:
current_target_version = target_version_result.group(1)
else:
raise Exception("Could not detect target version in the following file: -> %s \n" % (file))
file_string = file_string.replace(current_base_version, self.base_version)
file_string = file_string.replace(current_target_version, self.target_version)
file_handle = open(file, 'w')
file_handle.write(file_string)
file_handle.close()
I'm a Python noob, and am trying to write a script to take two Intel hex files (one for my application code, one for a bootloader), strip the EOF record out of the first one, append the second file to the stripped version of the first, and save as a new file. I've got everything working, but then decided to get fancier: I want to ensure that the last line of the first file truly matches the Intel EOF record format. I can't seem to get the syntax for this conditional down correctly, though.
appFile = open("MyAppFile.hex", "r")
lines = appFile.readlines()
appFile.close()
appStrip = open("MyAppFile and BootFile.hex",'w')
if appStrip.readline[:] == ":00000001FF": #Python complains about "builtin_function_or_method" not subscriptable here
appStrip.writelines([item for item in lines[:-1]])
appStrip.close()
else:
print("No EOF record in last line. File may be corrupted.")
appFile = open("MyAppFile and BootFile", "r")
appObcode = appFile.read()
appFile.close()
bootFile = open("MyBootFile", "r")
bootObcode = bootFile.read()
bootFile.close()
comboData = appObcode + bootObcode
comboFile = open("MyAppFile and BootFile", "w")
comboFile.write(comboData)
comboFile.close()
Any other suggestions for a cleaner or safer version of this are welcome, too.
UPDATE:
Added a line to print the last line; I am getting the expected output, but the comparison still fails every time. Here is current full program:
appFile = open("C:/LightLock/Master/Project/Debug/Exe/Light Lock.hex")
appLines = appFile.readlines()
appFile = open("MyAppFile.hex").read()
EOF = appLines[len(appLines)-1]
print(appLines[len(appLines)-1])
if not EOF == (":00000001FF"):
print("No EOF record in last line of file. File may be corrupted.")
else:
with open("MyAppFile Plus Boot", "a") as appStrip:
appStrip.writelines([item for item in appLines[:-1]])
with open("MyAppFile Plus Boot.hex", "r") as appFile:
appObcode = appFile.read()
with open("MyBootFile.hex", "r") as bootFile:
bootObcode = bootFile.read()
comboData = appObcode + bootObcode
with open("MyAppFile Plus Boot.hex", "w") as comboFile:
comboFile.write(comboData)
UPDATE2:
Tried modifying the check to include a carriage return and line feed like so:
EOF = appLines[len(appLines)-1]
print(EOF)
if EOF != (":00000001FF","\r","\n"):
print("No EOF record in last line of file. File may be corrupted.")
Still no luck.
Here's a simpler version :
app = open("app.hex").read()
if not app.endswith(":00000001FF"):
print("No EOF")
combo = open("combo.hex","w")
combo.write(app)
boot = open("boot.hex").read()
combo.write(boot)
combo.close() # it's automatic after program ended
Finally figured it out: I wrote some test code to output the length of the string that Python was reading. Turns out it was 12 chars, though only 11 were displayed. So I knew one of the "invisible" chars must be the carriage return or line feed. Tried both; turned out to be line feed (new line).
Here's the final (working, but "unoptimized") code:
appFile = open("MyAppFile.hex")
appLines = appFile.readlines()
appFile = open("MyAppFile.hex").read()
EOF = appLines[len(appLines)-1]
if EOF != (":00000001FF\n"):
print("No EOF record in last line of file. File may be corrupted.")
else:
with open("MyAppFile and Boot.hex", "a") as appStrip:
appStrip.writelines([item for item in appLines[:-1]])
with open("MyAppFile and Boot.hex", "r") as appFile:
appObcode = appFile.read()
with open("MyBootFile.hex", "r") as bootFile:
bootObcode = bootFile.read()
comboData = appObcode + bootObcode
with open("MyAppFile and Boot.hex", "w") as comboFile:
comboFile.write(comboData)
I have a file that looks like this
!--------------------------------------------------------------------------DISK
[DISK]
DIRECTION = 'OK'
TYPE = 'normal'
!------------------------------------------------------------------------CAPACITY
[CAPACITY]
code = 0
ID = 110
I want to read sections [DISK] and [CAPACITY].. there will be more sections like these. I want to read the parameters defined under those sections.
I wrote a following code:
file_open = open(myFile,"r")
all_lines = file_open.readlines()
count = len(all_lines)
file_open.close()
my_data = {}
section = None
data = ""
for line in all_lines:
line = line.strip() #remove whitespace
line = line.replace(" ", "")
if len(line) != 0: # remove white spaces between data
if line[0] == "[":
section = line.strip()[1:]
data = ""
if line[0] !="[":
data += line + ","
my_data[section] = [bit for bit in data.split(",") if bit != ""]
print my_data
key = my_data.keys()
print key
Unfortunately I am unable to get those sections and the data under that. Any ideas on this would be helpful.
As others already pointed out, you should be able to use the ConfigParser module.
Nonetheless, if you want to implement the reading/parsing yourself, you should split it up into two parts.
Part 1 would be the parsing at file level: splitting the file up into blocks (in your example you have two blocks: DISK and CAPACITY).
Part 2 would be parsing the blocks itself to get the values.
You know you can ignore the lines starting with !, so let's skip those:
with open('myfile.txt', 'r') as f:
content = [l for l in f.readlines() if not l.startswith('!')]
Next, read the lines into blocks:
def partition_by(l, f):
t = []
for e in l:
if f(e):
if t: yield t
t = []
t.append(e)
yield t
blocks = partition_by(content, lambda l: l.startswith('['))
and finally read in the values for each block:
def parse_block(block):
gen = iter(block)
block_name = next(gen).strip()[1:-1]
splitted = [e.split('=') for e in gen]
values = {t[0].strip(): t[1].strip() for t in splitted if len(t) == 2}
return block_name, values
result = [parse_block(b) for b in blocks]
That's it. Let's have a look at the result:
for section, values in result:
print section, ':'
for k, v in values.items():
print '\t', k, '=', v
output:
DISK :
DIRECTION = 'OK'
TYPE = 'normal'
CAPACITY :
code = 0
ID = 110
Are you able to make a small change to the text file? If you can make it look like this (only changed the comment character):
#--------------------------------------------------------------------------DISK
[DISK]
DIRECTION = 'OK'
TYPE = 'normal'
#------------------------------------------------------------------------CAPACITY
[CAPACITY]
code = 0
ID = 110
Then parsing it is trivial:
from ConfigParser import SafeConfigParser
parser = SafeConfigParser()
parser.read('filename')
And getting data looks like this:
(Pdb) parser
<ConfigParser.SafeConfigParser instance at 0x100468dd0>
(Pdb) parser.get('DISK', 'DIRECTION')
"'OK'"
Edit based on comments:
If you're using <= 2.7, then you're a little SOL.. The only way really would be to subclass ConfigParser and implement a custom _read method. Really, you'd just have to copy/paste everything in Lib/ConfigParser.py and edit the values in line 477 (2.7.3):
if line.strip() == '' or line[0] in '#;': # add new comment characters in the string
However, if you're running 3'ish (not sure what version it was introduced in offhand, I'm running 3.4(dev)), you may be in luck: ConfigParser added the comment_prefixes __init__ param to allow you to customize your prefix:
parser = ConfigParser(comment_prefixes=('#', ';', '!'))
If the file is not big, you can load it and use Regexes to find parts that are of interest to you.