Shell text to python string - python

I'm writing a little python utility to help move our shell -help documentation to searchable webpages, but I hit a weird block :
output = subprocess.Popen([sys.argv[1], '--help'],stdout=subprocess.PIPE).communicate()[0]
output = output.split('\n')
print output[4]
#NAME
for l in output[4]:
print l
#N
#A
#
#A
#M
#
#M
#E
#
#E
#or when written, n?na?am?me?e
It does this for any heading/subheading in the documentation, which makes it near unusable.
Any tips on getting correct formatting? Where did I screw up?
Thanks

The documentation contains overstruck characters done in the ancient line-printer way: print each character, followed by a backspace (\b aka \x08), followed by the same character again. So "NAME" becomes "N\bNA\bAM\bME\bE". If you can convince the program not to output that way, it would be the best; otherwise, you can clean it up with something like output = re.sub(r'\x08.', '', output)

A common way to mark a character as bold in a terminal is to print the character, followed by a backspace characters, followed by the character itself again (just like you would do it on a mechanical typewriter). Terminal emulators like xterm detect such sequences and turn them into bold characters. Programs shouldn't be printing such sequences if stdout is not a terminal, but if your tool does, you will have to clean up the mess yourself.

Related

Removing a control character using Python

I have a script that processes the output of a command (the aws help cli command).
I step through the output line-by-line and don't start the actual real parsing until I encounter the text "AVAILABLE COMMANDS" at which point I set a flag to true and start further processing on each line.
I've had this working fine - BUT on Ubuntu we encounter a problem which is this :
The CLI highlights the text in a way I have not seen before:
The output is very long, so I've grep'd the particular line in question - see below:
># aws ec2 help | egrep '^A'
>AVAILABLE COMMANDS
># aws ec2 help | egrep '^A' | cat -vet
>A^HAV^HVA^HAI^HIL^HLA^HAB^HBL^HLE^HE C^HCO^HOM^HMM^HMA^HAN^HND^HDS^HS$
What I haven't seen before is that each letter that is highligted is in the format X^HX.
I'd like to apply a simple transformation of the type X^HX --> X (for all a-zA-Z).
What have I tried so far:
well my workaround is this - first I remove control characters like this:
String = re.sub(r'[\x00-\x1f\x7f-\x9f]','',String)
but I still have to search for 'AAVVAAIILLAABBLLEE' which is totally ugly. I considered using a further regex to turn doubles to singles but that will catch true doubles and get messy.
I started writing a function with an iteration across a constructed list of alpha characters to translate as described, and I used hexdump to try to figure out the exact \x code of the control characters in question but could not get it working - I could remove H but not the ^.
I really don't want to use any additional modules because I want to make this available to people without them having to install extras. In conclusion I have a workaround that is quite ugly, but I'm sure someone must know a quick an easy way to do this translation. It's odd that it only seems to show up on Ubuntu.
After looking at this a little further I was able to put in place a solution:
from string import ascii_lowercase
from string import ascii_uppercase
def RemoveUbuntuHighlighting(String):
for Char in ascii_uppercase + ascii_lowercase:
Match = Char + '\x08' + Char
String = re.sub(Match,Char,String)
return(String)
I'm still a little confounded to see characters highlighted in the format (X\x08X), the arrangement does seem to repeat the same information unnecessarily.
The other thing I would advise to anyone not familiar with reading hexcode is that each pair of hexes is swapped around with respect to the order of their appearance.
A much simpler and more reliable fix is to replace a backspace and duplicate of any character.
I have also augmented this to handle underscores using the same mechanism (character, backspace, underscore).
String = re.sub(r'(.)\x08(\1|_)', r'\1', String)
Demo: https://ideone.com/yzwd2V
This highlighting was standard back when output was to a line printer; backspacing and printing the same character again would add pigmentation to produce boldface. (Backspacing and printing an underscore would produce underlining.)
Probably the AWS CLI can be configured to disable this by setting the TERM variable to something like dumb. There is also a utility col which can remove this formatting (try col-b; maybe see also colcrt). Though perhaps really the best solution would be to import the AWS Python code and extract the help message natively.

Python: \b behaves differently from book's description [duplicate]

This question already has an answer here:
Why Python IDLE Print Instead Of Backspace? [duplicate]
(1 answer)
Closed 6 years ago.
I'm reading automate_the_boring_stuff_with_python_2015 and I got to this snippet:
print(positionStr, end='')
print('\b' * len(positionStr), end='', flush=True)
where positionStr is a string defined earlier. I looked at python escape sequences and saw that \b is backspace but for some reason the author says it should erase the printed string
To erase text, print the \b backspace escape character. This special
character erases a character at the end of the current line on the screen. The line at u uses string replication to produce a string with as many \b
characters as the length of the string stored in positionStr, which has the
effect of erasing the positionStr string that was last printed.
this contradicts what I saw in here (table in mid page)
this differs from my results
As you can see I got a bunch of backspace chars, as I guess I should have (I ran a loop in which I printed the said string and then the \b string)
Now, is the book wrong or should I have done something different in order for it to work? Additionally, if this is wrong, is there a way to achieve this goal? (print string and then delete it)
As it can be seen from the picture, I work with python 3.5.3. on Windows 8.1
Not all consoles support the \b character as a deletion character, especially graphical ones.
(same thing happens when you write it to a file, the previous char is not deleted either)
Try your example in a native shell (Windows or Linux would work) and the characters will be properly deleted.
Windows CMD:
>>> print("a\bc")
c
PyScripter (that's what I have):
>>> print("a\bc")
a<strange char>c

Use search with regex to find Korean characters using Python

Using Python 2.7.9 on Windows 8.1 Enterprise 64-bit
I'm using the following code to search for any Korean characters ( http://lcweb2.loc.gov/diglib/codetables/9.3.html )
line = ['x'. 'y', 'z', '쭌', 'a']
if any([re.search("[%s-%s]" % ("\xE3\x84\xB1".decode('utf-8'), "\xEC\xAD\x8C".decode('utf-8')), x) for x in line[3:]]):
print "found character"
When ever I run the script and give it the following character 쭌 the console shows ∞¡î which is a result of IDLE / Command Prompt being unable to show Korean characters I'm guessing.
쭌 is the last character that I was hoping to match in the regex
So is the above search correct at least? I'd prefer to know I at least have the right pattern to search for and spend time trying to make the console show the proper Korean characters.
I've tried in command prompt to do cph 1252 and nothing. It never prints out "found character" so I wouldn't ever know.
If it helps, the script is receiving text from an IRC channel where Korean is usually spoken.
Use Unicode strings (note the "u" prefixes):
import re
line = [u'x', u'y', u'z', u'쭌', u'a']
if any([re.search(u'[\u3131-\ucb4c]', x) for x in line[3:]]):
print "found character"
If you wanted to use the regex library (not to be confused with re), you could do this:
import regex
regex.search(r'\p{IsHangul}', '오소리')
or in a function to detect at least one Hangul character:
import regex
def is_hangul(value):
if regex.search(r'\p{IsHangul}', value):
return True
return False
print(is_hangul('오소리')) # True
print(is_hangul('mushroom')) # False
print(is_hangul('뱀')) # True

Invalid character in identifier

I am working on the letter distribution problem from HP code wars 2012. I keep getting an error message that says "invalid character in identifier". What does this mean and how can it be fixed?
Here is the page with the information.
import string
def text_analyzer(text):
'''The text to be parsed and
the number of occurrences of the letters given back
be. Punctuation marks, and I ignore the EOF
simple. The function is thus very limited.
'''
result = {}
# Processing
for a in string.ascii_lowercase:
result [a] = text.lower (). count (a)
return result
def analysis_result (results):
# I look at the data
keys = analysis.keys ()
values \u200b\u200b= list(analysis.values \u200b\u200b())
values.sort (reverse = True )
# I turn to the dictionary and
# Must avoid that letters will be overwritten
w2 = {}
list = []
for key in keys:
item = w2.get (results [key], 0 )
if item = = 0 :
w2 [analysis results [key]] = [key]
else :
item.append (key)
w2 [analysis results [key]] = item
# We get the keys
keys = list (w2.keys ())
keys.sort (reverse = True )
for key in keys:
list = w2 [key]
liste.sort ()
for a in list:
print (a.upper (), "*" * key)
text = """I have a dream that one day this nation will rise up and live out the true
meaning of its creed: "We hold these truths to be self-evident, that all men
are created equal. "I have a dream that my four little children will one day
live in a nation where they will not be Judged by the color of their skin but
by the content of their character.
# # # """
analysis result = text_analyzer (text)
analysis_results (results)
The error SyntaxError: invalid character in identifier means you have some character in the middle of a variable name, function, etc. that's not a letter, number, or underscore. The actual error message will look something like this:
File "invalchar.py", line 23
values = list(analysis.values ())
^
SyntaxError: invalid character in identifier
That tells you what the actual problem is, so you don't have to guess "where do I have an invalid character"? Well, if you look at that line, you've got a bunch of non-printing garbage characters in there. Take them out, and you'll get past this.
If you want to know what the actual garbage characters are, I copied the offending line from your code and pasted it into a string in a Python interpreter:
>>> s=' values ​​= list(analysis.values ​​())'
>>> s
' values \u200b\u200b= list(analysis.values \u200b\u200b())'
So, that's \u200b, or ZERO WIDTH SPACE. That explains why you can't see it on the page. Most commonly, you get these because you've copied some formatted (not plain-text) code off a site like StackOverflow or a wiki, or out of a PDF file.
If your editor doesn't give you a way to find and fix those characters, just delete and retype the line.
Of course you've also got at least two IndentationErrors from not indenting things, at least one more SyntaxError from stay spaces (like = = instead of ==) or underscores turned into spaces (like analysis results instead of analysis_results).
The question is, how did you get your code into this state? If you're using something like Microsoft Word as a code editor, that's your problem. Use a text editor. If not… well, whatever the root problem is that caused you to end up with these garbage characters, broken indentation, and extra spaces, fix that, before you try to fix your code.
If your keyboard is set to English US (International) rather than English US the double quotation marks don't work. This is why the single quotation marks worked in your case.
Similar to the previous answers, the problem is some character (possibly invisible) that the Python interpreter doesn't recognize. Because this is often due to copy-pasting code, re-typing the line is one option.
But if you don't want to re-type the line, you can paste your code into this tool or something similar (Google "show unicode characters online"), and it will reveal any non-standard characters. For example,
s=' values ​​= list(analysis.values ​​())'
becomes
s=' values U+200B U+200B​​ = list(analysis.values U+200B U+200B ​​())'
You can then delete the non-standard characters from the string.
Carefully see your quotation, is this correct or incorrect! Sometime double quotation doesn’t work properly, it's depend on your keyboard layout.
I got a similar issue. My solution was to change minus character from:
—
to
-
I got that error, when sometimes I type in Chinese language.
When it comes to punctuation marks, you do not notice that you are actually typing the Chinese version, instead of the English version.
The interpreter will give you an error message, but for human eyes, it is hard to notice the difference.
For example, "," in Chinese; and "," in English.
So be careful with your language setting.
Not sure this is right on but when i copied some code form a paper on using pgmpy and pasted it into the editor under Spyder, i kept getting the "invalid character in identifier" error though it didn't look bad to me. The particular line was grade_cpd = TabularCPD(variable='G',\
For no good reason I replaced the ' with " throughout the code and it worked. Not sure why but it did work
A little bit late but I got the same error and I realized that it was because I copied some code from a PDF. Check the difference between these two:
-
−
The first one is from hitting the minus sign on keyboard and the second is from a latex generated PDF.
This error occurs mainly when copy-pasting the code. Try editing/replacing minus(-), bracket({) symbols.
You don't get a good error message in IDLE if you just Run the module. Try typing an import command from within IDLE shell, and you'll get a much more informative error message. I had the same error and that made all the difference.
(And yes, I'd copied the code from an ebook and it was full of invisible "wrong" characters.)
My solution was to switch my Mac keyboard from Unicode to U.S. English.
it is similar for me as well after copying the code from my email.
def update(self, k=1, step = 2):
if self.start.get() and not self.is_paused.get(): U+A0
x_data.append([i for i in range(0,k,1)][-1])
y = [i for i in range(0,k,step)][-1]
There is additional U+A0 character after checking with the tool as recommended by #Jacob Stern.

Extract Text from a Binary File (using Python 2.7 on Windows 7)

I have a binary file of size about 5MB.. which has lots of interspersed text.. and control characters..
This is actually an equivalent of an outlook .pst file for SITATEX Application (from SITA).
The file contains all the TEXT MESSAGES sent and received to and from outside world...(but the text has to be extracted through the binary control characters).. all the text messages are clearly available... with line ending ^M characters... etc.
for example: assume ^# ^X are control characters... \xaa with HEX aa, etc. loads of them around my required text extraction.
^#^#^#^#^#^#^#^#^#^#^#BLLBBCC^X^X^X^X^X^X^X^X^X
^X^X^X
MVT^M
EA1123 TEXT TEXT TEXT^M
END^M
\xaa^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#
^#^#^#^#^#^#^#^#^#^#^#TTBBTT^X^X^X^X^X^X^X^X^X
^X^X^X blah blah blah... of control characters.. and then the message comes..
MVT MESSAGE 2
ED1123
etc.
and so on.. for several messages.
Using Perl.. it is easy to do:
while (<>) {
use regular expression to split messages
m/ /
}
How would one do this in python easily..
How to read the file? binary and text interspersed
Eliminate unnecessary control characters
parse the messages in between two \xaa USEFUL TEXT INFORMATION \xaa (HEX 'aa')
print out the required stuff
Loop through all the lines.. and more files.
In the text file sample... I am interested in seeing.. BLLBBCC... and MVT and EA1123 and so on.
Please assist... If it is going to be very difficult in python.. I will have to think through the logic in perl itself.. as it (perl) doesn't throw lots of errors at me at least for the looping part of binary and text stuff.. and the regex.
Thanks.
Update 02Jan after reading your answers/comments
After going through S.Lott's comments and others... This is where I am at.. and it is working 80% ok.
import fileinput
import sys
import re
strfile = r'C:\Users\' \
r'\Learn\python\mvt\sitatex_test.msgs'
f = open(strfile, 'rb')
contents = f.read() # read whole file in contents
#extract the string between two \xaaU.. multiline pattern match
#with look ahead assertion
#and this is stored in a list with all msgs
msgs = re.findall(r'\xaaU.*?(?=\xaaU)', contents, re.I|re.DOTALL|re.M)
for msg in msgs:
#loop through msgs.. to find the first msg then next and so on.
print "## NEW MESSAGE STARTS HERE ##"
#for each msg split the lines.. to read line by line
# stored as list in msglines
msglines = msg.splitlines()
line = 0
#then process each msgline with a message
for msgline in msglines:
line += 1
#msgline = re.sub(r'[\x00]+', r' ', msgline)
mystr = msgline
print mystr
textstrings = re.findall(r'[\x00\x20-\x7E]+', msgline)
So far so good.. still I am not completely done.. because I need to parse the text line by line and word by word.. to pickup (as an example) the origin address and headers, subject line, message body... by parsing the message through the control characters.
Now I am stuck with... how to print line by line with the control characters converted to \x00\x02.. etc (using the \xHH format).. but leave the normal readable text alone.
For example.. say I have this: assume ^# and ^X are some control characters
line1 = '^#UG^#^#^#^#^#^#^#^#^#^#BLLBBCC^X^X^X^X^X^X^X^X^X' (on the first line).
When I print the line as it is on IDLE.. print line1.. it prints only say the first 2 or 3 characters.. and ignores the rest due to the control characters get choked.
However, when I print with this: print re.findall(r'.*', line1)
['\xaaUG\x02\x05\x00\x04\x00\x00\x00\x05\x00\x00\x00....
x00\x00\x00..BLLBBCC\x00\x00N\x00N\\x00
002 010 180000 DEC 11', '']
It prints nicely with all the control characters converted to \xHH format.. and ascii text intact.. (just as I want it)..with one catch.. the list has two items.. with '' in the end.
What is the explanation for the empty string in the end?
How to avoid it... I just want the line converted nicely to a string (not a list). i.e. one line of binary/text to be converted to a string with \xHH codes.. leave the ASCII TEXT alone.
Is using re.findall(r'.*', line1) is the only easy solution.. to do this conversion.. or are there any other straightforward method.. to convert a '\x00string' to \xHH and TEXT (where it is a printable character or whitespace).
Also.. any other useful comments to get the lines out nicely.
Thanks.
Update 2Jan2011 - Part 2
I have found out that re.findall(r'.+', line1) strips to
['\xaaUG\x02\x05\x00\x04\x00\x00\x00\x05\x00\x00\x00....
x00\x00\x00..BLLBBCC\x00\x00N\x00N\\x00
002 010 180000 DEC 11']
without the extra blank '' item in the list. This finding after numerous trial and errors.
Still I will need assistance to eliminate the list altogether but return just a string.
like this:
'\xaaUG\x02\x05\x00\x04..BLLBBCC..002 010 180000 DEC 11'
Added Info on 05Jan:
#John Machin
1) \xaaU is the delimiter between messages.. In the example.. I may have just left out in the samples. Please see below for one actual message that ends with \xaaU (but left out).
Following text is obtained from repr(msg between r'\xaaU.*?(?=\xaaU)')
I am trying to understand the binary format.. this is a typical message which is sent out
the first 'JJJOWXH' is the sender address.. anything that follows that has 7 alphanumeric is the receiver addresses.. Based on the sender address.. I can know whether this is a 'SND' or 'RCV'.. as the source is 'JJJOWXH'... This msg is a 'SND' as we are 'JJJOWXH'.
The message is addressed to: JJJKLXH.... JJJKRXH.... and so on.
As soon as all the.. \x00000000 finishes..
the sita header and subject starts
In this particular case... "\x00QN\x00HX\x00180001 \x00" this is the header.. and I am only interested all the stuff between \x00.
and the body comes next.. after the final \x00 or any other control character... In this case... it is:
COR\r\nMVT \r\nHX9136/17.BLNZ.JJJ\r\nAD2309/2314 EA0128
BBB\r\nDLRA/CI/0032/0022\r\nSI EET 02:14 HRS\r\n RA / 0032 DUE TO
LATE ARVL ACFT\r\n CI / 0022 OFFLOAD OVERHANG PALLET DUE INADEQUATE
PACKING LEADING TO \r\n SPACE PROBLEM
once the readable text ends... the first control character that appears until the end \xaaU is to be ignored... In above cases.. "SPACE PROBLEM".. is the last one.. then control characters starts... so to be ignored... sometimes the control characters are not there till the next \xaaU.
This is one complete message.
"\xaaU\x1c\x04\x02\x00\x05\x06\x1f\x00\x19\x00\x00\x00\xc4\x9d\xedN\x1a\x00?\x02\x02\x00B\x02\x02\x00E\x02\x07\x00\xff\xff\x00\x00\xff\xff\x00\x00\xff\xff\x00\x00M\x02\xec\x00\xff\xff\x00\x00\x00\x00?\x02M\x02\xec\x00\xff\xff\x00\x00\xff\xff\x00\x00\xff\xff\x00\x00\xff\xff\x00\x00\xff\xff\x00\x00:\x03\x10\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x7f\x00JJJOWXH\x00\x05w\x01x\x01\x00\x01JJJKLXH\x00\x00\x7f\x01\x80\x01\x00\x01JJJKRXH\x00F\x87\x01\x88\x01\x00\x01JJJFFXH\x00\xff\x8f\x01\x90\x01\x00\x01JJJFCXH\x00\xff\x97\x01\x98\x01\x00\x01JJJFAXH\x00\x00\x9f\x01\xa0\x01\x00\x01JJJKPXH\x00\x00\xa7\x01\xa8\x01\x00\x01HAKUOHU\x00\x00\xaf\x01\xb0\x01\x00\x01BBBHRXH\x00\x00\xb7\x01\xb8\x01\x00\x01BBBFFHX\x00\x00\xbf\x01\xc0\x01\x00\x01BBBOMHX\x00\x00\xc7\x01\xc8\x01\x00\x01BBBFMXH\x00\x00\xcf\x01\xd0\x01\x00\x01JJJHBER\x00\x00\xd7\x01\xd8\x01\x00\x01BBBFRUO\x00\x00\xdf\x01\xe0\x01\x00\x01BBBKKHX\x00\x00\xe7\x01\xe8\x01\x00\x01JJJLOTG\x00\x01\xef\x01\xf0\x01\x00\x01JJJLCTG\x00\x00\xf7\x01\xf8\x01\x00\x01HDQOMTG\x005\xff\x01\x00\x02\x00\x01CHACSHX\x00K\x07\x02\x08\x02\x00\x01JJJKZXH\x00F\x0f\x02\x10\x02\x00\x01BBBOMUO\x00
\x17\x02\x18\x02\x00\x01BBBORXH\x00 \x1f\x02
\x02\x00\x01BBBOPXH\x00W'\x02(\x02\x00\x01CHACSHX\x00
/\x020\x02\x00\x01JJJDBXH\x0007\x028\x02\x00010000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00QN\x00HX\x00180001
\x00COR\r\nMVT \r\nHX9136/17.BLNZ.JJJ\r\nAD2309/2314 EA0128
BBB\r\nDLRA/CI/0032/0022\r\nSI EET 02:14 HRS\r\n RA / 0032 DUE TO
LATE ARVL ACFT\r\n CI / 0022 OFFLOAD OVERHANG PALLET DUE INADEQUATE
PACKING LEADING TO \r\n SPACE
PROBLEM\x00D-\xedN\x00\x04\x1a\x00t<\x93\x01x\x00M_\x00"
2) I am not using .+ anymore after the 'repr' is known.
3) each Message is multiline.. and i need to preserve all the control characters to make some sense of this proprietary format.. that is why i needed repr to see it up close.
Hope this explains... This is just 1 message out of 1000s with in the file... and some are 'SND' and some are 'RCV'... and for 'RCV' there will not be '000000'.. and occasionally there are minor exceptions to the rule... but usually that is okay.
Any further suggestions anyone.. I am still working with the file.. to retrieve the text out intact... with sender and receiver addresses.
Thank you.
Python supports regexes too. I don't speak Perl, so I don't know exactly what your Perl code does, but this Python program might help you:
import re
with open('yourfile.pst') as f:
contents = f.read()
textstrings = re.findall(r'[\x20-\x7E]+', contents)
That will get you a list of all strings of one or more ASCII printable characters in the file. That may not be exactly what you want, but you might be able to tweak it from there.
Note that if you're using Python 3, then you have to worry about the distinction between binary and textual data and it becomes a bit more complicated. I'm assuming you're in Python 2.
Q: How to read the file? binary and text interspersed
A: Don't bother, just read it as normal text and you'll be able to keep your binary/text dichotomy (otherwise you won't be able to regex it as easily)
fh = open('/path/to/my/file.ext', 'r')
fh.read()
Just in case you want to read binary later for some reason, you just add a b to the second input of the open:
fh = open('/path/to/my/file.ext', 'rb')
Q: Eliminate unnecessary control characters
A: Use the python re module. Your next question sorta ask how
Q: parse the messages in between two \xaa USEFUL TEXT INFORMATION \xaa (HEX 'aa')
A: re module has a findall function that works as you (mostly) expect.
import re
mytext = '\xaaUseful text that I want to keep\xaa^X^X^X\xaaOther text i like\xaa'
usefultext = re.findall('\xaa([a-zA-Z^!-~0-9 ]+)\xaa', mytext)
Q: print out the required stuff
*A: There's a print function...
print usefultext
Q: Loop through all the lines.. and more files.
fh = open('/some/file.ext','r')
for lines in fh.readlines():
#do stuff
I'll let you figure out the os module to figure out what files exist/how to iterate through them.
You say:
Still I will need assistance to eliminate the list altogether but return just a string. like this
In other words, you have foo = [some_string] and you are doing print foo which as a side does repr(some_string) but encloses it in square brackets which you don't want. So just do print repr(foo[0]).
There seem to be several things unexplained:
You say the useful text is bracketed by \xaaU but in the sample file instead of 2 occurrences of that delimiter there is only \xaa (missingU) near the start, and nothing else.
You say
I have found out that re.findall(r'.+', line1) strips to ...
That in effect is stripping out \n (but not \r!!) -- I thought line breaks would be worth preserving when attempting to recover an email message.
>>> re.findall(r'.+', 'abc\r\ndef\r\n\r\n')
['abc\r', 'def\r', '\r']
What you you done with the \r characters? Have you tested a multi-line message? Have you tested a multi-message file?
One is left to guess who or what is intended to consume your output; you write
I need to parse the text line by line and word by word
but you seem overly concerned with printing the message "legibly" with e.g. \xab instead of gibberish.
It looks like the last 6 or so lines in your latest code (for msgline in msglines: etc etc) should be indented one level.
Is it possible to clarify all of the above?

Categories

Resources