How to find no of paragraphs in a string in python

How to find no of paragraphs in a string in python - python

consider that I have a string say sample which is
"ApacheBench (ab) is a benchmarking tool that can load test servers by sending an arbitrary number of concurrent requests. It was designed for testing Apache installations.
cURL is a command line tool to send and receive data using various protocols like HTTP, HTTPS, FTP, LDAP etc. It comes by default with most of the Linux distributions.
Wrk is a tool that is similar to traditional Apache Benchmark. Configuration and execution are through a command line tool. It has very few but powerful settings, only the essential to generate HTTP load."
There are 3 paragraphs in the above string.How can I get the number of paragraphs using python ?

You seem to have double '\n' as paragraph separators. In this case, the number of paragraphs is:
text.count('\n\n') + 1

If your paragraph ends with a single line break or double line break, you can split the entire text by '\n' and then skip counting any lines with empty text (means empty new lines).
for example,
lines = text.split('\n')
count = 0
if not line.strip() == '':
count += 1

Related

Python - How to read multiple lines from text file as a string and remove all encoding?

I have a list of 77 items. I have placed all 77 items in a text file (one per line).
I am trying to read this into my python script (where I will then compare each item in a list, to another list pulled via API).
Problem: for some reason, 2/77 of the items on the list have encoding, giving me characters of "u00c2" and "u00a2" which means they are not comparing correctly and being missed. I have no idea why these 2/77 have this encoding, but the other 75 are fine, and I don't know how to get rid of the encoding, in python.
Question:
In Python, How can I get rid of the encoding to ensure none of them have any special/weird characters and are just plain text?
Is there a method I can use to do this upon reading the file in?
Here is how I am reading the text file into python:
with open("name_list2.txt", "r") as myfile:
policy_match_list = myfile.readlines()
policy_match_list = [x.strip() for x in policy_match_list]
Note - "policy_match_list" is the list of 77 policies read in from the text file.
Here is how I am comparing my two lists:
for policy_name in policy_match_list:
for us_policy in us_policies:
if policy_name == us_policy["name"]:
print(f"Match #{match} | {policy_name}")
match += 1
Note - "us_policies" is another list of thousands of policies, pulled via API that I am comparing to
Which is resulting in 75/77 expected matches, due to the other 2 policies comparing e.g. "text12 - text" to "text12u00c2-u00a2text" rather than "text12 - text" to "text12 - text"
I hope this makes sense, let me know if I can add any further info
Cheers!

Did you try to open the file while decoding from utf8? because I can't see the file I can't tell this is the problem, but the file might have characters that the default decoding option (which I think is Latin) can't process.
Try doing:
with open("name_list2.txt", "r", encoding="utf-8") as myfile:
Also, you can watch this question about how to treat control characters: Python - how to delete hidden signs from string?
Sorry about not posting it as a comment (as I really don't know if this is the solution), I don't have enough reputation for that.

Certain Unicode characters aren't properly decoded in some cases. In your case, the characters \u00c2 and \u00a2 caused the issue. As of now, I see two fixes:
Try to resolve the encoding by replacing the characters (refer to https://stackoverflow.com/a/56967370)
Copy the text to a new plain text file (if possible) and save it. These extra characters tend to get ignored in that case and consequently removed.

Parsing a CS:GO language file with encoding in Python

This topic is related to the Parsing a CS:GO script file in Python theme, but there is another problem.
I'm working on a content from CS:GO and now i'm trying to make a python tool importing all data from from /scripts/ folder into Python dictionaries.
The next step after parsing data is parsing Language resource file from /resources and making relations between dictionaries and language.
There is an original file for Eng localization:
https://github.com/spec45as/PySteamBot/blob/master/csgo_english.txt
The file format is similar to the previous task, but I have faced with another problems. All language files is in UTF-16-LE encode, i couldn't understand the way of working with encoded files and strings in Python (I'm mostly working with Java)
I have tried to make some solutions, based on open(fileName, encoding='utf-16-le').read(), but i don't know how to work with such encoded strings in pyparsing.
pyparsing.ParseException: Expected quoted string, starting with "
ending with " (at char 0), (line:1, col:1)
Another problem is lines with \"-like expressions, for example:
"musickit_midnightriders_01_desc" "\"HAPPY HOLIDAYS, ****ERS!\"\n -Midnight Riders"
How to parse these symbols if I want to leave these lines as they are?

There are a few new wrinkles to this input file that were not in the original CS:GO example:
embedded \" escaped quotes in some of the value strings
some of the quoted value strings span multiple lines
some of the values end with a trailing environment condition (such as [$WIN32], [$OSX])
embedded comments in the file, marked with '//'
The first two are addressed by modifying the definition of value_qs. Since values are now more fully-featured than keys, I decided to use separate QuotedString definitions for them:
key_qs = QuotedString('"').setName("key_qs")
value_qs = QuotedString('"', escChar='\\', multiline=True).setName("value_qs")
The third requires a bit more refactoring. The use of these qualifying conditions is similar to #IFDEF macros in C - they enable/disable the definition only if the environment matches the condition. Some of these conditions were even boolean expressions:
[!$PS3]
[$WIN32||$X360||$OSX]
[!$X360&&!$PS3]
This could lead to duplicate keys in the definition file, such as in these lines:
"Menu_Dlg_Leaderboards_Lost_Connection" "You must be connected to Xbox LIVE to view Leaderboards. Please check your connection and try again." [$X360]
"Menu_Dlg_Leaderboards_Lost_Connection" "You must be connected to PlayStation®Network and Steam to view Leaderboards. Please check your connection and try again." [$PS3]
"Menu_Dlg_Leaderboards_Lost_Connection" "You must be connected to Steam to view Leaderboards. Please check your connection and try again."
which contain 3 definitions for the key "Menu_Dlg_Leaderboards_Lost_Connection", depending on what environment values were set.
In order to not lose these values when parsing the file, I chose to modify the key at parse time by appending the condition if one was present. This code implements the change:
LBRACK,RBRACK = map(Suppress, "[]")
qualExpr = Word(alphanums+'$!&|')
qualExprCondition = LBRACK + qualExpr + RBRACK
key_value = Group(key_qs + value + Optional(qualExprCondition("qual")))
def addQualifierToKey(tokens):
tt = tokens[0]
if 'qual' in tt:
tt[0] += '/' + tt.pop(-1)
key_value.setParseAction(addQualifierToKey)
So that in the sample above, you would get 3 keys:
Menu_Dlg_Leaderboards_Lost_Connection/$X360
Menu_Dlg_Leaderboards_Lost_Connection/$PS3
Menu_Dlg_Leaderboards_Lost_Connection
Lastly, the handling of comments, probably the easiest. Pyparsing has built-in support for skipping over comments, just like whitespace. You just need to define the expression for the comment, and have the top-level parser ignore it. To support this feature, several common comment forms are pre-defined in pyparsing. In this case, the solution is just to change the final parser defintion to:
parser.ignore(dblSlashComment)
And LASTLY lastly, there is a minor bug in the implementation of QuotedString, in which standard whitespace string literals like \t and \n are not handled, and are just treated as an unnecessarily-escaped 't' or 'n'. So for now, when this line is parsed:
"SFUI_SteamOverlay_Text" "This feature requires Steam Community In-Game to be enabled.\n\nYou might need to restart the game after you enable this feature in Steam:\nSteam -> File -> Settings -> In-Game: Enable Steam Community In-Game\n" [$WIN32]
For the value string you just get:
This feature requires Steam Community In-Game to be enabled.nnYou
might need to restart the game after you enable this feature in
Steam:nSteam -> File -> Settings -> In-Game: Enable Steam Community
In-Gamen
instead of:
This feature requires Steam Community In-Game to be enabled.
You might need to restart the game after you enable this feature in Steam:
Steam -> File -> Settings -> In-Game: Enable Steam Community In-Game
I will have to fix this behavior in the next release of pyparsing.
Here is the final parser code:
from pyparsing import (Suppress, QuotedString, Forward, Group, Dict,
ZeroOrMore, Word, alphanums, Optional, dblSlashComment)
LBRACE,RBRACE = map(Suppress, "{}")
key_qs = QuotedString('"').setName("key_qs")
value_qs = QuotedString('"', escChar='\\', multiline=True).setName("value_qs")
# use this code to convert integer values to ints at parse time
def convert_integers(tokens):
if tokens[0].isdigit():
tokens[0] = int(tokens[0])
value_qs.setParseAction(convert_integers)
LBRACK,RBRACK = map(Suppress, "[]")
qualExpr = Word(alphanums+'$!&|')
qualExprCondition = LBRACK + qualExpr + RBRACK
value = Forward()
key_value = Group(key_qs + value + Optional(qualExprCondition("qual")))
def addQualifierToKey(tokens):
tt = tokens[0]
if 'qual' in tt:
tt[0] += '/' + tt.pop(-1)
key_value.setParseAction(addQualifierToKey)
struct = (LBRACE + Dict(ZeroOrMore(key_value)) + RBRACE).setName("struct")
value <<= (value_qs | struct)
parser = Dict(key_value)
parser.ignore(dblSlashComment)
sample = open('cs_go_sample2.txt').read()
config = parser.parseString(sample)
print (config.keys())
for k in config.lang.keys():
print ('- ' + k)
#~ config.lang.pprint()
print (config.lang.Tokens.StickerKit_comm01_burn_them_all)
print (config.lang.Tokens['SFUI_SteamOverlay_Text/$WIN32'])

Ways to parse tuple data from variable in python

I am working on an API at work that pulls some connection info from our backend, and sets it as a variable. Need some suggestions on how to parse this data to get what I need, when I need to. Below is a sample of the output that is in my variable.
(ArrayOfString){
string[] =
"Starting Up
- AuthCode OK
- Found 4123 Devices
Done
OK",
"007.blahname.com AB Publishing 1.1.1.1 CentOS Linux 5.0
",
"027503-blah test blah 1.1.1.2 NetScaler OS Network Gathering 1.1.1.1 22
",
"028072-;alskdjf; Alpha Group 192.168.19.100 CentOS Linux 5 SSH 2.2.2.2 2022
",
"028072-4alksgjasdfserver Alpha Group 192.168.19.101 CentOS Linux 5 SSH 2.3.4.5 2022
",
Not sure if easily visible, but everything is tab delimited. What I need in the end, is it set up as columns, so I can search for a Device name (column 1), and read the associated IP, port, and connection method(colums 7, 8, and 6 on the 028072 example above.
Any help/ideas on where to start would be helpful.

You could use the CSV module from the standard libary.

You can split by tabs specifically .split('\t') or by whitespace .split(), I believe.

What you've shown us looks like C# source code. If that's what you're actually getting, you'll need to first parse the strings out of that source code, and then you can parse the columns out of those strings.
So, first:
r = re.compile(r'"(.*?)"', re.MULTILINE | re.DOTALL)
lines = r.findall(data)
Next, the first string (the one with a bunch of newlines in it) seems to be some kind of header info that you want to skip over. Also, each line has a newline in it. So, let's fix both of those: (We could have stripped that newline in the regex, but it's just as easy to do it here.)
lines = [line.rstrip('\n') for line in lines[1:]]
Now, each string can be split into columns by tabs, right?
values = [line.split('\t') for line in lines]
That's it.
As an alternative, we could have done StringIO(''.join(lines)) and passed that to csv.reader(sio, delimiter='\t')… and if the parsing were any more complicated than just split, I probably would. But in this case, I think it adds more complexity than it saves.
But there's a problem. If you've copied and pasted correctly, those strings do not have tabs in them, they have spaces. And, since the columns themselves have internal spaces, and there's no quoting or escaping, there's no unambiguous way to split them. You can write some heuristic code that tries to reconstruct the tabs by guessing at tab stops, assuming that any run of 2 or more spaces must be a tab, etc., but that's going to take a good chunk of work to do.

Extract Text from a Binary File (using Python 2.7 on Windows 7)

I have a binary file of size about 5MB.. which has lots of interspersed text.. and control characters..
This is actually an equivalent of an outlook .pst file for SITATEX Application (from SITA).
The file contains all the TEXT MESSAGES sent and received to and from outside world...(but the text has to be extracted through the binary control characters).. all the text messages are clearly available... with line ending ^M characters... etc.
for example: assume ^# ^X are control characters... \xaa with HEX aa, etc. loads of them around my required text extraction.
^#^#^#^#^#^#^#^#^#^#^#BLLBBCC^X^X^X^X^X^X^X^X^X
^X^X^X
MVT^M
EA1123 TEXT TEXT TEXT^M
END^M
\xaa^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#
^#^#^#^#^#^#^#^#^#^#^#TTBBTT^X^X^X^X^X^X^X^X^X
^X^X^X blah blah blah... of control characters.. and then the message comes..
MVT MESSAGE 2
ED1123
etc.
and so on.. for several messages.
Using Perl.. it is easy to do:
while (<>) {
use regular expression to split messages
m/ /
}
How would one do this in python easily..
How to read the file? binary and text interspersed
Eliminate unnecessary control characters
parse the messages in between two \xaa USEFUL TEXT INFORMATION \xaa (HEX 'aa')
print out the required stuff
Loop through all the lines.. and more files.
In the text file sample... I am interested in seeing.. BLLBBCC... and MVT and EA1123 and so on.
Please assist... If it is going to be very difficult in python.. I will have to think through the logic in perl itself.. as it (perl) doesn't throw lots of errors at me at least for the looping part of binary and text stuff.. and the regex.
Thanks.
Update 02Jan after reading your answers/comments
After going through S.Lott's comments and others... This is where I am at.. and it is working 80% ok.
import fileinput
import sys
import re
strfile = r'C:\Users\' \
r'\Learn\python\mvt\sitatex_test.msgs'
f = open(strfile, 'rb')
contents = f.read() # read whole file in contents
#extract the string between two \xaaU.. multiline pattern match
#with look ahead assertion
#and this is stored in a list with all msgs
msgs = re.findall(r'\xaaU.*?(?=\xaaU)', contents, re.I|re.DOTALL|re.M)
for msg in msgs:
#loop through msgs.. to find the first msg then next and so on.
print "## NEW MESSAGE STARTS HERE ##"
#for each msg split the lines.. to read line by line
# stored as list in msglines
msglines = msg.splitlines()
line = 0
#then process each msgline with a message
for msgline in msglines:
line += 1
#msgline = re.sub(r'[\x00]+', r' ', msgline)
mystr = msgline
print mystr
textstrings = re.findall(r'[\x00\x20-\x7E]+', msgline)
So far so good.. still I am not completely done.. because I need to parse the text line by line and word by word.. to pickup (as an example) the origin address and headers, subject line, message body... by parsing the message through the control characters.
Now I am stuck with... how to print line by line with the control characters converted to \x00\x02.. etc (using the \xHH format).. but leave the normal readable text alone.
For example.. say I have this: assume ^# and ^X are some control characters
line1 = '^#UG^#^#^#^#^#^#^#^#^#^#BLLBBCC^X^X^X^X^X^X^X^X^X' (on the first line).
When I print the line as it is on IDLE.. print line1.. it prints only say the first 2 or 3 characters.. and ignores the rest due to the control characters get choked.
However, when I print with this: print re.findall(r'.*', line1)
['\xaaUG\x02\x05\x00\x04\x00\x00\x00\x05\x00\x00\x00....
x00\x00\x00..BLLBBCC\x00\x00N\x00N\\x00
002 010 180000 DEC 11', '']
It prints nicely with all the control characters converted to \xHH format.. and ascii text intact.. (just as I want it)..with one catch.. the list has two items.. with '' in the end.
What is the explanation for the empty string in the end?
How to avoid it... I just want the line converted nicely to a string (not a list). i.e. one line of binary/text to be converted to a string with \xHH codes.. leave the ASCII TEXT alone.
Is using re.findall(r'.*', line1) is the only easy solution.. to do this conversion.. or are there any other straightforward method.. to convert a '\x00string' to \xHH and TEXT (where it is a printable character or whitespace).
Also.. any other useful comments to get the lines out nicely.
Thanks.
Update 2Jan2011 - Part 2
I have found out that re.findall(r'.+', line1) strips to
['\xaaUG\x02\x05\x00\x04\x00\x00\x00\x05\x00\x00\x00....
x00\x00\x00..BLLBBCC\x00\x00N\x00N\\x00
002 010 180000 DEC 11']
without the extra blank '' item in the list. This finding after numerous trial and errors.
Still I will need assistance to eliminate the list altogether but return just a string.
like this:
'\xaaUG\x02\x05\x00\x04..BLLBBCC..002 010 180000 DEC 11'
Added Info on 05Jan:
#John Machin
1) \xaaU is the delimiter between messages.. In the example.. I may have just left out in the samples. Please see below for one actual message that ends with \xaaU (but left out).
Following text is obtained from repr(msg between r'\xaaU.*?(?=\xaaU)')
I am trying to understand the binary format.. this is a typical message which is sent out
the first 'JJJOWXH' is the sender address.. anything that follows that has 7 alphanumeric is the receiver addresses.. Based on the sender address.. I can know whether this is a 'SND' or 'RCV'.. as the source is 'JJJOWXH'... This msg is a 'SND' as we are 'JJJOWXH'.
The message is addressed to: JJJKLXH.... JJJKRXH.... and so on.
As soon as all the.. \x00000000 finishes..
the sita header and subject starts
In this particular case... "\x00QN\x00HX\x00180001 \x00" this is the header.. and I am only interested all the stuff between \x00.
and the body comes next.. after the final \x00 or any other control character... In this case... it is:
COR\r\nMVT \r\nHX9136/17.BLNZ.JJJ\r\nAD2309/2314 EA0128
BBB\r\nDLRA/CI/0032/0022\r\nSI EET 02:14 HRS\r\n RA / 0032 DUE TO
LATE ARVL ACFT\r\n CI / 0022 OFFLOAD OVERHANG PALLET DUE INADEQUATE
PACKING LEADING TO \r\n SPACE PROBLEM
once the readable text ends... the first control character that appears until the end \xaaU is to be ignored... In above cases.. "SPACE PROBLEM".. is the last one.. then control characters starts... so to be ignored... sometimes the control characters are not there till the next \xaaU.
This is one complete message.
"\xaaU\x1c\x04\x02\x00\x05\x06\x1f\x00\x19\x00\x00\x00\xc4\x9d\xedN\x1a\x00?\x02\x02\x00B\x02\x02\x00E\x02\x07\x00\xff\xff\x00\x00\xff\xff\x00\x00\xff\xff\x00\x00M\x02\xec\x00\xff\xff\x00\x00\x00\x00?\x02M\x02\xec\x00\xff\xff\x00\x00\xff\xff\x00\x00\xff\xff\x00\x00\xff\xff\x00\x00\xff\xff\x00\x00:\x03\x10\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x7f\x00JJJOWXH\x00\x05w\x01x\x01\x00\x01JJJKLXH\x00\x00\x7f\x01\x80\x01\x00\x01JJJKRXH\x00F\x87\x01\x88\x01\x00\x01JJJFFXH\x00\xff\x8f\x01\x90\x01\x00\x01JJJFCXH\x00\xff\x97\x01\x98\x01\x00\x01JJJFAXH\x00\x00\x9f\x01\xa0\x01\x00\x01JJJKPXH\x00\x00\xa7\x01\xa8\x01\x00\x01HAKUOHU\x00\x00\xaf\x01\xb0\x01\x00\x01BBBHRXH\x00\x00\xb7\x01\xb8\x01\x00\x01BBBFFHX\x00\x00\xbf\x01\xc0\x01\x00\x01BBBOMHX\x00\x00\xc7\x01\xc8\x01\x00\x01BBBFMXH\x00\x00\xcf\x01\xd0\x01\x00\x01JJJHBER\x00\x00\xd7\x01\xd8\x01\x00\x01BBBFRUO\x00\x00\xdf\x01\xe0\x01\x00\x01BBBKKHX\x00\x00\xe7\x01\xe8\x01\x00\x01JJJLOTG\x00\x01\xef\x01\xf0\x01\x00\x01JJJLCTG\x00\x00\xf7\x01\xf8\x01\x00\x01HDQOMTG\x005\xff\x01\x00\x02\x00\x01CHACSHX\x00K\x07\x02\x08\x02\x00\x01JJJKZXH\x00F\x0f\x02\x10\x02\x00\x01BBBOMUO\x00
\x17\x02\x18\x02\x00\x01BBBORXH\x00 \x1f\x02
\x02\x00\x01BBBOPXH\x00W'\x02(\x02\x00\x01CHACSHX\x00
/\x020\x02\x00\x01JJJDBXH\x0007\x028\x02\x00010000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00QN\x00HX\x00180001
\x00COR\r\nMVT \r\nHX9136/17.BLNZ.JJJ\r\nAD2309/2314 EA0128
BBB\r\nDLRA/CI/0032/0022\r\nSI EET 02:14 HRS\r\n RA / 0032 DUE TO
LATE ARVL ACFT\r\n CI / 0022 OFFLOAD OVERHANG PALLET DUE INADEQUATE
PACKING LEADING TO \r\n SPACE
PROBLEM\x00D-\xedN\x00\x04\x1a\x00t<\x93\x01x\x00M_\x00"
2) I am not using .+ anymore after the 'repr' is known.
3) each Message is multiline.. and i need to preserve all the control characters to make some sense of this proprietary format.. that is why i needed repr to see it up close.
Hope this explains... This is just 1 message out of 1000s with in the file... and some are 'SND' and some are 'RCV'... and for 'RCV' there will not be '000000'.. and occasionally there are minor exceptions to the rule... but usually that is okay.
Any further suggestions anyone.. I am still working with the file.. to retrieve the text out intact... with sender and receiver addresses.
Thank you.

Python supports regexes too. I don't speak Perl, so I don't know exactly what your Perl code does, but this Python program might help you:
import re
with open('yourfile.pst') as f:
contents = f.read()
textstrings = re.findall(r'[\x20-\x7E]+', contents)
That will get you a list of all strings of one or more ASCII printable characters in the file. That may not be exactly what you want, but you might be able to tweak it from there.
Note that if you're using Python 3, then you have to worry about the distinction between binary and textual data and it becomes a bit more complicated. I'm assuming you're in Python 2.

Q: How to read the file? binary and text interspersed
A: Don't bother, just read it as normal text and you'll be able to keep your binary/text dichotomy (otherwise you won't be able to regex it as easily)
fh = open('/path/to/my/file.ext', 'r')
fh.read()
Just in case you want to read binary later for some reason, you just add a b to the second input of the open:
fh = open('/path/to/my/file.ext', 'rb')
Q: Eliminate unnecessary control characters
A: Use the python re module. Your next question sorta ask how
Q: parse the messages in between two \xaa USEFUL TEXT INFORMATION \xaa (HEX 'aa')
A: re module has a findall function that works as you (mostly) expect.
import re
mytext = '\xaaUseful text that I want to keep\xaa^X^X^X\xaaOther text i like\xaa'
usefultext = re.findall('\xaa([a-zA-Z^!-~0-9 ]+)\xaa', mytext)
Q: print out the required stuff
*A: There's a print function...
print usefultext
Q: Loop through all the lines.. and more files.
fh = open('/some/file.ext','r')
for lines in fh.readlines():
#do stuff
I'll let you figure out the os module to figure out what files exist/how to iterate through them.

You say:
Still I will need assistance to eliminate the list altogether but return just a string. like this
In other words, you have foo = [some_string] and you are doing print foo which as a side does repr(some_string) but encloses it in square brackets which you don't want. So just do print repr(foo[0]).
There seem to be several things unexplained:
You say the useful text is bracketed by \xaaU but in the sample file instead of 2 occurrences of that delimiter there is only \xaa (missingU) near the start, and nothing else.
You say
I have found out that re.findall(r'.+', line1) strips to ...
That in effect is stripping out \n (but not \r!!) -- I thought line breaks would be worth preserving when attempting to recover an email message.
>>> re.findall(r'.+', 'abc\r\ndef\r\n\r\n')
['abc\r', 'def\r', '\r']
What you you done with the \r characters? Have you tested a multi-line message? Have you tested a multi-message file?
One is left to guess who or what is intended to consume your output; you write
I need to parse the text line by line and word by word
but you seem overly concerned with printing the message "legibly" with e.g. \xab instead of gibberish.
It looks like the last 6 or so lines in your latest code (for msgline in msglines: etc etc) should be indented one level.
Is it possible to clarify all of the above?

Python help - Parsing Packet Logs

I'm writing a simple program that's going to parse a logfile of a packet dump from wireshark into a more readable form. I'm doing this with python.
Currently I'm stuck on this part:
for i in range(len(linelist)):
if '### SERVER' in linelist[i]:
#do server parsing stuff
packet = linelist[i:find("\n\n", i, len(linelist))]
linelist is a list created using the readlines() method, so every line in the file is an element in the list. I'm iterating through it for all occurances of "### SERVER", then grabbing all lines after it until the next empty line(which signifies the end of the packet). I must be doing something wrong, because not only is find() not working, but I have a feeling there's a better way to grab everything between ### SERVER and the next occurance of a blank line.
Any ideas?

Looking at thefile.readlines() doc:
file.readlines([sizehint])
Read until EOF using readline() and return a list containing the lines thus read. If the optional sizehint argument is present, instead of reading up to EOF, whole lines totalling approximately sizehint bytes (possibly after rounding up to an internal buffer size) are read. Objects implementing a file-like interface may choose to ignore sizehint if it cannot be implemented, or cannot be implemented efficiently.
and the file.readline() doc:
file.readline([size])
Read one entire line from the file. A trailing newline character is kept in the string (but may be absent when a file ends with an incomplete line). [6] If the size argument is present and non-negative, it is a maximum byte count (including the trailing newline) and an incomplete line may be returned. An empty string is returned only when EOF is encountered immediately.
A trailing newline character is kept in the string - means that each line in linelist will contain at most one newline. That is why you cannot find a "\n\n" substring in any of the lines - look for a whole blank line (or an empty one at EOF):
if myline in ("\n", ""):
handle_empty_line()
Note: I tried to explain find behavior, but a pythonic solution looks very different from your code snippet.

General idea is:
inpacket = False
packets = []
for line in open("logfile"):
if inpacket:
content += line
if line in ("\n", ""): # empty line
inpacket = False
packets.append(content)
elif '### SERVER' in line:
inpacket = True
content = line
# put here packets.append on eof if needed

This works well with an explicit iterator, also. That way, nested loops can update the iterator's state by consuming lines.
fileIter= iter(theFile)
for x in fileIter:
if "### SERVER" in x:
block = [x]
for y in fileIter:
if len(y.strip()) == 0: # empty line
break
block.append(y)
print block # Or whatever
# elif some other pattern:
This has the pleasant property of finding blocks that are at the tail end of the file, and don't have a blank line terminating them.
Also, this is quite easy to generalize, since there's no explicit state-change variables, you just go into another loop to soak up lines in other kinds of blocks.

best way - use generators
read presentation Generator Tricks for Systems Programmers
This best that I saw about parsing log ;)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to find no of paragraphs in a string in python - python

You seem to have double '\n' as paragraph separators. In this case, the number of paragraphs is: text.count('\n\n') + 1

If your paragraph ends with a single line break or double line break, you can split the entire text by '\n' and then skip counting any lines with empty text (means empty new lines). for example, lines = text.split('\n') count = 0 if not line.strip() == '': count += 1

Related

Python - How to read multiple lines from text file as a string and remove all encoding?

Parsing a CS:GO language file with encoding in Python

Ways to parse tuple data from variable in python

Extract Text from a Binary File (using Python 2.7 on Windows 7)

Python help - Parsing Packet Logs

Categories

Resources