I'm writing a simple program that's going to parse a logfile of a packet dump from wireshark into a more readable form. I'm doing this with python.
Currently I'm stuck on this part:
for i in range(len(linelist)):
if '### SERVER' in linelist[i]:
#do server parsing stuff
packet = linelist[i:find("\n\n", i, len(linelist))]
linelist is a list created using the readlines() method, so every line in the file is an element in the list. I'm iterating through it for all occurances of "### SERVER", then grabbing all lines after it until the next empty line(which signifies the end of the packet). I must be doing something wrong, because not only is find() not working, but I have a feeling there's a better way to grab everything between ### SERVER and the next occurance of a blank line.
Any ideas?
Looking at thefile.readlines() doc:
file.readlines([sizehint])
Read until EOF using readline() and return a list containing the lines thus read. If the optional sizehint argument is present, instead of reading up to EOF, whole lines totalling approximately sizehint bytes (possibly after rounding up to an internal buffer size) are read. Objects implementing a file-like interface may choose to ignore sizehint if it cannot be implemented, or cannot be implemented efficiently.
and the file.readline() doc:
file.readline([size])
Read one entire line from the file. A trailing newline character is kept in the string (but may be absent when a file ends with an incomplete line). [6] If the size argument is present and non-negative, it is a maximum byte count (including the trailing newline) and an incomplete line may be returned. An empty string is returned only when EOF is encountered immediately.
A trailing newline character is kept in the string - means that each line in linelist will contain at most one newline. That is why you cannot find a "\n\n" substring in any of the lines - look for a whole blank line (or an empty one at EOF):
if myline in ("\n", ""):
handle_empty_line()
Note: I tried to explain find behavior, but a pythonic solution looks very different from your code snippet.
General idea is:
inpacket = False
packets = []
for line in open("logfile"):
if inpacket:
content += line
if line in ("\n", ""): # empty line
inpacket = False
packets.append(content)
elif '### SERVER' in line:
inpacket = True
content = line
# put here packets.append on eof if needed
This works well with an explicit iterator, also. That way, nested loops can update the iterator's state by consuming lines.
fileIter= iter(theFile)
for x in fileIter:
if "### SERVER" in x:
block = [x]
for y in fileIter:
if len(y.strip()) == 0: # empty line
break
block.append(y)
print block # Or whatever
# elif some other pattern:
This has the pleasant property of finding blocks that are at the tail end of the file, and don't have a blank line terminating them.
Also, this is quite easy to generalize, since there's no explicit state-change variables, you just go into another loop to soak up lines in other kinds of blocks.
best way - use generators
read presentation Generator Tricks for Systems Programmers
This best that I saw about parsing log ;)
Related
I have a file which currently stores a string eeb39d3e-dd4f-11e8-acf7-a6389e8e7978
which I am trying to pass into as a variable to my subprocess command.
My current code looks like this
with open(logfilnavn, 'r') as t:
test = t.readlines()
print(test)
But this prints ['eeb39d3e-dd4f-11e8-acf7-a6389e8e7978\n'] and I don't want the part with ['\n'] to be passed into my command, so i'm trying to remove them by using replace.
with open(logfilnavn, 'r') as t:
test = t.readlines()
removestrings = test.replace('[', '').replace('[', '').replace('\\', '').replace("'", '').replace('n', '')
print(removestrings)
I get an exception value saying this so how can I replace these with nothing and store them as a string for my subprocess command?
'list' object has no attribute 'replace'
so how can I replace these with nothing and store them as a string for my subprocess command?
readline() returns a list. Try print(test[0].strip())
You can read the whole file and split lines using str.splitlines:
test = t.read().splitlines()
Your test variable is a list, because readlines() returns a list of all lines read.
Since you said the file only contains this one line, you probably wish to perform the replace on only the first line that you read:
removestrings = test[0].replace('[', '').replace('[', '').replace('\\', '').replace("'", '').replace('n', '')
Where you went wrong...
file.readlines() in python returns an array (collection or grouping of the same variable type) of the lines in the file -- arrays in python are called lists. you, here are treating the list as a string. you must first target the string inside it, then apply that string-only function.
In this case however, this would not work as you are trying to change the way the python interpretter has displayed it for one to understand.
Further information...
In code it would not be a string - we just can't easily understand the stack, heap and memory addresses easily. The example below would work for any number of lines (but it will only print the first element) you will need to change that and
this may be useful...
you could perhaps make the variables globally available (so that other parts of the program can read them
more useless stuff
before they go out of scope - the word used to mean the points at which the interpreter (what runs the program) believes the variable is useful - so that it can remove it from memory, or in much larger programs only worry about the locality of variables e.g. when using for loops i is used a lot without scope there would need to be a different name for each variable in the whole project. scopes however get specialised (meaning that if a scope contains the re-declaration of a variable this would fail as it is already seen as being one. an easy way to understand this might be to think of them being branches and the connections between the tips of branches. they don't touch along with their variables.
solution?
e.g:
with open(logfilenavn, 'r') as file:
lines = file.readlines() # creates a list
# an in-line for loop that goes through each item and takes off the last character: \n - the newline character
#this will work with any number of lines
strippedLines = [line[:-1] for line in lines]
#or
strippedLines = [line.replace('\n', '') for line in lines]
#you can now print the string stored within the list
print(strippedLines[0]) # this prints the first element in the list
I hope this helped!
You get the error because readlines returns a list object. Since you mentioned in the comment that there is just one line in the file, its better to use readline() instead,
line = "" # so you can use it as a variable outside `with` scope,
with open("logfilnavn", 'r') as t:
line = t.readline()
print(line)
# output,
eeb39d3e-dd4f-11e8-acf7-a6389e8e7978
readlines will return a list of lines, and you can't use replace with a list.
If you really want to use readlines, you should know that it doesn't remove the newline character from the end, you'll have to do it yourself.
lines = [line.rstrip('\n') for line in t.readlines()]
But still, after removing the newline character yourself from the end of each line, you'll have a list of lines. And from the question, it looks like, you only have one line, you can just access first line lines[0].
Or you can just leave out readlines, and just use read, it'll read all of the contents from the file. And then just do rstrip.
contents = t.read().rstrip('\n')
In the code below readline() will not increment. I've tried using a value, no value and variable in readline(). When not using a value I don't close the file so that it will iterate but that and the other attempts have not worked.
What happens is just the first byte is displayed over and over again.
If I don't use a function and just place the code in the while loop (without 'line' variable in readline()) it works as expected. It will go through the log file and print out the different hex numbers.
i=0
x=1
def mFinder(line):
rgps=open('c:/code/gps.log', 'r')
varr=rgps.readline(line)
varr=varr[12:14].rstrip()
rgps.close()
return varr
while x<900:
val=mFinder(i)
i+=1
x+=1
print val
print 'this should change'
It appears you have misunderstood what file.readline() does. Passing in an argument does not tell the method to read a specific numbered line.
The documentation tells you what happens instead:
file.readline([size])
Read one entire line from the file. A trailing newline character is kept in the string (but may be absent when a file ends with an incomplete line). If the size argument is present and non-negative, it is a maximum byte count (including the trailing newline) and an incomplete line may be returned.
Bold emphasis mine, you are passing in a maximum byte count and rgps.readline(1) reads a single byte, not the first line.
You need to keep a reference to the file object around until you are done with it, and repeatedly call readline() on it to get successive lines. You can pass the file object to a function call:
def finder(fileobj):
line = fileobj.readline()
return line[12:14].rstrip()
with open('c:/code/gps.log') as rgps:
x = 0
while x < 900:
section = finder(rgps)
print section
# do stuff
x += 1
You can also loop over files directly, because they are iterators:
for line in openfilobject:
or use the next() function to get a next line, as long as you don't mix .readline() calls and iteration (including next()). If you combine this witha generator function, you can leave the file object entirely to a separate function that will read lines and produce sections until you are done:
def read_sections():
with open('c:/code/gps.log') as rgps:
for line in rgps:
yield line[12:14].rstrip()
for section in read_sections():
# do something with `section`.
I'm trying to find common elements in the strings reading from a file. And this is what I wrote:
file = open ("words.txt", 'r')
while 1:
line = file.readlines()
if len(line) == 0:
break
print line
file.close
def com_Letters(*strings):
return set.intersection(*map(set,strings))
and the result turns out: ['out\n', 'dog\n', 'pingo\n', 'coconut']
I put com_Letters(line), but the result is empty.
There are two problems, but neither one is with com_Letters.
First, this code guarantees that line will always be an empty list:
while 1:
line = file.readlines()
if len(line) == 0:
break
print line
The first time through the loop, you call readlines(), which will
Read until EOF using readline() and return a list containing the lines thus read.
If the file is empty, that's an empty list, so you'll break.
Otherwise, you'll print out the list, and go back into the loop. At which point readlines() is going to have nothing left to read, since you already read until EOF, so it's guaranteed to be an empty list. Which means you'll break.
Either way, list ends up empty.
It's not clear what you're trying to do with that loop. There's never any good reason to call readlines() repeatedly on the same file. But, even if there were, you'd probably want to accumulate all of the results, rather than just keeping the last (guaranteed-empty) result. Something like this:
while 1:
new_line = file.readlines()
if len(new_line) == 0:
break
print new_line
line += new_line
Anyway, if you fix that problem (e.g., by scrapping the whole loop and just using line = file.readlines()), you're calling com_Letters with a single list of strings. That's not particularly useful; it's just a very convoluted way of calling set. If it's not clear why:
Since there's only one argument (a list of strings), *strings ends up as a one-element tuple of that argument.
map(set, strings) on a single-element tuple just calls set on that element and returns a single-element list.
*map(set, strings) explodes that into one argument, the set.
set.intersection(s) is the same thing as s.intersection(), which just returns s itself.
All of this would be easier to see if you broke up some of those complex expressions and printed the intermediate values. Then you'd know exactly where it first goes wrong, instead of just knowing it's somewhere in a long chain of events.
A few side notes:
You forgot the () on the file.close, which means you're not actually closing the file. One of the many reasons that with is better is that it means you can't make that mistake.
Use plural names for collections. line sounds like a variable that should have a single line in it, not a variable that should have all of your lines.
The readlines function with no sizehint argument is basically useless. If you're just going to iterate over the lines, you can do that to the file itself. If you really need the lines in a list instead of reading them lazily, list(file) makes your intention clearer—and doesn't mislead you into thinking it might be useful to do repeatedly.
The Pythonic way to check for an empty collection is just if not line:, rather than if len(line) == 0:.
while True is clearer than while 1.
I suggest modifying the function as follows:
def com_Letters(strings):
return set.intersection(*map(set,strings))
I think the function is treating the argument strings as a list of a list of strings (only one argument passed in this case a single list) and therefore not finding the intersection.
I am working with a network library that returns a generator where you receive an arbitrary amount of text (as a string) with each Next() call; where if you simply concatenated the result of every Next() call; would look like a standard English text document.
There could be multiple newlines in the string returned from each Next() call, there could be none. The returned string doesn't necessarily end in a newline, i.e. one line of text could be spread across multiple Next() calls.
I am trying to use this data in a 2nd library that needs Next() to return one line of text. It is absolutely critical I do not read in the entire stream; this can be tens of gigabytes of data.
Is there a built-in library to solve this problem? If not, can someone suggest the best way to either write the generator or an alternative way to solve the problem?
Write a generator function that pulls the chunks down and splits them into lines for you. Since you won't know if the last line ended in a newline or not, save it and attach it to the next chunk.
def split_by_lines(text_generator):
last_line = ""
try:
while True:
chunk = "".join(last_line, next(text_generator))
chunk_by_line = chunk.split('\n')
last_line = chunk_by_line.pop()
for line in chunk_by_line:
yield line
except StopIteration: # the other end of the pipe is empty
yield last_line
raise StopIteration
After reading your edit, maybe you could modify the stream object which returns arbitrary amounts of text? For example, in the stream.next() method, there is some way the stream generates a string and yields it when .next() is called. Could you do something like:
def next(self):
if '\n' in self.remaining:
terms = self.remaining.split('\n')
to_yield, self.remaining = terms[0], ''.join(terms[1:])
yield to_yield
else:
to_yield = self.remaining + self.generate_arbitrary_string()
while '\n' not in to_yield:
to_yield += self.generate_arbitrary_string()
to_yield, self.remaining = terms[0], ''.join(terms[1:])
yield to_yield
This pseudocode assumes that the stream object generates some arbitrary string with generate_arbitrary_string(). On your first call of next(), the self.remaining string should be empty, so you go to the else statement. There, you concatenate arbitrary strings until you find a newline character, split the concatenated string at the first newline character, yield the first half and store the second half in remaining.
On subsequent calls of next(), you first check if self.remaining contains any newline characters. If so, yield the first line and store the rest. If not, append a new arbitrary string to self.remaining and continue like above.
I have file about 4MB (which i called as big one)...this file has about 160000 lines..in a specific format...and i need to cut them at regular interval(not at equal intervals) i.e at the end of a certain format and write the part into another file..
Basically,what i wanted is to copy the information for the big file into the many smaller files ...as i read the big file keep writing the information into one file and after the a certain pattern occurs then end this and starting writing for that line into another file...
Normally, if it is a small file i guess it can be done dont know if i can perform file.readline() to read each line check if pattern end if not then write it to a file if patter end then change the file name open new file..so on but how to do it for this big file..
thanks in advance..
didnt mention the file format as i felt it is not neccesary will mention if required..
I would first read all of the allegedly-big file in memory as a list of lines:
with open('socalledbig.txt', 'rt') as f:
lines = f.readlines()
should take little more than 4MB -- tiny even by the standard of today's phones, much less ordinary computers.
Then, perform whatever processing you need to determine the beginning and ending of each group of lines you want to write out to a smaller files (I'm not sure by your question's text whether such groups can overlap or leave gaps, so I'm offering the most general solution where they're fully allowed to -- this will also cover more constrained use cases, with no real performance penalty, though code might be a tad simpler if the constraints were very rigid).
Say that you put these numbers in lists starts (index from 0 of first line to write, included), ends (index from 0 of first line to NOT write -- may legitimately and innocuosly be len(lines) or more), names (filenames to which you want to write), all lists having the same length of course.
Then, lastly:
assert len(starts) == len(ends) == len(names)
for s, e, n in zip(starts, ends, names):
with open(n, 'wt') as f:
f.writelines(lines[s:e])
...and that's all you need to do!
Edit: the OP seems to be confused by the concept of having these lists, so let me try to give an example: each block written out to a file starts at a line containing 'begin' (included) and ends at the first immediately succeeding line containing 'end' (also included), and the names of the files to be written are to be result0.txt, result1.txt, and so on.
It's an error if the number of "closing ends" differ from that of "opening begins" (and remember, the first immediately succeeding "end" terminates all pending "begins"); no line is allowed to contain both 'begin' and 'end'.
A very arbitrary set of conditions, to be sure, but then, the OP leaves us totally in the dark about the actual specifics of the problem, so what else can we do but guess most wildly?-)
outfile = 0
starts = []
ends = []
names = []
for i, line in enumerate(lines):
if 'begin' in line:
if 'end' in line:
raise ValueError('Both begin and end: %r' % line)
starts.append(i)
names.append('result%d.txt' % outfile)
outfile += 1
elif 'end' in line:
ends.append(i + 1) # remember ends are EXCLUDED, hence the +1
That's it -- the assert about the three lists having identical lengths will take care of checking that the constraints are respected.
As the constraints and specs are changed, so of course will this snippet of code change accordingly -- as long as it fills the three equal-length lists starts, ends, and names, exactly how it does so matters not in the least to the rest of the code.
A 4MB file is very small, it fits in memory for sure. The fastest approach would be to read it all and then iterate over each line searching for the pattern, writing out the line to the appropriate file depending on the pattern (your approach for small files.)
I'm not going to get into the actual code, but pseudo code would do this.
BIGFILE="filename"
SMALLFILE="smallfile1"
while(readline(bigfile)) {
write(SMALLFILE, line)
if(line matches pattern) {
SMALLFILE="smallfile++"
}
}
Which is really bad code, but maybe you get the point. I should also have said that it doesn't matter how big your file is since you have to read the file anyway.