I have a difficult problem. I know there are so many 're' masters in python out there. So please help me. I have a huge log file. The format is something like this:
[text hello world yadda
lines lines lines
exceptions]
[something i'm not interested in]
[text hello world yadda
lines lines lines
exceptions]
And so on...
So Block 1 and 3 are same. And there are multiple cases like this. My ques is how can I read this file and write in an output file only the unique blocks? If there's a duplicate, it should be written only once. And sometimes there are multiple blocks in between two duplicate blocks. I'm actually pattern matching and this is the code as of now. It only matches the pattern but doesn't do anything about duplicates.
import re
import sys
from itertools import islice
try:
if len(sys.argv) != 3:
sys.exit("You should enter 3 parameters.")
elif sys.argv[1] == sys.argv[2]:
sys.exit("The two file names cannot be the same.")
else:
file = open(sys.argv[1], "r")
file1 = open(sys.argv[2],"w")
java_regex = re.compile(r'[java|javax|org|com]+?[\.|:]+?', re.I) # java
at_regex = re.compile(r'at\s', re.I) # at
copy = False # flag that control to copy or to not copy to output
for line in file:
if re.search(java_regex, line) and not (re.search(r'at\s', line, re.I) or re.search(r'mdcloginid:|webcontainer|c\.h\.i\.h\.p\.u\.e|threadPoolTaskExecutor|caused\sby', line, re.I)):
# start copying if "java" is in the input
copy = True
else:
if copy and not re.search(at_regex, line):
# stop copying if "at" is not in the input
copy = False
if copy:
file1.write(line)
file.close()
file1.close()
except IOError:
sys.exit("IO error or wrong file name.")
except IndexError:
sys.exit('\nYou must enter 3 parameters.') #prevents less than 3 inputs which is mandatory
except SystemExit as e: #Exception handles sys.exit()
sys.exit(e)
I don't care if this has to be in this code(removing duplicates). It can be in a separate .py file also. Doesn't matter
This is the original snippet of the log file:
javax.xml.ws.soap.SOAPFaultException: Uncaught BPEL fault http://schemas.xmlsoap.org/soap/envelope/:Server
at org.apache.axis2.jaxws.marshaller.impl.alt.MethodMarshallerUtils.createSystemException(MethodMarshallerUtils.java:1326) ~[org.apache.axis2.jar:na]
at org.apache.axis2.jaxws.marshaller.impl.alt.MethodMarshallerUtils.demarshalFaultResponse(MethodMarshallerUtils.java:1052) ~[org.apache.axis2.jar:na]
at org.apache.axis2.jaxws.marshaller.impl.alt.DocLitBareMethodMarshaller.demarshalFaultResponse(DocLitBareMethodMarshaller.java:415) ~[org.apache.axis2.jar:na]
at org.apache.axis2.jaxws.client.proxy.JAXWSProxyHandler.getFaultResponse(JAXWSProxyHandler.java:597) ~[org.apache.axis2.jar:na]
at org.apache.axis2.jaxws.client.proxy.JAXWSProxyHandler.createResponse(JAXWSProxyHandler.java:537) ~[org.apache.axis2.jar:na]
at org.apache.axis2.jaxws.client.proxy.JAXWSProxyHandler.invokeSEIMethod(JAXWSProxyHandler.java:403) ~[org.apache.axis2.jar:na]
at org.apache.axis2.jaxws.client.proxy.JAXWSProxyHandler.invoke(JAXWSProxyHandler.java:188) ~[org.apache.axis2.jar:na]
com.hcentive.utils.exception.HCRuntimeException: Unable to Find User Profile:null
at com.hcentive.agent.service.AgentServiceImpl.getAgentByUserProfile(AgentServiceImpl.java:275) ~[agent-service-core-4.0.0.jar:na]
at com.hcentive.agent.service.AgentServiceImpl$$FastClassByCGLIB$$e3caddab.invoke(<generated>) ~[cglib-2.2.jar:na]
at net.sf.cglib.proxy.MethodProxy.invoke(MethodProxy.java:191) ~[cglib-2.2.jar:na]
at org.springframework.aop.framework.Cglib2AopProxy$CglibMethodInvocation.invokeJoinpoint(Cglib2AopProxy.java:689) ~[spring-aop-3.1.2.RELEASE.jar:3.1.2.RELEASE]
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:150) ~[spring-aop-3.1.2.RELEASE.jar:3.1.2.RELEASE]
at org.springframework.transaction.interceptor.TransactionInterceptor.invoke(TransactionInterceptor.java:110) ~[spring-tx-3.1.2.RELEASE.jar:3.1.2.RELEASE]
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:172) ~[spring-aop-3.1.2.RELEASE.jar:3.1.2.RELEASE]
at org.springframework.security.access.intercept.aopalliance.MethodSecurityInterceptor.invoke(MethodSecurityInterceptor.java:64) ~[spring-security-core-3.1.2.RELEASE.jar:3.1.2.RELEASE]
javax.xml.ws.soap.SOAPFaultException: Uncaught BPEL fault http://schemas.xmlsoap.org/soap/envelope/:Server
at org.apache.axis2.jaxws.marshaller.impl.alt.MethodMarshallerUtils.createSystemException(MethodMarshallerUtils.java:1326) ~[org.apache.axis2.jar:na]
at org.apache.axis2.jaxws.marshaller.impl.alt.MethodMarshallerUtils.demarshalFaultResponse(MethodMarshallerUtils.java:1052) ~[org.apache.axis2.jar:na]
at org.apache.axis2.jaxws.marshaller.impl.alt.DocLitBareMethodMarshaller.demarshalFaultResponse(DocLitBareMethodMarshaller.java:415) ~[org.apache.axis2.jar:na]
at org.apache.axis2.jaxws.client.proxy.JAXWSProxyHandler.getFaultResponse(JAXWSProxyHandler.java:597) ~[org.apache.axis2.jar:na]
at org.apache.axis2.jaxws.client.proxy.JAXWSProxyHandler.createResponse(JAXWSProxyHandler.java:537) ~[org.apache.axis2.jar:na]
at org.apache.axis2.jaxws.client.proxy.JAXWSProxyHandler.invokeSEIMethod(JAXWSProxyHandler.java:403) ~[org.apache.axis2.jar:na]
And so on and on....
you can remove duplicate blocks with this:
import re
yourstr = r'''
[text hello world yadda
lines lines lines
exceptions]
[something i'm not interested in]
[text hello world yadda
lines lines lines
exceptions]
'''
pat = re.compile(r'\[([^]]+])(?=.*\[\1)', re.DOTALL)
result = pat.sub('', yourstr)
Note that only the last block is preserved, If you want the first you must reverse the string and use this pattern:
(][^[]+)\[(?=.*\1\[)
and then reverse the string again.
You could use a hashing algorithm like in hashlib and a dictionary that looks like this: {123456789: True}
the value is not important but a dict makes it significantly faster than a list if its a big file.
Anyway you can hash each block as you come along it and store it in the dictionary as long as its not in the dictionary. If it is in the dictionary then ignore the block. That's assuming your blocks are structured absolutely identical.
Related
I am having a problem with creating a script, to search a file for certain patterns. Once the pattern is found in the file, it should print a message.
The problem I am having is that it is just executing the first for loop and not printing the next message.
`import r
f = open('/R1.ios')
V96189 = ['session-limit']
V96197 = ['logging enable']
for pattern in V96189:
for line in f:
if pattern in line:
print('V-96189', 'Not a finding', 'session-limit is set')
for pattern in V96197:
for line in f:
if pattern in line:
print('V-96197', 'Not a finding', 'logging enable is set')'
`
This question already has answers here:
How to search for word in text file and print part of line with Python?
(2 answers)
Closed 5 years ago.
My goal is to create a script that search for credentials in an input file.
I find endless example, even here on StackOverflow, that can teach me how to search for a range of words in a file:
Example 1
Example 2
By the way, when I try to apply those rules to my script it return me nothing.
Here is my code:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import os
import argparse
import sys
parser = argparse.ArgumentParser()
parser.add_argument('-input', dest='input',help="input one or more files",metavar=None)
args = parser.parse_args()
GrabdirectoryFile = open(args.input,"r",encoding='UTF8')
directoryFile = GrabdirectoryFile.read()
HotWords = ['password', 'admin']
def search_for_lines(filename, words_list):
words_found = 0
for line_no, line in enumerate(filename):
if any(word in line for word in words_list):
print(line_no, ':', line)
words_found += 1
return words_found
search_for_lines(directoryFile,HotWords)
I tried following the instructions I find on the 2 links provided above but no luck.
The code is definitely executed and Python returns no errors.
The file contains many words and also a few 'password' and 'admin' but no line is returned.
Why?
EDIT:
dear #Kirk Broadhurst, #SIM, #André Schild, #kasperhj, #Garrett Hyde, I tried to follow your link and and substitute my code with:
with open(args.input) as openfile:
for line in openfile:
for part in line.split():
if "color=" in part:
print (part)
but unfortunately is still not working. The right solution was provided here below by #Farhan.K, I had to use readlines() instead of read()
You are reading the file using file.read() which returns a string but you are expecting a list. Use file.readlines() instead. As an aside, it is better use open/close files using the with statement.
Replace
GrabdirectoryFile = open(args.input,"r",encoding='UTF8')
directoryFile = GrabdirectoryFile.read()
with...
GrabdirectoryFile = open(args.input,"r",encoding='UTF8')
directoryFile = GrabdirectoryFile.readlines()
Using a with statement is better:
with open(args.input,"r",encoding='UTF8') as GrabdirectoryFile:
directoryFile = GrabdirectoryFile.readlines()
Python 2.7
Download a test file from: www.py4inf.com/code/mbox.txt
Briefly, I need list all lines that start with 'From' and take only the mail address. Selecting line by line.
If the condition is true, write in other file the ( only) mail address (result). I could wrote the code and it is working. But It would be better if I use functions. I crashed when I tried to pass the parameters. I have a problem when I have a function that receive a parameter and send one or two.
The result is: copy line by line ALL input file in output file almost like a recursion. No search nothing and the file is very big.
At last, have you any page to read about funtions, paramt, passing paramt, pass reference, and other. Ask is easy and I prefer read and try to understand and if I have a problem, light a cande in the middle of the night!.
#Li is the input paramet. Line from fileRead(the pointer of the file).
#if the condition is true select all lines that start with From:
def funFormat(li):
if li.startswith('From:'):
li = li.rstrip('')
li = li.replace('From: ',' \n')
return li?
fileRead=open('mbox_small.txt','r')
for eachline in fileRead:
result=funFormat(eachline)
fileWrite =open('Resul73.txt', 'a')
fileWrite.write( result )
fileRead.close()
fileWrite.close()
You're opening a file every time you need to write, and closing it just at the end. Maybe that's what's messing it up? Try this and let me know if it works -
#Li is the input paramet. Line from fileRead(the pointer of the file).
#if the condition is true select all lines that start with From:
def funFormat(li):
if li.startswith('From:'):
li = li.rstrip('')
li = li.replace('From: ',' \n')
return li
else:
return None
fileRead = open('mbox_small.txt', 'r')
fileWrite = open('Resul73.txt', 'a')
for eachline in fileRead:
result=funFormat (eachline)
if result:
fileWrite.write (result)
fileRead.close()
fileWrite.close()
Also, I suggest you to read up on with blocks. This will help you work with files more efficiently. As for functions, there are enough resources online.
I am writing a code, that gathers some statistics about ontologies. as input I have a folder with files some are RDF/XML, some are turtle or nt.
My problem is, that when I try to parse a file using wrong format, next time even if I parse it with correct format it fails.
Here test file is turtle format. If first parse it with turtle format all is fine. but if I first parse it with the wrong format 1. error is understandable (file:///test:1:0: not well-formed (invalid token)), but error for second is (Unknown namespace prefix : owl). Like I said when I first parse with the correct one, I don't get namespace error.
Pleas help, after 2 days, I'm getting desperate.
query = 'SELECT DISTINCT ?s ?o WHERE { ?s ?p owl:Ontology . ?s rdfs:comment ?o}'
data = open("test", "r")
g = rdflib.Graph("IOMemory")
try:
result = g.parse(file=data,format="xml")
relations = g.query(query)
print(( " graph has %s statements." % len(g)))
except:
print "bad1"
e = sys.exc_info()[1]
print e
try:
result = g.parse(file=data,format="turtle")
relations = g.query(query)
print(( " graph has %s statements." % len(g)))
except :
print "bad2"
e = sys.exc_info()[1]
print e
The problem is that the g.parse reads some part from the file input stream of data first, only to figure out afterwards that it is not xml. The second call (with the turtle format) then continues to read from the input stream after the part where the previous attempt has stopped. The part read by the first parser is lost to the secnd one.
If your test file is small, the xml-parser might have read it all, leaving an "empty" rest. It seems the turtle parser did not complain - it just read in nothing. Only the query in the next statement failed to find anything owl-like in it, as the graph is empty. (I have to admit I cannot reproduce this part, the turtle parser does complain in my case, but maybe I have a different version of rdflib)
To fix it, try to reopen the file; either reorganize the code so you have an data = open("test", "r") every time you call result = g.parse(file=data, format="(some format)"), or call data.seek(0) in the except: clause, like:
for format in 'xml','turtle':
try:
print 'reading', format
result = g.parse(data, format=format)
print 'success'
break
except Exception:
print 'failed'
data.seek(0)
I'm processing a large (120mb) text file from my thunderbird imap directory and attempting to extract to/from info from the headers using mbox and regex. the process runs for a while until I eventually get an exception: "TypeError: expected string or buffer".
The exception references the fifth line of this code:
PAT_EMAIL = re.compile(r"[0-9A-Za-z._-]+\#[0-9A-Za-z._-]+")
temp_list = []
mymbox = mbox("data.txt")
for email in mymbox.values():
from_address = PAT_EMAIL.findall(email["from"])
to_address = PAT_EMAIL.findall(email["to"])
for item in from_address:
temp_list.append(item) #items are added to a temporary list where they are sorted then written to file
I've run the code on other (smaller) files, so I'm guessing the issue is my file. The file appears to be just a bunch of text. Can someone point me in the write direction for debugging this?
There can only be one from address (I think!):
In the following:
from_address = PAT_EMAIL.findall(email["from"])
I have a feeling you're trying to duplicate the work of email.message_from_file and email.utils.parseaddr
from email.utils import parseaddr
>>> s = "Jon Clements <jon#example.com>"
>>> from email.utils import parseaddr
>>> parseaddr(s)
('Jon Clements', 'jon#example.com')
So you can use parseaddr(email['from'])[1] to get the email address and use that.
Similarly, you may wish to look at email.utils.getaddresses to handle to and cc addresses...
Well, I didn't solve the issue but have worked around it for my own purposes. I inserted a try statement so that the iteration just continues past any TypeError. For every thousand email addresses I'm getting about 8 failures, which will suffice. Thanks for your input!
PAT_EMAIL = re.compile(r"[0-9A-Za-z._-]+\#[0-9A-Za-z._-]+")
temp_list = []
mymbox = mbox("data.txt")
for email in mymbox.values():
try:
from_address = PAT_EMAIL.findall(email["from"])
except(TypeError):
print "TypeError!"
try:
to_address = PAT_EMAIL.findall(email["to"])
except(TypeError):
print "TypeError!"
for item in from_address:
temp_list.append(item) #items are added to a temporary list where they are sorted then written to file