Efficient way to find a string based on a list

Efficient way to find a string based on a list - python

I'm new to scripting and have been reading up on Python for about 6 weeks. The below is meant to read a log file and send an alert if one of the keywords defined in srchstring is found. It works as expected and doesn't alert on strings previously found, as expected. However the file its processing is actively being written to by an application and the script is too slow on files around 500mb. under 200mb it works fine ie within 20secs.
Could someone suggest a more efficient way to search for a string within a file based on a pre-defined list?
import os
srchstring = ["Shutdown", "Disconnecting", "Stopping Event Thread"]
if os.path.isfile(r"\\server\\share\\logfile.txt"):
with open(r"\\server\\share\\logfile.txt","r") as F:
for line in F:
for st in srchstring:
if st in line:
print line,
#do some slicing of string to get dd/mm/yy hh:mm:ss:ms
# then create a marker file called file_dd/mm/yy hh:mm:ss:ms
if os.path.isfile("file_dd/mm/yy hh:mm:ss:ms"): # check if a file already exists named file_dd/mm/yy hh:mm:ss:ms
print "string previously found- ignoring, continuing search" # marker file exists
else:
open("file_dd/mm/yy hh:mm:ss:ms", 'a') # create file_dd/mm/yy hh:mm:ss:ms
print "error string found--creating marker file sending email alert" # no marker file, create it then send email
else:
print "file not exist"

Refactoring the search expression to a precompiled regular expression avoids the (explicit) innermost loop.
import os, re
regex = re.compile(r'Shutdown|Disconnecting|Stopping Event Thread')
if os.path.isfile(r"\\server\\share\\logfile.txt"):
#Indentation fixed as per comment
with open(r"\\server\\share\\logfile.txt","r") as F:
for line in F:
if regex.search(line):
# ...

I assume here that you use Linux. If you don't, install MinGW on Windows and the solution below will become suitable too.
Just leave the hard part to the most efficient tools available. Filter your data before you go to the python script. Use grep command to get the lines containing "Shutdown", "Disconnecting" or "Stopping Event Thread"
grep 'Shutdown\|Disconnecting\|"Stopping Event Thread"' /server/share/logfile.txt
and redirect the lines to your script
grep 'Shutdown\|Disconnecting\|"Stopping Event Thread"' /server/share/logfile.txt | python log.py
Edit: Windows solution. You can create a .bat file to make it executable.
findstr /c:"Shutdown" /c:"Disconnecting" /c:"Stopping Event Thread" \server\share\logfile.txt | python log.py
In 'log.py', read from stdin. It's file-like object, so no difficulties here:
import sys
for line in sys.stdin:
print line,
# do some slicing of string to get dd/mm/yy hh:mm:ss:ms
# then create a marker file called file_dd/mm/yy hh:mm:ss:ms
# and so on
This solution will reduce the amount of work your script has to do. As Python isn't a fast language, it may speed up the task. I suspect it can be rewritten purely in bash and it will be even faster (20+ years of optimization of a C program is not the thing you compete with easily), but I don't know bash enough.

Related

Loop an existing script

I'm using a script from a third party I can't modify or show (let's call it original.py) which takes a file and produces some calculations. At the end it ouputs a result (using the print statment).
Since I have many files I decided to make a second script that gets all wanted files and runs them through the original.py
1st get list of all files to run
2nd run each file through the original.py
3rd obtain results from each file
I have the 1st and 2nd step. However, the end result only saves the calculations from the last file it read.
import sys
import original
import glob
import os
fn=str(sys.argv[1])
for filename in sys.argv[1:]:
print(filename)
ficheiros = [f for f in glob.glob(fn)]
for ficheiro in ficheiros:
original.file = bytes(ficheiro,'utf-8')
original.function()
To summarize:
Knowing I can't change the original script (which is made with a print statement) how can I obtain the results for each loop? Is there a better way than using a for loop?.
The first script can be invoked with python original.py
It requires the file to be changed manually inside the script in the original.file line.
This script outputs the result in the console and I redirect it with: python original.py > result.txt
At the moment when I try to run my script, it reads all the correct files in the folder but only returns the results for the last file.
#
(I tried to reformulate the question hopefully it's easier to understand)
#
The problem is due to a mistake in the ````ficheiros = [f for f in glob.glob(fn)]`````it's only reading one file, hence only outputting one result.
Thanks for the time.sleep() trick in the comments.
Solved:
I changed the initial part to:
fn=str(sys.argv[1])
ficheiros= []
for filename in sys.argv[1:]:
ficheiros.append(filename)
#print(filename)
and now it correctly reads all the files and it outputs all the results

Depending on your operating system there are different ways to take what is printed to the console and append it to a file.
For example on Linux, you could run this file that calls original.py for every file python yourfile.py >> outputfile.txt, which will then effectively save everything that is printed into outputfile.txt.
The syntax is similar for Windows.

I'm not quite sure what you're asking, but you could try one of these:
Either redirecting all output to a file for later use, by running the script like so: python secondscript.py > outfilename.txt
Or, and this might or might not work for you, redefining the print command to a function that outputs the result how you want, eg:
def print(x):
with open('outfile.txt','w') as f:
f.write('example: ' + x)
If you choose the second option, I recommend saving the old print function (oldprint = print) so you can restore and use the regular print later.

I don't know if I got exactly what you want. You have a first script named original.py which takes some arguments and returns things in the form of print statements and you would like to grab these prints statements in your scripts to do things?
If so, a solution could be the subprocess module:
Let's say that this is original.py:
print("Hi, I'm original.py")
print("print me!")
And this is main.py:
import subprocess
script_path = "original.py"
print("Executing ", script_path)
process = subprocess.Popen(["python3", script_path], stdout=subprocess.PIPE)
for line in process.stdout:
print(line.decode("utf8"))
You can easily add more arguments in the Popen call like ["arg1", "arg2",] etc.
Output:
Executing original.py
Hi, I'm original.py
print me!
and you can grab the lines in the main.py to do what you want with them.

Python script that prints its source

Is it possible (not necessarly using python introspection) to print the source code of a script?
I want to execute a short python script that also print its source (so I can see which commands are executed).
The script is something like this:
command1()
#command2()
command3()
print some_variable_that_contain_src
The real application is that I want to run a script from IPython with the run -i magic and have as output the source (i.e. the commands executed). In this way I can check which commands are commented at every execution. Moreover, if executed in a Notebook I leave a trace of which commands have been used.
Solution
Using korylprince solution I end up with this one-liner to be put at the beginning of the script:
with open(__file__) as f: print '\n'.join(f.read().split('\n')[1:])
This will print the script source except the first line (that would be only noise). It's also easy to modify the slicing in order to print a different "slice" of the script.
If you want to print the whole file instead, the one-liner simplifies to:
with open(__file__) as f: print f.read()

As long as you're not doing anything crazy with packages, put this at the top of your script
with open(__file__) as f:
print f.read()
Which will read in the current file and print it out.
For python 3 make sure to use instead
print(f.read())

For the most simple answer:
import my_module
print open(my_module.__file__).read()
I also tried using the inspect package.
import inspect
import my_module
source_list = inspect.getsourcelines(my_module)
Will give you a list of strings with the source code defined in it
for line in source_list[0]:
print line
Will print out the entire source code in a readable manner

Using textfile as stdin in python under windows 7

I'm a win7-user.
I accidentally read about redirections (like command1 < infile > outfile) in *nix systems, and then I discovered that something similar can be done in Windows (link). And python is also can do something like this with pipes(?) or stdin/stdout(?).
I do not understand how this happens in Windows, so I have a question.
I use some kind of proprietary windows-program (.exe). This program is able to append data to a file.
For simplicity, let's assume that it is the equivalent of something like
while True:
f = open('textfile.txt','a')
f.write(repr(ctime()) + '\n')
f.close()
sleep(100)
The question:
Can I use this file (textfile.txt) as stdin?
I mean that the script (while it runs) should always (not once) handle all new data, ie
In the "never-ending cycle":
The program (.exe) writes something.
Python script captures the data and processes.
Could you please write how to do this in python, or maybe in win cmd/.bat or somehow else.
This is insanely cool thing. I want to learn how to do it! :D

If I am reading your question correctly then you want to pipe output from one command to another.
This is normally done as such:
cmd1 | cmd2
However, you say that your program only writes to files. I would double check the documentation to see if their isn't a way to get the command to write to stdout instead of a file.
If this is not possible then you can create what is known as a named pipe. It appears as a file on your filesystem, but is really just a buffer of data that can be written to and read from (the data is a stream and can only be read once). Meaning your program reading it will not finish until the program writing to the pipe stops writing and closes the "file". I don't have experience with named pipes on windows so you'll need to ask a new question for that. One down side of pipes is that they have a limited buffer size. So if there isn't a program reading data from the pipe then once the buffer is full the writing program won't be able to continue and just wait indefinitely until a program starts reading from the pipe.
An alternative is that on Unix there is a program called tail which can be set up to continuously monitor a file for changes and output any data as it is appended to the file (with a short delay.
tail --follow=textfile.txt --retry | mycmd
# wait for data to be appended to the file and output new data to mycmd
cmd1 >> textfile.txt # append output to file
One thing to note about this is that tail won't stop just because the first command has stopped writing to the file. tail will continue to listen to changes on that file forever or until mycmd stops listening to tail, or until tail is killed (or "sigint-ed").
This question has various answers on how to get a version of tail onto a windows machine.

import sys
sys.stdin = open('textfile.txt', 'r')
for line in sys.stdin:
process(line)

If the program writes to textfile.txt, you can't change that to redirect to stdin of your Python script unless you recompile the program to do so.
If you were to edit the program, you'd need to make it write to stdout, rather than a file on the filesystem. That way you can use the redirection operators to feed it into your Python script (in your case the | operator).
Assuming you can't do that, you could write a program that polls for changes on the text file, and consumes only the newly written data, by keeping track of how much it read the last time it was updated.

When you use < to direct the output of a file to a python script, that script receives that data on it's stdin stream.
Simply read from sys.stdin to get that data:
import sys
for line in sys.stdin:
# do something with line

Python: how to capture output to a text file? (only 25 of 530 lines captured now)

I've done a fair amount of lurking on SO and a fair amount of searching and reading, but I must also confess to being a relative noob at programming in general. I am trying to learn as I go, and so I have been playing with Python's NLTK. In the script below, I can get everything to work, except it only writes what would be the first screen of a multi-screen output, at least that's how I am thinking about it.
Here's the script:
#! /usr/bin/env python
import nltk
# First we have to open and read the file:
thefile = open('all_no_id.txt')
raw = thefile.read()
# Second we have to process it with nltk functions to do what we want
tokens = nltk.wordpunct_tokenize(raw)
text = nltk.Text(tokens)
# Now we can actually do stuff with it:
concord = text.concordance("cultural")
# Now to save this to a file
fileconcord = open('ccord-cultural.txt', 'w')
fileconcord.writelines(concord)
fileconcord.close()
And here's the beginning of the output file:
Building index...
Displaying 25 of 530 matches:
y .   The Baobab Tree : Stories of Cultural Continuity The continuity evident
regardless of ethnicity , and the cultural legacy of Africa as well . This Af
What am I missing here to get the entire 530 matches written to the file?

text.concordance(self, word, width=79, lines=25) seem to have other parameters as per manual.
I see no way to extract the size of concordance index, however, the concordance printing code seem to have this part: lines = min(lines, len(offsets)), therefore you can simply pass sys.maxint as a last argument:
concord = text.concordance("cultural", 75, sys.maxint)
Added:
Looking at you original code now, I can't see a way it could work before. text.concordance does not return anything, but outputs everything to stdout using print. Therefore, the easy option would be redirection stdout to you file, like this:
import sys
....
# Open the file
fileconcord = open('ccord-cultural.txt', 'w')
# Save old stdout stream
tmpout = sys.stdout
# Redirect all "print" calls to that file
sys.stdout = fileconcord
# Init the method
text.concordance("cultural", 200, sys.maxint)
# Close file
fileconcord.close()
# Reset stdout in case you need something else to print
sys.stdout = tmpout
Another option would be to use the respective classes directly and omit the Text wrapper. Just copy bits from here and combine them with bits from here and you are done.

Update:
I found this write text.concordance output to a file Options
from the ntlk usergroup. It's from 2010, and states:
Documentation for the Text class says: "is intended to support
initial exploration of texts (via the interactive console). ... If you
wish to write a program which makes use of these analyses, then you
should bypass the Text class, and use the appropriate analysis
function or class directly instead."
If nothing has changed in the package since then, this may be the source of your problem.
--- previously ---
I don't see a problem with writing to the file using writelines():
file.writelines(sequence)
Write a sequence of strings to the file. The sequence can be any
iterable object producing strings, typically a list of strings. There
is no return value. (The name is intended to match readlines();
writelines() does not add line separators.)
Note the italicized part, did you examine the output file in different editors? Perhaps the data is there, but not being rendered correctly due to missing end of line seperators?
Are you sure this part is generating the data you want to output?
concord = text.concordance("cultural")
I'm not familiar with nltk, so I'm just asking as part of eliminating possible sources for the problem.

is there a built-in python analog to unix 'wc' for sniffing a file?

Everyone's done this--from the shell, you need some details about a text file (more than just ls -l gives you), in particular, that file's line count, so:
# > wc -l iris.txt
149 iris.txt
i know that i can access shell utilities from python, but i am looking for a python built-in, if there is one.
The crux of my question is getting this information without opening the file (hence my reference to the unix utility *wc -*l)
(is 'sniffing' the correct term for this--i.e., peeking at a file w/o opening it?')

You can always scan through it quickly, right?
lc = sum(1 for l in open('iris.txt'))

No, I would not call this "sniffing". Sniffing typically refers to looking at data as it passes through, like Ethernet packet capture.
You cannot get the number of lines from a file without opening it. This is because the number of lines in the file is actually the number of newline characters ("\n" on linux) in the file, which you have to read after open()ing it.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.