Finding many strings in directory - python

I have a directory full of files and a set of strings that need identification(about 40). I want to go through all of the files in the directory and print out the names of the files that have any one of my strings. I found code that works perfectly (Search directory for specific string), but it only works for one string. Whenever I try to add more, it prints out the name of every single file in the directory. Can someone help me tweak the code as I just started programming a few days ago and don't know what to do.
import glob
for file in glob.glob('*.csv'):
with open(file) as f:
contents = f.read()
if 'string' in contents:
print file
That code was taken from the question I mentioned above. Any help would be appreciated and any tips on asking the question better would as well! Thank You!

You can try:
import glob
strings = ['string1', 'string2']
for file in glob.glob('*.csv'):
with open(file) as f:
contents = f.read()
for string in strings:
if string in contents:
print file
break
About asking better questions: link

I would just use grep:
$ grep -l -f strings.txt *.csv

Related

Python - Regex pattern from a log file

I previously raised question where I wanted to find the latest file from a set of servers. This was answered thanks to this community.
Now another problem I am facing is that when I try to find a regex pattern match within the latest log file, I am getting a blank result i.e. the program cannot find the match even though notepad++ matches my pattern with the string I want [I just wanted to check if my pattern was at fault].
I want to search for 3 different patterns within the same file and I have tried the below code but still no output:
import glob
import os
import re
paths = [r'\\Server1\Logs\*.log',
r'\\Server2\Logs\*.log',
.....
r'\\Server16\Logs\*.log']
list_of_files = []
for path in paths:
list_of_files.extend(glob.glob(path))
#Find the latest log file from all the servers:
if list_of_files:
latest_file = max(list_of_files, key=os.path.getctime)
f = open(os.path.join(latest_file), "r")
s = f.read()
#Search for the required pattern:
pattern1 = re.search(r"[A-Z\s]{4}\_[A-Z\s]{8}\_[A-Z\s]{4}\_[0-9.]{4}[A-Z\s]{3}\"\]\s\-\s[a-z\s]{8}", s)
pattern2 = re.search(r"[A-Z]{4}\_[A-Z]{5}\_[A-Z]{4}\_[0-9]{1}\_[0-9]{8}\_[0-9.]{7}[A-Z]{3}\"\]\s\-\s[a-z]{8}", s)
pattern3 = re.search(r"[A-Z]{4}\_[A-Z]{4}\_[0-9]{1}\_[0-9]{8}\_[0-9.]{7}[A-Z]{3}\"\]\s\-\s[a-z]{8}", s)
print(pattern1)
print(pattern2)
print(pattern3)
print(latest_file)
else:
print("No log files found!")
Please note that I have tried re.findall an other re methods as well but to no success.
I have also tried to use for line in s and then pattern search in line but again, no success.
Apologies if this has a simple solution that I am not able to grasp but since I am new to the concept of programming itself, any help is really appreciated.
Thank you in advance!
As advised by the community, below is the sample of the log file that I am trying to find my pattern in:
Full line of random data that is not important to me
Another full line of random data that is not important to me
.
.
.
.
.
Yet another full line of random data that is not important to me
Upload of ["\\DATA01-ABC\companyname.projectname.appname\Production\WORD-Outbound\ProcessNameAndFile\WORD_ACTIVITY_FILE_010.DAT"] - complete.
WORD_ACTIVITY_FILE_010.DAT"] - complete this is what I want to match.
As this log is something that I don't think I can paste on the internet, I have replaced the desired output with random words that hopefully make sense.
The issue was resolved after adding "utf-16" argument to 'f' variable which was reading our .log file.
Before the code was reading the log file in ASCII mode which resulted in each character in the log file spaced out.
This fixed the issue for me:
f = codecs.open(os.path.join(latest_file), "r", "utf-16")

How to write and replace elements in a text file?

I am a bit lost with this concept, and we have no examples to show us how to replace or search within a file.
The prompt:
Write a File
You will be provided a file path for input I, a file path for output O, a string S, and a string T.
Read the contents of I, replacing each occurrence of S with T and write the resulting information to file O.
You should replace O if it already exists.
The inputs are unknown, this is the only code we are provided with:
import sys
I= sys.argv[1]
O= sys.argv[2]
S= sys.argv[3]
T= sys.argv[4]
The only examples we have been provided are how to read a file, and how to write a simple text element into a file.
My code so far:
file1 = open(I, 'r')
data = file1.read()
I am truly stuck.
You don't replace within the file, per se -- you read the file contents into a string, replace within the string, and then write the result to the output file.
open input file
read into string variable
close file
replace (Python method) all S with T
open output file
write string variable
close file
First you open the input file, and read the content inside it.
Next you create a loop where you check if the current word is the same with T, if it is, you replace it and finally you write the new string to the output file.
I won't give you any code, because that's what what you need, you need to find your self how to do it.
Good luck
Here is your code:
import sys
I= sys.argv[1]
O= sys.argv[2]
S= sys.argv[3]
T= sys.argv[4]
with open(I, 'r') as file_in:
text = file_in.read()
text = text.replace(S,T)
with open(O, 'w') as file_out:
file_out.write(text);
The with construct only makes sure that the file is closed once you're done, so that you don't have to think about it. The rest is straight-forward.
You can write a function to do it for you.
import sys
def elReplacer(I,O,S,T):
with open(O,"w") as outputFile:
outputFile.write(open(I,"r").read().replace(S,T))
outputFile.close()
I= sys.argv[1]
O= sys.argv[2]
S= sys.argv[3]
T= sys.argv[4]
elReplacer(I,O,S,T)
This is very late but I might help someone else since I just had this exact problem with my coursework. I made the variable named "inny" for the inputted file and "outy" for the one that will be my output later.
Steps I took were:
Open input file
Read the input file and store it in a string
Close the file
Open the file that needs overwritten (make sure you put it on write mode)
use re.sub to substitute S for T
write that into the file
close the file
This is the code I used below. I realize this can be shortened but if you want to display your work to the prof you can go with this.
import re
inny=open(I,'r')
results1 = inny.read()
inny.close()
outy=open(O,'w')
results2 = re.sub (S,T,results1)
outy.write(results2)
outy.close()

Renaming Files Based on In-File Strings Using Python

I posted the following code in the Windows and Rename tags and thought it might make more sense to ask about this code here. Essentially what I am trying to do is use this to rename files based on a particular text string located in the files (the text string in line.strip() below). I was wondering how I might implement something like this in Python, as this is a rough sketch of how I think it should look but not a complete work. Is there a best way to fill in the gaps here? Any suggestions would be much appreciated.
for file in directory:
f = fopen(file, 'r')
line = f.readLine();
while(line):
if(line.strip() == '<th style="width: 12em;">Name:</th>'):
nextline = f.readLine().strip();
c = nextline.find("</td>")
name = nextline[4:c]
os.commandline(rename file to name)
break
line = f.readLine()
I think the safest way to move the file is to use the shutil module. To do this replace replace
os.commandline(rename file to name)
with
shutil.move(os.path.join(directory,file), os.path.join(directory,name))
You can rename a file with the provided os command: (first close the file)
f.close()
os.rename(file_name, new_name)

Rename Files Based on File Content

Using Python, I'm trying to rename a series of .txt files in a directory according to a specific phrase in each given text file. Put differently and more specifically, I have a few hundred text files with arbitrary names but within each file is a unique phrase (something like No. 85-2156). I would like to replace the arbitrary file name with that given phrase for every text file. The phrase is not always on the same line (though it doesn't deviate that much) but it always is in the same format and with the No. prefix.
I've looked at the os module and I understand how
os.listdir
os.path.join
os.rename
could be useful but I don't understand how to combine those functions with intratext manipulation functions like linecache or general line reading functions.
I've thought through many ways of accomplishing this task but it seems like easiest and most efficient way would be to create a loop that finds the unique phrase in a file, assigns it to a variable and use that variable to rename the file before moving to the next file.
This seems like it should be easy, so much so that I feel silly writing this question. I've spent the last few hours looking reading documentation and parsing through StackOverflow but it doesn't seem like anyone has quite had this issue before -- or at least they haven't asked about their problem.
Can anyone point me in the right direction?
EDIT 1: When I create the regex pattern using this website, it creates bulky but seemingly workable code:
import re
txt='No. 09-1159'
re1='(No)' # Word 1
re2='(\\.)' # Any Single Character 1
re3='( )' # White Space 1
re4='(\\d)' # Any Single Digit 1
re5='(\\d)' # Any Single Digit 2
re6='(-)' # Any Single Character 2
re7='(\\d)' # Any Single Digit 3
re8='(\\d)' # Any Single Digit 4
re9='(\\d)' # Any Single Digit 5
re10='(\\d)' # Any Single Digit 6
rg = re.compile(re1+re2+re3+re4+re5+re6+re7+re8+re9+re10,re.IGNORECASE|re.DOTALL)
m = rg.search(txt)
name = m.group(0)
print name
When I manipulate that to fit the glob.glob structure, and make it like this:
import glob
import os
import re
re1='(No)' # Word 1
re2='(\\.)' # Any Single Character 1
re3='( )' # White Space 1
re4='(\\d)' # Any Single Digit 1
re5='(\\d)' # Any Single Digit 2
re6='(-)' # Any Single Character 2
re7='(\\d)' # Any Single Digit 3
re8='(\\d)' # Any Single Digit 4
re9='(\\d)' # Any Single Digit 5
re10='(\\d)' # Any Single Digit 6
rg = re.compile(re1+re2+re3+re4+re5+re6+re7+re8+re9+re10,re.IGNORECASE|re.DOTALL)
for fname in glob.glob("\file\structure\here\*.txt"):
with open(fname) as f:
contents = f.read()
tname = rg.search(contents)
print tname
Then this prints out the byte location of the the pattern -- signifying that the regex pattern is correct. However, when I add in the nname = tname.group(0) line after the original tname = rg.search(contents) and change around the print function to reflect the change, it gives me the following error: AttributeError: 'NoneType' object has no attribute 'group'. When I tried copying and pasting #joaquin's code line for line, it came up with the same error. I was going to post this as a comment to the #spatz answer but I wanted to include so much code that this seemed to be a better way to express the `new' problem. Thank you all for the help so far.
Edit 2: This is for the #joaquin answer below:
import glob
import os
import re
for fname in glob.glob("/directory/structure/here/*.txt"):
with open(fname) as f:
contents = f.read()
tname = re.search('No\. (\d\d\-\d\d\d\d)', contents)
nname = tname.group(1)
print nname
Last Edit: I got it to work using mostly the code as written. What was happening is that there were some files that didn't have that regex expression so I assumed Python would skip them. Silly me. So I spent three days learning to write two lines of code (I know the lesson is more than that). I also used the error catching method recommended here. I wish I could check all of you as the answer, but I bothered #Joaquin the most so I gave it to him. This was a great learning experience. Thank you all for being so generous with your time. The final code is below.
import os
import re
pat3 = "No\. (\d\d-\d\d)"
ext = '.txt'
mydir = '/directory/files/here'
for arch in os.listdir(mydir):
archpath = os.path.join(mydir, arch)
with open(archpath) as f:
txt = f.read()
s = re.search(pat3, txt)
if s is None:
continue
name = s.group(1)
newpath = os.path.join(mydir, name)
if not os.path.exists(newpath):
os.rename(archpath, newpath + ext)
else:
print '{} already exists, passing'.format(newpath)
Instead of providing you with some code which you will simply copy-paste without understanding, I'd like to walk you through the solution so that you will be able to write it yourself, and more importantly gain enough knowledge to be able to do it alone next time.
The code which does what you need is made up of three main parts:
Getting a list of all filenames you need to iterate
For each file, extract the information you need to generate a new name for the file
Rename the file from its old name to the new one you just generated
Getting a list of filenames
This is best achieved with the glob module. This module allows you to specify shell-like wildcards and it will expand them. This means that in order to get a list of .txt file in a given directory, you will need to call the function glob.iglob("/path/to/directory/*.txt") and iterate over its result (for filename in ...:).
Generate new name
Once we have our filename, we need to open() it, read its contents using read() and store it in a variable where we can search for what we need. That would look something like this:
with open(filename) as f:
contents = f.read()
Now that we have the contents, we need to look for the unique phrase. This can be done using regular expressions. Store the new filename you want in a variable, say newfilename.
Rename
Now that we have both the old and the new filenames, we need to simply rename the file, and that is done using os.rename(filename, newfilename).
If you want to move the files to a different directory, use os.rename(filename, os.path.join("/path/to/new/dir", newfilename). Note that we need os.path.join here to construct the new path for the file using a directory path and newfilename.
There is no checking or protection for failures (check is archpath is a file, if newpath already exists, if the search is succesful, etc...), but this should work:
import os
import re
pat = "No\. (\d\d\-\d\d\d\d)"
mydir = 'mydir'
for arch in os.listdir(mydir):
archpath = os.path.join(mydir, arch)
with open(archpath) as f:
txt = f.read()
s = re.search(pat, txt)
name = s.group(1)
newpath = os.path.join(mydir, name)
os.rename(archpath, newpath)
Edit: I tested the regex to show how it works:
>>> import re
>>> pat = "No\. (\d\d\-\d\d\d\d)"
>>> txt='nothing here or whatever No. 09-1159 you want, does not matter'
>>> s = re.search(pat, txt)
>>> s.group(1)
'09-1159'
>>>
The regex is very simple:
\. -> a dot
\d -> a decimal digit
\- -> a dash
So, it says: search for the string "No. " followed by 2+4 decimal digits separated by a dash.
The parentheses are to create a group that I can recover with s.group(1) and that contains the code number.
And that is what you get, before and after:
Text of files one.txt, two.txt and three.txt is always the same, only the number changes:
this is the first
file with a number
nothing here or whatever No. 09-1159 you want, does not matter
the number is
Create a backup of your files, then try something like this:
import glob
import os
def your_function_to_dig_out_filename(lines):
import re
# i'll let you attempt this yourself
for fn in glob.glob('/path/to/your/dir/*.txt'):
with open(fn) as f:
spam = f.readlines()
new_fn = your_function_to_dig_out_filename(spam)
if not os.path.exists(new_fn):
os.rename(fn, new_fn)
else:
print '{} already exists, passing'.format(new_fn)

Python: How do i split the file?

I have this txt file which is ls -R of etc directory in a linux system. Example file:
etc:
ArchiveSEL
xinetd.d
etc/cmm:
CMM_5085.bin
cmm_sel
storage.cfg
etc/crontabs:
root
etc/pam.d:
ftp
rsh
etc/rc.d:
eth.set.sh
rc.sysinit
etc/rc.d/init.d:
cmm
functions
userScripts
etc/security:
access.conf
console.apps
time.conf
etc/security/console.apps:
kbdrate
etc/ssh:
ssh_host_dsa_key
sshd_config
etc/var:
setUser
snmpd.conf
etc/xinetd.d:
irsh
wu-ftpd
I would like to split it by subdirectories into several files. example files would be like this: etc.txt, etcCmm.txt, etcCrontabs.txt, etcPamd.txt, ...
Can someone give me a python code that can do that?
Notice that the subdirectory lines end with ':', but i'm just not smart enough to write the code. some examples would be appreciated.
thank you :)
Maybe something like this? re.M generates a multiline regular expression which can match several lines, and the last part just iterates over the matches and creates the files...
import re
data = '<your input data as above>' # or open('data.txt').read()
results = map(lambda m: (m[0], m[1].strip().splitlines()),
re.findall('^([^\n]+):\n((?:[^\n]+\n)*)\n', data, re.M))
for dirname, files in results:
f = open(dirname.replace('/', '')+'.txt', 'w')
for line in files:
f.write(line + '\n')
f.close()
You will need to do it line-by-line. if a line.endswith(":") then you are in a new subdirectory. From then on, each line is a new entry into your subdirectory, until another line ends with :.
From my understanding, you just want to split one textfile into several, ambiguously named, text files.
So you'd see if a line ends with :. then you open a new text file, like etcCmm.txt, and every line that you read from the source text, from that point on, you write intoetcCmm.txt. When you encounter another line that ends in :, you close the previously opened file, create a new one, and continue.
I'm leaving a few things for you to do yourself, such as figuring out what to call the text file, reading a file line-by-line, etc.
use regexp like '.*:'.
use file.readline().
use loops.
If Python is not a must, you can use this one liner
awk '/:$/{gsub(/:|\//,"");fn=$0}{print $0 > fn".txt"}' file
Here's what I would do:
Read the file into memory (myfile = open(filename).read() should do).
Then split the file along the delimiters:
import re
myregex = re.compile(r"^(.*):[ \t]*$", re.MULTILINE)
arr = myregex.split(myfile)[1:] # dropping everything before the first directory entry
Then convert the array to a dict, removing unwanted characters along the way:
mydict = dict([(re.sub(r"\W+","",k), v.strip()) for (k,v) in zip(arr[::2], arr[1::2])])
Then write the files:
for name,content in mydict.iteritems():
output = open(name+".txt","w")
output.write(content)
output.close()

Categories

Resources