Match data in two files

Match data in two files - python

I'm trying to match (what are network logon usernames) in two files. The All is a text file of names I'm (or will be) interested to match. Currently, I'm doing something like this:
def find_files(directory, pattern):
#directory= (raw_input("Enter a directory to search for Userlists: ")
directory=("c:\\TEST")
os.chdir(directory)
for root, dirs, files in os.walk(directory):
for basename in files:
if fnmatch.fnmatch(basename, pattern):
filename = os.path.join(root, basename)
yield filename
for filename in find_files('a-zA-Z0-9', '*.txt'):
with open (filename, "r") as file1:
with open ("c:/All.txt", "r") as file2:
list1 = file1.readlines()[18:]
list2 = file2.readlines()
for i in list1:
for j in list2:
if i == j:
I'n new to python and am wondering if this is the best, and most efficient way of doing this. It seems, to me even as a newbie a little clunky, but with my current coding knowledge is the best I can come up with at the moment.
Any help and advice would be gratefully received.

You want to read one file into memory first, storing it in a set. Membership testing in a set is very efficient, much more so than looping over the lines of the second file for every line in the first file.
Then you only need to read the second file, and line by line process it and test if lines match.
What file you keep in memory depends on the size of All.txt. If it is < 1000 lines or so, just keep that in memory and compare it to the other files. If All.txt is really large, re-open it for every file1 you process, and read only the first 18 lines of file1 into memory and match those against every line in All.txt, line by line.
To read just 18 lines of a file, use itertools.islice(); files are iterables and islice() is the easiest way to pick a subset of lines to read.
Reading All.txt into memory first:
from itertools import islice
with open ("c:/All.txt", "r") as all:
# storing lines without whitespace to make matching a little more robust
all_lines = set(line.strip() for line in all)
for filename in find_files('a-zA-Z0-9', '*.txt'):
with open(filename, "r") as file1:
for line in islice(file1, 18):
if line.strip() in all_lines:
# matched line
If All.txt is large, store those 18 lines of each file in a set first, then re-open All.txt and process it line by line:
for filename in find_files('a-zA-Z0-9', '*.txt'):
with open(filename, "r") as file1:
file1_lines = set(line.strip() for line in islice(file1, 18))
with open ("c:/All.txt", "r") as all:
for line in all:
if line.strip() in file1_lines:
# matched line
Note that you do not have to change directories in find_files(); os.walk() is already passed the directory name. The fnmatch module also has a .filter() method, use that to loop over files instead of using fnmatch.fnmatch() on each file individually:
def find_files(directory, pattern):
directory = "c:\\TEST"
for root, dirs, files in os.walk(directory):
for basename in fnmatch.filter(files, pattern):
yield os.path.join(root, basename)

Related

Define directory and search for string in each txt file and list files

I want to search for a string in a number of text files in a folder and its subfolders.
Then all files containing this string should be listed. How can this be made?
The string is just something like "Test". So no special chars. I thought of something like the following in a loop:
open('*', 'r').read().find('Test')

You can use a simple loop and the glob module:
from glob import glob
for fname in glob('**/*', recursive=True):
with open(fname, 'r') as f:
out = []
if any('Test' in line for line in f):
out.append(fname)
print(out)
If you just want to print:
for fname in glob('**/*', recursive=True):
with open(fname, 'r') as f:
if any('Test' in line for line in f):
print(fname)

I suggest taking look at glob.glob with recursive set to true, consider following simple example
import glob
for fpath in glob.glob("**/*.txt",recursive=True):
print(fpath)
it does print paths (relative to current working directory) of all *.txt files, including subdirectories (observe ** in 1st argument)

finding string in .txt file and delete it

i write folder content (files wit .pdf .doc and .xls) in a small txt file. every filename get a new line in the txt file. Works fine.
Now i want to remove all line with the .pdf files.
I still use the following code to remove false entries (fail.png in this case):
def clean():
with open("files.txt", "r") as f:
lines = f.readlines()
with open("files.txt", "w") as f:
for line in lines:
if line.strip("\n") != "fail.png":
f.write(line)
clean_folderlog()
Is it possible to use some sort of "wildcard" (*.pdf) instead of the specific file name?
Or is there a complete other way to solve that?
Thanks a lot

There are multiple options:
You could check wether the line contains the string '.pdf':
if not "pdf" in line.strip("\n")
f.write(line)
You could also use a regular expression. This can be useful in other situations where you want to have a more complex pattern matching.
import re
with open("testdata.txt", "w") as f:
for line in lines:
line = line.strip()
if not re.match(".+\.pdf$",line):
f.write(line)
.+ matches any character
\. matches the literal dot
pdf matches the literal chars 'pdf'
$ matches at the end of the line
Whole code would look like this:
def clean():
with open("files.txt", "r") as f:
lines = f.readlines()
with open("files.txt", "w") as f:
for line in lines:
if not "pdf" in line.strip("\n"):
f.write(line)
clean_folderlog()
Also, I fixed the indentation, because the write-open doesn't have to be indented

You have lots of options:
Check if the string ends with ".pdf":
if not line.endswith(".pdf"):
Use the re module (most general pattern matching):
import re
...
if not re.match(r"\.pdf$", line):
Use the fnmatch module for shell-style pattern matching:
from fnmatch import fnmatch
....
if not fnmatch(line, "*.pdf"):

You can easily replace your two functions of writing folder content and removing unnecessary files with, for example, such code snippet, written below:
import os
extensions = ['.pdf', 'PUT_YOUR_OTHER_EXTENSIONS']
with open('test.txt', 'w') as f:
for file_name in os.listdir('PUT_YOUR_FOLDER_PATH'):
if os.path.isfile(file_name) and not file_name.endswith(tuple(extensions)):
f.write("%s\n" % file_name)
It will write in a file all filenames of your folder. You just need to put in list extensions that you don't need. Enjoy!
Note: This works for one folder, that is mentioned in os.listdir() function. For writing all files from subfolders, use recursive walk.

Read, manipulate and save text file

All,
I'm trying to read text files that are being downloaded every 20 min in to a certain folder, check it, manipulate it and move it to another location for further processing. Basically, I want to check each file that comes in, check if a string contains a "0.00" value and if so, delete that particular string. There are two strings per file.
I managed to manipulate a file with a given name, but now I need to do the same for files with variable names (there is a timestamp included in the title). One file will need to be processed at a time.
This is what I got so far:
import os
path = r"C:\Users\r1.0"
dir = os.listdir(path)
def remove_line(line, stop):
return any([word in line for word in stop])
stop = ["0.00"]
for file in dir:
if file.lower().endswith('.txt'):
with open(file, "r") as f:
lines = f.readlines()
with open(file, "w") as f:
for line in lines:
if not remove_line(line, stop):
f.write(line)
What works are the def-function and the two "with open..." codes. What am I doing wrong here?
Also, can I write a file to another directory using the open() function?
Thanks in advance!

Your code looks mostly fine. I don't think your list comprehension method does remove the string though. You can write to a different folder with Open(). This should do the trick for you:
import os
path = r"C:\Users\r1.0"
dir = os.listdir(path)
stop = ["0.00"]
for file in dir:
if file.lower().endswith('.txt'):
with open(file, "r") as f:
lines = f.readlines()
# put the new file in a different location
newfile = os path.join("New", "directory", file)
with open(newfile, "w") as f:
for line in lines:
if stop in line: #check if we need to modify lime
#modify line here
#this will remove stop from the line
line.replace(stop, "")
# Regardless of whether the line has changed, we need to write it out.
f.write(line)

How to search string in files recursively in python

I am trying to find all log files in my C:\ and then in these log file find a string. If the string is found the output should be the abs path of the log file where the string is found. below is what I have done till now.
import os
rootdir=('C:\\')
for folder,dirs,file in os.walk(rootdir):
for files in file:
if files.endswith('.log'):
fullpath=open(os.path.join(folder,files),'r')
for line in fullpath.read():
if "saurabh" in line:
print(os.path.join(folder,files))

Your code is broken at:
for line in fullpath.read():
The statement fullpath.read() will return the entire file as one string, and when you iterate over it, you will be iterating a character at a time. You will never find the string 'saurabh' in a single character.
A file is its own iterator for lines, so just replace this statement with:
for line in fullpath:
Also, for cleanliness, you might want to close the file when you're done, either explicitly or by using a with statement.
Finally, you may want to break when you find a file, rather than printing the same file out multiple times (if there are multiple occurrences of your string):
import os
rootdir=('C:\\')
for folder, dirs, files in os.walk(rootdir):
for file in files:
if file.endswith('.log'):
fullpath = os.path.join(folder, file)
with open(fullpath, 'r') as f:
for line in f:
if "saurabh" in line:
print(fullpath)
break

Issue finding a file in a list to be rewritten

Very new to Python and programming in general so apologies if I am missing anything straightforward.
I am trying to iterate through a directory and open the included .txt files and modify them with new content.
import os
def rootdir(x):
for paths, dirs, files in os.walk(x):
for filename in files:
f=open(filename, 'r')
lines=f.read()
f.close()
for line in lines:
f=open(filename, 'w')
newline='rewritten content here'
f.write(newline)
f.close()
return x
rootdir("/Users/russellculver/documents/testfolder")`
Is giving me: IOError: [Errno 2] No such file or directory: 'TestText1.rtf'
EDIT: I should clarify there IS a file named 'TestText1.rtf' in the folder specified in the function argument. It is the first one of three text files.
When I try moving where the file is closed / opened as seen below:
import os
def rootdir(x):
for paths, dirs, files in os.walk(x):
for filename in files:
f=open(filename, 'r+')
lines=f.read()
for line in lines:
newline='rewritten content here'
f.write(newline)
f.close()
return x
rootdir("/Users/russellculver/documents/testfolder")
It gives me: ValueError: I/O operation on closed file
Thanks for any thoughts in advance.
#mescalinum Okay so I've made amendments to what I've got based on everyones assistance (thanks!), but it is still failing to enter the text "newline" in any of the .txt files in the specified folder.
import os
x = raw_input("Enter the directory here: ")
def rootdir(x):
for dirpaths, dirnames, files in os.walk(x):
for filename in files:
try:
with open(os.dirpaths.join(filename, 'w')) as f:
f.write("newline")
return x
except:
print "There are no files in the directory or the files cannot be opened!"
return x

From https://docs.python.org/2/library/os.html#os.walk:
os.walk(top, topdown=True, onerror=None, followlinks=False)
Generate the file names in a directory tree by walking the tree either top-down or bottom-up. For each directory in the tree rooted at directory top (including top itself), it yields a 3-tuple (dirpath, dirnames, filenames).
dirpath is a string, the path to the directory. dirnames is a list of the names of the subdirectories in dirpath (excluding '.' and '..'). filenames is a list of the names of the non-directory files in dirpath. Note that the names in the lists contain no path components. To get a full path (which begins with top) to a file or directory in dirpath, do os.path.join(dirpath, name).
Also, f.close() should be outside for line in lines, otherwise you call it multiple times, and the second time you call it, f is already closed, and it will give that I/O error.
You should avoid explicitly open()ing and close()ing files, like:
f=open(filename, 'w')
f.write(newline)
f.close()
and instead use context managers (i.e. the with statement):
with open(filename, 'w'):
f.write(newline)
which does exactly the same thing, but implicitly closes the file when the body of with is finished.

Here is the code that does as you asked:
import os
def rootdir(x):
for paths, dirs, files in os.walk(x):
for filename in files:
try:
f=open(os.path.join(dirpath, name), 'w')
f.write('new content here')
f.close()
except Exception, e:
print "Could not open " + filename
rootdir("/Users/xrisk/Desktop")
However, I have a feeling you don’t quite understand what’s happening here (no offence). First have a look at the documentation of os.walk provided by #mescalinum . The third tuple element files will contain only the file name. You need to combine it with paths to get a full path to the file.
Also, you don’t need to read the file first to write to it. On the other hand, if you want to append to the file, you should use the mode 'a' when opening the file
In general, when reading/writing a file, you only close it after finishing all the read/writes. Otherwise you will get an exception.
Thanks #mescalinum

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Match data in two files - python

Related

Define directory and search for string in each txt file and list files

finding string in .txt file and delete it

Read, manipulate and save text file

How to search string in files recursively in python

Issue finding a file in a list to be rewritten

Categories

Resources