finding string in .txt file and delete it - python

i write folder content (files wit .pdf .doc and .xls) in a small txt file. every filename get a new line in the txt file. Works fine.
Now i want to remove all line with the .pdf files.
I still use the following code to remove false entries (fail.png in this case):
def clean():
with open("files.txt", "r") as f:
lines = f.readlines()
with open("files.txt", "w") as f:
for line in lines:
if line.strip("\n") != "fail.png":
f.write(line)
clean_folderlog()
Is it possible to use some sort of "wildcard" (*.pdf) instead of the specific file name?
Or is there a complete other way to solve that?
Thanks a lot

There are multiple options:
You could check wether the line contains the string '.pdf':
if not "pdf" in line.strip("\n")
f.write(line)
You could also use a regular expression. This can be useful in other situations where you want to have a more complex pattern matching.
import re
with open("testdata.txt", "w") as f:
for line in lines:
line = line.strip()
if not re.match(".+\.pdf$",line):
f.write(line)
.+ matches any character
\. matches the literal dot
pdf matches the literal chars 'pdf'
$ matches at the end of the line
Whole code would look like this:
def clean():
with open("files.txt", "r") as f:
lines = f.readlines()
with open("files.txt", "w") as f:
for line in lines:
if not "pdf" in line.strip("\n"):
f.write(line)
clean_folderlog()
Also, I fixed the indentation, because the write-open doesn't have to be indented

You have lots of options:
Check if the string ends with ".pdf":
if not line.endswith(".pdf"):
Use the re module (most general pattern matching):
import re
...
if not re.match(r"\.pdf$", line):
Use the fnmatch module for shell-style pattern matching:
from fnmatch import fnmatch
....
if not fnmatch(line, "*.pdf"):

You can easily replace your two functions of writing folder content and removing unnecessary files with, for example, such code snippet, written below:
import os
extensions = ['.pdf', 'PUT_YOUR_OTHER_EXTENSIONS']
with open('test.txt', 'w') as f:
for file_name in os.listdir('PUT_YOUR_FOLDER_PATH'):
if os.path.isfile(file_name) and not file_name.endswith(tuple(extensions)):
f.write("%s\n" % file_name)
It will write in a file all filenames of your folder. You just need to put in list extensions that you don't need. Enjoy!
Note: This works for one folder, that is mentioned in os.listdir() function. For writing all files from subfolders, use recursive walk.

Related

How to keep lines which contains specific string and remove other lines from .txt file?

How to keep lines which contains specific string and remove other lines from .txt file?
Example: I want to keep the line which has word "hey" and remove others.
test.txt file:
first line
second one
heyy yo yo
fourth line
Code:
keeplist = ["hey"]
with open("test.txt") as f:
for line in f:
for word in keeplist:
Its hard to remove lines from a file. Its usually better to write a temporary file with the desired content and then change that to the original file name.
import os
keeplist = ["hey"]
with open("test.txt") as f, open("test.txt.tmp", "w") as outf:
for line in f:
for word in keeplist:
if word in line:
outf.write(line)
break
os.rename("test.txt.tmp", "test.txt")

Read, manipulate and save text file

All,
I'm trying to read text files that are being downloaded every 20 min in to a certain folder, check it, manipulate it and move it to another location for further processing. Basically, I want to check each file that comes in, check if a string contains a "0.00" value and if so, delete that particular string. There are two strings per file.
I managed to manipulate a file with a given name, but now I need to do the same for files with variable names (there is a timestamp included in the title). One file will need to be processed at a time.
This is what I got so far:
import os
path = r"C:\Users\r1.0"
dir = os.listdir(path)
def remove_line(line, stop):
return any([word in line for word in stop])
stop = ["0.00"]
for file in dir:
if file.lower().endswith('.txt'):
with open(file, "r") as f:
lines = f.readlines()
with open(file, "w") as f:
for line in lines:
if not remove_line(line, stop):
f.write(line)
What works are the def-function and the two "with open..." codes. What am I doing wrong here?
Also, can I write a file to another directory using the open() function?
Thanks in advance!
Your code looks mostly fine. I don't think your list comprehension method does remove the string though. You can write to a different folder with Open(). This should do the trick for you:
import os
path = r"C:\Users\r1.0"
dir = os.listdir(path)
stop = ["0.00"]
for file in dir:
if file.lower().endswith('.txt'):
with open(file, "r") as f:
lines = f.readlines()
# put the new file in a different location
newfile = os path.join("New", "directory", file)
with open(newfile, "w") as f:
for line in lines:
if stop in line: #check if we need to modify lime
#modify line here
#this will remove stop from the line
line.replace(stop, "")
# Regardless of whether the line has changed, we need to write it out.
f.write(line)

regex output to a text file

I'm trying to write a python Script that write regex output (IP Addresses) to a text file. The script will rewrite the file each time it runs. Is there any better way to do it?
import re
pattern = re.compile(r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}')
with open('test.txt', 'r') as rf:
content = rf.read()
matches = pattern.findall(content)
open('iters.txt', 'w').close()
for match in matches:
with open('iters.txt', 'a') as wf:
wf.write(match + '\n')
I rewrote the code a bit.
I changed the regex to use {3} that way you don't have to repeat the same pattern so many times.
I added os.path.exists to check to see if the file already exists. I think this is what you want, if not just remove that if. If the file already exists it does not write.
I combined the two with's as you seem to be writing to a file and it doesn't make a ton of sense to keep on reopening the file to append a new line.
I renamed pattern to ip_pattern just for readability sake, you can change it back if you want.
Code:
import re, os
ip_pattern = re.compile(r'(?:\d{1,3}\.){3}\d{1,3}')
if not os.path.exists("iters.txt"):
with open('test.txt', 'r') as rf, open('iters.txt', 'a') as wf:
content = rf.read()
matches = ip_pattern.findall(content)
for match in matches:
wf.write(match + '\n')

remove first char from each line in a text file

im new to Python, to programming in general.
I want to remove first char from each line in a text file and write the changes back to the file. For example i have file with 36 lines, and the first char in each line contains a symbol or a number, and i want it to be removed.
I made a little code here, but it doesn't work as expected, it only duplicates whole liens. Any help would be appreciated in advance!
from sys import argv
run, filename = argv
f = open(filename, 'a+')
f.seek(0)
lines = f.readlines()
for line in lines:
f.write(line[1:])
f.close()
Your code already does remove the first character. I saved exactly your code as both dupy.py and dupy.txt, then ran python dupy.py dupy.txt, and the result is:
from sys import argv
run, filename = argv
f = open(filename, 'a+')
f.seek(0)
lines = f.readlines()
for line in lines:
f.write(line[1:])
f.close()
rom sys import argv
un, filename = argv
= open(filename, 'a+')
.seek(0)
ines = f.readlines()
or line in lines:
f.write(line[1:])
.close()
It's not copying entire lines; it's copying lines with their first character stripped.
But from the initial statement of your problem, it sounds like you want to overwrite the lines, not append new copies. To do that, don't use append mode. Read the file, then write it:
from sys import argv
run, filename = argv
f = open(filename)
lines = f.readlines()
f.close()
f = open(filename, 'w')
for line in lines:
f.write(line[1:])
f.close()
Or, alternatively, write a new file, then move it on top of the original when you're done:
import os
from sys import argv
run, filename = argv
fin = open(filename)
fout = open(filename + '.tmp', 'w')
lines = f.readlines()
for line in lines:
fout.write(line[1:])
fout.close()
fin.close()
os.rename(filename + '.tmp', filename)
(Note that this version will not work as-is on Windows, but it's simpler than the actual cross-platform version; if you need Windows, I can explain how to do this.)
You can make the code a lot simpler, more robust, and more efficient by using with statements, looping directly over the file instead of calling readlines, and using tempfile:
import tempfile
from sys import argv
run, filename = argv
with open(filename) as fin, tempfile.NamedTemporaryFile(delete=False) as fout:
for line in fin:
fout.write(line[1:])
os.rename(fout.name, filename)
On most platforms, this guarantees an "atomic write"—when your script finishes, or even if someone pulls the plug in the middle of it running, the file will end up either replaced by the new version, or untouched; there's no way it can end up half-way overwritten into unrecoverable garbage.
Again this version won't work on Windows. Without a whole lot of work, there is no way to implement this "write-temp-and-rename" algorithm on Windows. But you can come close with only a bit of extra work:
with open(filename) as fin, tempfile.NamedTemporaryFile(delete=False) as fout:
for line in fin:
fout.write(line[1:])
outname = fout.name
os.remove(filename)
os.rename(outname, filename)
This does prevent you from half-overwriting the file, but it leaves a hole where you may have deleted the original file, and left the new file in a temporary location that you'll have to search for. You can make this a little nicer by putting the file somewhere easier to find (see the NamedTemporaryFile docs to see how). Or renaming the original file to a temporary name, then writing to the original filename, then deleting the original file. Or various other possibilities. But to actually get the same behavior as on other platforms is very difficult.
You can either read all lines in memory then recreate file,
from sys import argv
run, filename = argv
with open(filename, 'r') as f:
data = [i[1:] for i in f
with open(filename, 'w') as f:
f.writelines(i+'\n' for i in data) # this is for linux. for win use \r\n
or You can create other file and move data from first file to second line by line. Then You can rename it If You'd like
from sys import argv
run, filename = argv
new_name = filename + '.tmp'
with open(filename, 'r') as f_in, open(new_name, 'w') as f_out:
for line in f_in:
f_out.write(line[1:])
os.rename(new_name, filename)
At its most basic, your problem is that you need to seek back to the beginning of the file after you read its complete contents into the array f. Since you are making the file shorter, you also need to use truncate to adjust the official length of the file after you're done. Furthermore, open mode a+ (a is for append) overrides seek and forces all writes to go to the end of the file. So your code should look something like this:
import sys
def main(argv):
filename = argv[1]
with open(filename, 'r+') as f:
lines = f.readlines()
f.seek(0)
for line in lines:
f.write(line[1:])
f.truncate()
if __name__ == '__main__': main(sys.argv)
It is better, when doing something like this, to write the changes to a new file and then rename it over the old file when you're done. This causes the update to happen "atomically" - a concurrent reader sees either the old file or the new one, not some mangled combination of the two. That looks like this:
import os
import sys
import tempfile
def main(argv):
filename = argv[1]
with open(filename, 'r') as inf:
with tempfile.NamedTemporaryFile(dir=".", delete=False) as outf:
tname = outf.name
for line in inf:
outf.write(line[1:])
os.rename(tname, filename)
if __name__ == '__main__': main(sys.argv)
(Note: Atomically replacing a file via rename does not work on Windows; you have to os.remove the old name first. This unfortunately does mean there is a brief window (no pun intended) where a concurrent reader will find that the file does not exist. As far as I know there is no way to avoid this.)
import re
with open(filename,'r+') as f:
modified = re.sub('^.','',f.read(),flags=re.MULTILINE)
f.seek(0,0)
f.write(modified)
In the regex pattern:
^ means 'start of string'
^ with flag re.MULTILINE means 'start of line'
^. means 'the only one character at the start of a line'
The start of a line is the start of the string or any position after a newline (a newline is \n)
So, we may fear that some newlines in sequences like \n\n\n\n\n\n\n could match with the regex pattern.
But the dot symbolizes any character EXCEPT a newline, then all the newlines don't match with this regex pattern.
During the reading of the file triggered by f.read(), the file's pointer goes until the end of the file.
f.seek(0,0) moves the file's pointer back to the beginning of the file
f.truncate() puts a new EOF = end of file at the point where the writing has stopped. It's necessary since the modified text is shorter than the original one.
Compare what it does with a code without this line
To be hones, i'm really not sure how good/bad is an idea of nesting with open(), but you can do something like this.
with open(filename_you_reading_lines_FROM, 'r') as f0:
with open(filename_you_appending_modified_lines_TO, 'a') as f1:
for line in f0:
f1.write(line[1:])
While there seemed to be some discussion of best practice and whether it would run on Windows or not, being new to Python, I was able to run the first example that worked and get it to run in my Win environment that has cygwin binaries in my environmental variables Path and remove the first 3 characters (which were line numbers from a sample file):
import os
from sys import argv
run, filename = argv
fin = open(filename)
fout = open(filename + '.tmp', 'w')
lines = fin.readlines()
for line in lines:
fout.write(line[3:])
fout.close()
fin.close()
I chose not to automatically overwrite since I wanted to be able to eyeball the output.
python c:\bin\remove1st3.py sampleCode.txt

how to pass a list of files to python open() method

I have a list of around 100 files form which I wanted to read and match one word.
Here's the piece of code I wrote.
import re
y = 'C:\\prova.txt'
var1 = open(y, 'r')
for line in var1:
if re.match('(.*)version(.*)', line):
print line
var1.close()
every time I try to pass a tuple to y I get this error:
TypeError: coercing to Unicode: need string or buffer, tuple found.
(I think that open() does not accept any tuple but only strings)
So I could I get it to work with a list of files?
Thank you in advance!!!!
You are quite correct that open doesn't accept a tuple and needs a string. So you have to iterate over the file names one by one:
import re
for path in paths:
with open(path) as f:
for line in f:
if re.match('(.*)version(.*)', line):
print line
Here I use paths as the variable the hold the file names — it can be a tuple or a list or some other object that you can iterate over.
Use fileinput.input instead of open.
This module implements a helper class and functions to quickly write a loop over standard input or a list of files
[...] To specify an alternative list of filenames, pass it as the first argument to input(). A single file name is also allowed.
Example:
import fileinput
for line in fileinput.input(list_of_files):
# etc...
Just iterate over the tuple. And you don't need a regex here.
y = ('C:\\prova.txt', 'C:\\prova2.txt')
for filename in y:
with open(filename) as f:
for line in f:
if 'version' in line:
print line
Using the with statement this way also saves you from having to close the files you're working with. They will be closed automatically when the with block is exited.
Something like this:
import re
files = ['a.txt', 'b.txt']
for f in files:
with open(f, 'r') as var1:
for line in var1:
if re.match('(.*)version(.*)', line):
print line
def simple_search(filenames, query):
for filename in filenames:
with open(filename) as f:
for line_num, line in enumerate(f, 1):
if query in line:
print filename, line_num, line.strip()
My added value: (1) it's useless printing the line contents without showing which line in which file (2) doesn't double-space the output

Categories

Resources