regex output to a text file - python

I'm trying to write a python Script that write regex output (IP Addresses) to a text file. The script will rewrite the file each time it runs. Is there any better way to do it?
import re
pattern = re.compile(r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}')
with open('test.txt', 'r') as rf:
content = rf.read()
matches = pattern.findall(content)
open('iters.txt', 'w').close()
for match in matches:
with open('iters.txt', 'a') as wf:
wf.write(match + '\n')

I rewrote the code a bit.
I changed the regex to use {3} that way you don't have to repeat the same pattern so many times.
I added os.path.exists to check to see if the file already exists. I think this is what you want, if not just remove that if. If the file already exists it does not write.
I combined the two with's as you seem to be writing to a file and it doesn't make a ton of sense to keep on reopening the file to append a new line.
I renamed pattern to ip_pattern just for readability sake, you can change it back if you want.
Code:
import re, os
ip_pattern = re.compile(r'(?:\d{1,3}\.){3}\d{1,3}')
if not os.path.exists("iters.txt"):
with open('test.txt', 'r') as rf, open('iters.txt', 'a') as wf:
content = rf.read()
matches = ip_pattern.findall(content)
for match in matches:
wf.write(match + '\n')

Related

finding string in .txt file and delete it

i write folder content (files wit .pdf .doc and .xls) in a small txt file. every filename get a new line in the txt file. Works fine.
Now i want to remove all line with the .pdf files.
I still use the following code to remove false entries (fail.png in this case):
def clean():
with open("files.txt", "r") as f:
lines = f.readlines()
with open("files.txt", "w") as f:
for line in lines:
if line.strip("\n") != "fail.png":
f.write(line)
clean_folderlog()
Is it possible to use some sort of "wildcard" (*.pdf) instead of the specific file name?
Or is there a complete other way to solve that?
Thanks a lot
There are multiple options:
You could check wether the line contains the string '.pdf':
if not "pdf" in line.strip("\n")
f.write(line)
You could also use a regular expression. This can be useful in other situations where you want to have a more complex pattern matching.
import re
with open("testdata.txt", "w") as f:
for line in lines:
line = line.strip()
if not re.match(".+\.pdf$",line):
f.write(line)
.+ matches any character
\. matches the literal dot
pdf matches the literal chars 'pdf'
$ matches at the end of the line
Whole code would look like this:
def clean():
with open("files.txt", "r") as f:
lines = f.readlines()
with open("files.txt", "w") as f:
for line in lines:
if not "pdf" in line.strip("\n"):
f.write(line)
clean_folderlog()
Also, I fixed the indentation, because the write-open doesn't have to be indented
You have lots of options:
Check if the string ends with ".pdf":
if not line.endswith(".pdf"):
Use the re module (most general pattern matching):
import re
...
if not re.match(r"\.pdf$", line):
Use the fnmatch module for shell-style pattern matching:
from fnmatch import fnmatch
....
if not fnmatch(line, "*.pdf"):
You can easily replace your two functions of writing folder content and removing unnecessary files with, for example, such code snippet, written below:
import os
extensions = ['.pdf', 'PUT_YOUR_OTHER_EXTENSIONS']
with open('test.txt', 'w') as f:
for file_name in os.listdir('PUT_YOUR_FOLDER_PATH'):
if os.path.isfile(file_name) and not file_name.endswith(tuple(extensions)):
f.write("%s\n" % file_name)
It will write in a file all filenames of your folder. You just need to put in list extensions that you don't need. Enjoy!
Note: This works for one folder, that is mentioned in os.listdir() function. For writing all files from subfolders, use recursive walk.

Script to find and replace text in a .csv doesn't work with "=$"

My simple find and replace python script should find the "find_str" text and replace it with empty. It seems to work for any text I enter except the string "=$" for some reason. Can anyone help with why this might be.
import re
# open your csv and read as a text string
with open('new.csv', 'r') as f:
my_csv_text = f.read()
find_str = '=$'
replace_str = ' '
# substitute
new_csv_str = re.sub(find_str, replace_str, my_csv_text)
# open new file and save
new_csv_path = './my_new_csv.csv'
with open(new_csv_path, 'w') as f:
f.write(new_csv_str)
$ is a special character within the regex world.
You have different choices:
Escape the $:
find_str = '=\$'
Use simple string functions as you do not have any variation in your pattern (no re module needed, really):
my_csv_text.replace(find_str, replace_str, my_csv_text)

replace new line in a different file with an underscore (without using with)

I posted a question yesterday in similar regards to this but didn't quite gauge the response I wanted because I wasn't specific enough. Basically the function takes a .txt file as the argument and returns a string with all \n characters replaced with an '_' on the same line. I want to do this without using WITH. I thought I did this correctly but when I run it and check the file, nothing has changed. Any pointers?
This is what I did:
def one_line(filename):
wordfile = open(filename)
text_str = wordfile.read().replace("\n", "_")
wordfile.close()
return text_str
one_line("words.txt")
but to no avail. I open the text file and it remains the same.
The contents of the textfile are:
I like to eat
pancakes every day
and the output that's supposed to be shown is:
>>> one_line("words.txt")
’I like to eat_pancakes every day_’
The fileinput module in the Python standard library allows you to do this in one fell swoop.
import fileinput
for line in fileinput.input(filename, inplace=True):
line = line.replace('\n', '_')
print(line, end='')
The requirement to avoid a with statement is trivial but rather pointless. Anything which looks like
with open(filename) as handle:
stuff
can simply be rewritten as
try:
handle = open(filename)
stuff
finally:
handle.close()
If you take out the try/finally you have a bug which leaves handle open if an error happens. The purpose of the with context manager for open() is to simplify this common use case.
You are missing some steps. After you obtain the updated string, you need to write it back to the file, example below without using with
def one_line(filename):
wordfile = open(filename)
text_str = wordfile.read().replace("\n", "_")
wordfile.close()
return text_str
def write_line(s):
# Open the file in write mode
wordfile = open("words.txt", 'w')
# Write the updated string to the file
wordfile.write(s)
# Close the file
wordfile.close()
s = one_line("words.txt")
write_line(s)
Or using with
with open("file.txt",'w') as wordfile:
#Write the updated string to the file
wordfile.write(s)
with pathlib you could achieve what you want this way:
from pathlib import Path
path = Path(filename)
contents = path.read_text()
contents = contents.replace("\n", "_")
path.write_text(contents)

Delete comments in text file

I am trying to delete comments starting on new lines in a Python code file using Python code and regular expressions. For example, for this input:
first line
#description
hello my friend
I would like to get this output:
first line
hello my friend
Unfortunately this code didn't work for some reason:
with open(input_file,"r+") as f:
string = re.sub(re.compile(r'\n#.*'),"",f.read()))
f.seek(0)
f.write(string)
for some reason the output I get is the same as the input.
1) There is no reason to call re.compile unless you save the result. You can always just use the regular expression text.
2) Seeking to the beginning of the file and writing there may cause problems for you if your replacement text is shorter than your original text. It is easier to re-open the file and write the data.
Here is how I would fix your program:
import re
input_file = 'in.txt'
with open(input_file,"r") as f:
data = f.read()
data = re.sub(r'\n#.*', "", data)
with open(input_file, "w") as f:
f.write(data)
It doesn't seem right to start the regular expression with \n, and I don't think you need to use re.compile here.
In addition to that, you have to use the flag re.M to make the search on multiline
This will delete all lines that start with # and empty lines.
with open(input_file, "r+") as f:
text = f.read()
string = re.sub('^(#.*)|(\s*)$', '', text, flags=re.M)
f.write(string)

Delete a specific string (not line) from a text file python

I have a text file with two lines in a text file:
<BLAHBLAH>483920349<FOOFOO>
<BLAHBLAH>4493<FOOFOO>
Thats the only thing in the text file. Using python, I want to write to the text file so that i can take away BLAHBLAH and FOOFOO from each line. It seems like a simple task but after refreshing my file manipulation i cant seem to find a way to do it.
Help is greatly appreciated :)
Thanks!
If it's a text file as you say, and not HTML/XML/something else, just use replace:
for line in infile.readlines():
cleaned_line = line.replace("BLAHBLAH","")
cleaned_line = cleaned_line.replace("FOOFOO","")
and write cleaned_line to an output file.
f = open(path_to_file, "w+")
f.write(f.read().replace("<BLAHBLAH>","").replace("<FOOFOO>",""))
f.close()
Update (saving to another file):
f = open(path_to_input_file, "r")
output = open(path_to_output_file, "w")
output.write(f.read().replace("<BLAHBLAH>","").replace("<FOOFOO>",""))
f.close()
output.close()
Consider the regular expressions module re.
result_text = re.sub('<(.|\n)*?>',replacement_text,source_text)
The strings within < and > are identified. It is non-greedy, ie it will accept a substring of the least possible length. For example if you have "<1> text <2> more text", a greedy parser would take in "<1> text <2>", but a non-greedy parser takes in "<1>" and "<2>".
And of course, your replacement_text would be '' and source_text would be each line from the file.

Categories

Resources