I am trying to delete comments starting on new lines in a Python code file using Python code and regular expressions. For example, for this input:
first line
#description
hello my friend
I would like to get this output:
first line
hello my friend
Unfortunately this code didn't work for some reason:
with open(input_file,"r+") as f:
string = re.sub(re.compile(r'\n#.*'),"",f.read()))
f.seek(0)
f.write(string)
for some reason the output I get is the same as the input.
1) There is no reason to call re.compile unless you save the result. You can always just use the regular expression text.
2) Seeking to the beginning of the file and writing there may cause problems for you if your replacement text is shorter than your original text. It is easier to re-open the file and write the data.
Here is how I would fix your program:
import re
input_file = 'in.txt'
with open(input_file,"r") as f:
data = f.read()
data = re.sub(r'\n#.*', "", data)
with open(input_file, "w") as f:
f.write(data)
It doesn't seem right to start the regular expression with \n, and I don't think you need to use re.compile here.
In addition to that, you have to use the flag re.M to make the search on multiline
This will delete all lines that start with # and empty lines.
with open(input_file, "r+") as f:
text = f.read()
string = re.sub('^(#.*)|(\s*)$', '', text, flags=re.M)
f.write(string)
Related
My simple find and replace python script should find the "find_str" text and replace it with empty. It seems to work for any text I enter except the string "=$" for some reason. Can anyone help with why this might be.
import re
# open your csv and read as a text string
with open('new.csv', 'r') as f:
my_csv_text = f.read()
find_str = '=$'
replace_str = ' '
# substitute
new_csv_str = re.sub(find_str, replace_str, my_csv_text)
# open new file and save
new_csv_path = './my_new_csv.csv'
with open(new_csv_path, 'w') as f:
f.write(new_csv_str)
$ is a special character within the regex world.
You have different choices:
Escape the $:
find_str = '=\$'
Use simple string functions as you do not have any variation in your pattern (no re module needed, really):
my_csv_text.replace(find_str, replace_str, my_csv_text)
I have a large txt-file and want to extract all strings with these patterns:
/m/meet_the_crr
/m/commune
/m/hann_2
Here is what I tried:
import re
with open("testfile.txt", "r") as text_file:
contents = text_file.read().replace("\n", "")
print(re.match(r'^\/m\/[a-zA-Z0-9_-]+$', contents))
The result I get is a simple "None". What am I doing wrong here?
You need to not remove lineends and use the re.MULTILINE flag so you get multiple results from a bigger text returned:
# write a demo file
with open("t.txt","w") as f:
f.write("""
/m/meet_the_crr\n
/m/commune\n
/m/hann_2\n\n
# your text looks like this after .read().replace(\"\\n\",\"\")\n
/m/meet_the_crr/m/commune/m/hann_2""")
Program:
import re
regex = r"^\/m\/[a-zA-Z0-9_-]+$"
with open("t.txt","r") as f:
contents = f.read()
found_all = re.findall(regex,contents,re.M)
print(found_all)
print("-")
print(open("t.txt").read())
Output:
['/m/meet_the_crr', '/m/commune', '/m/hann_2']
Filecontent:
/m/meet_the_crr
/m/commune
/m/hann_2
# your text looks like this after .read().replace("\n","")
/m/meet_the_crr/m/commune/m/hann_2
This is about what Wiktor Stribiżew did tell you in his comment - although he suggested to use a better pattern as well: r'^/m/[\w-]+$'
There is nothing logically wrong with your code, and in fact your pattern will match the inputs you describe:
result = re.match(r'^\/m\/[a-zA-Z0-9_-]+$', '/m/meet_the_crr')
if result:
print(result.groups()) # this line is reached, as there is a match
Since you did not specify any capture groups, you will see () being printed to the console. You could capture the entire input, and then it would be available, e.g.
result = re.match(r'(^\/m\/[a-zA-Z0-9_-]+$)', '/m/meet_the_crr')
if result:
print(result.groups(1)[0])
/m/meet_the_crr
You are reading a whole file into a variable (into memory) using .read(). With .replace("\n", ""), you re,ove all newlines in the string. The re.match(r'^\/m\/[a-zA-Z0-9_-]+$', contents) tries to match the string that entirely matches the \/m\/[a-zA-Z0-9_-]+ pattern, and it is impossible after all the previous manipulations.
There are at least two ways out. Either remove .replace("\n", "") (to prevent newline removal) and use re.findall(r'^/m/[\w-]+$', contents, re.M) (re.M option will enable matching whole lines rather than the whole text), or read the file line by line and use your re.match version to check each line for a match, and if it matches add to the final list.
Example:
import re
with open("testfile.txt", "r") as text_file:
contents = text_file.read()
print(re.findall(r'^/m/[\w-]+$', contents, re.M))
Or
import re
with open("testfile.txt", "r") as text_file:
for line in text_file:
if re.match(r'/m/[\w-]+\s*$', line):
print(line.rstrip())
Note I used \w to make the pattern somewhat shorter, but if you are working in Python 3 and only want to match ASCII letters and digits, use also re.ASCII option.
Also, / is not a special char in Python regex patterns, there is no need escaping it.
I'm trying to write a python Script that write regex output (IP Addresses) to a text file. The script will rewrite the file each time it runs. Is there any better way to do it?
import re
pattern = re.compile(r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}')
with open('test.txt', 'r') as rf:
content = rf.read()
matches = pattern.findall(content)
open('iters.txt', 'w').close()
for match in matches:
with open('iters.txt', 'a') as wf:
wf.write(match + '\n')
I rewrote the code a bit.
I changed the regex to use {3} that way you don't have to repeat the same pattern so many times.
I added os.path.exists to check to see if the file already exists. I think this is what you want, if not just remove that if. If the file already exists it does not write.
I combined the two with's as you seem to be writing to a file and it doesn't make a ton of sense to keep on reopening the file to append a new line.
I renamed pattern to ip_pattern just for readability sake, you can change it back if you want.
Code:
import re, os
ip_pattern = re.compile(r'(?:\d{1,3}\.){3}\d{1,3}')
if not os.path.exists("iters.txt"):
with open('test.txt', 'r') as rf, open('iters.txt', 'a') as wf:
content = rf.read()
matches = ip_pattern.findall(content)
for match in matches:
wf.write(match + '\n')
Hello I'm making a python program that takes in a file. I want this to be set to a single string. My current code is:
with open('myfile.txt') as f:
title = f.readline().strip();
content = f.readlines();
The text file (simplified) is:
Title of Document
asdfad
adfadadf
adfadaf
adfadfad
I want to strip the title (which my program does) and then make the rest one string. Right now the output is:
['asdfad\n', 'adfadadf\n', ect...]
and I want:
asdfadadfadadf ect...
I am new to python and I have spent some time trying to figure this out but I can't find a solution that works. Any help would be appreciated!
You can do this:
with open('/tmp/test.txt') as f:
title=f.next() # strip title line
data=''.join(line.rstrip() for line in f)
Use list.pop(0) to remove the first line from content.
Then str.join(iterable). You'll also need to strip off the newlines.
content.pop(0)
done = "".join([l.strip() for l in content])
print done
Another option is to read the entire file, then remove the newlines instead of joining together:
with open('somefile') as fin:
next(fin, None) # ignore first line
one_big_string = fin.read().replace('\n', '')
If you want the rest of the file in a single chunk, just call the read() function:
with open('myfile.txt') as f:
title = f.readline().strip()
content = f.read()
This will read the file until EOF is encountered.
I have a text file with two lines in a text file:
<BLAHBLAH>483920349<FOOFOO>
<BLAHBLAH>4493<FOOFOO>
Thats the only thing in the text file. Using python, I want to write to the text file so that i can take away BLAHBLAH and FOOFOO from each line. It seems like a simple task but after refreshing my file manipulation i cant seem to find a way to do it.
Help is greatly appreciated :)
Thanks!
If it's a text file as you say, and not HTML/XML/something else, just use replace:
for line in infile.readlines():
cleaned_line = line.replace("BLAHBLAH","")
cleaned_line = cleaned_line.replace("FOOFOO","")
and write cleaned_line to an output file.
f = open(path_to_file, "w+")
f.write(f.read().replace("<BLAHBLAH>","").replace("<FOOFOO>",""))
f.close()
Update (saving to another file):
f = open(path_to_input_file, "r")
output = open(path_to_output_file, "w")
output.write(f.read().replace("<BLAHBLAH>","").replace("<FOOFOO>",""))
f.close()
output.close()
Consider the regular expressions module re.
result_text = re.sub('<(.|\n)*?>',replacement_text,source_text)
The strings within < and > are identified. It is non-greedy, ie it will accept a substring of the least possible length. For example if you have "<1> text <2> more text", a greedy parser would take in "<1> text <2>", but a non-greedy parser takes in "<1>" and "<2>".
And of course, your replacement_text would be '' and source_text would be each line from the file.