Find and replace full words only? - python

I'm completely new to Python and I'm currently trying to write a smalle script for finding&replacing different inputs in all files of the same type in a folder. It currently looks like this and works so far:
import glob
import os
z = input('filetype? (i.e. *.txt): ')
x = input('search for?: ')
y = input('replace with?: ')
replacements = {x:y}
lines = []
for filename in glob.glob(z):
with open(filename, 'r') as infile:
for line in infile:
for src, target in replacements.items():
line = line.replace(src, target)
lines.append(line)
with open(filename, 'w') as outfile:
for line in lines:
outfile.write(line)
lines = []
My problem is that I want to replace full words only, so when trying to replace '0' with 'x', '4025' should not become '4x25'. I've tried to incorporate regex, but I couldn't make it work.

Instead of
line.replace(src, target)
try
re.sub(r'\b' + re.escape(src) + r'\b', target, line)
after importing re. \b matches a word boundary.

Just use regex (re) and add ^ and $ patterns in your expression.
line = re.sub('^'+src+'$', target, line)
Don't forget to import re.

Related

Using CSV columns to Search and Replace in a text file

Background
I have a two column CSV file like this:
Find
Replace
is
was
A
one
b
two
etc.
First column is text to find and second is text to replace.
I have second file with some text like this:
"This is A paragraph in a text file." (Please note the case sensitivity)
My requirement:
I want to use that csv file to search and replace in the text file with three conditions:-
whole word replacement.
case sensitive replacement.
Replace all instances of each entry in CSV
Script tried:
with open(CSV_file.csv', mode='r') as infile:
reader = csv.reader(infile)
mydict = {(r'\b' + rows[0] + r'\b'): (r'\b' + rows[1]+r'\b') for rows in reader}<--Requires Attention
print(mydict)
with open('find.txt') as infile, open(r'resul_out.txt', 'w') as outfile:
for line in infile:
for src, target in mydict.items():
line = re.sub(src, target, line) <--Requires Attention
# line = line.replace(src, target)
outfile.write(line)
Description of script
I have loaded my csv into a python dictionary and use regex to find whole words.
Problems
I used r'\b' to make word boundry in order to make whole word replacement but output gives me "\\b" in the dictionary instead of '\b' ??
using REPLACE function gives like:
"Thwas was one paragraph in a text file."
secondly I don't know how to make replacement case sensitive in regex pattern?
If anyone know better solution than this script or can improve the script?
Thanks for help if any..
Here's a more cumbersome approach (more code) but which is easier to read and does not rely on regular expressions. In fact, given the very simple nature of your CSV control file, I wouldn't normally bother using the csv module at all:-
import csv
with open('temp.csv', newline='') as c:
reader = csv.DictReader(c, delimiter=' ')
D = {}
for row in reader:
D[row['Find']] = row['Replace']
with open('input.txt', newline='') as infile:
with open('output.txt', 'w') as outfile:
for line in infile:
tokens = line.split()
for i, t in enumerate(tokens):
if t in D:
tokens[i] = D[t]
outfile.write(' '.join(tokens)+'\n')
I'd just put pure strings into mydict so it looks like
{'is': 'was', 'A': 'one', ...}
and replace this line:
# line = re.sub(src, target, line) # old
line = re.sub(r'\b' + src + r'\b', target, line) # new
Note that you don't need \b in the replacement pattern. Regarding your other questions,
regular expressions are case-sensitive by default,
changing '\b' to '\\b' is exactly what the r'' does. You can omit the r and write '\\b', but that quickly gets ugly with more complex regexs.

Is there any shortcut in Python to remove all blanks at the end of each line in a file?

I've learned that we can easily remove blank lined in a file or remove blanks for each string line, but how about remove all blanks at the end of each line in a file ?
One way should be processing each line for a file, like:
with open(file) as f:
for line in f:
store line.strip()
Is this the only way to complete the task ?
Possibly the ugliest implementation possible but heres what I just scratched up :0
def strip_str(string):
last_ind = 0
split_string = string.split(' ')
for ind, word in enumerate(split_string):
if word == '\n':
return ''.join([split_string[0]] + [ ' {} '.format(x) for x in split_string[1:last_ind]])
last_ind += 1
Don't know if these count as different ways of accomplishing the task. The first is really just a variation on what you have. The second does the whole file at once, rather than line-by-line.
Map that calls the 'rstrip' method on each line of the file.
import operator
with open(filename) as f:
#basically the same as (line.rstrip() for line in f)
for line in map(operator.methodcaller('rstrip'), f)):
# do something with the line
read the whole file and use re.sub():
import re
with open(filename) as f:
text = f.read()
text = re.sub(r"\s+(?=\n)", "", text)
You just want to remove spaces, another solution would be...
line.replace(" ", "")
Good to remove white spaces.

How do I find a token in a file and then count the number of spaces that precede the token?

Using python, I am trying to search a file for a token, and then count the number of white spaces which precede that token to the start of the line.
So if the file is like this:
<index>
<scm>
</scm>
</index>
I want to find the number of spaces which precede <scm>
The solution for a single-line mode:
import itertools
with open('yourfile.txt', 'r') as f:
txt = f.read()
print(len(list(itertools.takewhile(lambda c: c.isspace(), txt[txt.index('<scm>')-1::-1]))))
The output:
5
txt[txt.index('<scm>')-1::-1] - "reversed" slice from the position of string <scm> to the beginning of the text
itertools.takewhile(func, iterable) - will accumulate values/characters from the input string(iterable) untill value/character is a whitespace (c.isspace())
If you meant just for the single line case, this would get you the preceeding spaces for that line
def get_preceeding_spaces(file_name, tag):
with open(file_name, 'r') as f:
for line in f.readlines():
if tag in line:
prefix = line.split(tag)[0]
if re.match('\s*', prefix):
return len(prefix)
print(get_preceeding_spaces('test.html', '<scm>'))
returns for your file:
3
You could use a regular expression. The number of spaces would be:
import re
with open('input.txt') as f_input:
r = re.search('( +)' + re.escape("<scm>"), f_input.read(), re.S)
print len(r.groups()[0])
Which would be 3. Or the number of whitespace characters:
with open('input.txt') as f_input:
r = re.search('(\s+)' + re.escape("<scm>"), f_input.read(), re.S)
print len(r.groups()[0])
Which would be 5

extracting certain strings from a a file using python

I have a file with some lines. Out of those lines I will choose only lines which starts with xxx. Now the lines which starts with xxx have pattern as follows:
xxx:(12:"pqrs",223,"rst",-90)
xxx:(23:"abc",111,"def",-80)
I want to extract only the string which are their in the first double quote
i.e., "pqrs" and "abc".
Any help using regex is appreciated.
My code is as follows:
with open("log.txt","r") as f:
f = f.readlines()
for line in f:
line=line.rstrip()
for phrase in 'xxx:':
if re.match('^xxx:',line):
c=line
break
this code is giving me error
Your code is wrongly indented. Your f = f.readlines() has 9 spaces in front while for line in f: has 4 spaces. It should look like below.
import re
list_of_prefixes = ["xxx","aaa"]
resulting_list = []
with open("raw.txt","r") as f:
f = f.readlines()
for line in f:
line=line.rstrip()
for phrase in list_of_prefixes:
if re.match(phrase + ':\(\d+:\"(\w+)',line) != None:
resulting_list.append(re.findall(phrase +':\(\d+:\"(\w+)',line)[0])
Well you are heading in the right direction.
If the input is this simple, you can use regex groups.
with open("log.txt","r") as f:
f = f.readlines()
for line in f:
line=line.rstrip()
m = re.match('^xxx:\(\d*:("[^"]*")',line)
if m is not None:
print(m.group(1))
All the magic is in the regular expression.
^xxx:(\d*:("[^"]*") means
Start from the beginning of the line, match on "xxx:(<any number of numbers>:"<anything but ">"
and because the sequence "<anything but ">" is enclosed in round brackets it will be available as a group (by calling m.group(1)).
PS: next time make sure to include the exact error you are getting
results = []
with open("log.txt","r") as f:
f = f.readlines()
for line in f:
if line.startswith("xxx"):
line = line.split(":") # line[1] will be what is after :
result = line[1].split(",")[0][1:-1] # will be pqrs
results.append(result)
You want to look for lines that start with xxx
then split the line on the :. The first thing after the : is what you want -- up to the comma. Then your result is that string, but remove the quotes. There is no need for regex. Python string functions will be fine
To check if a line starts with xxx do
line.startswith('xxx')
To find the text in first double-quotes do
re.search(r'"(.*?)"', line).group(1)
(as match.group(1) is the first parenthesized subgroup)
So the code will be
with open("file") as f:
for line in f:
if line.startswith('xxx'):
print(re.search(r'"(.*?)"', line).group(1))
re module docs

How to replace a character in some specific word in a text file using python

I got a task to replace "O"(capital O) by "0" in a text file by using python. But one condition is that I have to preserve the other words like Over, NATO etc. I have to replace only the words like 9OO to 900, 2OO6 to 2006 and so on. I tried a lot but yet not successful. My code is given below. Please help me any one. Thanks in advance
import re
srcpatt = 'O'
rplpatt = '0'
cre = re.compile(srcpatt)
with open('myfile.txt', 'r') as file:
content = file.read()
wordlist = re.findall(r'(\d+O|O\d+)',str(content))
print(wordlist)
for word in wordlist:
subcontent = cre.sub(rplpatt, word)
newrep = re.compile(word)
newcontent = newrep.sub(subcontent,content)
with open('myfile.txt', 'w') as file:
file.write(newcontent)
print('"',srcpatt,'" is successfully replaced by "',rplpatt,'"')
re.sub can take in a replacement function, so we can pare this down pretty nicely:
import re
with open('myfile.txt', 'r') as file:
content = file.read()
with open('myfile.txt', 'w') as file:
file.write(re.sub(r'\d+[\dO]+|[\dO]+\d+', lambda m: m.group().replace('O', '0'), content))
import re
srcpatt = 'O'
rplpatt = '0'
cre = re.compile(srcpatt)
reg = r'\b(\d*)O(O*\d*)\b'
with open('input', 'r') as f:
for line in f:
while re.match(reg,line): line=re.sub(reg, r'\g<1>0\2', line)
print line
print('"',srcpatt,'" is successfully replaced by "',rplpatt,'"')
You can probably get away with matching just a leading digit followed by O. This won't handle OO7, but it will work nicely with 8080 for example. Which none of the answers here matching the trailing digits will. If you want to do that you need to use a lookahead match.
re.sub(r'(\d)(O+)', lambda m: m.groups()[0] + '0'*len(m.groups()[1]), content)

Categories

Resources