Python - How to match and replace words from a given string? - python

I have a array list with large collection, and i have one input string. Large collecion if found in the input string, it will replace by given option.
I tried following but its returning wrong:
#!/bin/python
arr=['www.', 'http://', '.com', 'many many many....']
def str_replace(arr, replaceby, original):
temp = ''
for n,i in enumerate(arr):
temp = original.replace(i, replaceby)
return temp
main ='www.google.com'
main1='www.a.b.c.company.google.co.uk.com'
print str_replace(arr,'',main);
Output:
www.google
Expected:
google

You are deriving temp from the original every time, so only the last element of arr will be replaced in the temp that is returned. Try this instead:
def str_replace(arr, replaceby, original):
temp = original
for n,i in enumerate(arr):
temp = temp.replace(i, replaceby)
return temp

You don't even need temp (assuming the above code is the whole function):
def str_replace(search, replace, subject):
for s in search:
subject = subject.replace(s, replace)
return subject
Another (probably more efficient) option is to use regular expressions:
import re
def str_replace(search, replace, subject):
search = '|'.join(map(re.escape, search))
return re.sub(search, replace, subject)
Do note that these functions may produce different results if replace contains substrings from search.

temp = original.replace(i, replaceby)
It should be
temp = temp.replace(i, replaceby)
You're throwing away the previous substitutions.

Simple way :)
arr=['www.', 'http://', '.com', 'many many many....']
main ='http://www.google.com'
for item in arr:
main = main.replace(item,'')
print main

Related

String formatting "in-place"

I have strings (list of str) containing placeholders {} and want to include variable values into those placeholders. One example of such a string could be 'test_variable = {}'.
I need to find the index within the list I want to deal with and replace the {} as described.
Currently, the code looks like this with find_occurrence_in_str_list() being a simple function that returns the first occurrence of the search string in the list:
def find_occurrence_in_str_list(lines, findstr, start_index=0):
for i in range(start_index, len(lines)):
if findstr in lines[i]:
return i
# Examples
variable_value = 10
strlist = ['example = {}', 'test_variable = {}']
# Code in question
index = find_occurrence_in_str_list(strlist, 'test_variable')
strlist[index] = strlist[index].format(variable_value)
This is totally fine. However, since I have a lot of such replacements, a better readability, especially a one-liner, would be desired instead of the last two lines. Currently, I just come up with this, which calls the search function twice and is not really more readable:
strlist[find_occurrence_in_str_list(strlist, 'test_variable')] = strlist[find_occurrence_in_str_list(strlist, 'test_variable')].format(variable_value)
Is there any way of formatting a string in-place instead of just returning the new string and needing to replacing it manually?
You could use a combination of str.replace() and your function
def find_occurrence_in_str_list(lines, findstr,value, start_index=0):
for i in range(start_index, len(lines)):
if findstr in lines[i]:
lines[i] = lines[i].replace('{}', str(value))
return lines
# Examples
variable_value = 10
strlist = ['example = {}', 'test_variable = {}']
# Code in question
strlist = find_occurrence_in_str_list(strlist, 'test_variable', variable_value)
Note: This will replace every {} in the string

python how can string .replace() only calc the new value if the old value is found in the string

I have a code similar to this:
s = some_template
s = s.replace('<HOLDER1>',data1)
s = s.replace('<HOLDER2>',data2)
s = s.replace('<HOLDER3>',data3)
... #(about 30 similar lines)
where data1/data2/etc is often a call to a function or a complex expression which might take a while to calculate. for example:
s = some_template
s = s.replace('<HOLDER4>',long_func4(a,b,'some_flag') if c==1 else '')
s = s.replace('<HOLDER5>',long_func5(d,e).replace('.',''))
s = s.replace('<HOLDER6>',self.attr6)
s = s.replace('<HOLDER7>',f'{self.name}_{get_cur_month()}')
... #(about 30 similar lines)
in order to save on runtime, i want the string.replace() method to calculate the new value only if the old value is found in str. this can be achieved by:
if '<HOLDER1>' in s:
s = s.replace('<HOLDER1>',data1)
if '<HOLDER2>' in s:
s = s.replace('<HOLDER2>',data2)
if '<HOLDER3>' in s:
s = s.replace('<HOLDER3>',data3)
...
but i don't like this solutions because it takes double the number of lines of code which will be really messy and also finds the old value in s twice for each holder..
any ideas?
Thanks!
str is immutable. You can't change it, only creating a new instance is allowed.
You could do something like:
def replace_many(replacements, s):
for pattern, replacement in replacements:
s = s.replace(pattern, replacement)
return s
without_replacements = 'this_will_be_replaced, will it?'
replacements = [('this_will_be_replaced', 'by_this')]
with_replacements = replace_many(replacements, without_replacements)
You can easily make it lazy:
def replace_many_lazy(replacements, s):
for pattern, replacement_func in replacements:
if pattern in s:
s = s.replace(pattern, replacement_func())
return s
without_replacements = 'this_will_be_replaced, will it?'
replacements = [('this_will_be_replaced', lambda: 'by_this')]
with_replacements = replace_many_lazy(replacements, without_replacements)
...now you don't do the expensive computation unless necessary.

python how to increment vars in regex replacements

I want to replace multiple patterns in a file with regex.
This is my (working) code so far:
import re
with open('test.txt', "r") as fp:
text = fp.read()
result = re.sub(r'pattern', 'replacement', str)
result2 = re.sub(r'anotherpattern', 'anotherreplacement2', result)
...
with open('results.txt', 'w') as fp:
fp.write(result_x)
This works. But it seems to be inelegant to increment the vars names manually in every new line. How can I increment them better? It must be a for loop, I think. But how?
You do not need the previous result once you used it. You can store the new result in the same variable:
text = re.sub(r'pattern1', 'replacement1', text) # str() is a string constructor!
text = re.sub(r'pattern2', 'replacement2', text)
You can also have a list of patterns and replacements and loop through it:
to_replace = [('pattern1', 'replacement1'), ('pattern2', 'replacement2')]
for pattern,replacement in to_replace:
text = re.sub(pattern, replacement, text)
Or in an even more Pythonic way:
to_replace = [('pattern1', 'replacement1'), ('pattern2', 'replacement2')]
for pr in to_replace:
text = re.sub(*pr, string=text)
I don't know Python too well, but I think if you want to combine the patterns,
you could do it in a single pass using a callback.
Example:
def repl(m):
contents = m.group(1)
if m.group(1) != '':
return sr1
if m.group(2) != '':
return sr2
if m.group(3) != '':
return sr3
return m.group(0)
print re.sub('(stuff1)|(stuff2)|(stuff3)', repl, text)
And, it could also be looped inside the callback.
For instance, a var holding the fixed number of patterns
which is looped to test the match object.
There must be a replacement array the same size (and position) of the
number of groups in the regex.
How much of a performance increase will this give you?
Doing this in a single pass, you gain exponential performance.
Note that it is almost an error to re-examine the same text over and over again. Imagine searching the library of congress one word at a time from the beginning each time.. How long would that take ?

Python: Split between two characters

Let's say I have a ton of HTML with no newlines. I want to get each element into a list.
input = "<head><title>Example Title</title></head>"
a_list = ["<head>", "<title>Example Title</title>", "</head>"]
Something like such. Splitting between each ><.
But in Python, I don't know of a way to do that. I can only split at that string, which removes it from the output. I want to keep it, and split between the two equality operators.
How can this be done?
Edit: Preferably, this would be done without adding the characters back in to the ends of each list item.
# initial input
a = "<head><title>Example Title</title></head>"
# split list
b = a.split('><')
# remove extra character from first and last elements
# because the split only removes >< pairs.
b[0] = b[0][1:]
b[-1] = b[-1][:-1]
# initialize new list
a_list = []
# fill new list with formatted elements
for i in range(len(b)):
a_list.append('<{}>'.format(b[i]))
This will output the given list in python 2.7.2, but it should work in python 3 as well.
You can try this:
import re
a = "<head><title>Example Title</title></head>"
data = re.split("><", a)
new_data = [data[0]+">"]+["<" + i+">" for i in data[1:-1]] + ["<"+data[-1]]
Output:
['<head>', '<title>Example Title</title>', '</head>']
The shortest approach using re.findall() function on extended example:
# extended html string
s = "<head><title>Example Title</title></head><body>hello, <b>Python</b></body>"
result = re.findall(r'(<[^>]+>[^<>]+</[^>]+>|<[^>]+>)', s)
print(result)
The output:
['<head>', '<title>Example Title</title>', '</head>', '<body>', '<b>Python</b>', '</body>']
Based on the answers by other people, I made this.
It isn't as clean as I had wanted, but it seems to work. I had originally wanted to not re-add the characters after split.
Here, I got rid of one extra argument by combining the two characters into a string. Anyways,
def split_between(string, chars):
if len(chars) is not 2: raise IndexError("Argument chars must contain two characters.")
result_list = [chars[1] + line + chars[0] for line in string.split(chars)]
result_list[0] = result_list[0][1:]
result_list[-1] = result_list[-1][:-1]
return result_list
Credit goes to #cforemanand #Ajax1234.
Or even simpler, this:
input = "<head><title>Example Title</title></head>"
print(['<'+elem if elem[0]!='<' else elem for elem in [elem+'>' if elem[-1]!='>' else elem for elem in input.split('><') ]])

Matching in Python lists when there are extra characters

I am trying to write a python code to match things from two lists in python.
One tab-delimited file looks like this:
COPB2
KLMND7
BLCA8
while the other file2 has a long list of similar looking "names", if you will. There should be some identical matches in the file, which I have succeeded in identifying and writing out to a new file. The problem is when there are additional characters at the end of one of the "names". For example, COPB2 from above should match COPB2A in file2, but it does not. Similarly KLMND7 should match KLMND79. Should I use regular expressions? Make them into strings? Any ideas are helpful, thank you!
What I have worked on so far, after the first response seen below:
with open(in_file1, "r") as names:
for line in names:
file1_list = [i.strip() for i in line.split()]
file1_str = str(file1_list)
with open(in_file2, "r") as symbols:
for line in symbols:
items = line.split("\t")
items = str(items)
matches = items.startswith(file1_str)
print matches
This code returns False when I know there should be some matches.
string.startswith() No need for regex, if it's only trailing characters
>>> g = "COPB2A"
>>> f = "COPB2"
>>> g.startswith(f)
True
Here is a working piece of code:
file1_list = []
with open(in_file1, "r") as names:
for line in names:
line_items = line.split()
for item in line_items:
file1_list.append(item)
matches = []
with open(in_file2, "r") as symbols:
for line in symbols:
file2_items = line.split()
for file2_item in file2_items:
for file1_item in file1_list:
if file2_item.startswith(file1_item):
matches.append(file2_item)
print file2_item
print matches
It may be quite slow for large files. If it's unacceptable, I could try to think about how to optimize it.
You might take a look at difflib if you need a more generic solution. Keep in mind it is a big import with lots of overhead so only use it if you really need to. Here is another question that is somewhat similar.
https://stackoverflow.com/questions/1209800/difference-between-two-strings-in-python-php
Assuming you loaded the files into lists X, Y.
## match if a or b is equal to or substring of one another in a case-sensitive way
def Match( a, b):
return a.find(b[0:min(len(a),len(b))-1])
common_words = {};
for a in X:
common_words[a]=[];
for b in Y:
if ( Match( a, b ) ):
common_words[a].append(b);
If you want to use regular expressions to do the matching, you want to use "beginning of word match" operator "^".
import re
def MatchRe( a, b ):
# make sure longer string is in 'a'.
if ( len(a) < len(b) ):
a, b = b, a;
exp = "^"+b;
q = re.match(exp,a);
if ( not q ):
return False; #no match
return True; #access q.group(0) for matches

Categories

Resources