Replace string from a huge python dictionary

Replace string from a huge python dictionary - python

I have a dictionary like this:
id_dict = {'C1001': 'John','D205': 'Ben','501': 'Rose'}
This dictionary has more than 10000 keys and values. I have to search for the key from a report which has nearly 500 words and replace with values.
I have to process thousands of reports within a few minutes, so speed and memory are really important for me.
This is the code I am using now:
str = "strings in the reports"
for key, value in id_dict.iteritems():
str = str.replace(key, value)
Is there any better solution than this?

Using str.replace in a loop is very inefficient. A few arguments:
when the word is replaced, a new string is allocated and the old one is discarded. If you have a lot of words, it can take ages
str.replace would replace inside of words, probably not what you want: ex: replace "nut" by "eel" changes "donut" to "doeel".
if there are a lot of words in your replacement dictionary, you loop through all of them (using a python loop, rather slow), even if the text doesn't contain any one of them.
I would use re.sub with a replacement function (as a lambda), matching a word-boundary alphanumeric string (letters or digits).
The lambda would lookup in the dictionary and return the word if found, else return the original word, replacing nothing, but since everything is done in the re module, it executes way faster.
import re
id_dict = {'C1001': 'John','D205': 'Ben','501': 'Rose'}
s = "Hello C1001, My name is D205, not X501"
result = re.sub(r"\b(\w+)\b",lambda m : id_dict.get(m.group(1),m.group(1)),s)
print(result)
prints:
Hello John, My name is Ben, not X501
(note that the last word was left unreplaced because it's only a partial match)

Related

Replace Items with regex (kind of)

I am handed a bunch of data and trying to get rid of certain characters. The data contains multiple instances of "^{number}" → "^0", "^1", "^2", etc.
I am trying to set all of these instances to an empty string, "", is there a better way to do this than
string.replace("^0", "").replace("^1", "").replace("^2", "")
I understand you can use a dictionary, but it seems a little overkill considering each item will be replaced with "".

I understand that the digits are always at the end of the string, have a look at the solutions below.
with regex:
import re
text = 'xyz125'
s = re.sub("\d+$",'', text)
print(s)
it should print:
'xyz'
without regex, keep in mind that this solution removes all digits and not only the ones at the end of a string:
text = 'xyz125'
result = ''.join(i for i in text if not i.isdigit())
print(result)
it should print:
'xyz'

Case-sensitive string word replacement from dictionary in Python

I have a dictionary full of key-value pairs where the key is a word I want to search for in a string and the value is what I want to replace it with. It needs to be able to preserve case as well. I'm stumbling over the logic in this circumstance.
I was thinking it would work to split the string up into a list of words, but I'm not sure if this would be the simplest way.
dict = {'my':'your', 'dog':'cat'}
string = 'My dog is named Jeffrey.'
I'd like to substitute the values in for the keys in the string, but maintain case and punctuation.

You may use the re.sub to make a substitution case insensitive. What is really difficult is to know what letter is capital because we may have words with different sizes, so I applied the rule that the first letter in the phrase has to be capital using capitalize method.
import re
dict = {'my':'your', 'dog':'cat'}
inputString = 'My dog is named Jeffrey.'
for key in dict.keys():
inputString = re.sub("(?i)"+key,dict[key],inputString)
inputString = inputString.capitalize()
print(inputString)

Python re.sub() is not replacing every match

I'm using Python 3 and I have two strings: abbcabb and abca. I want to remove every double occurrence of a single character. For example:
abbcabb should give c and abca should give bc.
I've tried the following regex (here):
(.)(.*?)\1
But, it gives wrong output for first string. Also, when I tried another one (here):
(.)(.*?)*?\1
But, this one again gives wrong output. What's going wrong here?
The python code is a print statement:
print(re.sub(r'(.)(.*?)\1', '\g<2>', s)) # s is the string

It can be solved without regular expression, like below
>>>''.join([i for i in s1 if s1.count(i) == 1])
'bc'
>>>''.join([i for i in s if s.count(i) == 1])
'c'

re.sub() doesn't perform overlapping replacements. After it replaces the first match, it starts looking after the end of the match. So when you perform the replacement on
abbcabb
it first replaces abbca with bbc. Then it replaces bb with an empty string. It doesn't go back and look for another match in bbc.
If you want that, you need to write your own loop.
while True:
newS = re.sub(r'(.)(.*?)\1', r'\g<2>', s)
if newS == s:
break
s = newS
print(newS)
DEMO

Regular expressions doesn't seem to be the ideal solution
they don't handle overlapping so it it needs a loop (like in this answer) and it creates strings over and over (performance suffers)
they're overkill here, we just need to count the characters
I like this answer, but using count repeatedly in a list comprehension loops over all elements each time.
It can be solved without regular expression and without O(n**2) complexity, only O(n) using collections.Counter
first count the characters of the string very easily & quickly
then filter the string testing if the count matches using the counter we just created.
like this:
import collections
s = "abbcabb"
cnt = collections.Counter(s)
s = "".join([c for c in s if cnt[c]==1])
(as a bonus, you can change the count to keep characters which have 2, 3, whatever occurrences)

EDIT: based on the comment exchange - if you're just concerned with the parity of the letter counts, then you don't want regex and instead want an approach like #jon's recommendation. (If you don't care about order, then a more performant approach with very long strings might use something like collections.Counter instead.)
My best guess as to what you're trying to match is: "one or more characters - call this subpattern A - followed by a different set of one or more characters - call this subpattern B - followed by subpattern A again".
You can use + as a shortcut for "one or more" (instead of specifying it once and then using * for the rest of the matches), but either way you need to get the subpatterns right. Let's try:
>>> import re
>>> pattern = re.compile(r'(.+?)(.+?)\1')
>>> pattern.sub('\g<2>', 'abbcabbabca')
'bbcbaca'
Hmm. That didn't work. Why? Because with the first pattern not being greedy, our "subpattern A" can just match the first a in the string - it does appear later, after all. So if we use a greedy match, Python will backtrack until it finds as long of a pattern for subpattern A that still allows for the A-B-A pattern to appear:
>>> pattern = re.compile(r'(.+)(.+?)\1')
>>> pattern.sub('\g<2>', 'abbcabbabca')
'cbc'
Looks good to me.

The site explains it well, hover and use the explanation section.
(.)(.*?)\1 Does not remove or match every double occurance. It matches 1 character, followed by anything in the middle sandwiched till that same character is encountered again.
so, for abbcabb the "sandwiched" portion should be bbc between two a
EDIT:
You can try something like this instead without regexes:
string = "abbcabb"
result = []
for i in string:
if i not in result:
result.append(i)
else:
result.remove(i)
print(''.join(result))
Note that this produces the "last" odd occurrence of a string and not first.
For "first" known occurance, you should use a counter as suggested in this answer . Just change the condition to check for odd counts. pseudo code(count[letter] %2 == 1)

I want to split a string by a character on its first occurence, which belongs to a list of characters. How to do this in python?

Basically, I have a list of special characters. I need to split a string by a character if it belongs to this list and exists in the string. Something on the lines of:
def find_char(string):
if string.find("some_char"):
#do xyz with some_char
elif string.find("another_char"):
#do xyz with another_char
else:
return False
and so on. The way I think of doing it is:
def find_char_split(string):
char_list = [",","*",";","/"]
for my_char in char_list:
if string.find(my_char) != -1:
my_strings = string.split(my_char)
break
else:
my_strings = False
return my_strings
Is there a more pythonic way of doing this? Or the above procedure would be fine? Please help, I'm not very proficient in python.
(EDIT): I want it to split on the first occurrence of the character, which is encountered first. That is to say, if the string contains multiple commas, and multiple stars, then I want it to split by the first occurrence of the comma. Please note, if the star comes first, then it will be broken by the star.

I would favor using the re module for this because the expression for splitting on multiple arbitrary characters is very simple:
r'[,*;/]'
The brackets create a character class that matches anything inside of them. The code is like this:
import re
results = re.split(r'[,*;/]', my_string, maxsplit=1)
The maxsplit argument makes it so that the split only occurs once.
If you are doing the same split many times, you can compile the regex and search on that same expression a little bit faster (but see Jon Clements' comment below):
c = re.compile(r'[,*;/]')
results = c.split(my_string)
If this speed up is important (it probably isn't) you can use the compiled version in a function instead of having it re compile every time. Then make a separate function that stores the actual compiled expression:
def split_chars(chars, maxsplit=0, flags=0, string=None):
# see note about the + symbol below
c = re.compile('[{}]+'.format(''.join(chars)), flags=flags)
def f(string, maxsplit=maxsplit):
return c.split(string, maxsplit=maxsplit)
return f if string is None else f(string)
Then:
special_split = split_chars(',*;/', maxsplit=1)
result = special_split(my_string)
But also:
result = split_chars(',*;/', my_string, maxsplit=1)
The purpose of the + character is to treat multiple delimiters as one if that is desired (thank you Jon Clements). If this is not desired, you can just use re.compile('[{}]'.format(''.join(chars))) above. Note that with maxsplit=1, this will not have any effect.
Finally: have a look at this talk for a quick introduction to regular expressions in Python, and this one for a much more information packed journey.

Python, breaking up Strings

I need to make a program in which the user inputs a word and I need to do something to each individual letter in that word. They cannot enter it one letter at a time just one word.
I.E. someone enters "test" how can I make my program know that it is a four letter word and how to break it up, like make my program make four variables each variable set to a different letter. It should also be able to work with bigger and smaller words.
Could I use a for statement? Something like For letter ste that letter to a variable, but what is it was like a 20 character letter how would the program get all the variable names and such?

Do you mean something like this?
>>> s = 'four'
>>> l = list(s)
>>> l
['f', 'o', 'u', 'r']
>>>
Addendum:
Even though that's (apparently) what you think you wanted, it's probably not necessary because it's possible for a string to hold virtually any size of a word -- so a single string variable likesabove should be good enough for your program verses trying to create a bunch of separately named variables for each character. For one thing, it would be difficult to write the rest of the program because you wouldn't to know what valid variable names to use.
The reason it's OK not to have separate variable for each character is because a single string can have any number of characters in it as well as be empty. Python's built-inlen()function will return a count of the number of letters in a string if applied to one, so the result oflen(s)in the above would be4.
Any character in a string can be randomly accessed by indexing it with an integer between0andlen(s)-1inside of square brackets, so to reference the third character you would uses[2]. It's useful to think of the index as the offset or the character from the beginning of the string.
Even so, in Python using indexing is often not needed because you can also iteratively process each character in a string in aforloop without using them as shown in this simple example:
num_vowels = 0
for ch in s:
if ch in 'aeiou':
num_vowels += 1
print 'there are', num_vowels, 'vowel(s) in the string', s
Python also has many other facilities and built-ins that further help when processing strings (and in fact could simplify the above example), which you'll eventually learn as you become more familiar with the language and its many libraries.

When you iterate a string, it returns the individual characters like
for c in thestring:
print(c)
You can use this to put the letters into a list if you really need to, which will retain its order but list(string) is a better choice for that (be aware that unordered types like dict or set do not guarantee any order).

You don't have to do any of those; In Python, you can access characters of a string using square brackets:
>>> word = "word"
>>> print(word[0])
w
>>> print(word[3])
d
>>> print(len(word))
4

You don't want to assign each letter to a separate variable. Then you'd be writing the rest of your program without even being able to know how many variables you have defined! That's an even worse problem than dealing with the whole string at once.
What you instead want to do is have just one variable holding the string, but you can refer to individual characters in it with indexing. Say the string is in s, then s[0] is the first character, s[1] is the second character, etc. And you can find out how far up the numbers go by checking len(s) - 1 (because the indexes start at 0, a length 1 string has maximum index 0, a length 2 string has maximum index 1, etc).
That's much more manageable than figuring out how to generate len(s) variable names, assign them all to a piece of the string, and then know which variables you need to reference.
Strings are immutable though, so you can't assign to s[1] to change the 2nd character. If you need to do that you can instead create a list with e.g. l = list(s). Then l[1] is the second character, and you can assign l[1] = something to change the element in the list. Then when you're done you can get a new string out with s_new = ''.join(l) (join builds a string by joining together a sequence of strings passed as its argument, using the string it was invoked on to the left as a separator between each of the elements in the sequence; in this case we're joining a list of single-character strings using the empty string as a separator, so we just get all the single-character strings joined into a single string).

x = 'test'
counter = 0
while counter < len(x):
print x[counter] # you can change this to do whatever you want to with x[counter]
counter += 1

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Replace string from a huge python dictionary - python

Related

Replace Items with regex (kind of)

Case-sensitive string word replacement from dictionary in Python

Python re.sub() is not replacing every match

I want to split a string by a character on its first occurence, which belongs to a list of characters. How to do this in python?

Python, breaking up Strings

Categories

Resources