Regex matching string python - python

I wanted to match the following string:
strings = iamcool.iplay=ball?end
I want to remove items starting (including the ".") and up till "?", so I want to remove .iplay=ball, so I should have iamcool?end
This is the regex I have:
print re.sub(r'\.\.*?','', strings)
I am not sure how to stop at the "?"

Use negated character class [^?] which matches anything except ?.
>>> re.sub(r'\.[^?]*', '', strings)
'strings = iamcool?end'

Related

Replacing everything with a backslash till next white space

As part of preprocessing my data, I want to be able to replace anything that comes with a slash till the occurrence of space with empty string. For example, \fs24 need to be replaced with empty or \qc23424 with empty. There could be multiple occurrences of tags with slashes which I want to remove. I have created a "tags to be eradicated" list which I aim to consume in a regular expression to clean the extracted text.
Input String: This is a string \fs24 and it contains some texts and tags \qc23424. which I want to remove from my string.
Expected output: This is a string and it contains some texts and tags. which I want to remove from my string.
I am using the regular expression based replace function in Python:
udpated = re.sub(r'/\fs\d+', '')
However, this is not fetching the desired result. Alternately, I have built an eradicate list and replacing that from a loop from top to lower number but this is a performance killer.
Assuming a 'tag' can also occur at the very beginning of your string, and avoid selecting false positives, maybe you could use:
\s?(?<!\S)\\[a-z\d]+
And replace with nothing. See an online demo.
\s? - Optionally match a whitespace character (if a tag is mid-string and therefor preceded by a space);
(?<!\S) - Assert position is not preceded by a non-whitespace character (to allow a position at the start of your input);
\\ - A literal backslash.
[a-z\d]+ - 1+ (Greedy) Characters as per given class.
First, the / doesn't belong in the regular expression at all.
Second, even though you are using a raw string literal, \ itself has special meaning to the regular expression engine, so you still need to escape it. (Without a raw string literal, you would need '\\\\fs\\d+'.) The \ before f is meant to be used literally; the \ before d is part of the character class matching the digits.
Finally, sub takes three arguments: the pattern, the replacement text, and the string on which to perform the replacement.
>>> re.sub(r'\\fs\d+', '', r"This is a string \fs24 and it contains...")
'This is a string and it contains...'
Does that work for you?
re.sub(
r"\\\w+\s*", # a backslash followed by alphanumerics and optional spacing;
'', # replace it with an empty string;
input_string # in your input string
)
>>> re.sub(r"\\\w+\s*", "", r"\fs24 hello there")
'hello there'
>>> re.sub(r"\\\w+\s*", "", "hello there")
'hello there'
>>> re.sub(r"\\\w+\s*", "", r"\fs24hello there")
'there'
>>> re.sub(r"\\\w+\s*", "", r"\fs24hello \qc23424 there")
'there'
'\\' matches '\' and 'w+' matches a word until space
import re
s = r"""This is a string \fs24 and it contains some texts and tags \qc23424. which I want to remove from my string."""
re.sub(r'\\\w+', '', s)
output:
'This is a string and it contains some texts and tags . which I want to remove from my string.'
I tried this and it worked fine for me:
def remover(text, state):
removable = text.split("\\")[1]
removable = removable.split(" ")[0]
removable = "\\" + removable + " "
text = text.replace(removable, "")
state = True if "\\" in text else False
return text, state
text = "hello \\I'm new here \\good luck"
state = True
while state:
text, state = remover(text, state)
print(text)

How to substitute only second occurrence of re.search() group

I need to replace part of the string value with extra zeroes if it needs.
T-46-5-В,Г,6-В,Г ---> T-46-005-В,Г,006-В,Г or
T-46-55-В,Г,56-В,Г ---> T-46-055-В,Г,066-В,Г, for example.
I have Regex pattern ^\D-\d{1,2}-([\d,]+)-[а-яА-я,]+,([\d,]+)-[а-яА-я,]+$ that retrieves 2 separate groups of the string, that i must change. The problem is I can't substitute back exact same groups with changed values if there is another occurrence of my re.search().group() in the whole string.
import re
my_string = "T-46-5-В,Г,6-В,Г"
my_pattern = r"^\D-\d{1,2}-([\d,]+)-[а-яА-я,]+,([\d,]+)-[а-яА-я,]+$"
new_string_parts = ["005", "006"]
new_string = re.sub(re.search(my_pattern, my_string).group(1), new_string_parts[0], my_string)
new_string = re.sub(re.search(my_pattern, my_string).group(2), new_string_parts[1], new_string)
print(new_string)
I get T-4006-005-В,Г,006-В,Г instead of T-46-005-В,Г,006-В,Г because there is another "6" in my_string. How can i solve this?
Thanks for your answers!
Capture the parts you need to keep and use a single re.sub pass with unambiguous backreferences in the replacement part (because they are mixed with numeric string variables):
import re
my_string = "T-46-5-В,Г,6-В,Г"
my_pattern = r"^(\D-\d{1,2}-)[\d,]+(-[а-яёА-ЯЁ,]+,)[\d,]+(-[а-яёА-ЯЁ,]+)$"
new_string_parts = ["005", "006"]
new_string = re.sub(my_pattern, fr"\g<1>{new_string_parts[0]}\g<2>{new_string_parts[1]}\3", my_string)
print(new_string)
# => T-46-005-В,Г,006-В,Г
See the Python demo. Note I also added ёЁ to the Russian letter ranges.
The pattern - ^(\D-\d{1,2}-)[\d,]+(-[а-яёА-ЯЁ,]+,)[\d,]+(-[а-яёА-ЯЁ,]+)$ - now contains parentheses around the parts you do not need to change, and \g<1> refers to the string captured with (\D-\d{1,2}-), \g<2> refers to the value captured with (-[а-яёА-ЯЁ,]+,) and \3 - to (-[а-яёА-ЯЁ,]+).

Remove trailing special characters from string

I'm trying to use a regex to clean some data before I insert the items into the database. I haven't been able to solve the issue of removing trailing special characters at the end of my strings.
How do I write this regex to only remove trailing special characters?
import re
strings = ['string01_','str_ing02_^','string03_#_', 'string04_1', 'string05_a_']
for item in strings:
clean_this = (re.sub(r'([_+!##$?^])', '', item))
print (clean_this)
outputs this:
string01 # correct
string02 # incorrect because it remove _ in the string
string03 # correct
string041 # incorrect because it remove _ in the string
string05a # incorrect because it remove _ in the string and not just the trailing _
You could also use the special purpose rstrip method of strings
[s.rstrip('_+!##$?^') for s in strings]
# ['string01', 'str_ing02', 'string03', 'string04_1', 'string05_a']
You could repeat the character class 1+ times or else only 1 special character would be replaced. Then assert the end of the string $. Note that you don't need the capturing group around the character class:
[_+!##$?^]+$
For example:
import re
strings = ['string01_','str_ing02_^','string03_#_', 'string04_1', 'string05_a_']
for item in strings:
clean_this = (re.sub(r'[_+!##$?^]+$', '', item))
print (clean_this)
See the Regex demo | Python demo
If you also want to remove whitespace characters at the end you could add \s to the character class:
[_+!##$?^\s]+$
Regex demo
You need an end-of-word anchor $
clean_this = (re.sub(r'[_+!##$?^]+$', '', item))
Demo

Python pattern match a string

I am trying to pattern match a string, so that if it ends in the characters 'std' I split the last 6 characters and append a different prefix.
I am assuming I can do this with regular expressions and re.split, but I am unsure of the correct notation to append a new prefix and take last 6 chars based on the presence of the last 3 chars.
regex = r"([a-zA-Z])"
if re.search(regex, "std"):
match = re.search(regex, "std")
#re.sub(r'\Z', '', varname)
You're confused about how to use regular expressions here. Your code is saying "search the string 'std' for any alphanumeric character".
But there is no need to use regexes here anyway. Just use string slicing, and .endswith:
if my_string.endswith('std'):
new_string = new_prefix + mystring[-6:]
No need for a regex. Just use standard string methods:
if s.endswith('std'):
s = s[:-6] + new_suffix
But if you had to use a regex, you would substitute a regex, you would substitute the new suffix in:
regex = re.compile(".{3}std$")
s = regex.sub(new_suffix, s)

How to find a non-alphanumeric character and move it to the end of a string in Python

I have the following string:
"string.isnotimportant"
I want to find the dot (it could be any non-alphanumeric character), and move it to the end of the string.
The result should look like:
"stringisnotimportant."
I am looking for a regular expression to do this job.
import re
inp = "string.isnotimportant"
re.sub('(\w*)(\W+)(\w*)', '\\1\\3\\2', inp)
>>> import re
>>> string = "string.isnotimportant"
#I explain a bit about this at the end
>>> regex = '\w*(\W+)\w*' # the brackets in the regex mean that item, if matched will be stored as a group
#in order to understand the re module properly, I think your best bet is to read some docs, I will link you at the end of the post
>>> x = re.search(regex, string)
>>> x.groups() #remember the stored group above? well this accesses that group.
#if there were more than one group above, there would be more items in the tuple
('.',)
#here I reassign the variable string to a modified version where the '.' is replaced with ''(nothing).
>>> string = string.replace('.', '')
>>> string += x.groups()[0] # here I basically append a letter to the end of string
The += operator appends a character to the end of a string. Since strings don't have an .append method like lists do, this is a handy feature. x.groups()[0] refers to the first item(only item in this case) of the tuple above.
>>> print string
"stringisnotimportant."
about the regex:
"\w" Matches any alphanumeric character and the underscore: a through z, A through Z, 0 through 9, and '_'.
"\W" Matches any non-alphanumeric character. Examples for this include '&', '$', '#', etc.
https://developers.google.com/edu/python/regular-expressions?csw=1
http://python.about.com/od/regularexpressions/a/regexprimer.htm

Categories

Resources