Regular expression split

Regular expression split - python

I have inputs similar to the following:
TV-12VX
TV-14JW
TV-2JIS
VC-224X
I need to remove everything after the numbers after the dash. The result would be:
TV-12
TV-14
TV-2
TV-224
How would I do this split via regular expressions?

The following code shows how to match strings of the form "TV-" + (some number):
>>> re.match('TV-[0-9]+','TV-12VX').group(0)
'TV-12'
(Note that, because I'm using match, this only works if the string starts with the bit you want to extract.)

I think this regex is appropriate for you: (.+?-\d+?)[a-zA-Z]. You can use it with re.findall, or re.match.

import re
p = re.match('([\w]{2}-\d+)', 'TV-12VX')
print(p.group(0))
Outputs
TV-12

You can remove everything after the digits with this:
re.sub(r"^(\w+-\d+).*", r"\1", input)

Related

Remove Characters From A String Until A Specific Format is Reached

So I have the following strings and I have been trying to figure out how to manipulate them in such a way that I get a specific format.
string1-itd_jan2021-internal
string2itd_mar2021-space
string3itd_feb2021-internal
string4-itd_mar2021-moon
string5itd_jun2021-internal
string6-itd_feb2021-apollo
I want to be able to get rid of any of the last string so I am just left with the month and year, like below:
string1-itd_jan2021
string2itd_mar2021
string3itd_feb2021
string4-itd_mar2021
string5itd_jun2021
string6-itd_feb2021
I thought about using string.split on the - but then realized that for some strings this wouldn't work. I also thought about getting rid of a set amount of characters by putting it into a list and slicing but the end is varying characters length?
Is there anything I can do it with regex or any other python module?

Use str.rsplit with the appropriate maxsplit parameter:
s = s.rsplit("-", 1)[0]
You could also use str.split (even though this is clearly the worse choice):
s = "-".join(s.split("-")[:-1])
Or using regular expressions:
s = re.sub(r'-[^-]*$', '', s)
# "-[^-]*" a "-" followed by any number of non-"-"

With a regex:
import re
re.sub(r'([0-9]{4}).*$', r'\1', s)

Use re.sub like so:
import re
lines = '''string1-itd_jan2021-internal
string2itd_mar2021-space
string3itd_feb2021-internal
string4-itd_mar2021-moon
string5itd_jun2021-internal
string6-itd_feb2021-apollo'''
for old in lines.split('\n'):
new = re.sub(r'[-][^-]+$', '', old)
print('\t'.join([old, new]))
Prints:
string1-itd_jan2021-internal string1-itd_jan2021
string2itd_mar2021-space string2itd_mar2021
string3itd_feb2021-internal string3itd_feb2021
string4-itd_mar2021-moon string4-itd_mar2021
string5itd_jun2021-internal string5itd_jun2021
string6-itd_feb2021-apollo string6-itd_feb2021
Explanation:
r'[-][^-]+$' : Literal dash (-), followed by any character other than a dash ([^-]) repeated 1 or more times, followed by the end of the string ($).

Regex in python: combining 2 regex expressions into one

Suppose I have the following list:
a = ['35','years','opened','7,000','churches','rev.','mr.','brandt','said','adding','denomination','national','goal','one','church','every','10,000','persons']
I want to remove all elements, that contain numbers and elements, that end with dots.
So I want to delete '35','7,000','10,000','mr.','rev.'
I can do it separately using the following regex:
regex = re.compile('[a-zA-Z\.]')
regex2 = re.compile('[0-9]')
But when I try to combine them I delete either all elements or nothing.
How can I combine two regex correctly?

This should work:
reg = re.compile('[a-zA-Z]+\.|[0-9,]+')
Note that your first regex is wrong because it deletes any string within a dot inside it.
To avoid this, I included [a-zA-Z]+\. in the combined regex.
Your second regex is also wrong as it misses a "+" and a ",", which I included in the above solution.
Here a demo.
Also, if you assume that elements which end with a dot might contain some numbers the complete solution should be:
reg = re.compile('[a-zA-Z0-9]+\.|[0-9,]+')

If you don't need to capture the result, this matches any string with a dot at the end, or any with a number in it.
\.$|\d

You could use:
(?:[^\d\n]*\d)|.*\.$
See a demo on regex101.com.

Here is a way to do the job:
import re
a = ['35','years','opened','7,000','churches','rev.','mr.','brandt','said','adding','denomination','national','goal','one','church','every','10,000','per.sons']
b = []
for s in a:
if not re.search(r'^(?:[\d,]+|.*\.)$', s):
b.append(s)
print b
Output:
['years', 'opened', 'churches', 'brandt', 'said', 'adding', 'denomination', 'national', 'goal', 'one', 'church', 'every', 'per.sons']
Demo & explanation

how to match a pattern and add a character to it

I have something like:
GCF_002904975:2.6672e-05):2.6672e-05.
and I would like to add the word '_S' right after any GCF(any number) entry before the next colon.
In other words I would like my text becoming like:
GCF_002904975_S:2.6672e-05):2.6672e-05.
I have repeated pattern like that all along my text.

This can be easily done with re.sub function. A working example would look like this:
import re
inp_string='(((GCF_001297375:2.6671e-05,GCF_002904975:2.6672e-05)0.924:0.060046136,(GCF_000144955:0.036474926,((GCF_001681075:0.017937143,...'
if __name__ == "__main__":
outp_string = re.sub(r'GCF_(?P<gfc_number>\d+)\:', r'GCF_\g<gfc_number>_S:', inp_string)
print(outp_string)
This code gives the following result, which is hopefully what you need:
(((GCF_001297375_S:2.6671e-05,GCF_002904975_S:2.6672e-05)0.924:0.060046136,(GCF_000144955_S:0.036474926,((GCF_001681075_S:0.017937143,...
For more info take a look at the docs:
https://docs.python.org/3/library/re.html

You can use regular expressions with a function substitution. The solution below depends on the numbers always being 9 digits, but could be modified to work with other cases.
test_str = '(((GCF_001297375:2.6671e-05,GCF_002904975:2.6672e-05)0.924:0.060046136,GCF_000144955:0.036474926,((GCF_001681075:0.017937143,...'
new_str = re.sub(r"GCF_\d{9}", lambda x: x.group(0) + "_S", test_str)
print(new_str)
#(((GCF_001297375_S:2.6671e-05,GCF_002904975_S:2.6672e-05)0.924:0.060046136,GCF_000144955_S:0.036474926,((GCF_001681075_S:0.017937143,...

Why not just do a replace? Shortening your example string to make it easier to read:
"(((GCF_001297375:2.6671e-05,GCF_002904975:2.6672e-05)...".replace(":","_S:")

Complex regex in Python

I am trying to write a generic pattern using regex so that it fetches only particular things from the string. Let's say we have strings like GigabitEthernet0/0/0/0 or FastEthernet0/4 or Ethernet0/0.222. The regex should fetch the first 2 characters and all the numerals. Therefore, the fetched result should be something like Gi0000 or Fa04 or Et00222 depending on the above cases.
x = 'GigabitEthernet0/0/0/2
m = re.search('([\w+]{2}?)[\\\.(\d+)]{0,}',x)
I am not able to understand how shall I write the regular expression. The values can be fetched in the form of a list also. I write few more patterns but it isn't helping.

In regex, you may use re.findall function.
>>> import re
>>> s = 'GigabitEthernet0/0/0/0 '
>>> s[:2]+''.join(re.findall(r'\d', s))
'Gi0000'
OR
>>> ''.join(re.findall(r'^..|\d', s))
'Gi0000'
>>> ''.join(re.findall(r'^..|\d', 'Ethernet0/0.222'))
'Et00222'
OR
>>> s = 'GigabitEthernet0/0/0/0 '
>>> s[:2]+''.join([i for i in s if i.isdigit()])
'Gi0000'

z="Ethernet0/0.222."
print z[:2]+"".join(re.findall(r"(\d+)(?=[\d\W]*$)",z))
You can try this.This will make sure only digits from end come into play .

Here is another option:
s = 'Ethernet0/0.222'
"".join(re.findall('^\w{2}|[\d]+', s))

Python Regular Expression Escape or not

I need to write a regular expression to get all the characters in the list below..
(remove all the characters not in the list)
allow_characters = "#.-_abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"
I don't know how to do it, should I even use re.match or re.findall or re.sub...?
Thanks a lot in advance.

Don't use regular expressions at all, first convert allow_characters to a set and then use ''.join() with a generator expression that strips out the unwanted characters. Assuming the string you are transforming is called s:
allow_char_set = set(allow_characters)
s = ''.join(c for c in s if c in allow_char_set)
That being said, here is how this might look with regex:
s = re.sub(r'[^#.\-_a-zA-Z0-9]+', '', s)
You could convert your allow_characters string into this regex, but I think the first solution is significantly more straightforward.
Edit: As pointed out by DSM in comments, str.translate() is often a very good way to do something like this. In this case it is slightly complicated but you can still use it like this:
import string
allow_characters = "#.-_abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"
all_characters = string.maketrans('', '')
delete_characters = all_characters.translate(None, allow_characters)
s = s.translate(None, delete_characters)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Regular expression split - python

I have inputs similar to the following: TV-12VX TV-14JW TV-2JIS VC-224X I need to remove everything after the numbers after the dash. The result would be: TV-12 TV-14 TV-2 TV-224 How would I do this split via regular expressions?

The following code shows how to match strings of the form "TV-" + (some number): >>> re.match('TV-[0-9]+','TV-12VX').group(0) 'TV-12' (Note that, because I'm using match, this only works if the string starts with the bit you want to extract.)

I think this regex is appropriate for you: (.+?-\d+?)[a-zA-Z]. You can use it with re.findall, or re.match.

import re p = re.match('([\w]{2}-\d+)', 'TV-12VX') print(p.group(0)) Outputs TV-12

You can remove everything after the digits with this: re.sub(r"^(\w+-\d+).*", r"\1", input)

Related

Remove Characters From A String Until A Specific Format is Reached

Regex in python: combining 2 regex expressions into one

how to match a pattern and add a character to it

Complex regex in Python

Python Regular Expression Escape or not

Categories

Resources