String split on specific characters - python

I have a string like;
'[abc] [def] [zzz]'
How would I be able to split it into three parts:
abc
def
zzz

You can use re.findall:
>>> from re import findall
>>> findall('\[([^\]]*)\]', '[abc] [def] [zzz]')
['abc', 'def', 'zzz']
>>>
All of the Regex syntax used above is explained in the link, but here is a quick breakdown:
\[ # [
( # The start of a capture group
[^\]]* # Zero or more characters that are not ]
) # The end of the capture group
\] # ]
For those who want a non-Regex solution, you could always use a list comprehension and str.split:
>>> [x[1:-1] for x in '[abc] [def] [zzz]'.split()]
['abc', 'def', 'zzz']
>>>
[1:-1] strips off the square brackets on each end of x.

Another way:
s = '[abc] [def] [zzz]'
s = [i.strip('[]') for i in s.split()]

Related

Python Combining f-string with r-string and curly braces in regex

Given a single word (x); return the possible n-grams that can be found in that word.
You can modify the n-gram value according as you want;
it is in the curly braces in the pat variable.
The default n-gram value is 4.
For example; for the word (x):
x = 'abcdef'
The possible 4-gram are:
['abcd', 'bcde', 'cdef']
def ngram_finder(x):
pat = r'(?=(\S{4}))'
xx = re.findall(pat, x)
return xx
The Question is:
How to combine the f-string with the r-string in the regex expression, using curly braces.
You can use this string to combine the n value into your regexp, using double curly brackets to create a single one in the output:
fr'(?=(\S{{{n}}}))'
The regex needs to have {} to make a quantifier (as you had in your original regex {4}). However f strings use {} to indicate an expression replacement so you need to "escape" the {} required by the regex in the f string. That is done by using {{ and }} which in the output create { and }. So {{{n}}} (where n=4) generates '{' + '4' + '}' = '{4}' as required.
Complete code:
import re
def ngram_finder(x, n):
pat = fr'(?=(\S{{{n}}}))'
return re.findall(pat, x)
x = 'abcdef'
print(ngram_finder(x, 4))
print(ngram_finder(x, 5))
Output:
['abcd', 'bcde', 'cdef']
['abcde', 'bcdef']

Splitting string using different scenarios using regex

I have 2 scenarios so split a string
scenario 1:
"##$hello?? getting good.<li>hii"
I want to be split as 'hello','getting','good.<li>hii (Scenario 1)
'hello','getting','good','li,'hi' (Scenario 2)
Any ideas please??
Something like this should work:
>>> re.split(r"[^\w<>.]+", s) # or re.split(r"[##$? ]+", s)
['', 'hello', 'getting', 'good.<li>hii']
>>> re.split(r"[^\w]+", s)
['', 'hello', 'getting', 'good', 'li', 'hii']
This might be what your looking for \w+ it matches any digit or letter from 1 to n times as many times as possible. Here is a working Java-Script
var value = "##$hello?? getting good.<li>hii";
var matches = value.match(
new RegExp("\\w+", "gi")
);
console.log(matches)
It works by using \w+ which matches word characters as many times as possible. You cound also use [A-Za-b] to match only letters which not numbers. As show here.
var value = "##$hello?? getting good.<li>hii777bloop";
var matches = value.match(
new RegExp("[A-Za-z]+", "gi")
);
console.log(matches)
It matches what are in the brackets 1 to n timeas as many as possible. In this case the range a-z of lower case charactors and the range of A-Z uppder case characters. Hope this is what you want.
For first scenario just use regex to find all words that are contain word characters and <>.:
In [60]: re.findall(r'[\w<>.]+', s)
Out[60]: ['hello', 'getting', 'good.<li>hii']
For second one you need to repleace the repeated characters only if they are not valid english words, you can do this using nltk corpus, and re.sub regex:
In [61]: import nltk
In [62]: english_vocab = set(w.lower() for w in nltk.corpus.words.words())
In [63]: repeat_regexp = re.compile(r'(\w*)(\w)\2(\w*)')
In [64]: [repeat_regexp.sub(r'\1\2\3', word) if word not in english_vocab else word for word in re.findall(r'[^\W]+', s)]
Out[64]: ['hello', 'getting', 'good', 'li', 'hi']
In case you are looking for solution without regex. string.punctuation will give you list of all special characters.
Use this list with list comprehension for achieving your desired result as:
>>> import string
>>> my_string = '##$hello?? getting good.<li>hii'
>>> ''.join([(' ' if s in string.punctuation else s) for s in my_string]).split()
['hello', 'getting', 'good', 'li', 'hii'] # desired output
Explanation: Below is the step by step instruction regarding how it works:
import string # Importing the 'string' module
special_char_string = string.punctuation
# Value of 'special_char_string': '!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~'
my_string = '##$hello?? getting good.<li>hii'
# Generating list of character in sample string with
# special character replaced with whitespace
my_list = [(' ' if item in special_char_string else item) for item in my_string]
# Join the list to form string
my_string = ''.join(my_list)
# Split it based on space
my_desired_list = my_string.strip().split()
The value of my_desired_list will be:
['hello', 'getting', 'good', 'li', 'hii']

Python regular expression to extract the parenthesis

I have the following unwieldy code to extract out 'ABC' and '(XYZ)' from a string 'ABC(XYZ)'
import re
test_str = 'ABC(XYZ)'
partone = re.sub(r'\([^)]*\)', '', test_str)
parttwo_temp = re.match('.*\((.+)\)', test_str)
parttwo = '(' + parttwo_temp.group(1) + ')'
I was wondering if someone can think of a better regular expression to split up the string. Thanks.
You may use re.findall
>>> import re
>>> test_str = 'ABC(XYZ)'
>>> re.findall(r'\([^()]*\)|[^()]+', test_str)
['ABC', '(XYZ)']
>>> [i for i in re.findall(r'(.*)(\([^()]*\))', test_str)[0]]
['ABC', '(XYZ)']
[i for i in re.split(r'(.*?)(\(.*?\))', test_str) if i]
For this kind of input data, we can replace the ( with space+( and split by space:
>>> s = 'ABC(XYZ)'
>>> s.replace("(", " (").split()
['ABC', '(XYZ)']
This way we are artificially creating a delimiter before every opening parenthesis.

Finding groups of letters, numbers, or symbols

How can I split a string into substrings based on the characters contained in the substrings. For example, given a string "ABC12345..::", I would like to get a list like ['ABC', '12345', '..::']. I know the valid characters for each substring, but I don't know the lengths. So the string could also look like "CC123:....:", in which case I would like to have ['CC', '123', ':....:'] as the result.
By your example you don't seem to have anything to split with (e.g. nothing between C and 1), but what you do have is a well-formed pattern that you can match. So just simply create a pattern that groups the strings you want matched:
>>> import re
>>> s = "ABC12345..::"
>>> re.match('([A-Z]*)([0-9]*)([\.:]*)', s).groups()
('ABC', '12345', '..::')
Alternative, compile the pattern into a reusable regex object and do this:
>>> patt = re.compile('([A-Z]*)([0-9]*)([\.:]*)')
>>> patt.match(s).groups()
('ABC', '12345', '..::')
>>> patt.match("CC123:....:").groups()
('CC', '123', ':....:')
Match each group with the following regex
[0-9]+|[a-zA-Z]+|[.:]+
[0-9]+ any digits repeated any times, or
[a-zA-Z]+ any letters repeated any times, or
[.:]+ any dots or colons repeated any times
This will allow you to match groups in any order, ie: "123...xy::ab..98765PQRS".
import re
print(re.findall( r'[0-9]+|[a-zA-Z]+|[.:]+', "ABC12345..::"))
# => ['ABC', '12345', '..::']
ideone demo
If you want a non-regex approach:
value = 'ABC12345..::'
indexes = [i for i, char in enumerate(value) if char.isdigit()] # Collect indexes of any digits
arr = [ value[:indexes[0]], value[indexes[0]:indexes[-1]+1], value[indexes[-1]+1:] ] # Use splicing to build list
Output:
['ABC', '12345', '..::']
Another string:
value = "CC123:....:"
indexes = [i for i, char in enumerate(value) if char.isdigit()] # Collect indexes of any digits
arr = [ value[:indexes[0]], value[indexes[0]:indexes[-1]+1], value[indexes[-1]+1:] ] # Use splicing to build list
Output:
['CC', '123', ':....:']
EDIT:
Just did a benchmark, metatoaster's method is slightly faster than this :)

python regular expression, pulling all letters out

Is there a better way to pull A and F from this: A13:F20
a="A13:F20"
import re
pattern = re.compile(r'\D+\d+\D+')
matches = re.search(pattern, a)
num = matches.group(0)
print num[0]
print num[len(num)-1]
output
A
F
note: the digits are of unknown length
You don't have to use regular expressions, or re at all. Assuming you want just letters to remain, you could do something like this:
a = "A13:F20"
a = filter(lambda x: x.isalpha(), a)
I'd do it like this:
>>> re.findall(r'[a-z]', a, re.IGNORECASE)
['A', 'F']
Use a simple list comprehension, as a filter and get only the alphabets from the actual string.
print [char for char in input_string if char.isalpha()]
# ['A', 'F']
You could use re.sub:
>>> a="A13.F20"
>>> re.sub(r'[^A-Z]', '', a) # Remove everything apart from A-Z
'AF'
>>> re.sub(r'[A-Z]', '', a) # Remove A-Z
'13.20'
>>>
If you're working with strings that all have the same format, you can just cut out substrings:
a="A13:F20"
print a[0], a[4]
More on python slicing in this answer:
Is there a way to substring a string in Python?

Categories

Resources