String split on specific characters

String split on specific characters - python

I have a string like;
'[abc] [def] [zzz]'
How would I be able to split it into three parts:
abc
def
zzz

You can use re.findall:
>>> from re import findall
>>> findall('\[([^\]]*)\]', '[abc] [def] [zzz]')
['abc', 'def', 'zzz']
>>>
All of the Regex syntax used above is explained in the link, but here is a quick breakdown:
\[ # [
( # The start of a capture group
[^\]]* # Zero or more characters that are not ]
) # The end of the capture group
\] # ]
For those who want a non-Regex solution, you could always use a list comprehension and str.split:
>>> [x[1:-1] for x in '[abc] [def] [zzz]'.split()]
['abc', 'def', 'zzz']
>>>
[1:-1] strips off the square brackets on each end of x.

Another way:
s = '[abc] [def] [zzz]'
s = [i.strip('[]') for i in s.split()]

Related

Python Combining f-string with r-string and curly braces in regex

Given a single word (x); return the possible n-grams that can be found in that word.
You can modify the n-gram value according as you want;
it is in the curly braces in the pat variable.
The default n-gram value is 4.
For example; for the word (x):
x = 'abcdef'
The possible 4-gram are:
['abcd', 'bcde', 'cdef']
def ngram_finder(x):
pat = r'(?=(\S{4}))'
xx = re.findall(pat, x)
return xx
The Question is:
How to combine the f-string with the r-string in the regex expression, using curly braces.

You can use this string to combine the n value into your regexp, using double curly brackets to create a single one in the output:
fr'(?=(\S{{{n}}}))'
The regex needs to have {} to make a quantifier (as you had in your original regex {4}). However f strings use {} to indicate an expression replacement so you need to "escape" the {} required by the regex in the f string. That is done by using {{ and }} which in the output create { and }. So {{{n}}} (where n=4) generates '{' + '4' + '}' = '{4}' as required.
Complete code:
import re
def ngram_finder(x, n):
pat = fr'(?=(\S{{{n}}}))'
return re.findall(pat, x)
x = 'abcdef'
print(ngram_finder(x, 4))
print(ngram_finder(x, 5))
Output:
['abcd', 'bcde', 'cdef']
['abcde', 'bcdef']

Splitting string using different scenarios using regex

I have 2 scenarios so split a string
scenario 1:
"##$hello?? getting good.<li>hii"
I want to be split as 'hello','getting','good.<li>hii (Scenario 1)
'hello','getting','good','li,'hi' (Scenario 2)
Any ideas please??

Something like this should work:
>>> re.split(r"[^\w<>.]+", s) # or re.split(r"[##$? ]+", s)
['', 'hello', 'getting', 'good.<li>hii']
>>> re.split(r"[^\w]+", s)
['', 'hello', 'getting', 'good', 'li', 'hii']

This might be what your looking for \w+ it matches any digit or letter from 1 to n times as many times as possible. Here is a working Java-Script
var value = "##$hello?? getting good.<li>hii";
var matches = value.match(
new RegExp("\\w+", "gi")
);
console.log(matches)
It works by using \w+ which matches word characters as many times as possible. You cound also use [A-Za-b] to match only letters which not numbers. As show here.
var value = "##$hello?? getting good.<li>hii777bloop";
var matches = value.match(
new RegExp("[A-Za-z]+", "gi")
);
console.log(matches)
It matches what are in the brackets 1 to n timeas as many as possible. In this case the range a-z of lower case charactors and the range of A-Z uppder case characters. Hope this is what you want.

For first scenario just use regex to find all words that are contain word characters and <>.:
In [60]: re.findall(r'[\w<>.]+', s)
Out[60]: ['hello', 'getting', 'good.<li>hii']
For second one you need to repleace the repeated characters only if they are not valid english words, you can do this using nltk corpus, and re.sub regex:
In [61]: import nltk
In [62]: english_vocab = set(w.lower() for w in nltk.corpus.words.words())
In [63]: repeat_regexp = re.compile(r'(\w*)(\w)\2(\w*)')
In [64]: [repeat_regexp.sub(r'\1\2\3', word) if word not in english_vocab else word for word in re.findall(r'[^\W]+', s)]
Out[64]: ['hello', 'getting', 'good', 'li', 'hi']

In case you are looking for solution without regex. string.punctuation will give you list of all special characters.
Use this list with list comprehension for achieving your desired result as:
>>> import string
>>> my_string = '##$hello?? getting good.<li>hii'
>>> ''.join([(' ' if s in string.punctuation else s) for s in my_string]).split()
['hello', 'getting', 'good', 'li', 'hii'] # desired output
Explanation: Below is the step by step instruction regarding how it works:
import string # Importing the 'string' module
special_char_string = string.punctuation
# Value of 'special_char_string': '!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~'
my_string = '##$hello?? getting good.<li>hii'
# Generating list of character in sample string with
# special character replaced with whitespace
my_list = [(' ' if item in special_char_string else item) for item in my_string]
# Join the list to form string
my_string = ''.join(my_list)
# Split it based on space
my_desired_list = my_string.strip().split()
The value of my_desired_list will be:
['hello', 'getting', 'good', 'li', 'hii']

Python regular expression to extract the parenthesis

I have the following unwieldy code to extract out 'ABC' and '(XYZ)' from a string 'ABC(XYZ)'
import re
test_str = 'ABC(XYZ)'
partone = re.sub(r'\([^)]*\)', '', test_str)
parttwo_temp = re.match('.*\((.+)\)', test_str)
parttwo = '(' + parttwo_temp.group(1) + ')'
I was wondering if someone can think of a better regular expression to split up the string. Thanks.

You may use re.findall
>>> import re
>>> test_str = 'ABC(XYZ)'
>>> re.findall(r'\([^()]*\)|[^()]+', test_str)
['ABC', '(XYZ)']
>>> [i for i in re.findall(r'(.*)(\([^()]*\))', test_str)[0]]
['ABC', '(XYZ)']

[i for i in re.split(r'(.*?)(\(.*?\))', test_str) if i]

For this kind of input data, we can replace the ( with space+( and split by space:
>>> s = 'ABC(XYZ)'
>>> s.replace("(", " (").split()
['ABC', '(XYZ)']
This way we are artificially creating a delimiter before every opening parenthesis.

Finding groups of letters, numbers, or symbols

How can I split a string into substrings based on the characters contained in the substrings. For example, given a string "ABC12345..::", I would like to get a list like ['ABC', '12345', '..::']. I know the valid characters for each substring, but I don't know the lengths. So the string could also look like "CC123:....:", in which case I would like to have ['CC', '123', ':....:'] as the result.

By your example you don't seem to have anything to split with (e.g. nothing between C and 1), but what you do have is a well-formed pattern that you can match. So just simply create a pattern that groups the strings you want matched:
>>> import re
>>> s = "ABC12345..::"
>>> re.match('([A-Z]*)([0-9]*)([\.:]*)', s).groups()
('ABC', '12345', '..::')
Alternative, compile the pattern into a reusable regex object and do this:
>>> patt = re.compile('([A-Z]*)([0-9]*)([\.:]*)')
>>> patt.match(s).groups()
('ABC', '12345', '..::')
>>> patt.match("CC123:....:").groups()
('CC', '123', ':....:')

Match each group with the following regex
[0-9]+|[a-zA-Z]+|[.:]+
[0-9]+ any digits repeated any times, or
[a-zA-Z]+ any letters repeated any times, or
[.:]+ any dots or colons repeated any times
This will allow you to match groups in any order, ie: "123...xy::ab..98765PQRS".
import re
print(re.findall( r'[0-9]+|[a-zA-Z]+|[.:]+', "ABC12345..::"))
# => ['ABC', '12345', '..::']
ideone demo

If you want a non-regex approach:
value = 'ABC12345..::'
indexes = [i for i, char in enumerate(value) if char.isdigit()] # Collect indexes of any digits
arr = [ value[:indexes[0]], value[indexes[0]:indexes[-1]+1], value[indexes[-1]+1:] ] # Use splicing to build list
Output:
['ABC', '12345', '..::']
Another string:
value = "CC123:....:"
indexes = [i for i, char in enumerate(value) if char.isdigit()] # Collect indexes of any digits
arr = [ value[:indexes[0]], value[indexes[0]:indexes[-1]+1], value[indexes[-1]+1:] ] # Use splicing to build list
Output:
['CC', '123', ':....:']
EDIT:
Just did a benchmark, metatoaster's method is slightly faster than this :)

python regular expression, pulling all letters out

Is there a better way to pull A and F from this: A13:F20
a="A13:F20"
import re
pattern = re.compile(r'\D+\d+\D+')
matches = re.search(pattern, a)
num = matches.group(0)
print num[0]
print num[len(num)-1]
output
A
F
note: the digits are of unknown length

You don't have to use regular expressions, or re at all. Assuming you want just letters to remain, you could do something like this:
a = "A13:F20"
a = filter(lambda x: x.isalpha(), a)

I'd do it like this:
>>> re.findall(r'[a-z]', a, re.IGNORECASE)
['A', 'F']

Use a simple list comprehension, as a filter and get only the alphabets from the actual string.
print [char for char in input_string if char.isalpha()]
# ['A', 'F']

You could use re.sub:
>>> a="A13.F20"
>>> re.sub(r'[^A-Z]', '', a) # Remove everything apart from A-Z
'AF'
>>> re.sub(r'[A-Z]', '', a) # Remove A-Z
'13.20'
>>>

If you're working with strings that all have the same format, you can just cut out substrings:
a="A13:F20"
print a[0], a[4]
More on python slicing in this answer:
Is there a way to substring a string in Python?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

String split on specific characters - python

I have a string like; '[abc] [def] [zzz]' How would I be able to split it into three parts: abc def zzz

Another way: s = '[abc] [def] [zzz]' s = [i.strip('[]') for i in s.split()]

Related

Python Combining f-string with r-string and curly braces in regex

Splitting string using different scenarios using regex

Python regular expression to extract the parenthesis

Finding groups of letters, numbers, or symbols

python regular expression, pulling all letters out

Categories

Resources