Split a line with matching regex special characters - python

Currently I want to split a line with all the matching special characters of the regex. As it is hard to explain, here are a few examples:
('.+abcd[0-9]+\.mp3', 'Aabcd09.mp3') -> [ 'A', '09' ]
.+ is a special expression of the regex and this is the match that I want
[0-9]+ is another regex expression and I want what it matches too
('.+\..+_[0-9]+\.mp3', 'A.abcd_09.mp3') -> [ 'A', 'abcd', '09' ]
.+ is the first special expression of the regex, it matches A
.+ is the second special expression of the regex, it matches abcd
[0-9]+ is the third special expression of the regex, it matches 09
Do you know how to achieve this? I didn't find anything.

Looks like you need a so called tokenizer/lexer to parse a regular expression first. It will allow you to split a base regex on sub-expressions. Then just apply these sub-expressions to the original string and print out matches.

You can try this:
import re
s = ['Aabcd09.mp3', 'A.abcd_09.mp3']
new_s = [re.findall('(?<=^)[a-zA-Z]|(?<=\.)[a-zA-Z]+(?=_)|\d+(?=\.mp3)', i) for i in s]
Output:
[['A', '09'], ['A', 'abcd', '09']]

Related

How to split string at any number followed by a period instead of a fixed delimiter

input:
string="1.Adam-Lee-Dotnet-9191919191-AdamLee#gmail.com-London-UK-Oracle-Banking2.Peter-Smith-Salesforce-9222291910-PeterSmith21#gmail.com-Mumbai-INDIA-Oracle-Engineering3.Harrison-Lu-Java-9223391910-HarrisonLu#gmail.com-Mumbai-INDIA-Samsung-Engineering4.Andrew-Joseph-Javascript-9200091910-AndrewJoseph#gmail.com-Toronto-CANADA-Dell-Engineering5.Larry-Ken-SQL-8880091910-LarryKen#gmail.com-Newyork-USA-HP-Management"
expected output:
[
"1.Adam-Lee-Dotnet-9191919191-AdamLee#gmail.com-London-UK-Oracle-Banking",
"2.Peter-Smith-Salesforce-9222291910-PeterSmith21#gmail.com-Mumbai-INDIA-Oracle-Engineering",
...
]
Attempt: I have tried using a string.split(range(0,5)+"."). What would be the best way to do this?
I don't usually reach for regular expressions first, but this cries out for re.split.
parts = re.split(r'(\d\.)`, string)
This does need a bit of post-processing. It creates:
['', '1.', 'Adam-Lee-Dotnet-9191919191-AdamLee#gmail.com-London-UK-Oracle-Banking', '2.', 'Peter-Smith-Salesforce-9222291910-PeterSmith21#gmail.com-Mumbai-INDIA-Oracle-Engineering', ...
So you'll need to combine ever other element.
You could split using a regex with lookaround assertions that assert 1+ digits followed by a dot to the right using (?=\d+\.) and assert not the start of the string to the left using (?<!^)
(?<!^)(?=\d+\.)
Regex demo | Python demo
import re
pattern = r"(?<!^)(?=\d+\.)"
string="1.Adam-Lee-Dotnet-9191919191-AdamLee#gmail.com-London-UK-Oracle-Banking2.Peter-Smith-Salesforce-9222291910-PeterSmith21#gmail.com-Mumbai-INDIA-Oracle-Engineering3.Harrison-Lu-Java-9223391910-HarrisonLu#gmail.com-Mumbai-INDIA-Samsung-Engineering4.Andrew-Joseph-Javascript-9200091910-AndrewJoseph#gmail.com-Toronto-CANADA-Dell-Engineering5.Larry-Ken-SQL-8880091910-LarryKen#gmail.com-Newyork-USA-HP-Management"
res = re.split(pattern, string)
print(res)
Output
[
'1.Adam-Lee-Dotnet-9191919191-AdamLee#gmail.com-London-UK-Oracle-Banking',
'2.Peter-Smith-Salesforce-9222291910-PeterSmith21#gmail.com-Mumbai-INDIA-Oracle-Engineering',
'3.Harrison-Lu-Java-9223391910-HarrisonLu#gmail.com-Mumbai-INDIA-Samsung-Engineering',
'4.Andrew-Joseph-Javascript-9200091910-AndrewJoseph#gmail.com-Toronto-CANADA-Dell-Engineering',
'5.Larry-Ken-SQL-8880091910-LarryKen#gmail.com-Newyork-USA-HP-Management'
]
Or instead of splitting, you could also use a pattern to match 1 or more digits followed by a dot, and then match until the first occurrence of the same pattern or the end of the string.
\d+\..*?(?=\d+\.|$)
Regex demo | Python demo
import re
pattern = r"\d+\..*?(?=\d+\.|$)"
string="1.Adam-Lee-Dotnet-9191919191-AdamLee#gmail.com-London-UK-Oracle-Banking2.Peter-Smith-Salesforce-9222291910-PeterSmith21#gmail.com-Mumbai-INDIA-Oracle-Engineering3.Harrison-Lu-Java-9223391910-HarrisonLu#gmail.com-Mumbai-INDIA-Samsung-Engineering4.Andrew-Joseph-Javascript-9200091910-AndrewJoseph#gmail.com-Toronto-CANADA-Dell-Engineering5.Larry-Ken-SQL-8880091910-LarryKen#gmail.com-Newyork-USA-HP-Management"
res = re.findall(pattern, string)

Regex expression for a given string

I have a small issue i am running into. I need a regular expression that would split a passed string with numbers separately and anything chunk of characters within square brackets separately and regular set of string separately.
for example if I have a strings that resembles
s = 2[abc]3[cd]ef
i need a list with lst = ['2','abc','3','cd','ef']
I have a code so far that has this..
import re
s = "2[abc]3[cd]ef"
s_final = ""
res = re.findall("(\d+)\[([^[\]]*)\]", s)
print(res)
This is outputting a list of tuples that looks like this.
[('2', 'abc'), ('3', 'cd')]
I am very new to regular expression and learning.. Sorry if this is an easy one.
Thanks!
The immediate fix is getting rid of the capturing groups and using alternation to match either digits or chars other than square bracket chars:
import re
s = "2[abc]3[cd]ef"
res = re.findall(r"\d+|[^][]+", s)
print(res)
# => ['2', 'abc', '3', 'cd', 'ef']
See the regex demo and the Python demo. Details:
\d+ - one or more digits
| - or
[^][]+ - one or more chars other than [ and ]
Other solutions that might help are:
re.findall(r'\w+', s)
re.findall(r'\d+|[^\W\d_]+', s)
where \w+ matches one or more letters, digits, underscores and some more connector punctuation with diacritics and [^\W\d_]+ matches any one or more Unicode letters.
See this Python demo.
Don't try a regex that will find all part in the string, but rather a regex that is able to match each block, and \w (meaning [a-zA-Z0-9_]) feats well
s = "2[abc]3[cd]ef"
print(re.findall(r"\w+", s)) # ['2', 'abc', '3', 'cd', 'ef']
Or split on brackets
print(re.split(r"[\[\]]", s)) # ['2', 'abc', '3', 'cd', 'ef ']
Regex is intended to be used as a Regular Expression, your string is Irregular.
regex is being mostly used to find a specific pattern in a long text, text validation, extract things from text.
for example, in order to find a phone number in a string, I would use RegEx, but when I want to build a calculator and I need to extract operators/digits I would not, but I would rather want to write a python code to do that.

how to make a list in python from a string and using regular expression [duplicate]

I have a sample string <alpha.Customer[cus_Y4o9qMEZAugtnW] active_card=<alpha.AlphaObject[card] ...>, created=1324336085, description='Customer for My Test App', livemode=False>
I only want the value cus_Y4o9qMEZAugtnW and NOT card (which is inside another [])
How could I do it in easiest possible way in Python?
Maybe by using RegEx (which I am not good at)?
How about:
import re
s = "alpha.Customer[cus_Y4o9qMEZAugtnW] ..."
m = re.search(r"\[([A-Za-z0-9_]+)\]", s)
print m.group(1)
For me this prints:
cus_Y4o9qMEZAugtnW
Note that the call to re.search(...) finds the first match to the regular expression, so it doesn't find the [card] unless you repeat the search a second time.
Edit: The regular expression here is a python raw string literal, which basically means the backslashes are not treated as special characters and are passed through to the re.search() method unchanged. The parts of the regular expression are:
\[ matches a literal [ character
( begins a new group
[A-Za-z0-9_] is a character set matching any letter (capital or lower case), digit or underscore
+ matches the preceding element (the character set) one or more times.
) ends the group
\] matches a literal ] character
Edit: As D K has pointed out, the regular expression could be simplified to:
m = re.search(r"\[(\w+)\]", s)
since the \w is a special sequence which means the same thing as [a-zA-Z0-9_] depending on the re.LOCALE and re.UNICODE settings.
You could use str.split to do this.
s = "<alpha.Customer[cus_Y4o9qMEZAugtnW] active_card=<alpha.AlphaObject[card]\
...>, created=1324336085, description='Customer for My Test App',\
livemode=False>"
val = s.split('[', 1)[1].split(']')[0]
Then we have:
>>> val
'cus_Y4o9qMEZAugtnW'
This should do the job:
re.match(r"[^[]*\[([^]]*)\]", yourstring).groups()[0]
your_string = "lnfgbdgfi343456dsfidf[my data] ljfbgns47647jfbgfjbgskj"
your_string[your_string.find("[")+1 : your_string.find("]")]
courtesy: Regular expression to return text between parenthesis
You can also use
re.findall(r"\[([A-Za-z0-9_]+)\]", string)
if there are many occurrences that you would like to find.
See also for more info:
How can I find all matches to a regular expression in Python?
You can use
import re
s = re.search(r"\[.*?]", string)
if s:
print(s.group(0))
How about this ? Example illusrated using a file:
f = open('abc.log','r')
content = f.readlines()
for line in content:
m = re.search(r"\[(.*?)\]", line)
print m.group(1)
Hope this helps:
Magic regex : \[(.*?)\]
Explanation:
\[ : [ is a meta char and needs to be escaped if you want to match it literally.
(.*?) : match everything in a non-greedy way and capture it.
\] : ] is a meta char and needs to be escaped if you want to match it literally.
This snippet should work too, but it will return any text enclosed within "[]"
re.findall(r"\[([a-zA-Z0-9 ._]*)\]", your_text)

Add [] around numbers in strings

I like to add [] around any sequence of numbers in a string e.g
"pixel1blue pin10off output2high foo9182bar"
should convert to
"pixel[1]blue pin[10]off output[2]high foo[9182]bar"
I feel there must be a simple way but its eluding me :(
Yes, there is a simple way, using re.sub():
result = re.sub(r'(\d+)', r'[\1]', inputstring)
Here \d matches a digit, \d+ matches 1 or more digits. The (...) around that pattern groups the match so we can refer to it in the second argument, the replacement pattern. That pattern simply replaces the matched digits with [...] around the group.
Note that I used r'..' raw string literals; if you don't you'd have to double all the \ backslashes; see the Backslash Plague section of the Python Regex HOWTO.
Demo:
>>> import re
>>> inputstring = "pixel1blue pin10off output2high foo9182bar"
>>> re.sub(r'(\d+)', r'[\1]', inputstring)
'pixel[1]blue pin[10]off output[2]high foo[9182]bar'
You can use re.sub :
>>> s="pixel1blue pin10off output2high foo9182bar"
>>> import re
>>> re.sub(r'(\d+)',r'[\1]',s)
'pixel[1]blue pin[10]off output[2]high foo[9182]bar
Here the (\d+) will match any combinations of digits and re.sub function will replace it with the first group match within brackets r'[\1]'.
You can start here to learn regular expression http://www.regular-expressions.info/

Exclude matched string python re.findall

I am using python's re.findall method to find occurrence of certain string value in Input string.
e.g. From search in 'ABCdef' string, I have two search requirements.
Find string starting from Single Capital letter.
After 1 find string that contains all capital letter.
e.g. input string and expected output will be:
'USA' -- output: ['USA']
'BObama' -- output: ['B', 'Obama']
'Institute20CSE' -- output: ['Institute', '20', 'CSE']
So My expectation from
>>> matched_value_list = re.findall ( '[A-Z][a-z]+|[A-Z]+' , 'ABCdef' )
is to return ['AB', 'Cdef'].
But which does Not seems to be happening. What I get is ['ABC'] as return value, which matches later part of regex with full string.
So Is there any way we can ignore found matches. So that once 'Cdef' is matched with '[A-Z][a-z]+'. second part of regex (i.e. '[A-Z]+') only matches with remaining string 'AB'?
First you need to match AB, which is followed by an Uppercase alphabet and then a lowercase alphabet. or is at the end of the string. For that you can use look-ahead.
Then you need to match an Uppercase alphabet C, followed by multiple lowercase alphabets def.
So, you can use this pattern:
>>> s = "ABCdef"
>>> re.findall("([A-Z]+(?=[A-Z][a-z]|$)|[A-Z][a-z]+)", s)
['AB', 'Cdef']
>>> re.findall("([A-Z]+(?=[A-Z][a-z]|$)|[A-Z][a-z]+)", 'MumABXYZCdefXYZAbc')
['Mum', 'ABXYZ', 'Cdef', 'XYZ', 'Abc']
As pointed out in comment by #sotapme, you can also modify the above regex to: -
"([A-Z]+(?=[A-Z]|$)|[A-Z][a-z]+|\d+)"
Added \d+ since you also want to match digit as in one of your example. Also, he removed [a-z] part from the first part of look-ahead. That works because, + quantifier on the [A-Z] outside is greedy by default, so, it will automatically match maximum string, and will stop only before the last upper case alphabet.
You can use this regex
[A-Z][a-zA-Z]*?(?=[A-Z][a-z]|[^a-zA-Z]|$)

Categories

Resources