RegEx or Pygrok Pattern match - python

I have text like this
Example:
"visa code: ab c master number: efg discover: i j k"
Output should be like this:
abc, efg, ijk
Is there a way, I can use Grok pattern match or Reg EX to get 3 characters after the ":" (not considering space) ?

You can start with this:
>>> import re
>>> p = re.compile(r"\b((?:\w\s*){2}\w)\b")
>>> re.findall(p, "visa code: ab c master number: efg discover: i j k")
['ab c', 'efg', 'i j k']
But you have more work to do. For example, nobody can guess what you mean - exactly - by "characters".
Beyond that, pattern matching systems match strings, but do not convert them. You'll have to remove spaces you don't want via some other means (which should be easy).

Related

Python Regex Findall non-greedy

I am relatively new to regex and I seem to be struggling to understand the greedy vs non-greedy search (if that is indeed the issue here). Let's say I have a simple text such as this:
# numbers: 4 A 3 B
My goal would be to run a findall to get something like the following output:
['# number:', '4 A 3 B', ' 4 A', ' 3 B']
So if I use the following regex with findall, I would expect it to work:
matches = re.findall(r"(# numbers:)(((?:\s\d)(?:\s\D))*)", "# numbers: 4 A 3 B")
However, the actual output is this:
[('# numbers:', ' 4 A 3 B', ' 3 B')]
Can someone explain why the group ((\s\d)(\d\D)) is only matching ' 3 B' and not also ' 4 A'? I assume it has something to do with the greedy vs. non-greedy search of * is that true? And if so, could you explain how to solve this issue?
Thanks in advance!
I would use re.findall here, twice. First, extract the digit/non digit text series, then use re.findall a second time to find the tuples:
inp = "# numbers: 4 A 3 B"
text = re.findall(r'^# numbers:\s+(.*)$', inp)[0]
matches = re.findall(r'(\d+)\s+(\D+)', text)
print(matches) # [('4', 'A '), ('3', 'B')]

regex the text string in python and split into arrays

I need to split a text like:
//string
s = CS -135IntrotoComputingCS -154IntroToWonderLand...
in array like
inputarray[0]= CS -135 Intro to computing
inputarray[1]= CS -154 Intro to WonderLand
.
.
.
and so on;
I am trying something like this:
re.compile("[CS]+\s").split(s)
But it's just not ready to even break, even if I try something like
re.compile("[CS]").split(s)
If anyone can throw some light on this?
You may use findall with a lookahead regex as this:
>>> s = 'CS -135IntrotoComputingCS -154IntroToWonderLand'
>>> print re.findall(r'.+?(?=CS|$)', s)
['CS -135IntrotoComputing', 'CS -154IntroToWonderLand']
Regex: .+?(?=CS|$) matches 1+ any characters that has CS at next position or end of line.
Although findall is more straightforward but finditer can also be used here
s = 'CS -135IntrotoComputingCS -154IntroToWonderLand'
x=[i.start() for i in re.finditer('CS ',s)] # to get the starting positions of 'CS'
count=0
l=[]
while count+1<len(x):
l.append(s[x[count]:x[count+1]])
count+=1
l.append(s[x[count]:])
print(l) # ['CS -135IntrotoComputing', 'CS -154IntroToWonderLand']

Regex for Matching Apostrophe 's' words

I'm trying to create a regex to match a word that has or doesn't have an apostrophe 's' at the end. For the below example, I'd like add a regex to replace the apostrophe with the regex to match either an apostrophe 's' or just an 's'.
Philip K Dick's Electric Dreams
Philip K Dicks Electric Dreams
What I am trying so far is below, but I'm not getting it to match correctly. Any help here is great. Thanks!
Philip K Dick[\'[a-z]|[a-z]] Electric Dreams
Just set the apostrophe as optional in the regex pattern.
Like this: [a-zA-Z]+\'?s,
For example, using your test strings:
import re
s1 = "Philip K Dick's Electric Dreams"
s2 = "Philip K Dicks Electric Dreams"
>>> re.findall("[a-zA-Z]+\'?s", s1)
["Dick's", 'Dreams']
>>> re.findall("[a-zA-Z]+\'?s", s2)
['Dicks', 'Dreams']
You can use the regex (\w+)'s to represent any letters followed by 's. Then you can substitute back that word followed by just s.
>>> s = "Philip K Dick's Electric Dreams"
>>> re.sub(r"(\w+)'s", r'\1s', s)
'Philip K Dicks Electric Dreams'

remove multiple substrings inside a string

Here's hoping somebody can shed some light on this question because it has me stumped. I have a string that looks like this:
s = "abcdef [[xxxx xxx|ghijk]] lmnop [[qrs]] tuv [[xx xxxx|wxyz]] 0123456789"
I want this result:
abcdef ghijk lmnop qrs tuv wxyz 0123456789
Having reviewed numerous questions and answers here, the closest I have come to a solution is:
s = "abcdef [[xxxx xxx|ghijk]] lmnop [[qrs]] tuv [[xx xxxx|wxyz]] 0123456789"
s = re.sub('\[\[.*?\|', '', s)
s = re.sub('[\]\]]', '', s)
--> abcdef ghijk lmnop wxyz 0123456789
Since not every substring within double brackets contains a pipe, the re.sub removes everything from '[[' to next '|' instead of checking within each set of double brackets.
Any assistance would be most appreciated.
What about this:
In [187]: re.sub(r'([\[|\]])|((?<=\[)\w+\s+\w+(?=|))', '', s)
Out[187]: 'abcdef ghijk lmnop qrs tuv wxyz 0123456789'
I purpose you a contrary method, instead of remove it you can just catch patterns you want. I think this way can make your code more semantics.
There are two patterns you wish to catch:
Case: words outside [[...]]
Pattern: Any words are either leaded by ']] ' or trailed by ' [['.
Regex: (?<=\]\]\s)\w+|\w+(?=\s\[\[)
Case: words inside [[...]]
Pattern: Any words are trailed by ']]'
Regex: \w+(?=\]\])
Example code
1 #!/usr/bin/env python
2 import re
3
4 s = "abcdef [[xxxx xxx|ghijk]] lmnop [[qrs]] tuv [[xx xxxx|wxyz]] 0123456789 "
5
6 p = re.compile('(?<=\]\]\s)\w+|\w+(?=\s\[\[)|\w+(?=\]\])')
7 print p.findall(s)
Result:
['abcdef', 'ghijk', 'lmnop', 'qrs', 'tuv', 'wxyz', '0123456789']
>>> import re
>>> s = "abcdef [[xxxx xxx|ghijk]] lmnop [[qrs]] tuv [[xx xxxx|wxyz]] 0123456789"
>>> re.sub(r'(\[\[[^]]+?\|)|([\[\]])', '', s)
'abcdef ghijk lmnop qrs tuv wxyz 0123456789'
This searches for and removes the following two items:
Two opening brackets followed by a bunch of stuff that isn't a closing bracket followed by a pipe.
Opening or closing brackets.
As a general regex using built-in re module you can use follwing regex that used look-around:
(?<!\[\[)\b([\w]+)\b(?!\|)|\[\[([^|]*)\]\]
you can use re.finditer to get the desire result :
>>> g=re.finditer(r'(?<!\[\[)\b([\w]+)\b(?!\|)|(?<=\[\[)[^|]*(?=\]\])',s)
>>> [j.group() for j in g]
['abcdef', 'ghijk', 'lmnop', 'qrs', 'tuv', 'wxyz', '0123456789']
The preceding regex contains from 2 part one is :
(?<=\[\[)[^|]*(?=\]\])
which match any combinations of word characters that not followed by | and not precede by [[.
the second part is :
\[\[([^|]*)\]\]
that will match any thing between 2 brackets except |.

Python regex to remove all words which contains number

I am trying to make a Python regex which allows me to remove all worlds of a string containing a number.
For example:
in = "ABCD abcd AB55 55CD A55D 5555"
out = "ABCD abcd"
The regex for delete number is trivial:
print(re.sub(r'[1-9]','','Paris a55a b55 55c 555 aaa'))
But I don't know how to delete the entire word and not just the number.
Could you help me please?
Do you need a regex? You can do something like
>>> words = "ABCD abcd AB55 55CD A55D 5555"
>>> ' '.join(s for s in words.split() if not any(c.isdigit() for c in s))
'ABCD abcd'
If you really want to use regex, you can try \w*\d\w*:
>>> re.sub(r'\w*\d\w*', '', words).strip()
'ABCD abcd'
Here's my approach:
>>> import re
>>> s = "ABCD abcd AB55 55CD A55D 5555"
>>> re.sub("\S*\d\S*", "", s).strip()
'ABCD abcd'
>>>
The below code of snippet removes the words mixed with digits only
string='1i am20 years old, weight is in between 65-70 kg '
string=re.sub(r"[A-Za-z]+\d+|\d+[A-Za-z]+",'',string).strip()
print(s)
OUTPUT:
years old and weight is in between 65-70 kg name

Categories

Resources