I have text like this
Example:
"visa code: ab c master number: efg discover: i j k"
Output should be like this:
abc, efg, ijk
Is there a way, I can use Grok pattern match or Reg EX to get 3 characters after the ":" (not considering space) ?
You can start with this:
>>> import re
>>> p = re.compile(r"\b((?:\w\s*){2}\w)\b")
>>> re.findall(p, "visa code: ab c master number: efg discover: i j k")
['ab c', 'efg', 'i j k']
But you have more work to do. For example, nobody can guess what you mean - exactly - by "characters".
Beyond that, pattern matching systems match strings, but do not convert them. You'll have to remove spaces you don't want via some other means (which should be easy).
Related
I am relatively new to regex and I seem to be struggling to understand the greedy vs non-greedy search (if that is indeed the issue here). Let's say I have a simple text such as this:
# numbers: 4 A 3 B
My goal would be to run a findall to get something like the following output:
['# number:', '4 A 3 B', ' 4 A', ' 3 B']
So if I use the following regex with findall, I would expect it to work:
matches = re.findall(r"(# numbers:)(((?:\s\d)(?:\s\D))*)", "# numbers: 4 A 3 B")
However, the actual output is this:
[('# numbers:', ' 4 A 3 B', ' 3 B')]
Can someone explain why the group ((\s\d)(\d\D)) is only matching ' 3 B' and not also ' 4 A'? I assume it has something to do with the greedy vs. non-greedy search of * is that true? And if so, could you explain how to solve this issue?
Thanks in advance!
I would use re.findall here, twice. First, extract the digit/non digit text series, then use re.findall a second time to find the tuples:
inp = "# numbers: 4 A 3 B"
text = re.findall(r'^# numbers:\s+(.*)$', inp)[0]
matches = re.findall(r'(\d+)\s+(\D+)', text)
print(matches) # [('4', 'A '), ('3', 'B')]
I need to split a text like:
//string
s = CS -135IntrotoComputingCS -154IntroToWonderLand...
in array like
inputarray[0]= CS -135 Intro to computing
inputarray[1]= CS -154 Intro to WonderLand
.
.
.
and so on;
I am trying something like this:
re.compile("[CS]+\s").split(s)
But it's just not ready to even break, even if I try something like
re.compile("[CS]").split(s)
If anyone can throw some light on this?
You may use findall with a lookahead regex as this:
>>> s = 'CS -135IntrotoComputingCS -154IntroToWonderLand'
>>> print re.findall(r'.+?(?=CS|$)', s)
['CS -135IntrotoComputing', 'CS -154IntroToWonderLand']
Regex: .+?(?=CS|$) matches 1+ any characters that has CS at next position or end of line.
Although findall is more straightforward but finditer can also be used here
s = 'CS -135IntrotoComputingCS -154IntroToWonderLand'
x=[i.start() for i in re.finditer('CS ',s)] # to get the starting positions of 'CS'
count=0
l=[]
while count+1<len(x):
l.append(s[x[count]:x[count+1]])
count+=1
l.append(s[x[count]:])
print(l) # ['CS -135IntrotoComputing', 'CS -154IntroToWonderLand']
I'm trying to create a regex to match a word that has or doesn't have an apostrophe 's' at the end. For the below example, I'd like add a regex to replace the apostrophe with the regex to match either an apostrophe 's' or just an 's'.
Philip K Dick's Electric Dreams
Philip K Dicks Electric Dreams
What I am trying so far is below, but I'm not getting it to match correctly. Any help here is great. Thanks!
Philip K Dick[\'[a-z]|[a-z]] Electric Dreams
Just set the apostrophe as optional in the regex pattern.
Like this: [a-zA-Z]+\'?s,
For example, using your test strings:
import re
s1 = "Philip K Dick's Electric Dreams"
s2 = "Philip K Dicks Electric Dreams"
>>> re.findall("[a-zA-Z]+\'?s", s1)
["Dick's", 'Dreams']
>>> re.findall("[a-zA-Z]+\'?s", s2)
['Dicks', 'Dreams']
You can use the regex (\w+)'s to represent any letters followed by 's. Then you can substitute back that word followed by just s.
>>> s = "Philip K Dick's Electric Dreams"
>>> re.sub(r"(\w+)'s", r'\1s', s)
'Philip K Dicks Electric Dreams'
Here's hoping somebody can shed some light on this question because it has me stumped. I have a string that looks like this:
s = "abcdef [[xxxx xxx|ghijk]] lmnop [[qrs]] tuv [[xx xxxx|wxyz]] 0123456789"
I want this result:
abcdef ghijk lmnop qrs tuv wxyz 0123456789
Having reviewed numerous questions and answers here, the closest I have come to a solution is:
s = "abcdef [[xxxx xxx|ghijk]] lmnop [[qrs]] tuv [[xx xxxx|wxyz]] 0123456789"
s = re.sub('\[\[.*?\|', '', s)
s = re.sub('[\]\]]', '', s)
--> abcdef ghijk lmnop wxyz 0123456789
Since not every substring within double brackets contains a pipe, the re.sub removes everything from '[[' to next '|' instead of checking within each set of double brackets.
Any assistance would be most appreciated.
What about this:
In [187]: re.sub(r'([\[|\]])|((?<=\[)\w+\s+\w+(?=|))', '', s)
Out[187]: 'abcdef ghijk lmnop qrs tuv wxyz 0123456789'
I purpose you a contrary method, instead of remove it you can just catch patterns you want. I think this way can make your code more semantics.
There are two patterns you wish to catch:
Case: words outside [[...]]
Pattern: Any words are either leaded by ']] ' or trailed by ' [['.
Regex: (?<=\]\]\s)\w+|\w+(?=\s\[\[)
Case: words inside [[...]]
Pattern: Any words are trailed by ']]'
Regex: \w+(?=\]\])
Example code
1 #!/usr/bin/env python
2 import re
3
4 s = "abcdef [[xxxx xxx|ghijk]] lmnop [[qrs]] tuv [[xx xxxx|wxyz]] 0123456789 "
5
6 p = re.compile('(?<=\]\]\s)\w+|\w+(?=\s\[\[)|\w+(?=\]\])')
7 print p.findall(s)
Result:
['abcdef', 'ghijk', 'lmnop', 'qrs', 'tuv', 'wxyz', '0123456789']
>>> import re
>>> s = "abcdef [[xxxx xxx|ghijk]] lmnop [[qrs]] tuv [[xx xxxx|wxyz]] 0123456789"
>>> re.sub(r'(\[\[[^]]+?\|)|([\[\]])', '', s)
'abcdef ghijk lmnop qrs tuv wxyz 0123456789'
This searches for and removes the following two items:
Two opening brackets followed by a bunch of stuff that isn't a closing bracket followed by a pipe.
Opening or closing brackets.
As a general regex using built-in re module you can use follwing regex that used look-around:
(?<!\[\[)\b([\w]+)\b(?!\|)|\[\[([^|]*)\]\]
you can use re.finditer to get the desire result :
>>> g=re.finditer(r'(?<!\[\[)\b([\w]+)\b(?!\|)|(?<=\[\[)[^|]*(?=\]\])',s)
>>> [j.group() for j in g]
['abcdef', 'ghijk', 'lmnop', 'qrs', 'tuv', 'wxyz', '0123456789']
The preceding regex contains from 2 part one is :
(?<=\[\[)[^|]*(?=\]\])
which match any combinations of word characters that not followed by | and not precede by [[.
the second part is :
\[\[([^|]*)\]\]
that will match any thing between 2 brackets except |.
I am trying to make a Python regex which allows me to remove all worlds of a string containing a number.
For example:
in = "ABCD abcd AB55 55CD A55D 5555"
out = "ABCD abcd"
The regex for delete number is trivial:
print(re.sub(r'[1-9]','','Paris a55a b55 55c 555 aaa'))
But I don't know how to delete the entire word and not just the number.
Could you help me please?
Do you need a regex? You can do something like
>>> words = "ABCD abcd AB55 55CD A55D 5555"
>>> ' '.join(s for s in words.split() if not any(c.isdigit() for c in s))
'ABCD abcd'
If you really want to use regex, you can try \w*\d\w*:
>>> re.sub(r'\w*\d\w*', '', words).strip()
'ABCD abcd'
Here's my approach:
>>> import re
>>> s = "ABCD abcd AB55 55CD A55D 5555"
>>> re.sub("\S*\d\S*", "", s).strip()
'ABCD abcd'
>>>
The below code of snippet removes the words mixed with digits only
string='1i am20 years old, weight is in between 65-70 kg '
string=re.sub(r"[A-Za-z]+\d+|\d+[A-Za-z]+",'',string).strip()
print(s)
OUTPUT:
years old and weight is in between 65-70 kg name