Match only list of words in string - python

I have a list of words and am creating a regular expression like so:
((word1)|(word2)|(word3){1,3})
Basically I want to match a string that contains 1 - 3 of those words.
This works, however I want it to match the string only if the string contains words from the regex. For example:
((investment)|(property)|(something)|(else){1,3})
This should match the string investmentproperty but not the string abcinvestmentproperty. Likewise it should match somethinginvestmentproperty because all those words are in the regex.
How do I go about achieving that?
Thanks

You can use $...^ to match with a string with (^) and ($) to mark the beginning and ending of the string you want to match. Also note you need to add (...) around your group of words to match for the {1,3}:
^((investment)|(property)|(something)|(else)){1,3}$
Regex101 Example

Related

Split a split (regex) in python

I do have got the below string and I am looking for a way to split it in order to consistently end up with the following output
'1GB 02060250396L7.067,702BE 129517720L6.633,403NL 134187650L3.824,234DE 165893440L3.111,005PL 65775644897L1.010,006DE 811506926L3.547,407AT U16235008L-830,008SE U57469158L3.001,30'
['1GB 02060250396L1.060,70',
'2BE 129517720L2.639,40',
'3NL 134187650L4.024,23',
'4DE 165893440L8.111,00',
'5PL 65775644897L3.010,00',
'6DE 811506926L3.547,40',
'7AT U16235008L-830,00',
'8SE U57469158L8.0221,30']
My current approach
re.split("([0-9][0-9][0-9][A-Z][A-Z])", input) however is also splitting my delimiter which gives and there is no other split possible than the one I am currently using in order to remain consistent. Is it possible to split my delimiter as well and assign a part of it "70" to the string in front and a part "2BE" to the following string?
Use re.findall() instead of re.split().
You want to match
a number \d, followed by
two letters [A-Z]{2}, followed by
a space \s, followed by
a bunch of characters until you encounter a comma [^,]+, followed by
two digits \d{2}
Try it at regex101
So do:
input_str = '1GB 02060250396L7.067,702BE 129517720L6.633,403NL 134187650L3.824,234DE 165893440L3.111,005PL 65775644897L1.010,006DE 811506926L3.547,407AT U16235008L-830,008SE U57469158L3.001,30'
re.findall(r"\d[A-Z]{2}\s[^,]+,\d{2}", input_str)
Which gives
['1GB 02060250396L7.067,70',
'2BE 129517720L6.633,40',
'3NL 134187650L3.824,23',
'4DE 165893440L3.111,00',
'5PL 65775644897L1.010,00',
'6DE 811506926L3.547,40',
'7AT U16235008L-830,00',
'8SE U57469158L3.001,30']
Alternatively, if you don't want to be so specific with your pattern, you could simply use the regex
[^,]+,\d{2} Try it at regex101
This will match as many of any character except a comma, then a single comma, then two digits.
re.findall(r"[^,]+,\d{2}", input_str)
# Output:
['1GB 02060250396L7.067,70',
'2BE 129517720L6.633,40',
'3NL 134187650L3.824,23',
'4DE 165893440L3.111,00',
'5PL 65775644897L1.010,00',
'6DE 811506926L3.547,40',
'7AT U16235008L-830,00',
'8SE U57469158L3.001,30']
Is it possible to split my delimiter as well and assign a part of it "70" to the string in front and a part "2BE" to the following string?
If you must use re.split AT ANY PRICE then you might exploit zero-length assertion for this task following way
import re
text = '1GB 02060250396L7.067,702BE 129517720L6.633,403NL 134187650L3.824,234DE 165893440L3.111,005PL 65775644897L1.010,006DE 811506926L3.547,407AT U16235008L-830,008SE U57469158L3.001,30'
parts = re.split(r'(?<=,[0-9][0-9])', text)
print(parts)
output
['1GB 02060250396L7.067,70', '2BE 129517720L6.633,40', '3NL 134187650L3.824,23', '4DE 165893440L3.111,00', '5PL 65775644897L1.010,00', '6DE 811506926L3.547,40', '7AT U16235008L-830,00', '8SE U57469158L3.001,30', '']
Explanation: This particular one is positive lookbehind, it does find zero-length substring preceded by , digit digit. Note that parts has superfluous empty str at end.

Match sequence of words with regex

I have a list of strings and I want to extract from it only the item name, with spaces, if there are.
The strings stay in column named 0, and index is just for reference.
For example, from each index line I want the following results:
Index - Expected result
0 - BOV BCONTRA
1 - BF PARAROLE C
2 - CUBINHOS DACE
... and so on.
Notice that inline 25 the desired result are not separated from the preceding numbers with spaces
There can be a dot . between the words line in index line 30.
I've tried re.findall(r"\n\d{1,2} \d+(\b\w+\b)") with no success.
Also re.findall(r"\n\d{1,2} \d+( ?\w+)") brings me only the first word, and I want all the words, not only the first one.
The lines start with a \n char that it's not printed at the list.
so basically you need all the upper case strings on the text.
try this expression, where it will get all the text with or without spaces
re.findall('[A-Z]+[ A-Z]*', text)
It seems you want [A-Z .]+, not "words" (represented by r'\w'), bordered by
integers. \w maps to
[a-zA-Z0-9_].
That's the Regex string to have: r'\d+ \d+([A-Z .]+)\d+'.
I don't know what you mean that a newline precedes each line. If you have a string with lines in it, it's perhaps better to split the input in lines with string.splitlines(), then do a linear Regex match (re.match so the Regex only matches from the start) on each relevant line.

regex select sequences that start with specific number

I want to select select all character strings that begin with 0
x= '1,1,1075 1,0,39 2,4,1,22409 0,1,1,755,300 0,1,1,755,50'
I have
re.findall(r'\b0\S*', x)
but this returns
['0,39', '0,1,1,755,300', '0,1,1,755,50']
I want
['0,1,1,755,300', '0,1,1,755,50']
The problem is that \b matches the boundaries between digits and commas too. The simplest way might be not to use a regex at all:
thingies = [thingy for thingy in x.split() if thingy.startswith('0')]
Instead of using the boundary \b which will match between the comma and number (between any word [a-zA-Z0-9_] and non word character), you will want to match on start of string or space like (^|\s).
(^|\s)0\S*
https://regex101.com/r/Mrzs8a/1
Which will match the start of string or a space preceding the target string. But that will also include the space if present so I would suggest either trimming your matched string or wrapping the latter part with parenthesis to make it a group and then just getting group 1 from the matches like:
(?:^|\s)(0\S*)
https://regex101.com/r/Mrzs8a/2

Generate regex for exact words from a list

I am trying to write a regex that can match any word in the following or similar words. * in these strings are exact * and not any character.
Jump
J**p
J*m*
J***
***p
J***ing
J***ed
****ed
I want to keeo the length fixed.
1. Any string of lenght 4 that matches the string 'jump'
2. Any string of length 6 that matches 'jumped'
3. Any string of length 7 that matches 'jumping'
I was using the following statements but for some reason, i am not able to to the correct translation. It accepts other strings as well.
p = re.compile('j|\*)(u|\*)(m|\*)...)
bool(p.match('******g'))
This is a fairly straightforward regex. We want to match a word, but allow each character to be an asterisk. The regex is therefore a sequence of character groups of the form [x*]:
[Jj*][u*][m*][p*](?:[i*][n*][g*]|[e*][d*])?
See it in action at regex101.
If you only want to match these exact words, make sure to use the pattern with re.fullmatch.

regex to strict check numbers in string

Example strings:
I am a numeric string 75698
I am a alphanumeric string A14-B32-C7D
So far my regex works: (\S+)$
I want to add a way (probably look ahead) to check if the result generated by above regex contains any digit (0-9) one or more times?
This is not working: (\S+(?=\S*\d\S*))$
How should I do it?
Look ahead is not necessary for this, this is simply :
(\S*\d+\S*)
Here is a test case :
http://regexr.com?34s7v
permute it and use the \D class instead of \S:
((?=\D*\d)\S+)$
explanation: \D = [^\d] in other words it is all that is not a digit.
You can be more explicit (better performances for your examples) with:
((?=[a-zA-Z-]*\d)\[a-zA-Z\d-]+)$
and if you have only uppercase letters, you know what to do. (smaller is the class, better is the regex)
text = '''
I am a numeric string 75698 \t
I am a alphanumeric string A14-B32-C7D
I am a alphanumeric string A14-B32-C74578
I am an alphabetic number: three
'''
import re
regx = re.compile('\s(?=.*\d)([\da-zA-Z-]+)\s*$',re.MULTILINE)
print regx.findall(text)
# result ['75698', 'A14-B32-C7D', 'A14-B32-C74578']
Note the presence of \s* in front of $ in order to catch alphanumeric portions that are separated with whitespazces from the end of the lines.

Categories

Resources