How to find word with subscript?

How to find word with subscript? - python

Input: s = "test1 this is a sample subscript o₁"
I've tried: re.compile(r'\b[^\W\d_]{2,}\b').findall(s)
It finds the word with more than 2 chars and doesn't contain number
'this', 'is', 'sample', 'subscript', 'o₁',
but it still has the subscript number.
Is there a way to remove word that contains subscript in it?
Desire output: 'this', 'is', 'sample', 'subscript'

The point is that the Unicode aware \d in Python 3 regex does not match No Unicode category.
If you need to work with ASCII only letter words, use
r'\b[a-zA-Z]{2,}\b'
Or, make the pattern non-Unicode aware by using re.A / re.ASCII flag:
re.compile(r'\b[^\W\d_]{2,}\b', re.A)
See this Python 3 demo.
If you need to work with any Unicode letters you may fix it by either adding all the No characters to the regex negated character class (which might make it a tedious solution), or add a programmatic check after a match is found to see if the match contains any char from the No category.
See this Python 3 demo:
import re, sys, unicodedata
s = "test1 this is a sample subscript o₁"
No = [chr(i) for i in range(sys.maxunicode) if unicodedata.category(chr(i)) == 'No']
print([x for x in re.findall(r'\b[^\W\d_]{2,}\b', s) if not any(y in x for y in No)])
# => ['this', 'is', 'sample', 'subscript']
Make sure you are using the latest Python version to support the latest Unicode standard, or rely on the PyPi regex module:
p = regex.compile(r"\b\p{L}{2,}\b")
print(p.findall(s))

Related

Python regex for unicode capitalized words

I have a set of words in different languages (English, Polish, Finnish, Russian etc.) and need to check, what of them written with a capital letter.
I tried to use simple regular expression: ^[A-Z], but it matches only latinate letters, then I've added the russian capital letters: ^[A-ZА-Я].
But many unicode letters with diacritics remains. How I can add all capital letters to my regex?
It's possible to make this without enumerations of symbols?
P.S. I know, how to make this in Ruby, but now I'm using Python.

If you need to use a regex, you have 2 options:
Install PyPi regex module and use \p{Lu} or [[:upper:]] (having more uppercase chars in it) class (make sure you have the latest version installed)
Use re with a character class containing all uppercase letter ranges, either using Python utilities (and then the amount of the Unicode letters matched will depend on the Python version, the latest having up-to-date data) or by manually creating/updating the range from the Unicode Utilities CLDR page.
Here is a solution with a regex containing all uppercase letter ranges taken from Unicode Utilities CLDR reference page:
import re
pLu = "[A-Z\u00C0-\u00D6\u00D8-\u00DE\u0100\u0102\u0104\u0106\u0108\u010A\u010C\u010E\u0110\u0112\u0114\u0116\u0118\u011A\u011C\u011E\u0120\u0122\u0124\u0126\u0128\u012A\u012C\u012E\u0130\u0132\u0134\u0136\u0139\u013B\u013D\u013F\u0141\u0143\u0145\u0147\u014A\u014C\u014E\u0150\u0152\u0154\u0156\u0158\u015A\u015C\u015E\u0160\u0162\u0164\u0166\u0168\u016A\u016C\u016E\u0170\u0172\u0174\u0176\u0178\u0179\u017B\u017D\u0181\u0182\u0184\u0186\u0187\u0189-\u018B\u018E-\u0191\u0193\u0194\u0196-\u0198\u019C\u019D\u019F\u01A0\u01A2\u01A4\u01A6\u01A7\u01A9\u01AC\u01AE\u01AF\u01B1-\u01B3\u01B5\u01B7\u01B8\u01BC\u01C4\u01C7\u01CA\u01CD\u01CF\u01D1\u01D3\u01D5\u01D7\u01D9\u01DB\u01DE\u01E0\u01E2\u01E4\u01E6\u01E8\u01EA\u01EC\u01EE\u01F1\u01F4\u01F6-\u01F8\u01FA\u01FC\u01FE\u0200\u0202\u0204\u0206\u0208\u020A\u020C\u020E\u0210\u0212\u0214\u0216\u0218\u021A\u021C\u021E\u0220\u0222\u0224\u0226\u0228\u022A\u022C\u022E\u0230\u0232\u023A\u023B\u023D\u023E\u0241\u0243-\u0246\u0248\u024A\u024C\u024E\u0370\u0372\u0376\u037F\u0386\u0388-\u038A\u038C\u038E\u038F\u0391-\u03A1\u03A3-\u03AB\u03CF\u03D2-\u03D4\u03D8\u03DA\u03DC\u03DE\u03E0\u03E2\u03E4\u03E6\u03E8\u03EA\u03EC\u03EE\u03F4\u03F7\u03F9\u03FA\u03FD-\u042F\u0460\u0462\u0464\u0466\u0468\u046A\u046C\u046E\u0470\u0472\u0474\u0476\u0478\u047A\u047C\u047E\u0480\u048A\u048C\u048E\u0490\u0492\u0494\u0496\u0498\u049A\u049C\u049E\u04A0\u04A2\u04A4\u04A6\u04A8\u04AA\u04AC\u04AE\u04B0\u04B2\u04B4\u04B6\u04B8\u04BA\u04BC\u04BE\u04C0\u04C1\u04C3\u04C5\u04C7\u04C9\u04CB\u04CD\u04D0\u04D2\u04D4\u04D6\u04D8\u04DA\u04DC\u04DE\u04E0\u04E2\u04E4\u04E6\u04E8\u04EA\u04EC\u04EE\u04F0\u04F2\u04F4\u04F6\u04F8\u04FA\u04FC\u04FE\u0500\u0502\u0504\u0506\u0508\u050A\u050C\u050E\u0510\u0512\u0514\u0516\u0518\u051A\u051C\u051E\u0520\u0522\u0524\u0526\u0528\u052A\u052C\u052E\u0531-\u0556\u10A0-\u10C5\u10C7\u10CD\u13A0-\u13F5\u1E00\u1E02\u1E04\u1E06\u1E08\u1E0A\u1E0C\u1E0E\u1E10\u1E12\u1E14\u1E16\u1E18\u1E1A\u1E1C\u1E1E\u1E20\u1E22\u1E24\u1E26\u1E28\u1E2A\u1E2C\u1E2E\u1E30\u1E32\u1E34\u1E36\u1E38\u1E3A\u1E3C\u1E3E\u1E40\u1E42\u1E44\u1E46\u1E48\u1E4A\u1E4C\u1E4E\u1E50\u1E52\u1E54\u1E56\u1E58\u1E5A\u1E5C\u1E5E\u1E60\u1E62\u1E64\u1E66\u1E68\u1E6A\u1E6C\u1E6E\u1E70\u1E72\u1E74\u1E76\u1E78\u1E7A\u1E7C\u1E7E\u1E80\u1E82\u1E84\u1E86\u1E88\u1E8A\u1E8C\u1E8E\u1E90\u1E92\u1E94\u1E9E\u1EA0\u1EA2\u1EA4\u1EA6\u1EA8\u1EAA\u1EAC\u1EAE\u1EB0\u1EB2\u1EB4\u1EB6\u1EB8\u1EBA\u1EBC\u1EBE\u1EC0\u1EC2\u1EC4\u1EC6\u1EC8\u1ECA\u1ECC\u1ECE\u1ED0\u1ED2\u1ED4\u1ED6\u1ED8\u1EDA\u1EDC\u1EDE\u1EE0\u1EE2\u1EE4\u1EE6\u1EE8\u1EEA\u1EEC\u1EEE\u1EF0\u1EF2\u1EF4\u1EF6\u1EF8\u1EFA\u1EFC\u1EFE\u1F08-\u1F0F\u1F18-\u1F1D\u1F28-\u1F2F\u1F38-\u1F3F\u1F48-\u1F4D\u1F59\u1F5B\u1F5D\u1F5F\u1F68-\u1F6F\u1FB8-\u1FBB\u1FC8-\u1FCB\u1FD8-\u1FDB\u1FE8-\u1FEC\u1FF8-\u1FFB\u2102\u2107\u210B-\u210D\u2110-\u2112\u2115\u2119-\u211D\u2124\u2126\u2128\u212A-\u212D\u2130-\u2133\u213E\u213F\u2145\u2160-\u216F\u2183\u24B6-\u24CF\u2C00-\u2C2E\u2C60\u2C62-\u2C64\u2C67\u2C69\u2C6B\u2C6D-\u2C70\u2C72\u2C75\u2C7E-\u2C80\u2C82\u2C84\u2C86\u2C88\u2C8A\u2C8C\u2C8E\u2C90\u2C92\u2C94\u2C96\u2C98\u2C9A\u2C9C\u2C9E\u2CA0\u2CA2\u2CA4\u2CA6\u2CA8\u2CAA\u2CAC\u2CAE\u2CB0\u2CB2\u2CB4\u2CB6\u2CB8\u2CBA\u2CBC\u2CBE\u2CC0\u2CC2\u2CC4\u2CC6\u2CC8\u2CCA\u2CCC\u2CCE\u2CD0\u2CD2\u2CD4\u2CD6\u2CD8\u2CDA\u2CDC\u2CDE\u2CE0\u2CE2\u2CEB\u2CED\u2CF2\uA640\uA642\uA644\uA646\uA648\uA64A\uA64C\uA64E\uA650\uA652\uA654\uA656\uA658\uA65A\uA65C\uA65E\uA660\uA662\uA664\uA666\uA668\uA66A\uA66C\uA680\uA682\uA684\uA686\uA688\uA68A\uA68C\uA68E\uA690\uA692\uA694\uA696\uA698\uA69A\uA722\uA724\uA726\uA728\uA72A\uA72C\uA72E\uA732\uA734\uA736\uA738\uA73A\uA73C\uA73E\uA740\uA742\uA744\uA746\uA748\uA74A\uA74C\uA74E\uA750\uA752\uA754\uA756\uA758\uA75A\uA75C\uA75E\uA760\uA762\uA764\uA766\uA768\uA76A\uA76C\uA76E\uA779\uA77B\uA77D\uA77E\uA780\uA782\uA784\uA786\uA78B\uA78D\uA790\uA792\uA796\uA798\uA79A\uA79C\uA79E\uA7A0\uA7A2\uA7A4\uA7A6\uA7A8\uA7AA-\uA7AE\uA7B0-\uA7B4\uA7B6\uFF21-\uFF3A\U00010400-\U00010427\U000104B0-\U000104D3\U00010C80-\U00010CB2\U000118A0-\U000118BF\U0001D400-\U0001D419\U0001D434-\U0001D44D\U0001D468-\U0001D481\U0001D49C\U0001D49E\U0001D49F\U0001D4A2\U0001D4A5\U0001D4A6\U0001D4A9-\U0001D4AC\U0001D4AE-\U0001D4B5\U0001D4D0-\U0001D4E9\U0001D504\U0001D505\U0001D507-\U0001D50A\U0001D50D-\U0001D514\U0001D516-\U0001D51C\U0001D538\U0001D539\U0001D53B-\U0001D53E\U0001D540-\U0001D544\U0001D546\U0001D54A-\U0001D550\U0001D56C-\U0001D585\U0001D5A0-\U0001D5B9\U0001D5D4-\U0001D5ED\U0001D608-\U0001D621\U0001D63C-\U0001D655\U0001D670-\U0001D689\U0001D6A8-\U0001D6C0\U0001D6E2-\U0001D6FA\U0001D71C-\U0001D734\U0001D756-\U0001D76E\U0001D790-\U0001D7A8\U0001D7CA\U0001E900-\U0001E921\U0001F130-\U0001F149\U0001F150-\U0001F169\U0001F170-\U0001F189]"
p = re.compile(pLu)
if p.match("Żółw"):
print("Capitalized!")
See the IDEONE demo. To make it work in Python 2.x, make sure you add u prefix to the string literals.
There are other ways to get the Unicode upper-case letter character class in Python using unicodedata and sys packages like
# Python 3
pLu = '[{}]'.format("".join([chr(i) for i in range(sys.maxunicode) if chr(i).isupper()]))
# Python 2
pLu = u'[{}]'.format(u"".join([unichr(i) for i in xrange(sys.maxunicode) if unichr(i).isupper()]))
However, this range does not match all uppercase letters displayed at the Unicode Utilities: UnicodeSet page for [:upper:] POSIX character class.
Cf.:
Python 2.7 len([unichr(i) for i in xrange(sys.maxunicode) if unichr(i).isupper()]) displays 1427
Python 3.5 len([chr(i) for i in range(sys.maxunicode) if chr(i).isupper()]) shows 1751
Python 3.6 len([chr(i) for i in range(sys.maxunicode) if chr(i).isupper()]) shows 1822
Current Unicode Utilities CLDR page displays 1,822 uppercase letters for [:upper:] class, and 1,702 for the \p{Lu}.
With PyPi regex module, it is simpler:
import regex
p = regex.compile(r"\p{Lu}") # To support (currently) 1702 uppercase letters
# p = regex.compile(r"[[:upper:]]") # To support (currently) 1822 uppercase letters
if p.match("Żółw"):
print("Capitalized!")
In Python 2.x you should use:
p = regex.compile(ur"\p{Lu}")
p = regex.compile(ur"[[:upper:]]")
or
p = regex.compile(r"\p{Lu}", regex.U)
p = regex.compile(r"[[:upper:]]", regex.U)

re is excessive when you can just use word[0].isupper().
>>>> 'żółć'[0].isupper()
False
>>>> 'Żółw'[0].isupper()
True
>>>> 'ćMa'[0].isupper()
False

>>> words
'Does not match Äh Oh Äi Üs üx Öjjj'
>>> re.findall(r"(\b[A-ZÜÖÄ][a-z.-]+\b)", words, re.UNICODE)
['Does', 'Äh', 'Oh', 'Äi', 'Üs', 'Öjjj']
Just add to the list all the Unicode letters which are not in the range A-Z, I added the german umlauts only.
You can find all non ASCII letters (A-Z) like this:
>>> [c for c in words if not c.isalpha() and not c.isdigit() and not c.isspace()]
['\xc3', '\x84', '\xc3', '\x84', '\xc3', '\x9c', '\xc3', '\xbc', '\xc3', '\x96']
Now you'll have to figure which are capitals.

You can compare string with its capitalized version to tell if it's capitalized or not:
>>>> s = 'żółć Żółw ćMa'
>>>> l = s.split()
>>>> [word for word in l if word == word.capitalize()]
['Żółw']
>>>> frozenset(l).intersection(s.capitalize() for s in l)
frozenset({'Żółw'})
Note that in Python2 you need to use unicode strings for it to work correctly.

You should try \W and \w :
Example in pythex.org

Python regex splitting additionally on _

I'm trying to split a string in python using regular expressions. This line works almost perfectly for me:
from string import punctuation
import re
row = re.findall('\w+|[{0}]+'.format(punctuation), string)
However, it doesn't split the string on instances of _ as well. For instance:
>>> string = "Hi my name is _Mark. I like apples!! Do you?!"
>>> row = re.findall('\w+|[{0}]+'.format(punctuation), string)
>>> row
['Hi', 'my', 'name', 'is', '_Mark', '.', 'I', 'like', 'apples', '!!', 'Do', 'you', '?!']
What i want is:
['Hi', 'my', 'name', 'is', '_', 'Mark', '.', 'I', 'like', 'apples', '!!', 'Do', 'you', '?!']
I've read its because _ is considered a character. Does anyone know how to accomplish this? Thanks for the help.

Since \w will match the underscore, you can more directly specify what you consider a character without too much more work:
re.findall('[a-zA-Z0-9]+|[{0}]+'.format(punctuation), string)

Because the left side of a disjunction will always match first if possible, you can simply include _ with the punctuation characters before you match letters:
row = re.findall(r'[{0}_]+|\w+'.format(string.punctuation), mystring)
But you can do the same without bothering with string.punctuation at all. "Punctuation" is anything that's neither a space nor a word character:
row = re.findall(r"(?:[^\s\w]|_)+|\w+", mystring)
PS. In your code sample, the string named string "shadows" the module string. Don't do that, it's bad practice and leads to bugs.

It is clearly stated in Python docs that \w not only include alphanumerical characters but also the underscore as well:
\w
When the LOCALE and UNICODE flags are not specified, matches any
alphanumeric character and the underscore; this is equivalent to the
set [a-zA-Z0-9_]. With LOCALE, it will match the set [0-9_] plus
whatever characters are defined as alphanumeric for the current
locale. If UNICODE is set, this will match the characters [0-9_] plus
whatever is classified as alphanumeric in the Unicode character
properties database.
so like Eric pointed out in his solution, better specify a set of only alphanumerical characters [a-zA-Z0-9]

python regex find all words in text

This sounds very simple, I know, but for some reason I can't get all the results I need
Word in this case is any char but white-space that is separetaed with white-space
for example in the following string: "Hello there stackoverflow."
the result should be: ['Hello','there','stackoverflow.']
My code:
import re
word_pattern = "^\S*\s|\s\S*\s|\s\S*$"
result = re.findall(word_pattern,text)
print result
but after using this pattern on a string like I've shown it only puts the first and the last words in the list and not the words separeted with two spaces
What is the problem with this pattern?

Use the \b boundary test instead:
r'\b\S+\b'
Result:
>>> import re
>>> re.findall(r'\b\S+\b', 'Hello there StackOverflow.')
['Hello', 'there', 'StackOverflow']
or not use a regular expression at all and just use .split(); the latter would include the punctiation in a sentence (the regex above did not match the . in the sentence).

to find all words in a string best use split
>>> "Hello there stackoverflow.".split()
['Hello', 'there', 'stackoverflow.']
but if you must use regular expressions, then you should change your regex to something simpler and faster: r'\b\S+\b'.
r turns the string to a 'raw' string. meaning it will not escape your characters.
\b means a boundary, which is a space, newline, or punctuation.
\S you should know, is any non-whitespace character.
+ means one or more of the previous.
so together it means find all visible sets of characters (words/numbers).

How about simply using -
>>> s = "Hello there stackoverflow."
>>> s.split()
['Hello', 'there', 'stackoverflow.']

The other answers are good. Depending on what you want (eg. include/exclude punctuation or other non-word characters) an alternative could be to use a regex to split by one or more whitespace characters:
re.split(r'\s+', 'Hello there StackOverflow.')
['Hello', 'There', 'StackOverflow.']

Confusing Behaviour of regex in Python

I'm trying to match a specific pattern using the re module in python.
I wish to match a full sentence (More correctly I would say that they are alphanumeric string sequences separated by spaces and/or punctuation)
Eg.
"This is a regular sentence."
"this is also valid"
"so is This ONE"
I'm tried out of various combinations of regular expressions but I am unable to grasp the working of the patterns properly, with each expression giving me a different yet inexplicable result (I do admit I am a beginner, but still).
I'm tried:
"((\w+)(\s?))*"
To the best of my knowledge this should match one or more alpha alphanumerics greedily followed by either one or no white-space character and then it should match this entire pattern greedily. This is not what it seems to do, so clearly I am wrong but I would like to know why. (I expected this to return the entire sentence as the result)
The result I get for the first sample string mentioned above is [('sentence', 'sentence', ''), ('', '', ''), ('', '', ''), ('', '', '')].
"(\w+ ?)*"
I'm not even sure how this one should work. The official documentation(python help('re')) says that the ,+,? Match x or x (greedy) repetitions of the preceding RE.
In such a case is simply space the preceding RE for '?' or is '\w+ ' the preceding RE? And what will be the RE for the '' operator? The output I get with this is ['sentence'].
Others such as "(\w+\s?)+)" ; "((\w*)(\s??)) etc. which are basically variation of the same idea that the sentence is a set of alpha numerics followed by a single/finite number of white spaces and this pattern is repeated over and over.
Can someone tell me where I go wrong and why, and why the above expressions do not work the way I was expecting them to?
P.S I eventually got "[ \w]+" to work for me but With this I cannot limit the number of white-space characters in continuation.

Your reasoning about the regex is correct, your problem is coming from using capturing groups with *. Here's an alternative:
>>> s="This is a regular sentence."
>>> import re
>>> re.findall(r'\w+\s?', s)
['This ', 'is ', 'a ', 'regular ', 'sentence']
In this case it might make more sense for you to use \b in order to match word boundries.
>>> re.findall(r'\w+\b', s)
['This', 'is', 'a', 'regular', 'sentence']
Alternatively you can match the entire sentence via re.match and use re.group(0) to get the whole match:
>>> r = r"((\w+)(\s?))*"
>>> s = "This is a regular sentence."
>>> import re
>>> m = re.match(r, s)
>>> m.group(0)
'This is a regular sentence'

Here's an awesome Regular Expression tutorial website:
http://regexone.com/
Here's a Regular Expression that will match the examples given:
([a-zA-Z0-9,\. ]+)

Why do you want to limit the number of white space character in continuation? Because a sentence can have any number of words (sequences of alphanumeric characters) and spaces in a row, but rather a sentence is the area of text that ends with a punctuation mark or rather something that is not in the above sequence including white space.
([a-zA-Z0-9\s])*
The above regex will match a sentence wherein it is a series or spaces in series zero or more times. You can refine it to be the following though:
([a-zA-Z0-9])([a-zA-Z0-9\s])*
Which simply states that the above sequence must be prefaced with a alphanumeric character.
Hope this is what you were looking for.

Maybe this will help:
import re
source = """
This is a regular sentence.
this is also valid
so is This ONE
how about this one followed by this one
"""
re_sentence = re.compile(r'[^ \n.].*?(\.|\n| +)')
def main():
i = 0
for s in re_sentence.finditer(source):
print "%d:%s" % (i, s.group(0))
i += 1
if __name__ == '__main__':
main()
I am using alternation in the expression (\.|\n| +) to describe the end-of-sentence condition. Note the use of two spaces in the third alternation. The second space has the '+' meta-character so that two or more spaces in a row will be an end-of-sentence.

Searching a string and returning only things I specify

Hopefully this post goes better..
So I am stuck on this feature of this program that will return the whole word where a certain keyword is specified.
ie - If I tell it to look for the word "I=" in the string "blah blah blah blah I=1mV blah blah etc?", that it returns the whole word where it is found, so in this case, it would return I=1mV.
I have tried a bunch of different approaches, such as,
text = "One of the values, I=1mV is used"
print(re.split('I=', text))
However, this returns the same String without I in it, so it would return
['One of the values, ', '1mV is used']
If I try regex solutions, I run into the problem where the number could possibly be more then 1 digit, and so this bottom piece of code only works if the number is 1 digit. If I=10mV was that value, it would only return one, but if i have [/0-9] in twice, the code no longer works with only 1 value.
text = "One of the values, I=1mV is used"
print(re.findall("I=[/0-9]", text))
['I=1']
When I tried using re.match,
text = "One of the values, I=1mV is used"
print(re.search("I=", text))
<_sre.SRE_Match object at 0x02408BF0>
What is a good way to retrieve the word (In this case, I want to retrieve I=1mV) and cut out the rest of the string?

A better way would be to split the text into words first:
>>> text = "One of the values, I=1mV is used"
>>> words = text.split()
>>> words
['One', 'of', 'the', 'values,', 'I=1mV', 'is', 'used']
And then filter the words to find the one you need:
>>> [w for w in words if 'I=' in w]
['I=1mV']
This returns a list of all words with I= in them. We can then just take the first element found:
>>> [w for w in words if 'I=' in w][0]
'I=1mV'
Done! What we can do to clean this up a bit is to just look for the first match, rather then checking every word. We can use a generator expression for that:
>>> next(w for w in words if 'I=' in w)
'I=1mV'
Of course you could adapt the if condition to fit your needs better, you could for example use str.startswith() to check if the words starts with a certain string or re.match() to check if the word matches a pattern.

Using string methods
For the record, your attempt to split the string in two halves, using I= as the separator, was nearly correct. Instead of using str.split(), which discards the separator, you could have used str.partition(), which keeps it.
>>> my_text = "Loadflow current was I=30.63kA"
>>> my_text.partition("I=")
('Loadflow current was ', 'I=', '30.63kA')
Using regular expressions
A more flexible and robust solution is to use a regular expression:
>>> import re
>>> pattern = r"""
... I= # specific string "I="
... \s* # Possible whitespace
... -? # possible minus sign
... \s* # possible whitespace
... \d+ # at least one digit
... (\.\d+)? # possible decimal part
... """
>>> m = re.search(pattern, my_text, re.VERBOSE)
>>> m
<_sre.SRE_Match object at 0x044CCFA0>
>>> m.group()
'I=30.63'
This accounts for a lot more possibilities (negative numbers, integer or decimal numbers).
Note the use of:
Quantifiers to say how many of each thing you want.
a* - zero or more as
a+ - at least one a
a? - "optional" - one or zero as
Verbose regular expression (re.VERBOSE flag) with comments - much easier to understand the pattern above than the non-verbose equivalent, I=\s?-?\s?\d+(\.\d+).
Raw strings for regexp patterns, r"..." instead of plain strings "..." - means that literal backslashes don't have to be escaped. Not required here because our pattern doesn't use backslashes, but one day you'll need to match C:\Program Files\... and on that day you will need raw strings.
Exercises
Exercise 1: How do you extend this so that it can match the unit as well? And how do you extend this so that it can match the unit as either mA, A, or kA? Hint: "Alternation operator".
Exercise 2: How do you extend this so that it can match numbers in engineering notation, i.e. "1.00e3", or "-3.141e-4"?

import re
text = "One of the values, I=1mV is used"
l = (re.split('I=', text))
print str(l[1]).split(' ') [0]
if you have more than one I= do the above for each odd index in l sice 0 is the first one.
that is a good way since one can write "One of the values, I= 1mV is used"
and I guess you want to get that I is 1mv.
BTW I is current and its units are Ampers and not Volts :)

With your re.findall attempt you would want to add a + which means one or more.
Here are some examples:
import re
test = "This is a test with I=1mV, I=1.414mv, I=10mv and I=1.618mv."
result = re.findall(r'I=[\d\.]+m[vV]', test)
print(result)
test = "One of the values, I=1mV is used"
result = re.search(r'I=([\d\.]+m[vV])', test)
print(result.group(1))
The first print is: ['I=1mV', 'I=1.414mv', 'I=10mv', 'I=1.618mv']
I've grouped everything other than I= in the re.search example,
so the second print is: 1mV
incase you are interested in extracting that.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to find word with subscript? - python

Related

Python regex for unicode capitalized words

Python regex splitting additionally on _

python regex find all words in text

Confusing Behaviour of regex in Python

Searching a string and returning only things I specify

Categories

Resources