Given the index of a word in a string starting at zero ("index" is position two in this sentence), and a word being defined as that which is separated by whitespace, I need to find the index of the first char of that word.
My whitespace regex pattern is "( +|\t+)+", just to cover all my bases (except new line chars, which are excluded). I used split() to separate the string into words, and then summed the lengths of each of those words. However, I need to account for the possibility that more than once whitespace character is used between words, so I can't simply add the number of words minus one to that figure and still be accurate every time.
Example:
>>> example = "This is an example sentence"
>>> get_word_index(example, 2)
8
Change your regular expression to include the whitespace around each word to prevent it from being lost. The expression \s*\S+\s* will first consume leading whitespace, then the actual word, then trailing spaces, so only the first word in the resulting list might have leading spaces (if the string itself started with whitespace). The rest consist of the word itself potentially followed by whitespace. After you have that list, simply find the total length of all the words before the one you want, and account for any leading spaces the string may have.
def get_word_index(s, idx):
words = re.findall(r'\s*\S+\s*', s)
return sum(map(len, words[:idx])) + len(words[idx]) - len(words[idx].lstrip())
Testing:
>>> example = "This is an example sentence"
>>> get_word_index(example, 2)
8
>>> example2 = ' ' + example
>>> get_word_index(example2, 2)
9
Maybe you could try with:
your_string.index(your_word)
documentation
Related
I have a list of strings and I want to extract from it only the item name, with spaces, if there are.
The strings stay in column named 0, and index is just for reference.
For example, from each index line I want the following results:
Index - Expected result
0 - BOV BCONTRA
1 - BF PARAROLE C
2 - CUBINHOS DACE
... and so on.
Notice that inline 25 the desired result are not separated from the preceding numbers with spaces
There can be a dot . between the words line in index line 30.
I've tried re.findall(r"\n\d{1,2} \d+(\b\w+\b)") with no success.
Also re.findall(r"\n\d{1,2} \d+( ?\w+)") brings me only the first word, and I want all the words, not only the first one.
The lines start with a \n char that it's not printed at the list.
so basically you need all the upper case strings on the text.
try this expression, where it will get all the text with or without spaces
re.findall('[A-Z]+[ A-Z]*', text)
It seems you want [A-Z .]+, not "words" (represented by r'\w'), bordered by
integers. \w maps to
[a-zA-Z0-9_].
That's the Regex string to have: r'\d+ \d+([A-Z .]+)\d+'.
I don't know what you mean that a newline precedes each line. If you have a string with lines in it, it's perhaps better to split the input in lines with string.splitlines(), then do a linear Regex match (re.match so the Regex only matches from the start) on each relevant line.
In trying to solve this challenge (which I pasted at the bottom of this question) using Python 3, the first of the two proposed solutions below, passes all test cases, while the second one doesn't. Since, in my eyes, they're doing pretty much the same, this leaves me very confused. Why doesn't the second block of code work?
It must be something very obvious because the second one fails most test cases, but having worked through custom-inputs, I still can't figure it out.
Working solution:
import re
import sys
lines = sys.stdin.readlines()
n=int(lines[0])
q=int(lines[n+1])
N=lines[1:n+1]
S=lines[n+2:]
text = "\n".join(N)
for s in S:
print(len(re.findall(r"(?<!\W)(?="+s.strip()+r"\w)", text)))
Broken "solution":
import re
import sys
lines = sys.stdin.readlines()
n=int(lines[0])
q=int(lines[n+1])
N=lines[1:n+1]
S=lines[n+2:]
for s in S:
total=0
for string in N:
total += len(re.findall("(?<!\W)(?="+s.strip()+"\w)", string))
print(total)
We define a word character to be any of the following:
An English alphabetic letter (i.e., a-z and A-Z).
A decimal digit (i.e., 0-9).
An underscore (i.e., _, which corresponds to ASCII value ).
We define a word to be a contiguous sequence of one or more word characters that is preceded and succeeded by one or more occurrences of non-word-characters or line terminators. For example, in the string I l0ve-cheese_?, the words are I, l0ve, and cheese_.
We define a sub-word as follows:
A sequence of word characters (i.e., English alphabetic letters,
digits, and/or underscores) that occur in the same exact order (i.e.,
as a contiguous sequence) inside another word.
It is preceded and succeeded by word characters only.
Given sentences consisting of one or more words separated by non-word characters, process queries where each query consists of a single string, . To process each query, count the number of occurrences of as a sub-word in all sentences, then print the number of occurrences on a new line.
Input Format
The first line contains an integer, n, denoting the number of sentences.
Each of the subsequent lines contains a sentence consisting of words separated by non-word characters.
The next line contains an integer, , denoting the number of queries.
Each line of the subsequent lines contains a string, , to check.
Constraints
1 ≤ n ≤ 100
1 ≤ q ≤ 10
Output Format
For each query string, print the total number of times it occurs as a sub-word within all words in all sentences.
Sample Input
1
existing pessimist optimist this is
1
is
Sample Output
3
Explanation
We must count the number of times is occurs as a sub-word in our input sentence(s):
occurs time as a sub-word of existing.
occurs time as a sub-word of pessimist.
occurs time as a sub-word of optimist.
While is a substring of the word this, it's followed by a blank
space; because a blank space is non-alphabetic, non-numeric, and not
an underscore, we do not count it as a sub-word occurrence.
While is a substring of the word is in the sentence, we do not count
it as a match because it is preceded and succeeded by non-word
characters (i.e., blank spaces) in the sentence. This means it
doesn't count as a sub-word occurrence.
Next, we sum the occurrences of as a sub-word of all our words as 1+1+1+0+0=3. Thus, we print 3 on a new line.
Without specifying your strings as raw strings, the regex metacharacters are actually interpreted as special escaped characters, and the pattern will not match as you expect.
Since you are no longer looking inside a multiline string, you'll want to add modify your negative lookbehind to a positive one: (?<=\w)
As Wiktor mentions in his comment, it would be a good idea to escape s.strip so that any potential chars that could be treated as regex metachars will be escaped and taken literally. You can use re.escape(s.strip()) for that.
Your code will work with this change:
total += len(re.findall(r"(?<\w)(?=" + re.escape(s.strip()) + r"\w)", string))
I would like to find words of length >= 1 which may contain a ' or a - within. Here is a test string:
a quake-prone area- (aujourd'hui-
In Python, I'm currently using this regex:
string = "a quake-prone area- (aujourd'hui-"
RE_WORDS = re.compile(r'[a-z]+[-\']?[a-z]+')
words = RE_WORDS.findall(string)
I would like to get this result:
>>> words
>>> [u'a', u'quake-prone', u'area', u"aujourd'hui"]
but I get this instead:
>>> words
>>> [u'quake-prone', u'area', u"aujourd'hui"]
Unfortunately, because of the last + quantifier, it skips all words of length 1. If I use the * quantifier, it will find a but also area- instead of area.
Then how could create a conditional regex saying: if the word contains an apostrophe or an hyphen, use the + quantifier else use the * quantifier ?
I suggest you to change the last [-\']?[a-z]+ part as optional by putting it into a group and then adding a ? quantifier next to that group.
>>> string = "a quake-prone area- (aujourd'hui-"
>>> RE_WORDS = re.compile(r'[a-z]+(?:[-\'][a-z]+)?')
>>> RE_WORDS.findall(string)
['a', 'quake-prone', 'area', "aujourd'hui"]
Reason for why the a is not printed is because of your regex contains two [a-z]+ which asserts that there must be atleast two lowercase letters present in the match.
Note that the regex i mentioned won't match area- because (?:[-\'][a-z]+)? optional group asserts that there must be atleast one lowercase letter would present just after to the - symbol. If no, then stop matching until it reaches the hyphen. So that you got area at the output instead of area- because there isn't an lowercase letter exists next to the -. Here it stops matching until it finds an hyphen without following lowercase letter.
I'm writing a Python script for a FOSS language learning initiative. Let's say I have an XML file (or to keep it simple, a Python list) with a list of words in a particular language (in my case, the words are in Tamil, which uses a Brahmi-based Indic script).
I need to draw out the subset of those words that can be spelled using just those letters.
An English example:
words = ["cat", "dog", "tack", "coat"]
get_words(['o', 'c', 'a', 't']) should return ["cat", "coat"]
get_words(['k', 'c', 't', 'a']) should return ["cat", "tack"]
A Tamil example:
words = [u"மரம்", u"மடம்", u"படம்", u"பாடம்"]
get_words([u'ம', u'ப', u'ட', u'ம்') should return [u"மடம்", u"படம்")
get_words([u'ப', u'ம்', u'ட') should return [u"படம்"]
The order that the words are returned in, or the order that the letters are entered in should not make a difference.
Although I understand the difference between unicode codepoints and graphemes, I'm not sure how they're handled in regular expressions.
In this case, I would want to match only those words that are made up of the specific graphemes in the input list, and nothing else (i.e. the markings that follow a letter should only follow that letter, but the graphemes themselves can occur in any order).
To support characters that can span several Unicode codepoints:
# -*- coding: utf-8 -*-
import re
import unicodedata
from functools import partial
NFKD = partial(unicodedata.normalize, 'NFKD')
def match(word, letters):
word, letters = NFKD(word), map(NFKD, letters) # normalize
return re.match(r"(?:%s)+$" % "|".join(map(re.escape, letters)), word)
words = [u"மரம்", u"மடம்", u"படம்", u"பாடம்"]
get_words = lambda letters: [w for w in words if match(w, letters)]
print(" ".join(get_words([u'ம', u'ப', u'ட', u'ம்'])))
# -> மடம் படம்
print(" ".join(get_words([u'ப', u'ம்', u'ட'])))
# -> படம்
It assumes that the same character can be used zero or more times in a word.
If you want only words that contain exactly given characters:
import regex # $ pip install regex
chars = regex.compile(r"\X").findall # get all characters
def match(word, letters):
return sorted(chars(word)) == sorted(letters)
words = ["cat", "dog", "tack", "coat"]
print(" ".join(get_words(['o', 'c', 'a', 't'])))
# -> coat
print(" ".join(get_words(['k', 'c', 't', 'a'])))
# -> tack
Note: there is no cat in the output in this case because cat doesn't use all given characters.
What does it mean to normalize? And could you please explain the syntax of the re.match() regex?
>>> import re
>>> re.escape('.')
'\\.'
>>> c = u'\u00c7'
>>> cc = u'\u0043\u0327'
>>> cc == c
False
>>> re.match(r'%s$' % (c,), cc) # do not match
>>> import unicodedata
>>> norm = lambda s: unicodedata.normalize('NFKD', s)
>>> re.match(r'%s$' % (norm(c),), norm(cc)) # do match
<_sre.SRE_Match object at 0x1364648>
>>> print c, cc
Ç Ç
Without normalization c and cc do not match. The characters are from the unicodedata.normalize() docs.
EDIT: Okay, don't use any of the answers from here. I wrote them all while thinking Python regular expressions didn't have a word boundary marker, and I tried to work around this lack. Then #Mark Tolonen added a comment that Python has \b as a word boundary marker! So I posted another answer, short and simple, using \b. I'll leave this here in case anyone is interested in seeing solutions that work around the lack of \b, but I don't really expect anyone to be.
It is easy to make a regular expression that matches only a string of a specific set of characters. What you need to use is a "character class" with just the characters you want to match.
I'll do this example in English.
[ocat] This is a character class that will match a single character from the set [o, c, a, t]. Order of the characters doesn't matter.
[ocat]+ Putting a + on the end makes it match one or more characters from the set. But this is not enough by itself; if you had the word "coach" this would match and return "coac".
Sadly, there isn't a regular expression feature for "word boundary". [EDIT: This turns out not to be correct, as I said in the first paragraph.] We need to make one of our own. There are two possible word beginnings: the start of a line, or whitespace separating our word from the previous word. Similarly, there are two possible word endings: end of a line, or whitespace separating our word from the next word.
Since we will be matching some extra stuff we don't want, we can put parentheses around the part of the pattern we do want.
To match two alternatives, we can make a group in parentheses and separate the alternatives with a vertical bar. Python regular expressions have a special notation to make a group whose contents we don't want to keep: (?:)
So, here is the pattern to match the beginning of a word. Start of line or white space: (?:^|\s)
Here is the pattern for end of word. White space or end of line: `(?:\s|$)
Putting it all together, here is our final pattern:
(?:^|\s)([ocat]+)(?:\s|$)
You can build this dynamically. You don't need to hard-code the whole thing.
import re
s_pat_start = r'(?:^|\s)(['
s_pat_end = r']+)(?:\s|$)'
set_of_chars = get_the_chars_from_somewhere_I_do_not_care_where()
# set_of_chars is now set to the string: "ocat"
s_pat = s_pat_start + set_of_chars + s_pat_end
pat = re.compile(s_pat)
Now, this doesn't in any way check for valid words. If you have the following text:
This is sensible. This not: occo cttc
The pattern I showed you will match on occo and cttc, and those are not really words. They are strings made only of letters from [ocat] though.
So just do the same thing with Unicode strings. (If you are using Python 3.x then all strings are Unicode strings, so there you go.) Put the Tamil characters in the character class and you are good to go.
This has a confusing problem: re.findall() doesn't return all possible matches.
EDIT: Okay, I figured out what was confusing me.
What we want is for our pattern to work with re.findall() so you can collect all the words. But re.findall() only finds non-overlapping patterns. In my example, re.findall() only returned ['occo'] and not ['occo', 'cttc'] as I expected... but this is because my pattern was matching the white space after occo. The match group didn't collect the white space, but it was matched all the same, and since re.findall() wants no overlap between matches, the white space was "used up" and didn't work for cttc.
The solution is to use a feature of Python regular expressions that I have never used before: special syntax that says "must not be preceded by" or "must not be followed by". The sequence \S matches any non-whitespace so we could use that. But punctuation is non-whitespace, and I think we do want punctuation to delimit a word. There is also special syntax for "must be preceded by" or "must be followed by". So here is, I think, the best we can do:
Build a string that means "match when the character class string is at start of line and followed by whitespace, or when character class string is preceded by whitespace and followed by whitespace, or when character class string is preceded by whitespace and followed by end of line, or when character class string is preceded by start of line and followed by end of line".
Here is that pattern using ocat:
r'(?:^([ocat]+)(?=\s)|(?<=\s)([ocat]+)(?=\s)|(?<=\s)([ocat]+)$|^([ocat]+)$)'
I'm very sorry but I really do think this is the best we can do and still work with re.findall()!
It's actually less confusing in Python code though:
import re
NMGROUP_BEGIN = r'(?:' # begin non-matching group
NMGROUP_END = r')' # end non-matching group
WS_BEFORE = r'(?<=\s)' # require white space before
WS_AFTER = r'(?=\s)' # require white space after
BOL = r'^' # beginning of line
EOL = r'$' # end of line
CCS_BEGIN = r'([' #begin a character class string
CCS_END = r']+)' # end a character class string
PAT_OR = r'|'
set_of_chars = get_the_chars_from_somewhere_I_do_not_care_where()
# set_of_chars now set to "ocat"
CCS = CCS_BEGIN + set_of_chars + CCS_END # build up character class string pattern
s_pat = (NMGROUP_BEGIN +
BOL + CCS + WS_AFTER + PAT_OR +
WS_BEFORE + CCS + WS_AFTER + PAT_OR +
WS_BEFORE + CCS + EOL + PAT_OR +
BOL + CCS + EOL +
NMGROUP_END)
pat = re.compile(s_pat)
text = "This is sensible. This not: occo cttc"
pat.findall(text)
# returns: [('', 'occo', '', ''), ('', '', 'cttc', '')]
So, the crazy thing is that when we have alternative patterns that could match, re.findall() seems to return an empty string for the alternatives that didn't match. So we just need to filter out the length-zero strings from our results:
import itertools as it
raw_results = pat.findall(text)
results = [s for s in it.chain(*raw_results) if s]
# results set to: ['occo', 'cttc']
I guess it might be less confusing to just build four different patterns, run re.findall() on each, and join the results together.
EDIT: Okay, here is the code for building four patterns and trying each. I think this is an improvement.
import re
WS_BEFORE = r'(?<=\s)' # require white space before
WS_AFTER = r'(?=\s)' # require white space after
BOL = r'^' # beginning of line
EOL = r'$' # end of line
CCS_BEGIN = r'([' #begin a character class string
CCS_END = r']+)' # end a character class string
set_of_chars = get_the_chars_from_somewhere_I_do_not_care_where()
# set_of_chars now set to "ocat"
CCS = CCS_BEGIN + set_of_chars + CCS_END # build up character class string pattern
lst_s_pat = [
BOL + CCS + WS_AFTER,
WS_BEFORE + CCS + WS_AFTER,
WS_BEFORE + CCS + EOL,
BOL + CCS
]
lst_pat = [re.compile(s) for s in lst_s_pat]
text = "This is sensible. This not: occo cttc"
result = []
for pat in lst_pat:
result.extend(pat.findall(text))
# result set to: ['occo', 'cttc']
EDIT: Okay, here is a very different approach. I like this one best.
First, we will match all words in the text. A word is defined as one or more characters that are not punctuation and are not white space.
Then, we use a filter to remove words from the above; we keep only words that are made only of the characters we want.
import re
import string
# Create a pattern that matches all characters not part of a word.
#
# Note that '-' has a special meaning inside a character class, but it
# is valid punctuation that we want to match, so put in a backslash in
# front of it to disable the special meaning and just match it.
#
# Use '^' which negates all the chars following. So, a word is a series
# of characters that are all not whitespace and not punctuation.
WORD_BOUNDARY = string.whitespace + string.punctuation.replace('-', r'\-')
WORD = r'[^' + WORD_BOUNDARY + r']+'
# Create a pattern that matches only the words we want.
set_of_chars = get_the_chars_from_somewhere_I_do_not_care_where()
# set_of_chars now set to "ocat"
# build up character class string pattern
CCS = r'[' + set_of_chars + r']+'
pat_word = re.compile(WORD)
pat = re.compile(CCS)
text = "This is sensible. This not: occo cttc"
# This makes it clear how we are doing this.
all_words = pat_word.findall(text)
result = [s for s in all_words if pat.match(s)]
# "lazy" generator expression that yields up good results when iterated
# May be better for very large texts.
result_genexp = (s for s in (m.group(0) for m in pat_word.finditer(text)) if pat.match(s))
# force the expression to expand out to a list
result = list(result_genexp)
# result set to: ['occo', 'cttc']
EDIT: Now I don't like any of the above solutions; please see the other answer, the one using \b, for the best solution in Python.
It is easy to make a regular expression that matches only a string of a specific set of characters. What you need to use is a "character class" with just the characters you want to match.
I'll do this example in English.
[ocat] This is a character class that will match a single character from the set [o, c, a, t]. Order of the characters doesn't matter.
[ocat]+ Putting a + on the end makes it match one or more characters from the set. But this is not enough by itself; if you had the word "coach" this would match and return "coac".
\b[ocat]+\b' Now it only matches on word boundaries. (Thank you very much #Mark Tolonen for educating me about\b`.)
So, just build up a pattern like the above, only using the desired character set at runtime, and there you go. You can use this pattern with re.findall() or re.finditer().
import re
words = ["cat", "dog", "tack", "coat"]
def get_words(chars_seq, words_seq=words):
s_chars = ''.join(chars_seq)
s_pat = r'\b[' + s_chars + r']+\b'
pat = re.compile(s_pat)
return [word for word in words_seq if pat.match(word)]
assert get_words(['o', 'c', 'a', 't']) == ["cat", "coat"]
assert get_words(['k', 'c', 't', 'a']) == ["cat", "tack"]
I would not use regular expressions to solve this problem. I would rather use collections.Counter like so:
>>> from collections import Counter
>>> def get_words(word_list, letter_string):
return [word for word in word_list if Counter(word) & Counter(letter_string) == Counter(word)]
>>> words = ["cat", "dog", "tack", "coat"]
>>> letters = 'ocat'
>>> get_words(words, letters)
['cat', 'coat']
>>> letters = 'kcta'
>>> get_words(words, letters)
['cat', 'tack']
This solution should also work for other languages. Counter(word) & Counter(letter_string) finds the intersection between the two counters, or the min(c[x], f[x]). If this intersection is equivalent to your word, then you want to return the word as a match.
Hopefully this post goes better..
So I am stuck on this feature of this program that will return the whole word where a certain keyword is specified.
ie - If I tell it to look for the word "I=" in the string "blah blah blah blah I=1mV blah blah etc?", that it returns the whole word where it is found, so in this case, it would return I=1mV.
I have tried a bunch of different approaches, such as,
text = "One of the values, I=1mV is used"
print(re.split('I=', text))
However, this returns the same String without I in it, so it would return
['One of the values, ', '1mV is used']
If I try regex solutions, I run into the problem where the number could possibly be more then 1 digit, and so this bottom piece of code only works if the number is 1 digit. If I=10mV was that value, it would only return one, but if i have [/0-9] in twice, the code no longer works with only 1 value.
text = "One of the values, I=1mV is used"
print(re.findall("I=[/0-9]", text))
['I=1']
When I tried using re.match,
text = "One of the values, I=1mV is used"
print(re.search("I=", text))
<_sre.SRE_Match object at 0x02408BF0>
What is a good way to retrieve the word (In this case, I want to retrieve I=1mV) and cut out the rest of the string?
A better way would be to split the text into words first:
>>> text = "One of the values, I=1mV is used"
>>> words = text.split()
>>> words
['One', 'of', 'the', 'values,', 'I=1mV', 'is', 'used']
And then filter the words to find the one you need:
>>> [w for w in words if 'I=' in w]
['I=1mV']
This returns a list of all words with I= in them. We can then just take the first element found:
>>> [w for w in words if 'I=' in w][0]
'I=1mV'
Done! What we can do to clean this up a bit is to just look for the first match, rather then checking every word. We can use a generator expression for that:
>>> next(w for w in words if 'I=' in w)
'I=1mV'
Of course you could adapt the if condition to fit your needs better, you could for example use str.startswith() to check if the words starts with a certain string or re.match() to check if the word matches a pattern.
Using string methods
For the record, your attempt to split the string in two halves, using I= as the separator, was nearly correct. Instead of using str.split(), which discards the separator, you could have used str.partition(), which keeps it.
>>> my_text = "Loadflow current was I=30.63kA"
>>> my_text.partition("I=")
('Loadflow current was ', 'I=', '30.63kA')
Using regular expressions
A more flexible and robust solution is to use a regular expression:
>>> import re
>>> pattern = r"""
... I= # specific string "I="
... \s* # Possible whitespace
... -? # possible minus sign
... \s* # possible whitespace
... \d+ # at least one digit
... (\.\d+)? # possible decimal part
... """
>>> m = re.search(pattern, my_text, re.VERBOSE)
>>> m
<_sre.SRE_Match object at 0x044CCFA0>
>>> m.group()
'I=30.63'
This accounts for a lot more possibilities (negative numbers, integer or decimal numbers).
Note the use of:
Quantifiers to say how many of each thing you want.
a* - zero or more as
a+ - at least one a
a? - "optional" - one or zero as
Verbose regular expression (re.VERBOSE flag) with comments - much easier to understand the pattern above than the non-verbose equivalent, I=\s?-?\s?\d+(\.\d+).
Raw strings for regexp patterns, r"..." instead of plain strings "..." - means that literal backslashes don't have to be escaped. Not required here because our pattern doesn't use backslashes, but one day you'll need to match C:\Program Files\... and on that day you will need raw strings.
Exercises
Exercise 1: How do you extend this so that it can match the unit as well? And how do you extend this so that it can match the unit as either mA, A, or kA? Hint: "Alternation operator".
Exercise 2: How do you extend this so that it can match numbers in engineering notation, i.e. "1.00e3", or "-3.141e-4"?
import re
text = "One of the values, I=1mV is used"
l = (re.split('I=', text))
print str(l[1]).split(' ') [0]
if you have more than one I= do the above for each odd index in l sice 0 is the first one.
that is a good way since one can write "One of the values, I= 1mV is used"
and I guess you want to get that I is 1mv.
BTW I is current and its units are Ampers and not Volts :)
With your re.findall attempt you would want to add a + which means one or more.
Here are some examples:
import re
test = "This is a test with I=1mV, I=1.414mv, I=10mv and I=1.618mv."
result = re.findall(r'I=[\d\.]+m[vV]', test)
print(result)
test = "One of the values, I=1mV is used"
result = re.search(r'I=([\d\.]+m[vV])', test)
print(result.group(1))
The first print is: ['I=1mV', 'I=1.414mv', 'I=10mv', 'I=1.618mv']
I've grouped everything other than I= in the re.search example,
so the second print is: 1mV
incase you are interested in extracting that.