string index out of range in inverted index implementation

string index out of range in inverted index implementation - python

*********************file a.py*********************************
a=input()
while (not (a[len(a)-1].isalpha())):
a=a[:-1]
print(a)
*****************part of file b.py*********************************
for my_word in my_words.split():
while(not(my_word[len(my_word)-1].isalpha())):
my_word=my_word[:-1]
ll=lemmatizer.lemmatize(my_word.lower())
if ll not in stop_words:
l.append(ll)
file a.py runs fine but b.py gives error
Traceback (most recent call last):
File "b.py", line 42, in <module>
while(not(my_word[len(my_word)-1].isalpha())):
IndexError: string index out of range.
If I remove the while loop
while(not(my_word[len(my_word)-1].isalpha())):
my_word=my_word[:-1]
my code(b.py) runs fine. But I want to remove special characters suffixes from my word.

You can use regular expression substitution (instead of the while loop) to remove non-alphabetic characters:
import re
my_word = "Hello_world+?a123"
re.sub(r"(\W|\d|_)+", "", my_word)
#'Helloworlda'

Related

How to convert str to a float?

I imported a list full of floats as strings, and i tried to convert them to floats, but this error kept popping up
Traceback (most recent call last):
File "c:\Users\peter\Documents\coding\projects\LineFitting.py", line 12, in <module>
StockPriceFile = float(value.strip(''))
ValueError: could not convert string to float:
this is what i did to try and convert the list:
#1
for value in range(0, len(StockPriceFile)):
StockPriceFile[value] = float(StockPriceFile[value])
#2
for value in StockPriceFile:
value = float(value)
#3
StockPriceFile[0] = StockPriceFile[0].strip('[]')
for value in StockPriceFile:
StockPriceFile = float(value.strip(''))
(Sample Of Data)
['[36800.]', '36816.666666666664', '36816.666666666664', '36833.333333333336', '36866.666666666664']
where its being written:
Data_AvgFile.write(str(Average) + ',')
What does this mean? and how can i fix it? it works fine when i do it one by one.
(also tell me if you need more data, i dont know if this is sufficient)

for value in StockPriceFile:
stock_price = float(value.strip('[]'))
print(stock_price)
strip() will remove the [] characters around the value.
DEMO

As long you have the brackets "[ ]" in you'r string you cant convert it to a a number as that would make it invalid so do letters and most symbols the dot (.) is an exception for float.
>>> print(float('[36800.]'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: could not convert string to float: '[36800.]'
>>> print(float('36800.'))
36800.0

l = ['[36800.]', '36816.666666666664', '36816.666666666664', '36833.333333333336', '36866.666666666664']
[float(f.strip('[]')) for f in l]
Output:
[36800.0,
36816.666666666664,
36816.666666666664,
36833.333333333336,
36866.666666666664]

How to remove all characters not inside parentheses using regex

I have a string that contains commas both inside and outside of a parentheses block:
foo(bat,foo),bat
How can I use regex to replace the comma not inside parentheses?
foo(bat,foo)bat

Do you really want to use re, or is anyway to achieve your goal is ok?
In the latter case, here is a way to do it:
mystring = 'foo(bat,foo),bat'
''.join(si + ',' if '(' in si else si for si in mystring.split(','))
#'foo(bat,foo)bat'

Assuming that there are no nested parentheses and there are no invalid pairings of parentheses, you can do this with a regex based on the fact that a comma will only be outside a pair of parentheses if and only if there are an even number of ( and ) symbols that follow it. Thus, you can use a lookahead regex to achieve this.
,(?![^(]*\))
If there are nested parentheses, it becomes a context-free grammar and you cannot capture this with a regular expression alone. You are better off just using split methods.
example:
import re
ori_str = "foo(bat,foo),bat foo(bat,foo),bat";
rep_str = re.sub(r',(?![^(]*\))', '', ori_str)
print(rep_str)

Considering that we want to remove all commas outside of all blocks and don't want to modify nested blocks.
Let's add string validation for cases when there are unclosed/unopened blocks found with
def validate_string(string):
left_parts_count = len(string.split('('))
right_parts_count = len(string.split(')'))
diff = left_parts_count - right_parts_count
if diff == 0:
return
if diff < 0:
raise ValueError('Invalid string: "{string}". '
'Number of closed '
'but not opened blocks: {diff}.'
.format(string=string,
diff=-diff))
raise ValueError('Invalid string: "{string}". '
'Number of opened '
'but not closed blocks: {diff}.'
.format(string=string,
diff=diff))
then we can do our job without regular expressions, just using str methods
def remove_commas_outside_of_parentheses(string):
# if you don't need string validation
# then remove this line and string validator
validate_string(string)
left_parts = string.split('(')
if len(left_parts) == 1:
# no opened blocks found,
# remove all commas
return string.replace(',', '')
left_outer_part = left_parts[0]
left_outer_part = left_outer_part.replace(',', '')
left_unopened_parts = left_parts[-1].split(')')
right_outer_part = left_unopened_parts[-1]
right_outer_part = right_outer_part.replace(',', '')
return '('.join([left_outer_part] +
left_parts[1:-1] +
[')'.join(left_unopened_parts[:-1]
+ [right_outer_part])])
it can look a bit nasty, i suppose, but it works.
Tests
>>>remove_commas_outside_of_parentheses('foo,bat')
foobat
>>>remove_commas_outside_of_parentheses('foo,(bat,foo),bat')
foo(bat,foo)bat
>>>remove_commas_outside_of_parentheses('bar,baz(foo,(bat,foo),bat),bar,baz')
barbaz(foo,(bat,foo),bat)barbaz
"broken" ones:
>>>remove_commas_outside_of_parentheses('(')
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "<input>", line 4, in remove_commas_outside_of_parentheses
File "<input>", line 17, in validate_string
ValueError: Invalid string: "(". Number of opened but not closed blocks: 1.
>>>remove_commas_outside_of_parentheses(')')
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "<input>", line 4, in remove_commas_outside_of_parentheses
File "<input>", line 12, in validate_string
ValueError: Invalid string: ")". Number of closed but not opened blocks: 1.

Using a dollar sign in enum (pypeg)?

I want to match types of the form either $f, $c, ..., $d using pypeg, so I tried putting it in an Enum as follows:
class StatementType(Keyword):
grammar = Enum( K("$f"), K("$c"),
K("$v"), K("$e"),
K("$a"), K("$p"),
K("$d"))
However, this fails:
>>> k = parse("$d", StatementType)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.6/site-packages/pypeg2/__init__.py", line 667, in parse
t, r = parser.parse(text, thing)
File "/usr/local/lib/python3.6/site-packages/pypeg2/__init__.py", line 794, in parse
raise r
File "<string>", line 1
$d
^
SyntaxError: expecting StatementType
I have also tried replacing the $x with \$x to escape the $ character. I also tried prepending r"\$x" in hopes that it treats it as a regex object. Neither of these combinations seem to work and give the same error message. How do I get it to match the example I gave?

The default regex for Keywords is \w+. You can change it by setting the Keyword.regex class variable:
class StatementType(Keyword):
grammar = Enum( K("$f"), K("$c"),
K("$v"), K("$e"),
K("$a"), K("$p"),
K("$d"))
Keyword.regex = re.compile(r"\$\w") # e.g. $a, $2, $_
k = parse("$d", StatementType)

Trying to do a simple regex

i wan to extract (abc)(def) using the regex
which i ended up with that error below
import re
def main():
str = "-->(abc)(def)<--"
match = re.search("\-->(.*?)\<--" , str).group(1)
print match
The error is:
Traceback (most recent call last):
File "test.py", line 7, in <module>
match = re.search("\-->(.*?)\<--" , str).group()
File "/usr/lib/python2.7/re.py", line 146, in search
return _compile(pattern, flags).search(string)
TypeError: expected string or buffer

Corrected:
import re
def main():
my_string = "-->(abc)(def)<--"
match = re.search("\-->(.*?)\<--" , my_string).group(1)
print match
# (abc)(def)
main()
Note, that I renamed str to my_string (do not use standard library functions as own variables!). Maybe you can still optimize your regex with lookarounds, the lazy star (.*?) can get very ineffective sometimes.

Splitting up lines in a regular expression

I'm trying to break up a long regex into smaller chunks. Is it possible/good practice to change A to B?
A:
line = re.sub(r'\$\{([0-9]+)\}|\$([0-9]+)|\$\{(\w+?\=\w?+)\}|[^\\]\$(\w[^-]+)|[^\\]\$\{(\w[^-]+)\}',replace,line)
B:
line = re.sub(r'\$\{([0-9]+)\}|'
r'\$([0-9]+)|'
r'\$\{(\w+?\=\w?+)\}|'
r'[^\\]\$(\w[^-]+)|'
r'[^\\]\$\{(\w[^-]+)\}',replace,line)
Edit:
I receive the following error when running this in Python 2:
def main():
while(1):
line = raw_input("(%s)$ " % ncmd)
line = re.sub(r'''
\$\{([0-9]+)\}|
\$([0-9]+)|
\$\{(\w+?\=\w?+)\}|
[^\\]\$(\w[^-]+)|
[^\\]\$\{(\w[^-]+)\}
''',replace,line,re.VERBOSE)
print '>> ' + line
Error:
(1)$ abc
Traceback (most recent call last):
File "Test.py", line 4, in <module>
main()
File "Test.py", line 2, in main
[^\\]\$\{(\w[^-]+)\}''',replace,line,re.VERBOSE)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py", line 151, in sub
return _compile(pattern, flags).sub(repl, string, count)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py", line 242, in _compile
raise error, v # invalid expression
sre_constants.error: multiple repeat

You can use a triple-quoted (multi-line) string and set the re.VERBOSE flag, which allows you to break a Regex pattern over multiple lines:
line = re.sub(r'''
\$\{([0-9]+)\}|
\$([0-9]+)|
\$\{(\w+?\=\w?+)\}|
[^\\]\$(\w[^-]+)|
[^\\]\$\{(\w[^-]+)\}
''', replace, line, re.VERBOSE)
You can even include comments directly inside the string:
line = re.sub(r'''
\$\{([0-9]+)\}| # Pattern 1
\$([0-9]+)| # Pattern 2
\$\{(\w+?\=\w?+)\}| # Pattern 3
[^\\]\$(\w[^-]+)| # Pattern 4
[^\\]\$\{(\w[^-]+)\} # Pattern 5
''', replace, line, re.VERBOSE)
Lastly, it should be noted that you can likewise activate the verbose flag by using re.X or by placing (?x) at the start of your Regex pattern.

You can also separate your expression over multiple lines using double quotes, like the following:
line = re.sub(r"\$\{([0-9]+)\}|\$([0-9]+)|"
r"\$\{(.+-.+)\}|"
r"\$\{(\w+?\=\w+?)\}|"
r"\$(\w[^-]+)|\$\{(\w[^-]+)\}",replace,line)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

string index out of range in inverted index implementation - python

You can use regular expression substitution (instead of the while loop) to remove non-alphabetic characters: import re my_word = "Hello_world+?a123" re.sub(r"(\W|\d|_)+", "", my_word) #'Helloworlda'

Related

How to convert str to a float?

How to remove all characters not inside parentheses using regex

Using a dollar sign in enum (pypeg)?

Trying to do a simple regex

Splitting up lines in a regular expression

Categories

Resources