I am currently doing a data analysis project involving text mining. As of now, I am stuck on filtering out certain phrases.
Suppose I have this tokenized array of words
arr = ['hello' ',' , 'how', 'is' , 'your', 'day', 'going', '?' , '#', 'HelloWorld']
(hello, how is your day going? #HelloWorld)
and I want to remove the #HelloWorld from the sentence.
My original logic was traverse through the array and check for the # , once it the # has been found, I would replace the # and the element after the # with a blank space as followed:
N = 0
for index to arr:
if arr[N] == '#':
arr[N] = (' ')
arr[N+1] = (' ')
N += 1
unfortunately, I got the error list assignment index out of range at line 5. I tried to use the .append() but it only allows modification at N .
Is there another approach to this?
This should work, like the others said, you need to check when you are at the end of the list.
EDIT: simplify !
arr = ['a', 'b', '#', 'aa']
indices = [idx for idx, elt in enumerate(arr) if elt == '#']
for idx in indices:
if idx != len(arr): arr[idx+1] = ' ' # Check if not at the end of the list
arr[idx] = ' '
Your code will try to access outside the array when the last element is #, so you need to check for that.
There's also no need to use a separate variable for iteration and indexing, just iterate over the range of indexes.
for i in range(len(arr)):
if arr[i] == '#':
arr[i] = ' '
if i < len(arr)-2:
arr[i+1] = ' '
The root cause of your codes is 'N+1' will be out of range when loop to the end of the list.
If one element must exist following one '#', try below:
arr = ['hello' ',' , 'how', 'is' , 'your', 'day', 'going', '?' , '#', 'HelloWorld']
for index in range(0, len(arr)):
if arr[index] == '#':
arr[index:index+2] = ['', '']
print (arr)
Output:
['hello,', 'how', 'is', 'your', 'day', 'going', '?', '', '']
[Finished in 0.133s]
if the array is end with '#', it will still replace '#' with ['',''] ( I am not sure whether this result is as you expected.
arr = ['hello' ',' , 'how', 'is' , 'your', 'day', 'going', '?' , '#', 'HelloWorld', '#']
for index in range(0, len(arr)):
if arr[index] == '#':
arr[index:index+2] = ['', '']
print (arr)
Output:
['hello,', 'how', 'is', 'your', 'day', 'going', '?', '', '', '', '']
[Finished in 0.179s]
Related
Input
['select', '*', 'from', 'ak','.','person']
I need to create a dictionary and merger the element after from
Expected Output
['select', '*', 'from', 'ak.person']
Code is below
m = []
for i in a:
if '.' == i:
ind = a.index('.')
m.append(a[ind-1] + a[ind] + a[ind+1])
else:
m.append(i)
My output >> ['select', '*', 'from', 'ak', 'ak.person', 'person']
Expected is ['select', '*', 'from', 'ak.person']
Relatively short:
arr = ['select', '*', 'from', 'ak','.','person']
ind = arr.index('from') + 1
# we join the initial array until 'from' then joining the rest:
print(arr[:ind] + ["".join(arr[ind:])])
The loop here checks if the previous element is 'from' and if so, it joins the following three elements that comes after it.
This should work for the test cases that follow the same pattern like you've given (also including longer queries with where mentioned in the updated question) .
a = ['select', '*', 'from', 'ak','.','person']
m = []
while i< len(a):
if a[i-1] == "from":
m.append("".join(a[i:i+3]))
i+=3
else:
m.append(a[i])
i+=1
Output
['select', '*', 'from', 'ak.person']
Try this. I first extracted a sub-list after 'from' and merged that.
orig = ['select', '*', 'from', 'ak', '.' ,'person']
from_pos = orig.index('from')
sublist = orig[from_pos+1:from_pos+4]
new_string = ''
for txt in sublist:
new_string += txt
new_list = orig[0:from_pos+1]
new_list.append(new_string)
print(orig)
print(new_list)
And if you have any where clause or group by after that, you can try this -
orig = ['select', '*', 'from', 'ak', '.' ,'person', 'where', 'filter', 'group', 'by']
from_pos = orig.index('from')
sublist = orig[from_pos+1:from_pos+4]
new_string = ''
for txt in sublist:
new_string += txt
new_list = orig[0:from_pos+1]
new_list.append(new_string)
where = orig[from_pos+4:]
new_list = new_list + where
print(orig)
print(new_list)
You get -
['select', '*', 'from', 'ak.person', 'where', 'filter', 'group', 'by']
Try this:
def my_func(my_list):
lp=my_list[(my_list.index('from')+1):(len(my_list))]
ans=my_list[0:(my_list.index('from')+1)]+["".join(lp)]
return ans
my_func(['select', '*', 'from', 'ak','.','person'])
Output:
['select', '*', 'from', 'ak.person']
Alternate Solution:
def sub_func(my_list):
lp=my_list[(my_list.index('.')-1):(my_list.index('.')+2)]
ans=my_list[0:(my_list.index('.')-1)]+["".join(lp)]+my_list[(my_list.index('.')+2):(len(my_list))]
return ans
def my_func(my_list):
lst=my_list
for i in range(my_list.count('.')):
lst=(lambda x: sub_func(x))(lst)
return lst
my_func(['select', '*', 'from', 'ak','.','person','where','foo','.','bar', '=', '30'])
Output:
['select', '*', 'from', 'ak.person', 'where', 'foo.bar', '=', '30']
This solution targets any elements separated by a period and joins them. Unlike the original solution, it will function with lists containing multiple periods and with lists that do not use a 'from' statement.
Golf answer:
print(a[:(i:=(a:=['select', '*', 'from', 'ak','.','person']).index('from')+1)]+[''.join(a[i:])])
This question already has answers here:
Find the indices at which any element of one list occurs in another
(5 answers)
Closed 2 years ago.
I have two lists. One is a list of selected punctuations and the other a list of tokens.
punc = ['.', '!', '?']
tokens = ['today', 'i', 'went', 'to', 'the', 'park', '.', 'it', 'was', 'great', '!']
How do I get the index of the first punctuation (as defined by the list punc) that appears in tokens?
In the above case, my desired output is index = 6 since the first punctuation that appears is '.'.
You can do it like this with index() on the tokens list:
punc = ['.', '!', '?']
tokens = ['today', 'i', 'went', 'to', 'the', 'park', '.', 'it', 'was', 'great', '!']
for p in punc:
if p in tokens:
print(p, tokens.index(p), sep=" index is: ")
else:
print(p, 'not found', sep=' ')
This code will print all the punc index in tokens, if exists.
With list comprehension:
[print(p, tokens.index(p), sep=" index is: ") if p in tokens else print(p, 'not found', sep=' ') for p in punc]
Output:
. index is: 6
! index is: 10
? not found
In case you just want to check the first item and not the entire punc list:
print(tokens.index(punc[0]) if punc[0] in tokens else 'not found')
OUTPUT:
6
The usage of [index()] can generate a ValueError exception when the element is not in the list:
Exception has occurred: ValueError
'?' is not in list
In you case this can happend for the value ? that is not present in tokens.
To solve this you have two simple ways:
Check if the item is in list like: '?' in tokens (This is the clean/redable approach)
Wrap the .index() call inside a try/except and manage it. (This is the fast approach)
The solution to your problem would be this
punc = ['.', '!', '?']
tokens = ['today', 'i', 'went', 'to', 'the', 'park', '.', 'it', 'was', 'great', '!']
for i, element in enumerate(tokens):
if element in punc:
print(f"Found {element} at index: {i}")
break
What we do here is we loop over tokens using enumerate, which returns the index and the element. For each iteration in the loop we check if the element is in "punc" if that's the case you have found your first element.
I have a list of strings ['what', 'is', 'apple', '&', 'orange']
Want to do a join only when '&' is in between of two strings.
Wanted:
['what', 'is', 'apple&orange']
What I can think of so far looks very dumb
Is there a Pythonic way to do this?
Here's a 1-liner with string methods
str.join
str.replace
str.split
l = ['what', 'is', 'apple', '&', 'orange', 'apple', '&', 'banana']
new_list = ' '.join(l).replace(' & ', '&').split()
# print(new_list)
['what', 'is', 'apple&orange', 'apple&banana']
This isn't very pythonic, but it's the cleanest way I could think of. Note that this code breaks if you have an '&' at the last position of your list.
your_list = ['what', 'is', 'apple', '&', 'orange']
out = []
i = 0
while i < len(your_list):
word = your_list[i]
if (i < len(your_list) - 1) and your_list[i + 1] == '&':
word = word + '&' + your_list[i + 2]
i += 2
out.append(word)
i += 1
print(out)
I have difficulties splitting a string into specific parts in Python 3.
The string is basically a list with a colon (:) as a delimiter.
Only when the colon (:) is prefixed with a backslash (\), it does
not count as a delimiter but part of the list item.
Example:
String --> I:would:like:to:find\:out:how:this\:works
Converted List --> ['I', 'would', 'like', 'to', 'find\:out', 'how', 'this\:works']
Any idea how this could work?
#Bertrand I was trying to give you some code and I was able to figure out a workaround but this is probably not the most beautiful solution
text = "I:would:like:to:find\:out:how:this\:works"
values = text.split(":")
new = []
concat = False
temp = None
for element in values:
# when one element ends with \\
if element.endswith("\\"):
temp = element
concat = True
# when the following element ends with \\
# concatenate both before appending them to new list
elif element.endswith("\\") and temp is not None:
temp = temp + ":" + element
concat = True
# when the following element does not end with \\
# append and set concat to False and temp to None
elif concat is True:
new.append(temp + ":" + element)
concat = False
temp = None
# Append element to new list
else:
new.append(element)
print(new)
Output:
['I', 'would', 'like', 'to', 'find\\:out', 'how', 'this\\:works']
You should use re.split and perform a negative lookbehind to check for the backslash character.
import re
pattern = r'(?<!\\):'
s = 'I:would:like:to:find\:out:how:this\:works'
print(re.split(pattern, s))
Output:
['I', 'would', 'like', 'to', 'find\\:out', 'how', 'this\\:works']
You can replace the ":\" with something (just make sure that this is something that doesn`t exist in the string in other place... you can use a long term or something), and than split by ":" and replace it back.
[x.replace("$","\:") for x in str1.replace("\:","$").split(":")]
Explanation:
str1 = 'I:would:like:to:find\:out:how:this\:works'
Replace ":" with "$" (or something else):
str1.replace("\:","$")
Out: 'I:would:like:to:find$out:how:this$works'
Now split by ":"
str1.replace("\:","$").split(":")
Out: ['I', 'would', 'like', 'to', 'find$out', 'how', 'this$works']
and replace "$" with ":" for every element:
[x.replace("$","\:") for x in str1.replace("\:","$").split(":")]
Out: ['I', 'would', 'like', 'to', 'find\\:out', 'how', 'this\\:works']
Use re.split
Ex:
import re
s = "I:would:like:to:find\:out:how:this\:works"
print( re.split(r"(?<=\w):", s) )
Output:
['I', 'would', 'like', 'to', 'find\\:out', 'how', 'this\\:works']
I tried to split a list into new list. Here's the initial list:
initList =['PTE123', '', 'I', 'am', 'programmer', 'PTE345', 'based', 'word',
'title', 'PTE427', 'how', 'are', 'you']
If I want to split the list based on the PTExyz to new list which looks:
newList = ['PTE123 I am programmer', 'PTE345 based word title', 'PTE427 how are you']
How should I develop proper algorithm for general case with repeated item PTExyz?
Thank You!
The algorithm will be something like this.
Iterate over the list. Find a the string s that starts with PTE. Assign it to a temp string which is initialized as an empty string. Add every next string s with temp unless that string starts with PTE. In that case, if the temp string is not empty then append it with your result list else add the string with temp.
ls = ['PTE123', '', 'I', 'am', 'programmer', 'PTE345', 'based', 'word', 'title', 'PTE427', 'how', 'are', 'you']
result = []
temp = ''
for s in ls:
if s.startswith('PTE'):
if temp != '':
result.append(temp)
temp = s
else:
if temp == '':
continue
temp += ' ' + s
result.append(temp)
print(result)
Edit
For handling the pattern PTExyz you can use regular expression. In that case the code will be like this where the line is s.startswith('PTE'):
re.match(r'PTE\w{3}$', s)
I think it will work
l =['PTE123', '', 'I', 'am', 'programmer', 'PTE345', 'based', 'word','title', 'PTE427', 'how', 'are', 'you']
resultlist = []
s = ' '.join(l)
str = s.split('PTE')
for i in str:
resultlist.append('PTE'+i)
resultlist.remove('PTE')
print resultlist
It works on a regular expression PTExyz
import re
l =['PTE123', '', 'I', 'am', 'programmer', 'PTE345', 'based', 'word',
'title', 'PTE427', 'how', 'are', 'you']
pattern = re.compile(r'[P][T][E]\d\d\d')
k = []
for i in l:
if pattern.match(i) is not None:
k.append(i)
s = ' '.join(l)
str = re.split(pattern, s)
str.remove('')
for i in range(len(k)):
str[i] = k[i] + str[i]
print str
>>> list =['PTE123', '', 'I', 'am', 'programmer', 'PTE345', 'based', 'word','title', 'PTE427', 'how', 'are', 'you']
>>> index_list =[ list.index(item) for item in list if "PTE" in item]
>>> index_list.append(len(list))
>>> index_list
[0, 5, 9, 13]
>>> [' '.join(list[index_list[i-1]:index_list[i]]) for i,item in enumerate(index_list) if item > 0 ]
Output
['PTE123 I am programmer', 'PTE345 based word title', 'PTE427 how are you']