I have a keyword list and an input list of lists. My task is to find those lists that contain the keyword (even partially). I am able to extract the lists that contain the keyword using the following code:
t_list = [['Subtotal: ', '1,292.80 '], ['VAT ', ' 64.64 '], ['RECEIPT TOTAL ', 'AED1,357.44 '],
['NOT_SELECTED, upto2,000 ', 'Sub total ', '60.58 '],
['NOT_SELECTED, upto500 ', 'amount 160.58 ', '', '3.03 '],
['Learn', 'Bectricity total ', '', '', '63.61 ']]
keyword = ['total ', 'amount ']
for lists in t_list:
for string_list in table:
string_list[:] = [item for item in string_list if item != '']
for element in string_list:
element = element.lower()
if any(s in element for s in keyword):
print(string_list)
The output is:
[['Subtotal: ', '1,292.80 '], ['RECEIPT TOTAL ', 'AED1,357.44 '], ['NOT_SELECTED, upto2,000 ', 'Sub total ', '60.58 '], ['NOT_SELECTED, upto500 ', 'amount 160.58 ', '3.03 '],
['Learn', 'Bectricity total ', '63.61 ']]
Required output is to have only the string that matched with the keyword and the number in the list.
Required output:
[['Subtotal: ', '1,292.80 '], ['RECEIPT TOTAL ', 'AED1,357.44 '], ['Sub total ', '60.58 '], ['amount 160.58 ', '3.03 '],['Bectricity total ', '63.61 ']]
If I can have the output as a dictionary with the string matched to the keyword as key and the number a value, it would be perfect.
Thanks a ton in advance!
Here is the answer from our chat, slightly modified with some comments as some explanation for the code. Feel free to ask me to clarify or change anything.
import re
t_list = [
['Subtotal: ', '1,292.80 '],
['VAT ', ' 64.64 '],
['RECEIPT TOTAL ', 'AED1,357.44 '],
['NOT_SELECTED, upto2,000 ', 'Sub total ', '60.58 '],
['NOT_SELECTED, upto500 ', 'amount 160.58 ', '', '3.03 '],
['Learn', 'Bectricity total ', '', '', '63.61 ']
]
keywords = ['total ', 'amount ']
output = {}
for sub_list in t_list:
# Becomes the string that matched the keyword if one is found
matched = None
for item in sub_list:
for keyword in keywords:
if keyword in item.lower():
matched = item
# If a match was found, then we start looking at the list again
# looking for the numbers
if matched:
for item in sub_list:
# split the string so for example 'amount 160.58 ' becomes ['amount', '160.58']
# This allows us to more easily extract just the number
split_items = item.split()
for split_item in split_items:
# Simple use of regex to match any '.' with digits either side
re_search = re.search(r'[0-9][.][0-9]', split_item)
if re_search:
# Try block because we are making a list. If the list exists,
# then just append a value, otherwise create the list with the item
# in it
try:
output[matched.strip()].append(split_item)
except KeyError:
output[matched.strip()] = [split_item]
print(output)
You mentioned wanting to match a string such as 'AED 63.61'. My solution is using .split() to separate strings and make it easier to grab just the number. For example, for a string like 'amount 160.58' it becomes much easier to just grab the 160.58. I'm not sure how to go about matching a string like the one you want to keep but not matching the one I just mentioned (unless, of course, it is just 'AED' in which case we could just add some more logic to match anything with 'aed').
Related
say I have a text file with a format similarly to this
Q: hello what is your name?
A: Hi my name is John Smith
and I want to create a matrix such that it is a 2xn in this case
[['hello','what','is',your','name','?', ' '],['hi','my','name','is','John','Smith']]
note that the first row has an empty entry because it has 6 strings while second row has 7 strings
You can use re.split:
import re
file_data = open('filename.txt').read()
results = filter(None, re.split('A:\s|Q:\s', file_data))
new_results = [re.findall('\w+|\W', i) for i in results]
Output:
[['hello', ' ', 'what', ' ', 'is', ' ', 'your', ' ', 'name', '?', ' '], ['Hi', ' ', 'my', ' ', 'name', ' ', 'is', ' ', 'John', ' ', 'Smith']]
just split the strings, using the split function:
with open('txt.txt') as my_file:
lines = my_file.readlines()
#lines[0] = "Q: hello what is your name?"
#lines[1] = "A: Hi my name is John Smith"
then just use
output = [lines[0].split,lines[1].split]
So I was trying to write a function called find_treasure which takes a 2D list as a parameter. The purpose of the function is to search through the 2D list given and to return the index of where the 'x' is located.
def find_treasure(my_list):
str1 = 'x'
if str1 in [j for i in (my_list) for j in i]:
index = (j for i in my_list for j in i).index(str1)
return(index)
treasure_map = [[' ', ' ', ' '], [' ', 'x', ' '], [' ', ' ', ' ']]
print(find_treasure(treasure_map))
However, I can't seem to get the function to return the index, I tried using the enumerate function too but either I was using it wrongly.
Using enumerate
def find_treasure(my_list):
str1 = 'x'
for i,n in enumerate(my_list):
for j, m in enumerate(n):
if m == str1:
return (i, j)
treasure_map = [[' ', ' ', ' '], [' ', 'x', ' '], [' ', ' ', ' ']]
print(find_treasure(treasure_map))
Output:
(1, 1)
Using index function.
def find_treasure(my_list):
str1 = 'x'
for i,n in enumerate(my_list):
try:
return (i, n.index(str1))
except ValueError:
pass
treasure_map = [[' ', ' ', ' '], [' ', 'x', ' '], [' ', ' ', ' ']]
print(find_treasure(treasure_map))
Output
(1, 1)
I got a list like this:
[['zai4'], [' '], ['tui1'], ['jin4'], [' '], ['shi2'], ['pin3'], [' '], ['an1'], ['quan2'], [' '], ['xin4'], ['xi1'], [' ']]
how could I convert it into this kind:
zai4 tui1 jin4 shi2 pin3 an1 quan2 xin4 xi1
Thank you.
I think the unique part of my question is how to extract content from
sub-list into string.
Try this approach:
data=[['zai4'], [' '], ['tui1'], ['jin4'], [' '], ['shi2'], ['pin3'], [' '], ['an1'], ['quan2'], [' '], ['xin4'], ['xi1'], [' ']]
print([i[0] for i in data if i[0].isalnum()])
output:
['zai4', 'tui1', 'jin4', 'shi2', 'pin3', 'an1', 'quan2', 'xin4', 'xi1']
if you want without list then:
print(" ".join([i[0] for i in data if i[0].isalnum()]))
output:
zai4 tui1 jin4 shi2 pin3 an1 quan2 xin4 xi1
Using list comprehension and str.join
a = [['zai4'], [' '], ['tui1'], ['jin4'], [' '], ['shi2'], ['pin3'], [' '], ['an1'], ['quan2'], [' '], ['xin4'], ['xi1'], [' ']]
print(" ".join(i[0] for i in a).replace(" ", ""))
Output:
zai4 tui1 jin4 shi2 pin3 an1 quan2 xin4 xi1
You can use unpacking:
s = [['zai4'], [' '], ['tui1'], ['jin4'], [' '], ['shi2'], ['pin3'], [' '], ['an1'], ['quan2'], [' '], ['xin4'], ['xi1'], [' ']]
new_s = ' '.join(a for [a] in s if a != ' ')
Output:
'zai4 tui1 jin4 shi2 pin3 an1 quan2 xin4 xi1'
For some reason I'm struggling to initialize a numpy.chararray with spaces.
This works:
char_array1 = np.chararray((3, 3))
char_array1[:] = 'a'
char_array1
Output:
chararray([['a', 'a', 'a'],
['a', 'a', 'a'],
['a', 'a', 'a']],
dtype='|S1')
This doesn't:
char_array2 = np.chararray((3, 3))
char_array2[:] = ' '
char_array2
Output:
chararray([['', '', ''],
['', '', ''],
['', '', '']],
dtype='|S1')
What is causing this? I can't see an option to strip the items or something.
In fact char arrays do remove whitespace:
Versus a regular NumPy array of type str or unicode, this class adds
the following functionality:
values automatically have whitespace removed from the end when indexed
comparison operators automatically remove whitespace from the end when
comparing values vectorized string operations are provided as methods
(e.g. endswith) and infix operators (e.g. "+", "*", "%")
So the answer is use a regular array of type str or unicode:
char_array3 = np.empty((3, 3), dtype='str')
char_array3[:] = ' '
char_array3
Output:
array([[' ', ' ', ' '],
[' ', ' ', ' '],
[' ', ' ', ' ']],
dtype='|S1')
Just create your array with ndarray:
chararray = np.ndarray((3,3), dtype='S1')
chararray[:]=' '
gives:
array([[' ', ' ', ' '],
[' ', ' ', ' '],
[' ', ' ', ' ']],
dtype='|S1')
I have a string of text that looks like this:
' 19,301 14,856 18,554'
Where is a space.
I'm trying to split it on the white space, but I need to retain all of the white space as an item in the new list. Like this:
[' ', '19,301',' ', '14,856', ' ', '18,554']
I have been using the following code:
re.split(r'( +)(?=[0-9])', item)
and it returns:
['', ' ', '19,301', ' ', '14,856', ' ', '18,554']
Notice that it always adds an empty element to the beginning of my list. It's easy enough to drop it, but I'm really looking to understand what is going on here, so I can get the code to treat things consistently. Thanks.
When using the re.split method, if the capture group is matched at the start of a string, the "result will start with an empty string". The reason for this is so that join method can behave as the inverse of the split method.
It might not make a lot of sense for your case, where the separator matches are of varying sizes, but if you think about the case where the separators were a | character and you wanted to perform a join on them, with the extra empty string it would work:
>> item = '|19,301|14,856|18,554'
>> items = re.split(r'\|', item)
>> print items
['', '19,301', '14,856', '18,554']
>> '|'.join(items)
'|19,301|14,856|18,554'
But without it, the initial pipe would be missing:
>> items = ['19,301', '14,856', '18,554']
>> '|'.join(items)
'19,301|14,856|18,554'
You can do it with re.findall():
>>> s = '\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s19,301\s\s\s\s\s\s\s\s\s14,856\s\s\s\s\s\s\s\s18,554'.replace('\\s',' ')
>>> re.findall(r' +|[^ ]+', s)
[' ', '19,301', ' ', '14,856', ' ', '18,554']
You said "space" in the question, so the pattern works with space. For matching runs of any whitespace character you can use:
>>> re.findall(r'\s+|\S+', s)
[' ', '19,301', ' ', '14,856', ' ', '18,554']
The pattern matches one or more whitespace characters or one or more non-whitespace character, for example:
>>> s=' \t\t ab\ncd\tef g '
>>> re.findall(r'\s+|\S+', s)
[' \t\t ', 'ab', '\n', 'cd', '\t', 'ef', ' ', 'g', ' ']