Iterate through list of lists and remove unwanted strings

Iterate through list of lists and remove unwanted strings - python

I'm having a play about and I've scraped a ingredient list from a website.
I now have a list of lists.
ingrediant_list = []
for ingrediant in soup.select('.wprm-recipe-ingredient'):
ingrediant_list.append(ingrediant.text)
full_list = []
for item in ingrediant_list:
full_list.append(item.split())
This is my code that generates the list of lists. First I get the ingredients from the website and throw them into ingrediants_list; then I split each string into a separate list generating a list of lists under full_list
My list is as follows:
[['400', 'g', '5%', 'Fat', 'Minced', 'Beef'], ['1', 'large', 'Onion',
'finely', 'chopped'], ['3', 'cloves', 'Garlic', 'finely', 'grated'],
['5', 'Mushrooms', 'sliced'], ['1', 'large', 'Carrot', 'finely',
'chopped'], ['1', 'stick', 'Celery', 'finely', 'chopped'], ['1',
'Red', 'Pepper', 'finely', 'chopped'], ['2', 'tins', 'Chopped',
'Tomatoes'], ['1', 'tbsp', 'Tomato', 'Puree'], ['1', 'tbsp', 'Mixed',
'Italian', 'Herbs'], ['1', 'tbsp', 'Balsamic', 'Vinegar'], ['1',
'Red', 'Wine', 'Stock', 'Pot'], ['250', 'ml', 'Beef', 'Stock', 'make',
'using', '1-2', 'beef', 'stock', 'cubes'], ['dash', "Henderson's",
'Relish/Worcestershire', 'Sauce'], ['Low', 'Calorie', 'Cooking',
'Spray'], ['200', 'g', 'Dried', 'Pasta', 'use', 'whichever', 'shape',
'you', 'prefer'], ['80', 'g', 'Reduced', 'Fat', 'Cheddar', 'Cheese']]
How can I iterate through this list of lists removing strings like 'finely', 'chopped' and 'grated', replace the 'tbsp' with 'grams', and then create another list similar to 'ingrediants_list' with none of stuff I didn't want?

Firstly, it's not necessary to split string to replace unnecessary words, you can use str.replace():
full_list = []
replace_rules = {
'finely': '',
'chopped': '',
'grated': '',
'tbsp': 'grams'
}
for s in ingrediant_list:
for old, new in replace_rules.items():
s = s.replace(old, new)
full_list.append(s.rstrip()) # .rstrip() removes trailing spaces if exist
Code above works but it will replace words only in lower case. We can use regular expressions to solve it:
import re
full_list = []
replace_rules = {
r'\s*(finely|chopped|grated)': '',
r'(\s*)tbsp': r'\1grams'
}
for s in ingrediant_list:
for old, new in replace_rules.items():
s = re.sub(old, new, s, re.IGNORECASE)
full_list.append(s)
If, for some reasons, you need to split sentences, you can just use nested loop:
replace_rules = {
'finely': '',
'chopped': '',
'grated': '',
'tbsp': 'grams'
}
result_list = []
for l in full_list:
temp_list = []
for w in l:
if w.lower() in replace_rules:
if replace_rules[w.lower()]:
temp_list.append(replace_rules[w.lower()])
else:
temp_list.append(w)
result_list.append(temp_list)
Or you can do the same using list comprehension:
filter_list = {'finely', 'chopped', 'grated'} # words to ignore
replace_rules = {'tbsp': 'grams'} # words to replace
result_list = [[replace_rules.get(w.lower(), w) for w in l if w.lower() not in filter_list] for l in full_list]

newlist = [i for i in oldlist if unwanted_string not in i]
I'll expand with an example
item_list = ["BigCar", "SmallCar", "BigHouse", "SmallHouse"]
unwanted_string = "Big"
[i for i in item_list if not unwanted_string in i]
Result:
['SmallCar', 'SmallHouse']

Related

filter list by condition using python

def myFunction(cond_list, input_list):
res = []
data = list(set(input_list)) # filter duplicate elements
for i in cond_list:
for j in data:
if i in j:
res.append(i)
return res
cond = ['cat', 'rabbit']
input_list = ['', 'cat 88.96%', '.', 'I have a dog', '', 'rabbit 12.44%', '', 'I like tiger']
result = myFunction(cond_list=cond, input_list=input_list)
print(result) # the input list have: ['cat', 'rabbit']
I have a function. Is there any better way to modify my function according to the conditions?

If I understand you correctly, is this what you are looking for?
cond = ['cat', 'rabbit'] # filter duplicate elements
input_list = ['', 'cat 88.96%', '.', 'dog 40.12%', '', 'rabbit 12.44%', '', 'tiger 85.44%']
[i for i in cond for j in input_list if i in j]
['cat', 'rabbit']

You can use itertools.product to generate the pairs for comparison:
>>> product = itertools.product(cond, input_list)
>>> [p for (p, q) in product if p in q]
['cat', 'rabbit']

cond = ['cat', 'rabbit'] # filter duplicate elements
input_list = ['', 'cat 88.96%', '.', 'dog 40.12%', '', 'rabbit 12.44%', '', 'tiger
85.44%']
matching = list(set([s for s in input_list if any(xs in s for xs in cond)]))
for i in matching:
print(i)

This is one approach using regex and a list comprehension
Ex:
import re
def myFunction(cond_list, input_list):
data = set(input_list) # filter repeat element
pattern = re.compile("|".join(data))
return [i for i in cond_list if pattern.search(i)]
cond = ['cat', 'rabbit'] # filter duplicate elements
input_list = ['', 'cat 88.96%', '.', 'dog 40.12%', '', 'rabbit 12.44%', '', 'tiger 85.44%']
result = myFunction(cond_list=cond, input_list=input_list)
print(result)
Output:
['cat', 'rabbit']

grouping a list into sublists, breaked by alphabet elements in python

I have a list in python that is mixed: some elements are numeric and some are alphabet.
For example: l = ['999','123','hello','222','333','444','bye']
I want to split this list to a lists that are seperated by the elements that are all alphabet:
['999','123','hello'], ['222','333','444','bye']
For ['hello', '123', 'test', 'test', '456', 'test', '789']
The output will be: ['hello'],['123','test'],['test'],['456','test'],['789']
Every element is all alphabet or all numeric.
What is the most pythonic way to do so?

output = []
for i in l:
if not output or output[-1][-1].isalpha():
output.append([i])
else:
output[-1].append(i)
so that with:
l = ['999','123','hello','222','333','444','bye']
output would become:
[['999', '123', 'hello'], ['222', '333', '444', 'bye']]
or with:
l = ['hello', '123', 'test', 'test', '456', 'test', '789']
output would become:
[['hello'], ['123', 'test'], ['test'], ['456', 'test'], ['789']]

split existed list based on the repeated word

I tried to split a list into new list. Here's the initial list:
initList =['PTE123', '', 'I', 'am', 'programmer', 'PTE345', 'based', 'word',
'title', 'PTE427', 'how', 'are', 'you']
If I want to split the list based on the PTExyz to new list which looks:
newList = ['PTE123 I am programmer', 'PTE345 based word title', 'PTE427 how are you']
How should I develop proper algorithm for general case with repeated item PTExyz?
Thank You!

The algorithm will be something like this.
Iterate over the list. Find a the string s that starts with PTE. Assign it to a temp string which is initialized as an empty string. Add every next string s with temp unless that string starts with PTE. In that case, if the temp string is not empty then append it with your result list else add the string with temp.
ls = ['PTE123', '', 'I', 'am', 'programmer', 'PTE345', 'based', 'word', 'title', 'PTE427', 'how', 'are', 'you']
result = []
temp = ''
for s in ls:
if s.startswith('PTE'):
if temp != '':
result.append(temp)
temp = s
else:
if temp == '':
continue
temp += ' ' + s
result.append(temp)
print(result)
Edit
For handling the pattern PTExyz you can use regular expression. In that case the code will be like this where the line is s.startswith('PTE'):
re.match(r'PTE\w{3}$', s)

I think it will work
l =['PTE123', '', 'I', 'am', 'programmer', 'PTE345', 'based', 'word','title', 'PTE427', 'how', 'are', 'you']
resultlist = []
s = ' '.join(l)
str = s.split('PTE')
for i in str:
resultlist.append('PTE'+i)
resultlist.remove('PTE')
print resultlist

It works on a regular expression PTExyz
import re
l =['PTE123', '', 'I', 'am', 'programmer', 'PTE345', 'based', 'word',
'title', 'PTE427', 'how', 'are', 'you']
pattern = re.compile(r'[P][T][E]\d\d\d')
k = []
for i in l:
if pattern.match(i) is not None:
k.append(i)
s = ' '.join(l)
str = re.split(pattern, s)
str.remove('')
for i in range(len(k)):
str[i] = k[i] + str[i]
print str

>>> list =['PTE123', '', 'I', 'am', 'programmer', 'PTE345', 'based', 'word','title', 'PTE427', 'how', 'are', 'you']
>>> index_list =[ list.index(item) for item in list if "PTE" in item]
>>> index_list.append(len(list))
>>> index_list
[0, 5, 9, 13]
>>> [' '.join(list[index_list[i-1]:index_list[i]]) for i,item in enumerate(index_list) if item > 0 ]
Output
['PTE123 I am programmer', 'PTE345 based word title', 'PTE427 how are you']

Grouping two lists in python

I have two lists which I want to group on the basis of the first element of the lists.
list1 = [['1','abc','zef'],['2','qwerty','opo'],['3','lol','pop']]
list2 = [['1','rofl','pole'],['2','sole','pop'],['3','lmao','wtf']]
Here the first elements in the list inside the list are '1' , '2' and '3'.
I want my final list to be like :-
Final_List = [['1', 'abc', 'zef', 'rofl', 'pole'], ['3', 'lol', 'pop', 'lmao', 'wtf'], ['2', 'qwerty', 'opo', 'sole', 'pop']]
I have tried this using below code.
#!/usr/bin/python
list1 = [['1','abc','zef'],['2','qwerty','opo'],['3','lol','pop']]
list2 = [['1','rofl','pole'],['2','sole','pop'],['3','lmao','wtf']]
d = {}
for i in list1:
d[i[0]] = i[1:]
for i in list2:
d[i[0]].extend(i[1:])
Final_List = []
for key, value in d.iteritems():
value.insert(0,key)
Final_List.append(value)
This code works but i was wondering if there was an easy and cleaner way to do it
Any help?

I would have written like you have written with a little modification, like this
Prepare a dictionary with all the elements from the second position gathered corresponding to the first element.
d = {}
for items in (list1, list2):
for item in items:
d.setdefault(item[0], [item[0]]).extend(item[1:])
And then just get all the values from the dictionary (Thanks #jamylak) :-)
print(d.values())
Output
[['3', 'lol', 'pop', 'lmao', 'wtf'],
['1', 'abc', 'zef', 'rofl', 'pole'],
['2', 'qwerty', 'opo', 'sole', 'pop']]

If item sequence in the lists inside of the Final_List is not important then this can be used,
[list(set(sum(itm, []))) for itm in zip(list1, list2)]

Your code seems correct. Just modify the following portion:
Final_List = []
for key in d:
L = [key] + [x for x in d[key]]
Final_List.append(L)

Yes, with list comprehension and enumerate
list1 = [['1','abc','zef'],['2','qwerty','opo'],['3','lol','pop']]
list2 = [['1','rofl','pole'],['2','sole','pop'],['3','lmao','wtf']]
print [set(v + list2[k]) for k,v in enumerate(list1)]
[['1', 'abc', 'zef', 'rofl', 'pole'], ['2', 'qwerty', 'opo', 'sole', 'pop'], ['3', 'lol', 'pop', 'lmao', 'wtf']]
EDIT
With index relation
list1 = [['1','abc','zef'],['2','qwerty','opo'],['3','lol','pop']]
list2 = [['1','rofl','pole'],['3','lmao','wtf'],['2','sole','pop']]
d1 = {a[0]:a for a in list1}
d2 = {a[0]:a for a in list2}
print [set(v + d2[k]) for k, v in d1.items()]

Using default dict and list comprehensions you can shorten your code
from collections import defaultdict
list1 = [['1','abc','zef'],['2','qwerty','opo'],['3','lol','pop']]
list2 = [['1','rofl','pole'],['2','sole','pop'],['3','lmao','wtf']]
d = defaultdict(list)
for i in list1 + list2:
d[i[0]].extend(i[1:])
Final_List = [[key] + value for key, value in d.iteritems()]
print Final_List

list3 = []
for i in xrange(0,max(len(list1[0]), len(list2[0]))):
list3.append(list(list1[i]))
list3[i].extend(x for x in list2[i] if x not in list3[i])
with a xrange, you can iterate only once through the list.

A bit of functional style:
import operator, itertools
from pprint import pprint
one = [['1','abc','zef'],['2','qwerty','opo'],['3','lol','pop']]
two = [['1','rofl','pole'],['2','sole','pop'],['3','lmao','wtf']]
A few helpers:
zero = operator.itemgetter(0)
all_but_the_first = operator.itemgetter(slice(1, None))
data = (one, two)
def foo(group):
# group is (key, iterator) from itertools.groupby
key = group[0]
lists = group[1]
result = list(key)
for item in lists:
result.extend(all_but_the_first(item))
return result
Function to process the daa
def process(data, func = foo):
# concatenate all the sublists
new = itertools.chain(*data)
# group by item zero
three = sorted(new, key = zero)
groups = itertools.groupby(three, zero)
# iterator that builds the new lists
return itertools.imap(foo, groups)
Usage
>>> pprint(list(process(data)))
[['1', 'abc', 'zef', 'rofl', 'pole'],
['2', 'qwerty', 'opo', 'sole', 'pop'],
['3', 'lol', 'pop', 'lmao', 'wtf']]
>>>
>>> for thing in process(data):
print thing
['1', 'abc', 'zef', 'rofl', 'pole']
['2', 'qwerty', 'opo', 'sole', 'pop']
['3', 'lol', 'pop', 'lmao', 'wtf']
>>>

list1 = [['1','abc','zef'],['2','qwerty','opo'],['3','lol','pop']]
list2 = [['1','rofl','pole'],['2','sole','pop'],['3','lmao','wtf']]
Final_List = []
for i in range(0, len(list1)):
Final_List.append(list1[i] + list2[i])
del Final_List[i][3]
print Final_List
Output
[['1', 'abc', 'zef', 'rofl', 'pole'], ['2', 'qwerty', 'opo', 'sole', 'pop'], ['3', 'lol', 'pop', 'lmao', 'wtf']]

Splitting a list using indices

I'm struggling to cut a list into pieces at certain indices. Although I'm able to do it one piece at a time, I haven't arrived at an expression that will allow me to skip doing it piecewise.
import re
# Creating list to split
list = ['Leading', 'text', 'of', 'no', 'interest', '1.', 'Here', 'begins', 'section', '1', '2.', 'This', 'is', 'section', '2', '3.', 'Now', 'we', `enter code here`'have', 'section', '3']
# Identifying where sections begin and end
section_ids = [i for i, item in enumerate(list) if re.search('[0-9]+\.(?![0-9])', item)]
# Simple creation of a new list for each section, piece by piece
section1 = list[section_ids[0]:section_ids[1]]
section2 = list[section_ids[1]:section_ids[2]]
section3 = list[section_ids[2]:]
# Iterative creation of a new list for each claim - DOES NOT WORK
for i in range(len(section_ids)):
if i < max(range(len(section_ids))):
section[i] = list[section_ids[i] : list[section_ids[i + 1]]
else:
section[i] = list[section_ids[i] : ]
print section[i]
# This is what I'd like to get
# ['1.', 'Here', 'begins', 'section', '1']
# ['2.', 'This', 'is', 'section', '2']
# ['3.', 'Now', 'we', 'have', 'section', '3']

for i,j in map(None, section_ids, section_ids[1:]):
print my_list[i:j]
itertools version will be more efficient if the section_ids is large
from itertools import izip_longest, islice
for i,j in izip_longest(section_ids, islice(section_ids, 1, None)):
print my_list[i:j]

I was able to produce the desired output with the following code:
section=[]
for i,v in enumerate(section_ids+[len(list)]):
if i==0:continue
section.append(list[section_ids[i-1]:v])

are you trying to achieve something like this:
>>> section = [] # list to hold sublists ....
>>> for index, location in enumerate(section_ids):
... if location != section_ids[-1]: # assume its not the last one
... section.append(list[location:section_ids[index + 1]])
... else:
... section.append(list[location:])
... print section[-1]
...
['1.', 'Here', 'begins', 'section', '1']
['2.', 'This', 'is', 'section', '2']
['3.', 'Now', 'we', 'have', 'section', '3']
>>>
or:
>>> import re
>>> from pprint import pprint
>>> values = ['Leading', 'text', 'of', 'no', 'interest', '1.', 'Here', 'begins', 'section', '1', '2.', 'This', 'is', 'section', '2', '3.', 'Now', 'we', 'have', 'section', '3']
>>> section_ids = [i for i, item in enumerate(values) if re.search('[0-9]+\.(?![0-9])', item)] + [len(values)]
>>> section = [values[location:section_ids[index + 1]] for index, location in enumerate(section_ids) if location != section_ids[-1]]
>>> pprint(section)
[['1.', 'Here', 'begins', 'section', '1'],
['2.', 'This', 'is', 'section', '2'],
['3.', 'Now', 'we', 'have', 'section', '3']]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Iterate through list of lists and remove unwanted strings - python

newlist = [i for i in oldlist if unwanted_string not in i] I'll expand with an example item_list = ["BigCar", "SmallCar", "BigHouse", "SmallHouse"] unwanted_string = "Big" [i for i in item_list if not unwanted_string in i] Result: ['SmallCar', 'SmallHouse']

Related

filter list by condition using python

grouping a list into sublists, breaked by alphabet elements in python

split existed list based on the repeated word

Grouping two lists in python

Splitting a list using indices

Categories

Resources