Split list for punctuation - python

I have a given title
I want to start splitting on whitespace and punctuation of a list, so that no word in the resulting list contains any whitespace or punctuation character.
Ex: the word "Joe's" gets split into "Joe" and "s"
'ad sf' gets split into 'ad' and 'sf'
Starting:
['Toms', 'ad sf', "Joe's"]
Ending:
['Toms', 'ad', 'sf' , 'Joe', 's']
I have tried regex, split, but there's not an easy and concise way. Can anyone think of a better way?

Split each item and join the pieces:
from itertools import chain
mylist = ['Toms', 'ad sf', "Joe's"]
list(chain(*[re.split("\W+", item) for item in mylist]))
#['Toms', 'ad', 'sf', 'Joe', 's']
Here's a "clean" functional solution:
list(chain(*map(lambda item: re.split("\W+", item), mylist)))
#['Toms', 'ad', 'sf', 'Joe', 's']

You can use re.split:
import re
s = ['Toms', 'ad sf', "Joe's"]
final_result = [j for i in s for j in re.split(r'\W+', i)]
Output:
['Toms', 'ad', 'sf', 'Joe', 's']

There is no builtin way to achieve what you want, but here is the most concise way I could think of using map.
import re
words = ['Toms', 'ad sf', "Joe's"]
sum(map(re.compile(r'\W+').split, words), [])
# Output: ['Toms', 'ad', 'sf', 'Joe', 's']

Related

Iterate through list of lists and remove unwanted strings

I'm having a play about and I've scraped a ingredient list from a website.
I now have a list of lists.
ingrediant_list = []
for ingrediant in soup.select('.wprm-recipe-ingredient'):
ingrediant_list.append(ingrediant.text)
full_list = []
for item in ingrediant_list:
full_list.append(item.split())
This is my code that generates the list of lists. First I get the ingredients from the website and throw them into ingrediants_list; then I split each string into a separate list generating a list of lists under full_list
My list is as follows:
[['400', 'g', '5%', 'Fat', 'Minced', 'Beef'], ['1', 'large', 'Onion',
'finely', 'chopped'], ['3', 'cloves', 'Garlic', 'finely', 'grated'],
['5', 'Mushrooms', 'sliced'], ['1', 'large', 'Carrot', 'finely',
'chopped'], ['1', 'stick', 'Celery', 'finely', 'chopped'], ['1',
'Red', 'Pepper', 'finely', 'chopped'], ['2', 'tins', 'Chopped',
'Tomatoes'], ['1', 'tbsp', 'Tomato', 'Puree'], ['1', 'tbsp', 'Mixed',
'Italian', 'Herbs'], ['1', 'tbsp', 'Balsamic', 'Vinegar'], ['1',
'Red', 'Wine', 'Stock', 'Pot'], ['250', 'ml', 'Beef', 'Stock', 'make',
'using', '1-2', 'beef', 'stock', 'cubes'], ['dash', "Henderson's",
'Relish/Worcestershire', 'Sauce'], ['Low', 'Calorie', 'Cooking',
'Spray'], ['200', 'g', 'Dried', 'Pasta', 'use', 'whichever', 'shape',
'you', 'prefer'], ['80', 'g', 'Reduced', 'Fat', 'Cheddar', 'Cheese']]
How can I iterate through this list of lists removing strings like 'finely', 'chopped' and 'grated', replace the 'tbsp' with 'grams', and then create another list similar to 'ingrediants_list' with none of stuff I didn't want?
Firstly, it's not necessary to split string to replace unnecessary words, you can use str.replace():
full_list = []
replace_rules = {
'finely': '',
'chopped': '',
'grated': '',
'tbsp': 'grams'
}
for s in ingrediant_list:
for old, new in replace_rules.items():
s = s.replace(old, new)
full_list.append(s.rstrip()) # .rstrip() removes trailing spaces if exist
Code above works but it will replace words only in lower case. We can use regular expressions to solve it:
import re
full_list = []
replace_rules = {
r'\s*(finely|chopped|grated)': '',
r'(\s*)tbsp': r'\1grams'
}
for s in ingrediant_list:
for old, new in replace_rules.items():
s = re.sub(old, new, s, re.IGNORECASE)
full_list.append(s)
If, for some reasons, you need to split sentences, you can just use nested loop:
replace_rules = {
'finely': '',
'chopped': '',
'grated': '',
'tbsp': 'grams'
}
result_list = []
for l in full_list:
temp_list = []
for w in l:
if w.lower() in replace_rules:
if replace_rules[w.lower()]:
temp_list.append(replace_rules[w.lower()])
else:
temp_list.append(w)
result_list.append(temp_list)
Or you can do the same using list comprehension:
filter_list = {'finely', 'chopped', 'grated'} # words to ignore
replace_rules = {'tbsp': 'grams'} # words to replace
result_list = [[replace_rules.get(w.lower(), w) for w in l if w.lower() not in filter_list] for l in full_list]
newlist = [i for i in oldlist if unwanted_string not in i]
I'll expand with an example
item_list = ["BigCar", "SmallCar", "BigHouse", "SmallHouse"]
unwanted_string = "Big"
[i for i in item_list if not unwanted_string in i]
Result:
['SmallCar', 'SmallHouse']

How to find all words in a string that begin with an uppercase letter, for multiple strings in a list

I have a list of strings, each string is about 10 sentences. I am hoping to find all words from each string that begin with a capital letter. Preferably after the first word in the sentence. I am using re.findall to do this. When I manually set the string = '' I have no trouble do this, however when I try to use a for loop to loop over each entry in my list I get a different output.
for i in list_3:
string = i
test = re.findall(r"(\b[A-Z][a-z]*\b)", string)
print(test)
output:
['I', 'I', 'As', 'I', 'University', 'Illinois', 'It', 'To', 'It', 'I', 'One', 'Manu', 'I', 'I', 'Once', 'And', 'Through', 'I', 'I', 'Most', 'Its', 'The', 'I', 'That', 'I', 'I', 'I', 'I', 'I', 'I']
When I manually input the string value
txt = 0
for i in list_3:
string = list_3[txt]
test = re.findall(r"(\b[A-Z][a-z]*\b)", string)
print(test)
output:
['Remember', 'The', 'Common', 'App', 'Do', 'Your', 'Often', 'We', 'Monica', 'Lannom', 'Co', 'Founder', 'Campus', 'Ventures', 'One', 'Break', 'Campus', 'Ventures', 'Universities', 'Undermatching', 'Stanford', 'Yale', 'Undermatching', 'What', 'A', 'Yale', 'Lannom', 'There', 'During', 'Some', 'The', 'Lannom', 'That', 'It', 'Lannom', 'Institutions', 'University', 'Chicago', 'Boston', 'College', 'These', 'Students', 'If', 'Lannom', 'Recruiting', 'Elite', 'Campus', 'Ventures', 'Understanding', 'Campus', 'Ventures', 'The', 'For', 'Lannom', 'What', 'I', 'Wish', 'I', 'Knew', 'Before', 'Starting', 'Company', 'I', 'Even', 'I', 'Lannom', 'The', 'There']
But I can't seem to write a for loop that correctly prints the output for each of the 5 items in the list. Any ideas?
The easiest way yo do that is to write a for loop which checks whether the first letter of an element of the list is capitalized. If it is, it will be appended to the output list.
output = []
for i in list_3:
if i[0] == i[0].upper():
output.append(i)
print(output)
We can also use the list comprehension and made that in 1 line. We are also checking whether the first letter of an element is the capitalized letter.
output = [x for x in list_3 if x[0].upper() == x[0]]
print(output)
EDIT
You want to place the sentence as an element of a list so here is the solution. We iterate over the list_3, then iterate for every word by using the split() function. We are thenchecking whether the word is capitalized. If it is, it is added to an output.
list_3 = ["Remember your college application process? The tedious Common App applications, hours upon hours of research, ACT/SAT, FAFSA, visiting schools, etc. Do you remember who helped you through this process? Your family and guidance counselors perhaps, maybe your peers or you may have received little to no help"]
output = []
for i in list_3:
for j in i.split():
if j[0].isupper():
output.append(j)
print(output)
Assuming sentences are separated by one space, you could use re.findall with the following regular expression.
r'(?m)(?<!^)(?<![.?!] )[A-Z][A-Za-z]*'
Start your engine! | Python code
Python's regex engine performs the following operations.
(?m) : set multiline mode so that ^ and $ match the beginning
and the end of a line
(?<!^) : negative lookbehind asserts current location is not
at the beginning of a line
(?<![.?!] ) : negative lookbehind asserts current location is not
preceded by '.', '?' or '!', followed by a space
[A-Z] : match an uppercase letter
[A-Za-z]* : match 1+ letters
If sentences can be separated by one or two spaces, insert the negative lookbehind (?<![.?!] ) after (?<![.?!] ).
If the PyPI regex module were used, one could use the variable-length lookbehind (?<![.?!] +)
As i understand, you have list like this:
list_3 = [
'First sentence. Another Sentence',
'And yet one another. Sentence',
]
You are iterating over the list but every iteration overrides test variable, thus you have incorrect result. You eihter have to accumulate result inside additional variable or print it right away, every iteration:
acc = []
for item in list_3:
acc.extend(re.findall(regexp, item))
print(acc)
or
for item in list_3:
print(re.findall(regexp, item))
As for regexp, that ignores first word in the sentence, you can use
re.findall(r'(?<!\A)(?<!\.)\s+[A-Z]\w+', s)
(?<!\A) - not the beginning of the string
(?<!\.) - not the first word after dot
\s+ - optional spaces after dot.
You'll receive words potentialy prefixed by space, so here's final example:
acc = []
for item in list_3:
words = [w.strip() for w in re.findall(r'(?<!\A)(?<!\.)\s+[A-Z]\w+', item)]
acc.extend(words)
print(acc)
as I really like regexes, try this one:
#!/bin/python3
import re
PATTERN = re.compile(r'[A-Z][A-Za-z0-9]*')
all_sentences = [
"My House! is small",
"Does Annie like Cats???"
]
def flat_list(sentences):
for sentence in sentences:
yield from PATTERN.findall(sentence)
upper_words = list(flat_list(all_sentences))
print(upper_words)
# Result: ['My', 'House', 'Does', 'Annie', 'Cats']

How to convert a list of phrases into list of words?

I want to convert a list which contains both phrases and words to a list which contains only words. For example, if the input is:
list_of_phrases_and_words = ['I am', 'John', 'michael and', 'I am', '16',
'years', 'old']
The expected output is:
list_of_words = ['I', 'am', 'John', 'michael', 'and', 'I', 'am', '16', 'years', 'old']
What is the efficient way to achieve this is in Python?
You can use a list comprehension:
list_of_words = [
word
for phrase in list_of_phrases_and_words
for word in phrase.split()
]
An alternative that might be slightly less efficient for larger lists would be to first create a large string containing everything and then splitting it:
list_of_words = " ".join(list_of_phrases_and_words).split()
The trick is a nested for loop, whereby you split on the space character " ".
words = [word for phrase in list_of_phrases_and_words for word in phrase.split(" ")]

Python split by multiple separators, including space?

Input:
Some Text here: Java, PHP, JS, HTML 5, CSS, Web, C#, SQL, databases, AJAX, etc.
Code:
import re
input_words = list(re.split('\s+', input()))
print(input_words)
Works perfect and returns me:
['Some', 'Text', 'here:', 'Java,', 'PHP,', 'JS,', 'HTML', '5,', 'CSS,', 'Web,', 'C#,', 'SQL,', 'databases,', 'AJAX,', 'etc.']
But when add some other separators, like this:
import re
input_words = list(re.split('\s+ , ; : . ! ( ) " \' \ / [ ] ', input()))
print(input_words)
It doesn't split by spaces anymore, where am I wrong?
Expected outpus would be:
['Some', 'Text', 'here', 'Java', 'PHP', 'JS', 'HTML', '5', 'CSS', 'Web', 'C#', 'SQL', 'databases', 'AJAX', 'etc']
You should be splitting on a regex alternation containing all those symbols:
input_words = re.split('[\s,;:.!()"\'\\\[\]]', input())
print(input_words)
This is a literal answer to your question. The actual solution you might want to use would be to split on the symbols with optional whitespace on either end, e.g
input = "A B ; C.D ! E[F] G"
input_words = re.split('\s*[,;:.!()"\'\\\[\]]?\s*', input)
print(input_words)
Prints:
['A', 'B', 'C', 'D', 'E', 'F', 'G']
write the expression inside brackets as shown below. Hope it helps
import re
input_words = list(re.split('[\s+,:.!()]', input()))
Word tokenization using nltk module
#!/usr/bin/python3
import nltk
sentence = """At eight o'clock on Thursday morning
... Arthur didn't feel very good."""
words = nltk.tokenize.word_tokenize(sentence)
print(words)
output:
['At', 'eight', "o'clock", 'on', 'Thursday', 'morning', '...',
'Arthur', 'did', "n't", 'feel', 'very', 'good', '.']

Nested List Iteration

I was attempting some preprocessing on nested list before attempting a small word2vec and encounter an issue as follow:
corpus = ['he is a brave king', 'she is a kind queen', 'he is a young boy', 'she is a gentle girl']
corpus = [_.split(' ') for _ in corpus]
[['he', 'is', 'a', 'brave', 'king'], ['she', 'is', 'a', 'kind', 'queen'], ['he', 'is', 'a', 'young', 'boy'], ['she', 'is', 'a', 'gentle', 'girl']]
So the output above was given as a nested list & I intended to remove the stopwords e.g. 'is', 'a'.
for _ in range(0, len(corpus)):
for x in corpus[_]:
if x == 'is' or x == 'a':
corpus[_].remove(x)
[['he', 'a', 'brave', 'king'], ['she', 'a', 'kind', 'queen'], ['he', 'a', 'young', 'boy'], ['she', 'a', 'gentle', 'girl']]
The output seems indicating that the loop skipped to the next sub-list after removing 'is' in each sub-list instead of iterating entirely.
What is the reasoning behind this? Index? If so, how to resolve assuming I'd like to retain the nested structure.
All you code is correct except a minor change: Use [:] to iterate over the contents using a copy of the list and avoid doing changes via reference to the original list. Specifically, you create a copy of a list as lst_copy = lst[:]. This is one way to copy among several others (see here for comprehensive ways). When you iterate through the original list and modify the list by removing items, the counter creates the problem which you observe.
for _ in range(0, len(corpus)):
for x in corpus[_][:]: # <--- create a copy of the list using [:]
if x == 'is' or x == 'a':
corpus[_].remove(x)
OUTPUT
[['he', 'brave', 'king'],
['she', 'kind', 'queen'],
['he', 'young', 'boy'],
['she', 'gentle', 'girl']]
Maybe you can define a custom method to reject elements matching a certain condition. Similar to itertools (for example: itertools.dropwhile).
def reject_if(predicate, iterable):
for element in iterable:
if not predicate(element):
yield element
Once you have the method in place, you can use this way:
stopwords = ['is', 'and', 'a']
[ list(reject_if(lambda x: x in stopwords, ary)) for ary in corpus ]
#=> [['he', 'brave', 'king'], ['she', 'kind', 'queen'], ['he', 'young', 'boy'], ['she', 'gentle', 'girl']]
nested = [input()]
nested = [i.split() for i in nested]

Categories

Resources