Python splitting text with line breaks into a list - python

I'm trying to convert some text into a list. The text contains special characters, numbers, and line breaks. Ultimately I want to have a list with each word as an item in the list without any special characters, numbers, or spaces.
exerpt from text:
I have no ambition to lose my life on the post-road between St. Petersburgh and Archangel. <the&lt I
Currently I'm using this line to split each word into an item in the list:
text_list = [re.sub(r"[^a-zA-Z0-9]+", ' ', k) \
for k in content.split(" ")]
print(text_list)
This code is leaving in spaces and combining words in each item of the list like below
Result:
['I', 'have', 'no', 'ambition', 'to', 'lose', 'my', 'life', 'on', 'the',
'post road', 'between St ', 'Petersburgh', 'and', 'Archangel ', ' lt the lt I']
I would like to split the words into individual items of the list and remove the string ' lt ' and numbers from my list items.
Expected result:
['I', 'have', 'no', 'ambition', 'to', 'lose', 'my', 'life', 'on', 'the',
'post', 'road', 'between', 'St', 'Petersburgh', 'and', 'Archangel', 'the' 'I']
Please help me resolve this issue.
Thanks

Since it looks like you're parsing html text, it's likely all entities are enclosed in & and ;. Removing those makes matching the rest quite easy.
import re
content = 'I have no ambition to lose my life on the post-road between St. Petersburgh and Archangel. <the< I'
# first, remove entities, the question mark makes sure the expression isn't too greedy
content = re.sub(r'&[^ ]+?;', '', content)
# then just match anything that meets your rules
text_list = re.findall(r"[a-zA-Z0-9]+", content)
print(text_list)
Note that 'St Petersburg' likely got matched together because the character between the 't' and 'P' probably isn't a space, but a non-breaking space. If this were just html, I'd expect there to be or something of the sort, but it's possible that in your case there's some UTF non-breaking space character there.
That should not matter with the code above, but if you use a solution using .split(), it likely won't see that character as a space.
In case the &lt is not your mistake, but in the original, this works as a replacement for the .sub() statement:
content = re.sub(r'&[^ ;]+?(?=[ ;]);?', '', content)
Clearly a bit more complicated: it substitutes any string that starts with & [&], followed by one or more characters that are not a space or ;, taking as little as possible [[^ ;]+?], but only if they are then followed by a space or a ; [(?=[ ;])], and in that case that ; is also matched [;?].

Here is what can can be done. You just need to replace any known code of syntax in advance
import re
# define some special syntax that want to remove
special_syntax = r"&(lt|nbsp|gt|amp|quot|apos|cent|pound|yen|euro|copy|reg|)[; ]"
text_list = [re.sub(r"[^a-zA-Z0-9]+", ' ', k).strip() \
# Here I remove the syntax before split them and substitue special char again
for k in re.sub(special_syntax, ' ', content).split(" ")]
# remove empty string from the list
filter_object = filter(lambda x: x != "", text_list)
list(filter_object)
Output
['I', 'have', 'no', 'ambition', 'to', 'lose', 'my', 'life', 'on', 'the',
'post road', 'between', 'St', 'Petersburgh', 'and', 'Archangel', 'the', 'I']

Related

Python connect composed keywords in texts

So, I have a keyword list lowercase. Let's say
keywords = ['machine learning', 'data science', 'artificial intelligence']
and a list of texts in lowercase. Let's say
texts = [
'the new machine learning model built by google is revolutionary for the current state of artificial intelligence. it may change the way we are thinking',
'data science and artificial intelligence are two different fields, although they are interconnected. scientists from harvard are explaining it in a detailed presentation that could be found on our page.'
]
I need to transform the texts into:
[[['the', 'new',
'machine_learning',
'model',
'built',
'by',
'google',
'is',
'revolutionary',
'for',
'the',
'current',
'state',
'of',
'artificial_intelligence'],
['it', 'may', 'change', 'the', 'way', 'we', 'are', 'thinking']],
[['data_science',
'and',
'artificial_intelligence',
'are',
'two',
'different',
'fields',
'although',
'they',
'are',
'interconnected'],
['scientists',
'from',
'harvard',
'are',
'explaining',
'it',
'in',
'a',
'detailed',
'presentation',
'that',
'could',
'be',
'found',
'on',
'our',
'page']]]
What I do right now is checking if the keywords are in a text and replace them with the keywords with _. But this is of complexity m*n and it is really slow when you have 700 long texts and 2M keywords as in my case.
I was trying to use Phraser, but I can't manage to build one with only my keywords.
Could someone suggest me a more optimized way of doing it?
The Phrases/Phraser classes of gensim are designed to use their internal, statistically-derived records of what word pairs should be promoted to phrases – not user-supplied pairings. (You could probably poke & prod a Phraser to do what you want, by synthesizing scores/thresholds, but that would be somewhat awkward & kludgey.)
You could, mimic their general approach: (1) operate on lists-of-tokens rather than raw strings; (2) learn & remember token-pairs that should be combined; & (3) perform combination in a single pass. That should work far more efficiently than anything based on doing repeated search-and-replace on a string – which it sounds like you've already tried and found wanting.
For example, let's first create a dictionary, where the keys are tuples of word-pairs that should be combined, and the values are tuples that include both their designated combination-token, and a 2nd item that's just an empty-tuple. (The reason for this will become clear later.)
keywords = ['machine learning', 'data science', 'artificial intelligence']
texts = [
'the new machine learning model built by google is revolutionary for the current state of artificial intelligence. it may change the way we are thinking',
'data science and artificial intelligence are two different fields, although they are interconnected. scientists from harvard are explaining it in a detailed presentation that could be found on our page.'
]
combinations_dict = {tuple(kwsplit):('_'.join(kwsplit), ())
for kwsplit in [kwstr.split() for kwstr in keywords]}
combinations_dict
After this step, combinations_dict is:
{('machine', 'learning'): ('machine_learning', ()),
('data', 'science'): ('data_science', ()),
('artificial', 'intelligence'): ('artificial_intelligence', ())}
Now, we can use a Python generator function to create an iterable transformation of any other sequence-of-tokens, that takes original tokens one-by-one – but before emitting any, adds the next to a buffered candidate pair-of-tokens. If that pair is one that should be combined, a single combined token is yielded – but if not, just the 1st token is emitted, leaving the 2nd to be combined with the next token in a new candidate pair.
For example:
def combining_generator(tokens, comb_dict):
buff = () # start with empty buffer
for in_tok in tokens:
buff += (in_tok,) # add latest to buffer
if len(buff) < 2: # grow buffer to 2 tokens if possible
continue
# lookup what to do for current pair...
# ...defaulting to emit-[0]-item, keep-[1]-item in new buff
out_tok, buff = comb_dict.get(buff, (buff[0], (buff[1],)))
yield out_tok
if buff:
yield buff[0] # last solo token if any
Here we see the reason for the earlier () empty-tuples: that's the preferred state of the buff after a successful replacement. And driving the result & next-state this way helps us use the form of dict.get(key, default) that supplies a specific value to be used if the key isn't found.
Now designated combinations can be applied via:
tokenized_texts = [text.split() for text in texts]
retokenized_texts = [list(combining_generator(tokens, combinations_dict)) for tokens in tokenized_texts]
retokenized_texts
...which reports tokenized_texts as:
[
['the', 'new', 'machine_learning', 'model', 'built', 'by', 'google', 'is', 'revolutionary', 'for', 'the', 'current', 'state', 'of', 'artificial', 'intelligence.', 'it', 'may', 'change', 'the', 'way', 'we', 'are', 'thinking'],
['data_science', 'and', 'artificial_intelligence', 'are', 'two', 'different', 'fields,', 'although', 'they', 'are', 'interconnected.', 'scientists', 'from', 'harvard', 'are', 'explaining', 'it', 'in', 'a', 'detailed', 'presentation', 'that', 'could', 'be', 'found', 'on', 'our', 'page.']
]
Note that the tokens ('artificial', 'intelligence.') aren't combined here, as the dirt-simple .split() tokenization used has left the punctuation attached, preventing an exact match to the rule.
Real projects will want to use a more-sophisticated tokenization, that might either strip the punctuation, or retain punctuation as tokens, or do other preprocessing - and as a result would properly pass 'artificial' as a token without the attached '.'. For example a simple tokenization that just retains runs-of-word-characters discarding punctuation would be:
import re
tokenized_texts = [re.findall('\w+', text) for text in texts]
tokenized_texts
Another that also keeps any stray non-word/non-space characters (punctuation) as standalone tokens would be:
tokenized_texts = [re.findall(r'\w+|(?:[^\w\s])', text) for text in texts]
tokenized_texts
Either of these alternatives to a simple .split() would ensure your 1st text presents the necessary ('artificial', 'intelligence') pair for combination.
This is probably not the best pythonic way to do it but it works with 3 steps.
keywords = ['machine learning', 'data science', 'artificial intelligence']
texts = ['the new machine learning model built by google is revolutionary for the current state of artificial intelligence. it may change the way we are thinking', 'data science and artificial intelligence are two different fields, although they are interconnected. scientists from harvard are explaining it in a detailed presentation that could be found on our page.']
#Add underscore
for idx, text in enumerate(texts):
for keyword in keywords:
reload_text = texts[idx]
if keyword in text:
texts[idx] = reload_text.replace(keyword, keyword.replace(" ", "_"))
#Split text for each "." encountered
for idx, text in enumerate(texts):
texts[idx] = list(filter(None, text.split(".")))
print(texts)
#Split text to get each word
for idx,text in enumerate(texts):
for idx_s,sentence in enumerate(text):
texts[idx][idx_s] = list(map(lambda x: re.sub("[,\.!?]", "", x), sentence.split())) #map to delete every undesired characters
print(texts)
Output
[
[
['the', 'new', 'machine_learning', 'model', 'built', 'by', 'google', 'is', 'revolutionary', 'for', 'the', 'current', 'state', 'of', 'artificial_intelligence'],
['it', 'may', 'change', 'the', 'way', 'we', 'are', 'thinking']
],
[
['data_science', 'and', 'artificial_intelligence', 'are', 'two', 'different', 'fields', 'although', 'they', 'are', 'interconnected'],
['scientists', 'from', 'harvard', 'are', 'explaining', 'it', 'in', 'a', 'detailed', 'presentation', 'that', 'could', 'be', 'found', 'on', 'our', 'page']
]
]

Create a regular expression for deleting whitespaces after a newline in python

I'd like to know how to create a regular expression to delete whitespaces after a newline, for example, if my text is like this:
So she refused to ex-
change the feather and the rock be-
cause she was afraid.
how I can create something to get:
["so","she","refused","to","exchange", "the","feather","and","the","rock","because","she","was","afraid" ]
i've tried to use "replace("-\n","")" to try to get them together but i only get something like:
["be","cause"] and ["ex","change"]
Any suggestion? Thanks!!
import re
s = '''So she refused to ex-
change the feather and the rock be-
cause she was afraid.'''.lower()
s = re.sub(r'-\n\s*', '', s) # join hyphens
s = re.sub(r'[^\w\s]', '', s) # remove punctuation
print(s.split())
\s* means 0 or more spaces.
From what I can tell, Alex Hall's answer more adequately answers your question (both explicitly in that it's regex and implicitly in that it's adjusts capitalization and removes punctuation), but it jumped out as a good candidate for a generator.
Here, using a generator to join tokens popped from a stack-like list:
s = '''So she refused to ex-
change the feather and the rock be-
cause she was afraid.'''
def condense(lst):
while lst:
tok = lst.pop(0)
if tok.endswith('-'):
yield tok[:-1] + lst.pop(0)
else:
yield tok
print(list(condense(s.split())))
# Result:
# ['So', 'she', 'refused', 'to', 'exchange', 'the', 'feather',
# 'and', 'the', 'rock', 'because', 'she', 'was', 'afraid.']
import re
s.replace('-\n', '') #Replace the newline and - with a space
#Your s would now look like 'So she refused to ex change the feather and the rock be cause she was afraid.'
s = re.sub('\s\s+', '', s) #Replace 2 or more whitespaces with a ''
#Now your s would look like 'So she refused to exchange the feather and the rock because she was afraid.'
You could use an optional greedy expression:
-?\n\s+
This needs to be replaced by nothing, see a demo on regex101.com.
For the second part, I'd suggest nltk so that you end up having:
import re
from nltk import word_tokenize
string = """
So she refused to ex-
change the feather and the rock be-
cause she was afraid.
"""
rx = re.compile(r'-?\n\s+')
words = word_tokenize(rx.sub('', string))
print(words)
# ['So', 'she', 'refused', 'to', 'exchange', 'the', 'feather', 'and', 'the', 'rock', 'because', 'she', 'was', 'afraid', '.']

easiest way to count the number of occurences of a word in a string of a paragraph

example how to count the word "paragraph" in the paragraph below..
A paragraph in Word is any text that ends with a hard return. You
insert a hard return anytime you press the Enter key. Paragraph
formatting lets you control the appearance if individual paragraphs.
For example, you can change the alignment of text from left to center
or the spacing between lines form single to double. You can indent
paragraphs, number them, or add borders and shading to them.
Paragraph formatting is applied to an entire paragraph. All formatting
for a paragraph is stored in the paragraph mark and carried to the
next paragraph when you press the Enter key. You can copy paragraph
formats from paragraph to paragraph and view formats through task
panes.
You want to use the count method on the input string, passing "paragraph" as the argument.
>>> text = """A paragraph in Word is any text that ends with a hard return. You insert a hard return anytime you press the Enter key. Paragraph formatting lets you control the appearance if individual paragraphs. For example, you can change the alignment of text from left to center or the spacing between lines form single to double. You can indent paragraphs, number them, or add borders and shading to them.
Paragraph formatting is applied to an entire paragraph. All formatting for a paragraph is stored in the paragraph mark and carried to the next paragraph when you press the Enter key. You can copy paragraph formats from paragraph to paragraph and view formats through task panes."""
>>> text.count('paragraph') # case sensitive
10
>>> text.lower().count('paragraph') # case insensitive
12
As mentioned in the comments, you can use lower() to transform the text to be all lowercase. This will include instances of "paragraph" and "Paragraph" in the count.
I would do the following:
Split into a list of words (although not totally necessary)
Lowercase all the words
Use count to count the number of instances
>>> s
'A paragraph in Word is any text that ends with a hard return. You insert a hard return anytime you press the Enter key. Paragraph formatting lets you control the appearance if individual paragraphs. For example, you can change the alignment of text from left to center or the spacing between lines form single to double. You can indent paragraphs, number them, or add borders and shading to them.\n\n Paragraph formatting is applied to an entire paragraph. All formatting for a paragraph is stored in the paragraph mark and carried to the next paragraph when you press the Enter key. You can copy paragraph formats from paragraph to paragraph and view formats through task panes.'
>>> s.split()
['A', 'paragraph', 'in', 'Word', 'is', 'any', 'text', 'that', 'ends', 'with', 'a', 'hard', 'return.', 'You', 'insert', 'a', 'hard', 'return', 'anytime', 'you', 'press', 'the', 'Enter', 'key.', 'Paragraph', 'formatting', 'lets', 'you', 'control', 'the', 'appearance', 'if', 'individual', 'paragraphs.', 'For', 'example,', 'you', 'can', 'change', 'the', 'alignment', 'of', 'text', 'from', 'left', 'to', 'center', 'or', 'the', 'spacing', 'between', 'lines', 'form', 'single', 'to', 'double.', 'You', 'can', 'indent', 'paragraphs,', 'number', 'them,', 'or', 'add', 'borders', 'and', 'shading', 'to', 'them.', 'Paragraph', 'formatting', 'is', 'applied', 'to', 'an', 'entire', 'paragraph.', 'All', 'formatting', 'for', 'a', 'paragraph', 'is', 'stored', 'in', 'the', 'paragraph', 'mark', 'and', 'carried', 'to', 'the', 'next', 'paragraph', 'when', 'you', 'press', 'the', 'Enter', 'key.', 'You', 'can', 'copy', 'paragraph', 'formats', 'from', 'paragraph', 'to', 'paragraph', 'and', 'view', 'formats', 'through', 'task', 'panes.']
>>> [word.lower() for word in s.split()].count("paragraph")
9
Here's another example of splitting the paragraph into words and then looping through the word list and incrementing a counter when the target word is found.
paragraph = '''insert paragraph here'''
wordlist = paragraph.split(" ")
count = 0
for word in wordlist:
if word == "paragraph":
count += 1

Python Regex sentence filtering

I'm trying to filter the following sentence
'I'm using C++ in high-tech applications!', said peter (in a confident way)
into its individual words to get
I'm using C++ in high-tech applications said peter in a confident way
what I have so far is
parsing=re.findall(r"\w+(?:[-']\w+)*|'|[-.(]+|\S\w*",text)
' '.join(w for w in parsing if w not in string.punctuation)
however this produces
I'm using C in high-tech applications said peter in a confident way
So 'C++' incorrectly turns into 'C' because '+' is in string.punctuation. Is there anyway I can modify the regex code to allow for '+''s not to be tokenized? Any alternative method to get the desired output would also be welcome, thanks!
Just use (\w|\+) instead of \w. This will use both word characters and the plus sign.
Alternatively, you could use [a-zA-Z+] or ideally [\w+] as suggested by Kyle Strand.
Similar to C0deH4cker's answer but slightly simpler, replace all instances of \w with [\w+].
>>> parsing=re.findall(r"[\w+]+(?:[-'][\w+]+)*|'|[-.(]+|\S[\w+]*",text)
>>> parsing
["'", "I'm", 'using', 'C++', 'in', 'high-tech', 'applications', '!', "'", ',', 'said', 'peter', '(', 'in', 'a', 'confident', 'way', ')']
>>> ' '.join(w for w in parsing if w not in string.punctuation)
"I'm using C++ in high-tech applications said peter in a confident way"
Note that your original solution splits "C++" into three distinct tokens, so even excluding + from string.punctuation wouldn't have solved your problem:
>>> parsing=re.findall(r"\w+(?:[-']\w+)*|'|[-.(]+|\S\w*",text)
>>> parsing
["'", "I'm", 'using', 'C', '+', '+', 'in', 'high-tech', 'applications', '!', "'", ',', 'said', 'r', '(', 'in', 'a', 'confident', 'way', ')']

How can I make my title-case regular expression match prefix titles?

I need to pull possible titles out of a chunk of text. So for instance, I want to match words like "Joe Smith", "The Firm", or "United States of America". I now need to modify it to match names that begin with a title of some kind (such as "Dr. Joe Smith"). Here's the regular expression I have:
NON_CAPPED_WORDS = (
# Articles
'the',
'a',
'an',
# Prepositions
'about',
'after',
'as',
'at',
'before',
'by',
'for',
'from',
'in',
'into',
'like',
'of',
'on',
'to',
'upon',
'with',
'without',
)
TITLES = (
'Dr\.',
'Mr\.',
'Mrs\.',
'Ms\.',
'Gov\.',
'Sen\.',
'Rep\.',
)
# These are words that don't match the normal title case regex, but are still allowed
# in matches
IRREGULAR_WORDS = NON_CAPPED_WORDS + TITLES
non_capped_words_re = r'[\s:,]+|'.join(IRREGULAR_WORDS)
TITLE_RE = re.compile(r"""(?P<title>([A-Z0-9&][a-zA-Z0-9]*[\s,:-]*|{0})+\s*)""".format(non_capped_words_re))
Which builds the following regular expression:
(?P<title>([A-Z0-9&][a-zA-Z0-9]*[\s,:-]*|the[\s:,]+|a[\s:,]+|an[\s:,]+|about[\s:,]+|after[\s:,]+|as[\s:,]+|at[\s:,]+|before[\s:,]+|by[\s:,]+|for[\s:,]+|from[\s:,]+|in[\s:,]+|into[\s:,]+|like[\s:,]+|of[\s:,]+|on[\s:,]+|to[\s:,]+|upon[\s:,]+|with[\s:,]+|without[\s:,]+|Dr\.[\s:,]+|Mr\.[\s:,]+|Mrs\.[\s:,]+|Ms\.[\s:,]+|Gov\.[\s:,]+|Sen\.[\s:,]+|Rep\.)+\s*)
This doesn't seem to be working though:
>>> whitelisting.TITLE_RE.findall('Dr. Joe Smith')
[('Dr', 'Dr'), ('Joe Smith', 'Smith')]
Can someone who has better regex-fu help me fix this mess of a regex?
The problem seems to be that the first part of the expression, [A-Z0-9&][a-zA-Z0-9]*[\s,:-]*, is gobbling up the initial characters in your "prefix titles", since they are title-cased until you get to the period. So, when the + is repeating the subexpression and encounters 'Dr.', that initial part of the expression matches 'Dr', and leaves only the non-matching period.
One easy fix is to simply move the "special cases" to the front of the expression, so they're matched as a first resort, not a last resort (this essentially just moves {0} from the end of the expression to the front):
TITLE_RE = re.compile(r"""(?P<title>({0}|[A-Z0-9&][a-zA-Z0-9]*[\s,:-]*)+\s*)""".format(non_capped_words_re))
Result:
>>> TITLE_RE.findall('Dr. Joe Smith');
[('Dr. Joe Smith', 'Smith')]
I would probably go further and modify the expression to avoid all the repetition of [\s:,]+, but I'm not sure there's any real benefit, aside from making the formatted expression look a little nicer:
'|'.join(IRREGULAR_WORDS)
TITLE_RE = re.compile(r"""(?P<title>((?:{0})[\s:,]+|[A-Z0-9&][a-zA-Z0-9]*[\s,:-]*)+\s*)""".format(non_capped_words_re))

Categories

Resources