Find same sequence of 2 characters in a sentence - python

I have a line that looks like this:
Amount:Category:Date:Description:55544355
My requirement is to find a sequence of two characters, followed later by that same sequence of two characters, followed later by that same sequence of two characters again till all sequences are found. I achieved this as follows:
>>>my_str = 'Amount:Category:Date:Description:55544355'
>>>[item[0] for item in re.findall(r"((..)\2*)", my_str)]
>>>['Am', 'ou', 'nt', ':C', 'at', 'eg', 'or', 'y:', 'Da', 'te', ':D', 'es', 'cr', 'ip', 'ti', 'on', ':5', '55', '44', '35']
This is obviously not the right output since the desired output is:
[[':D',':D'],['55','55'],['at', 'at']]
What am I doing wrong?

Would you please try the following:
my_str = 'Amount:Category:Date:Description:55544355'
print(re.findall(r'(..)(?=.*?\1)', my_str))
Output:
['at', ':D', '55']
If you want to print all occurrences of the characters, another step is required.

You have to use a lookahead with a backreference. To get both values, you can wrap the backreference also in a capture group which will be returned as a tuple by re.findall.
import re
print(re.findall(r"(..)(?=.*?(\1))", "Amount:Category:Date:Description:55544355"))
Output
[('at', 'at'), (':D', ':D'), ('55', '55')]
If you want a list of lists:
import re
print([list(elem) for elem in re.findall(r"(..)(?=.*?(\1))", "Amount:Category:Date:Description:55544355")])
Output
[['at', 'at'], [':D', ':D'], ['55', '55']]

Related

Python splitting text with line breaks into a list

I'm trying to convert some text into a list. The text contains special characters, numbers, and line breaks. Ultimately I want to have a list with each word as an item in the list without any special characters, numbers, or spaces.
exerpt from text:
I have no ambition to lose my life on the post-road between St. Petersburgh and Archangel. <the&lt I
Currently I'm using this line to split each word into an item in the list:
text_list = [re.sub(r"[^a-zA-Z0-9]+", ' ', k) \
for k in content.split(" ")]
print(text_list)
This code is leaving in spaces and combining words in each item of the list like below
Result:
['I', 'have', 'no', 'ambition', 'to', 'lose', 'my', 'life', 'on', 'the',
'post road', 'between St ', 'Petersburgh', 'and', 'Archangel ', ' lt the lt I']
I would like to split the words into individual items of the list and remove the string ' lt ' and numbers from my list items.
Expected result:
['I', 'have', 'no', 'ambition', 'to', 'lose', 'my', 'life', 'on', 'the',
'post', 'road', 'between', 'St', 'Petersburgh', 'and', 'Archangel', 'the' 'I']
Please help me resolve this issue.
Thanks
Since it looks like you're parsing html text, it's likely all entities are enclosed in & and ;. Removing those makes matching the rest quite easy.
import re
content = 'I have no ambition to lose my life on the post-road between St. Petersburgh and Archangel. <the< I'
# first, remove entities, the question mark makes sure the expression isn't too greedy
content = re.sub(r'&[^ ]+?;', '', content)
# then just match anything that meets your rules
text_list = re.findall(r"[a-zA-Z0-9]+", content)
print(text_list)
Note that 'St Petersburg' likely got matched together because the character between the 't' and 'P' probably isn't a space, but a non-breaking space. If this were just html, I'd expect there to be or something of the sort, but it's possible that in your case there's some UTF non-breaking space character there.
That should not matter with the code above, but if you use a solution using .split(), it likely won't see that character as a space.
In case the &lt is not your mistake, but in the original, this works as a replacement for the .sub() statement:
content = re.sub(r'&[^ ;]+?(?=[ ;]);?', '', content)
Clearly a bit more complicated: it substitutes any string that starts with & [&], followed by one or more characters that are not a space or ;, taking as little as possible [[^ ;]+?], but only if they are then followed by a space or a ; [(?=[ ;])], and in that case that ; is also matched [;?].
Here is what can can be done. You just need to replace any known code of syntax in advance
import re
# define some special syntax that want to remove
special_syntax = r"&(lt|nbsp|gt|amp|quot|apos|cent|pound|yen|euro|copy|reg|)[; ]"
text_list = [re.sub(r"[^a-zA-Z0-9]+", ' ', k).strip() \
# Here I remove the syntax before split them and substitue special char again
for k in re.sub(special_syntax, ' ', content).split(" ")]
# remove empty string from the list
filter_object = filter(lambda x: x != "", text_list)
list(filter_object)
Output
['I', 'have', 'no', 'ambition', 'to', 'lose', 'my', 'life', 'on', 'the',
'post road', 'between', 'St', 'Petersburgh', 'and', 'Archangel', 'the', 'I']

Removing lists that contain specific string from list of lists

I have the following list:
print(sentences_fam)
>>>[['30973', 'ok'],
['3044', 'ok'],
['53690', 'fd', '65', 'ca'],
['36471', 'none','good','standing'],
['j6426', 'none'],
['500861', 'm', 'br'],
['j0076', 'none'],
['mf4422', 'ok'],
['jf1816', 'father', '64', 'ca'],
['500854', 'no', 'fam', 'none', 'hx'],
['54480n', 'none'],
['mf583', 'none'],
...]
print (len(sentences_fam))
>>> 1523613
The lists are of many different lengths and contain all sorts of different strings.
I am trying to remove all lists that contain the keyword 'none'. Based on the list above my desired output should look like this.
[['30973', 'ok'],
['3044', 'ok'],
['53690', 'fd', '65', 'ca'],
['500861', 'm', 'br'],
['mf4422', 'ok'],
['jf1816', 'father', '64', 'ca'],
...]
My list comprehension skills are still not so great so I'm not sure what to do. I have tried converting this list into a dataframe but I have had no luck because each string gets assigned an individual column and I have not found a good way of formatting the data again into a list of lists. I need that type of format to be able to pass the data to the word2vec library.
Basically the whole list is the body of text and each sublist is a sentence. Also please keep in mind that I will be needing to apply this to a large list so performance/efficiency might be important.
filtered_list = [sublist for sublist in sentences_fam if "none" not in sublist]

extract substring between multiple words in a pandas dataframe

I have a pandas data frame where I need to extract sub-string from each row of a column based on the following conditions
We have start_list ('one','once I','he') and end_list ('fine','one','well').
The sub-string should be preceded by any of the elements of the start_list.
The sub-string may be succeeded by any of the elements of the end_list.
When any of the elements of the start_list is available then the succeeding sub string should be extracted with/without the presence of the elements of the end_list.
Example Problem:
df = pd.DataFrame({'a' : ['one was fine today', 'we had to drive', ' ','I
think once I was fine eating ham ', 'he studies really
well
and is polite ', 'one had to live well and prosper',
'43948785943one by onej89044809', '827364hjdfvbfv',
'&^%$&*+++===========one kfnv dkfjn uuoiu fine', 'they
is one who makes me crazy'],
'b' : ['11', '22', '33', '44', '55', '66', '77', '', '88',
'99']})
Expected Result:
df = pd.DataFrame({'a' : ['was', '','','was ','studies really','had to live',
'by','','kfnv dkfjn uuoiu','who makes me crazy'],
'b' : ['11', '22', '33', '44', '55', '66', '77', '',
'88','99']})
I think this should work for you. This solution requires Pandas of course and also the built-in library functools.
Function: remove_preceders
This function takes as input a collection of words start_list and str string. It looks to see if any of the items in start_list are in string, and if so returns only the piece of string that occurs after said items. Otherwise, it returns the original string.
def remove_preceders(start_list, string):
for word in start_list:
if word in string:
string = string[string.find(word) + len(word):]
return string
Function: remove_succeders
This function is very similar to the first, except it returns only the piece of string that occurs before the items in end_list.
def remove_succeeders(end_list, string):
for word in end_list:
if word in string:
string = string[:string.find(word)]
return string
Function: to_apply
How do you actually run the above functions? The apply method allows you to run complex functions on a DataFrame or Series, but it will then look for as input either a full row or single value, respectively (based on whether you're running on a DF or S).
This function takes as input a function to run & a collection of words to check, and we can use it to run the above two functions:
def to_apply(func, words_to_check):
return functools.partial(func, words_to_check)
How to Run
df['no_preceders'] = df.a.apply(
to_apply(remove_preceders,
('one', 'once I', 'he'))
)
df['no_succeders'] = df.a.apply(
to_apply(remove_succeeders,
('fine', 'one', 'well'))
)
df['substring'] = df.no_preceders.apply(
to_apply(remove_succeeders,
('fine', 'one', 'well'))
)
And then there's one final step to remove the items from the substring column that were not affected by the filtering:
def final_cleanup(row):
if len(row['a']) == len(row['substring']):
return ''
else:
return row['substring']
df['substring'] = df.apply(final_cleanup, axis=1)
Results
Hope this works.

Splitting words using nltk module in Python

I am trying to find a way for splitting words in Python using the nltk module. I am unsure how to reach my goal given the raw data I have which is a list of tokenized words e.g.
['usingvariousmolecularbiology', 'techniques', 'toproduce', 'genotypes', 'following', 'standardoperatingprocedures', '.', 'Operateandmaintainautomatedequipment', '.', 'Updatesampletrackingsystemsandprocess', 'documentation', 'toallowaccurate', 'monitoring', 'andrapid', 'progression', 'ofcasework']
As you can see many words are stuck together (i.e. 'to' and 'produce' are stuck in one string 'toproduce'). This is an artifact of scraping data from a PDF file and I would like to find a way using the nltk module in python to split the stuck-together words (i.e. split 'toproduce' into two words: 'to' and 'produce'; split 'standardoperatingprocedures' into three words: 'standard', 'operating', 'procedures').
I appreciate any help!
I believe you will want to use word segmentation in this case, and I am not aware of any word segmentation features in the NLTK that will deal with English sentences without spaces. You could use pyenchant instead. I offer the following code only by way of example. (It would work for a modest number of relatively short strings--such as the strings in your example list--but would be highly inefficient for longer strings or more numerous strings.) It would need modification, and it will not successfully segment every string in any case.
import enchant # pip install pyenchant
eng_dict = enchant.Dict("en_US")
def segment_str(chars, exclude=None):
"""
Segment a string of chars using the pyenchant vocabulary.
Keeps longest possible words that account for all characters,
and returns list of segmented words.
:param chars: (str) The character string to segment.
:param exclude: (set) A set of string to exclude from consideration.
(These have been found previously to lead to dead ends.)
If an excluded word occurs later in the string, this
function will fail.
"""
words = []
if not chars.isalpha(): # don't check punctuation etc.; needs more work
return [chars]
if not exclude:
exclude = set()
working_chars = chars
while working_chars:
# iterate through segments of the chars starting with the longest segment possible
for i in range(len(working_chars), 1, -1):
segment = working_chars[:i]
if eng_dict.check(segment) and segment not in exclude:
words.append(segment)
working_chars = working_chars[i:]
break
else: # no matching segments were found
if words:
exclude.add(words[-1])
return segment_str(chars, exclude=exclude)
# let the user know a word was missing from the dictionary,
# but keep the word
print('"{chars}" not in dictionary (so just keeping as one segment)!'
.format(chars=chars))
return [chars]
# return a list of words based on the segmentation
return words
As you can see, this approach (presumably) mis-segments only one of your strings:
>>> t = ['usingvariousmolecularbiology', 'techniques', 'toproduce', 'genotypes', 'following', 'standardoperatingprocedures', '.', 'Operateandmaintainautomatedequipment', '.', 'Updatesampletrackingsystemsandprocess', 'documentation', 'toallowaccurate', 'monitoring', 'andrapid', 'progression', 'ofcasework']
>>> [segment(chars) for chars in t]
"genotypes" not in dictionary (so just keeping as one segment)
[['using', 'various', 'molecular', 'biology'], ['techniques'], ['to', 'produce'], ['genotypes'], ['following'], ['standard', 'operating', 'procedures'], ['.'], ['Operate', 'and', 'maintain', 'automated', 'equipment'], ['.'], ['Updates', 'ample', 'tracking', 'systems', 'and', 'process'], ['documentation'], ['to', 'allow', 'accurate'], ['monitoring'], ['and', 'rapid'], ['progression'], ['of', 'casework']]
You can then use chain to flatten this list of lists:
>>> from itertools import chain
>>> list(chain.from_iterable(segment_str(chars) for chars in t))
"genotypes" not in dictionary (so just keeping as one segment)!
['using', 'various', 'molecular', 'biology', 'techniques', 'to', 'produce', 'genotypes', 'following', 'standard', 'operating', 'procedures', '.', 'Operate', 'and', 'maintain', 'automated', 'equipment', '.', 'Updates', 'ample', 'tracking', 'systems', 'and', 'process', 'documentation', 'to', 'allow', 'accurate', 'monitoring', 'and', 'rapid', 'progression', 'of', 'casework']
You can easily install the following library and use it for your purpose:
pip install wordsegment
import wordsegment
help(wordsegment)
from wordsegment import load, segment
load()
segment('usingvariousmolecularbiology')
The output will be like this:
Out[4]: ['using', 'various', 'molecular', 'biology']
Please refer to http://www.grantjenks.com/docs/wordsegment/ for more details.

How can I make my title-case regular expression match prefix titles?

I need to pull possible titles out of a chunk of text. So for instance, I want to match words like "Joe Smith", "The Firm", or "United States of America". I now need to modify it to match names that begin with a title of some kind (such as "Dr. Joe Smith"). Here's the regular expression I have:
NON_CAPPED_WORDS = (
# Articles
'the',
'a',
'an',
# Prepositions
'about',
'after',
'as',
'at',
'before',
'by',
'for',
'from',
'in',
'into',
'like',
'of',
'on',
'to',
'upon',
'with',
'without',
)
TITLES = (
'Dr\.',
'Mr\.',
'Mrs\.',
'Ms\.',
'Gov\.',
'Sen\.',
'Rep\.',
)
# These are words that don't match the normal title case regex, but are still allowed
# in matches
IRREGULAR_WORDS = NON_CAPPED_WORDS + TITLES
non_capped_words_re = r'[\s:,]+|'.join(IRREGULAR_WORDS)
TITLE_RE = re.compile(r"""(?P<title>([A-Z0-9&][a-zA-Z0-9]*[\s,:-]*|{0})+\s*)""".format(non_capped_words_re))
Which builds the following regular expression:
(?P<title>([A-Z0-9&][a-zA-Z0-9]*[\s,:-]*|the[\s:,]+|a[\s:,]+|an[\s:,]+|about[\s:,]+|after[\s:,]+|as[\s:,]+|at[\s:,]+|before[\s:,]+|by[\s:,]+|for[\s:,]+|from[\s:,]+|in[\s:,]+|into[\s:,]+|like[\s:,]+|of[\s:,]+|on[\s:,]+|to[\s:,]+|upon[\s:,]+|with[\s:,]+|without[\s:,]+|Dr\.[\s:,]+|Mr\.[\s:,]+|Mrs\.[\s:,]+|Ms\.[\s:,]+|Gov\.[\s:,]+|Sen\.[\s:,]+|Rep\.)+\s*)
This doesn't seem to be working though:
>>> whitelisting.TITLE_RE.findall('Dr. Joe Smith')
[('Dr', 'Dr'), ('Joe Smith', 'Smith')]
Can someone who has better regex-fu help me fix this mess of a regex?
The problem seems to be that the first part of the expression, [A-Z0-9&][a-zA-Z0-9]*[\s,:-]*, is gobbling up the initial characters in your "prefix titles", since they are title-cased until you get to the period. So, when the + is repeating the subexpression and encounters 'Dr.', that initial part of the expression matches 'Dr', and leaves only the non-matching period.
One easy fix is to simply move the "special cases" to the front of the expression, so they're matched as a first resort, not a last resort (this essentially just moves {0} from the end of the expression to the front):
TITLE_RE = re.compile(r"""(?P<title>({0}|[A-Z0-9&][a-zA-Z0-9]*[\s,:-]*)+\s*)""".format(non_capped_words_re))
Result:
>>> TITLE_RE.findall('Dr. Joe Smith');
[('Dr. Joe Smith', 'Smith')]
I would probably go further and modify the expression to avoid all the repetition of [\s:,]+, but I'm not sure there's any real benefit, aside from making the formatted expression look a little nicer:
'|'.join(IRREGULAR_WORDS)
TITLE_RE = re.compile(r"""(?P<title>((?:{0})[\s:,]+|[A-Z0-9&][a-zA-Z0-9]*[\s,:-]*)+\s*)""".format(non_capped_words_re))

Categories

Resources