python regex songs - python

sorry, I need some help finding the right regular expression for this.
Basically I want to recognize via regex if the following format is in a string:
artist - title (something)
examples:
"ellie goulding - good gracious (the chainsmokers remix)"
"the xx - you got the love (florence and the machine cover)"
"neneh cherry - everything (loco dice remix)"
"my chemical romance - famous last words (video)"
I have been trying but haven't been able to find the right regular expression.
regex = "[A-Za-z0-9\s]+[\-]{1}[A-Za-z0-9\s]+[\(]{1}[\s]*[A-Za-z0-9\s]*[\)]{1}"
help please!

I would do
regex = r"^[^\-]*-[^\(]*\([^\)]*\)$"

>>> import re
>>> songstring = 'ellie goulding - good gracious (the chainsmokers remix)'
>>> re.match(r'(.*?) - (.*?) (\(.*?\))', songstring).groups()
('ellie goulding', 'good gracious', '(the chainsmokers remix)')

Related

Looking for a very specific regex, between two dashes, after a dash

I am trying to find a regex that will pull out information between two dashes, but after a dash has already existed.
I will list a few examples:
a = BRAND NAME - 34953 Some Information, PPF - 00003 - (3.355pz.) 300%
b = 893535 - MX LMESJFG - 01001 - ST2 | ls.33.dg - OK
c = 4539883 - AB05 - AG03J238B - | D.87.yes.4/3 - OK
What I want to pull out:
a = 00003
b = 01001
c = AG03J238B
I would greatly appreciate any help.
We can stick to the base string functions here and use split() and strip():
a = "BRAND NAME - 34953 Some Information, PPF - 00003 - (3.355pz.) 300%"
match = a.split("-")[2].strip()
print(match) # 00003
Little late to the party, but the following python regular expression should find all occurrences for each line of file (assuming your using python language).
Hope this works for you. Not limited to string splits failures I don't think
import re
match = re.findall(r"(?=(-\s\w+\s-))", c)
print(match)
the (?=()) allows you to find overlapping issues, look up in the documents for look aheads and look behinds.
Good luck.

Get sentences from transcript file

I have files of transcripts where the format is
(name 1): (sentence)\n (<-- There can be multiples of this pattern)
(name 2): (sentence)\n (sentence)\n
and so on. I need all of the sentences. So far I have gotten it to work by hard-coding the names in the file, but I need it to be generic.
utterances = re.findall(r'(?:CALLER: |\nCALLER:\nCRO: |\nCALLER:\nOPERATOR: |\nCALLER:\nRECORDER: |RECORDER: |CRO: |OPERATOR: )(.*?)(?:CALLER: |RECORDER : |CRO: |OPERATOR: |\nCALLER:\n)', raw_calls, re.DOTALL)
Python 3.6 using re. Or if anyone knows how to do this using spacy, that would be a great help, thanks.
I want to just grab the \n after an empty statement, and put it in its own string. And I suppose I will just have to grab the tape information given at the end of this, for example, since I can't think of a way to distinguish if the line is part of someone's speech or not. Also sometimes, there's more than one word between start of line and colon.
Mock data:
CRO: How far are you from the World Trade Center, how many blocks, about? Three or four blocks?
63FDNY 911 Calls Transcript - EMS - Part 1 9-11-01
CALLER:
CRO: You're welcome. Thank you.
OPERATOR: Bye.
CRO: Bye.
RECORDER: The preceding portion of tape concludes at 0913 hours, 36 seconds.
This tape will continue on side B.
OPERATOR NEWELL: blah blah.
GUY IN DESK: I speak words!
You can use a lookahead expression that looks for the same pattern of a name at the beginning of a line and is followed by a colon:
s = '''CRO: How far are you from the World Trade Center, how many blocks, about? Three or four blocks?
63FDNY 911 Calls Transcript - EMS - Part 1 9-11-01
CALLER:
CRO: You're welcome. Thank you.
OPERATOR: Bye.
CRO: Bye.
RECORDER: The preceding portion of tape concludes at 0913 hours, 36 seconds.
This tape will continue on side B.
OPERATOR NEWELL: blah blah.
GUY IN DESK: I speak words!'''
import re
from pprint import pprint
pprint(re.findall(r'^([^:\n]+):\s*(.*?)(?=^[^:\n]+?:|\Z)', s, flags=re.MULTILINE | re.DOTALL), width=200)
This outputs:
[('CRO', 'How far are you from the World Trade Center, how many blocks, about? Three or four blocks?\n63FDNY 911 Calls Transcript - EMS - Part 1 9-11-01\n'),
('CALLER', ''),
('CRO', "You're welcome. Thank you.\n"),
('OPERATOR', 'Bye.\n'),
('CRO', 'Bye.\n'),
('RECORDER', 'The preceding portion of tape concludes at 0913 hours, 36 seconds.\nThis tape will continue on side B.\n'),
('OPERATOR NEWELL', 'blah blah.\n'),
('GUY IN DESK', 'I speak words!')]
I would use regular expressions and nested for loops in a list comprehension to grab all the sentences as illustrated in the code below.
s ='''(name 1): (sentence1 here)\n (<-- There can be multiples of this pattern)
(name 2): (sentence2 here)\n (sentence3 here)\n'''
[y.strip('()') for x in re.split('\(name \d+\):', s) for y in re.findall('\([^\)]+\)', x)]
>>> ['sentence1 here',
'<-- There can be multiples of this pattern',
'sentence2 here',
'sentence3 here']

Accurately splitting sentences

My program takes a text file and splits each sentence into a list using split('.') meaning that it will split when it registers a full stop however it can be inaccurate.
For Example
str='i love carpets. In fact i own 2.4 km of the stuff.'
Output
listOfSentences = ['i love carpets', 'in fact i own 2', '4 km of the stuff']
Desired Output
listOfSentences = ['i love carpets', 'in fact i own 2.4 km of the stuff']
My question is: How do I split the end of sentences and not at every full stop.
Any regex based approach cannot handle cases like "I saw Mr. Smith.", and adding hacks for those cases is not scalable. As user est has commented, any serious implementation uses data.
If you need to handle English only then spaCy is better than NLTK:
from spacy.en import English
en = English()
doc = en(u'i love carpets. In fact i own 2.4 km of the stuff.')
for s in list(doc.sents):
print s.string
Update: spaCy now supports many languages.
I found https://github.com/fnl/syntok/ to be quite good, actually the best from all the popular ones. Specifically, I tested nltk (punkt), spacy, and syntok on English news articles.
import syntok.segmenter as segmenter
document = "some text. some more text"
for paragraph in segmenter.analyze(document):
for sentence in paragraph:
for token in sentence:
# exactly reproduce the input
# and do not remove "imperfections"
print(token.spacing, token.value, sep='', end='')
print("\n") # reinsert paragraph separators
Not splitting at numbers can be done using the split function of the re module:
>>> import re
>>> s = 'i love carpets. In fact i own 2.4 km of the stuff.'
>>> re.split(r'\.[^0-9]', s)
['i love carpets', 'In fact i own 2.4 km of the stuff.']
If you have sentences both ending with "." and ". ", you can try regex:
import re
text = "your text here. i.e. something."
sentences = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', text)
source: Python - RegEx for splitting text into sentences (sentence-tokenizing)
The simplest way is to split on a dot followed by a space as:
>>> s = 'i love carpets. In fact i own 2.4 km of the stuff.'
>>> s.split('. ')
['i love carpets', 'In fact i own 2.4 km of the stuff.']

Python - Split String with Parenthesis based off a Pattern

I have a problem in python where I have a pattern, that can repeat anywhere from 1 to XXX times.
The pattern is I have a string of format
Author (Affiliation) Author (Affiliation) etc etc etc as many authors/affiliations that there are.
What is the best way in Python to go about splitting a string up like this when you dont know if you'll have 1 instance of Author (Affiliation) or 100?
EDIT - Viktor Leis* (Technische Universität München) Alfons Kemper (Technische Universität München) Thomas Neumann (Technische Universität München, Germany)
That is a sample string I am working with. I have tried re.split / re.findall and am having no luck. I'm assuming I am doing something with regex's wrong.
EDIT 2 - '\w+{1,3}(\w{1,10})' Is the pattern I was attempting to use.
My logic was a name is 1-3 words, then (. Then an affiliation is between 1-10 words, and a closing ).
Here is a sample. Looks like you are wanting to match text with no ) or ( and the text in between ( and ). Below is one way to do it assuming it is exactly like above.
import re
text = r'Viktor Leis* (Technische Universitt Mnchen) Alfons Kemper (Technische Universitt Mnchen) Thomas Neumann (Technische Universitt Mnchen, Germany)'
pattern = '[^\(\)]* \([^\(]+\)'
result = re.findall(pattern,s)
print result
output:
['Viktor Leis* (Technische Universitt Mnchen)', ' Alfons Kemper (Technische Universitt Mnchen)', ' Thomas Neumann (Technische Universitt Mnchen, Germany)']
You may want to removing leading and trailing spaces using strip.
This is the first thing that comes to mind
import re
s = 'Bob (ABC) Steve (XYZ) Mike (ALPHA)'
pattern = '\w+ \(\w+\)'
>>> re.findall(pattern,s)
['Bob (ABC)', 'Steve (XYZ)', 'Mike (ALPHA)']
You could do it like this:
thing="Author1 (Affiliation) Author2 (Affiliation) Author3 (Affiliation)"
s=thing.split(') ')
list=[]
for i in s:
if not i.endswith(')'):
list.append(i+')')
else:
list.append(i)

Python - RegEx for splitting text into sentences (sentence-tokenizing) [duplicate]

This question already has answers here:
How can I split a text into sentences?
(20 answers)
Closed 3 years ago.
I want to make a list of sentences from a string and then print them out. I don't want to use NLTK to do this. So it needs to split on a period at the end of the sentence and not at decimals or abbreviations or title of a name or if the sentence has a .com This is attempt at regex that doesn't work.
import re
text = """\
Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't.
"""
sentences = re.split(r' *[\.\?!][\'"\)\]]* *', text)
for stuff in sentences:
print(stuff)
Example output of what it should look like
Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it.
Did he mind?
Adam Jones Jr. thinks he didn't.
In any case, this isn't true...
Well, with a probability of .9 it isn't.
(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s
Try this. split your string this.You can also check demo.
http://regex101.com/r/nG1gU7/27
Ok so sentence-tokenizers are something I looked at in a little detail, using regexes, nltk, CoreNLP, spaCy. You end up writing your own and it depends on the application. This stuff is tricky and valuable and people don't just give their tokenizer code away. (Ultimately, tokenization is not a deterministic procedure, it's probabilistic, and also depends very heavily on your corpus or domain, e.g. legal/financial documents vs social-media posts vs Yelp reviews vs biomedical papers...)
In general you can't rely on one single Great White infallible regex, you have to write a function which uses several regexes (both positive and negative); also a dictionary of abbreviations, and some basic language parsing which knows that e.g. 'I', 'USA', 'FCC', 'TARP' are capitalized in English.
To illustrate how easily this can get seriously complicated, let's try to write you that functional spec for a deterministic tokenizer just to decide whether single or multiple period ('.'/'...') indicates end-of-sentence, or something else:
function isEndOfSentence(leftContext, rightContext)
Return False for decimals inside numbers or currency e.g. 1.23 , $1.23, "That's just my $.02" Consider also section references like 1.2.A.3.a, European date formats like 09.07.2014, IP addresses like 192.168.1.1, MAC addresses...
Return False (and don't tokenize into individual letters) for known abbreviations e.g. "U.S. stocks are falling" ; this requires a dictionary of known abbreviations. Anything outside that dictionary you will get wrong, unless you add code to detect unknown abbreviations like A.B.C. and add them to a list.
Ellipses '...' at ends of sentences are terminal, but in the middle of sentences are not. This is not as easy as you might think: you need to look at the left context and the right context, specifically is the RHS capitalized and again consider capitalized words like 'I' and abbreviations. Here's an example proving ambiguity which : She asked me to stay... I left an hour later. (Was that one sentence or two? Impossible to determine)
You may also want to write a few patterns to detect and reject miscellaneous non-sentence-ending uses of punctuation: emoticons :-), ASCII art, spaced ellipses . . . and other stuff esp. Twitter. (Making that adaptive is even harder). How do we tell if #midnight is a Twitter user, the show on Comedy Central, text shorthand, or simply unwanted/junk/typo punctuation? Seriously non-trivial.
After you handle all those negative cases, you could arbitrarily say that any isolated period followed by whitespace is likely to be an end of sentence. (Ultimately, if you really want to buy extra accuracy, you end up writing your own probabilistic sentence-tokenizer which uses weights, and training it on a specific corpus(e.g. legal texts, broadcast media, StackOverflow, Twitter, forums comments etc.)) Then you have to manually review exemplars and training errors. See Manning and Jurafsky book or Coursera course [a].
Ultimately you get as much correctness as you are prepared to pay for.
All of the above is clearly specific to the English-language/ abbreviations, US number/time/date formats. If you want to make it country- and language-independent, that's a bigger proposition, you'll need corpora, native-speaking people to label and QA it all, etc.
All of the above is still only ASCII, which is practically speaking only 96 characters. Allow the input to be Unicode, and things get harder still (and the training-set necessarily must be either much bigger or much sparser)
In the simple (deterministic) case, function isEndOfSentence(leftContext, rightContext) would return boolean, but in the more general sense, it's probabilistic: it returns a float 0.0-1.0 (confidence level that that particular '.' is a sentence end).
References: [a] Coursera video: "Basic Text Processing 2-5 - Sentence Segmentation - Stanford NLP - Professor Dan Jurafsky & Chris Manning" [UPDATE: an unofficial version used to be on YouTube, was taken down]
Try to split the input according to the spaces rather than a dot or ?, if you do like this then the dot or ? won't be printed in the final result.
>>> import re
>>> s = """Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't."""
>>> m = re.split(r'(?<=[^A-Z].[.?]) +(?=[A-Z])', s)
>>> for i in m:
... print i
...
Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it.
Did he mind?
Adam Jones Jr. thinks he didn't.
In any case, this isn't true...
Well, with a probability of .9 it isn't.
sent = re.split('(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)(\s|[A-Z].*)',text)
for s in sent:
print s
Here the regex used is : (?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)(\s|[A-Z].*)
First block: (?<!\w\.\w.) : this pattern searches in a negative feedback loop (?<!) for all words (\w) followed by fullstop (\.) , followed by other words (\.)
Second block: (?<![A-Z][a-z]\.): this pattern searches in a negative feedback loop for anything starting with uppercase alphabets ([A-Z]), followed by lower case alphabets ([a-z]) till a dot (\.) is found.
Third block: (?<=\.|\?): this pattern searches in a feedback loop of dot (\.) OR question mark (\?)
Fourth block: (\s|[A-Z].*): this pattern searches after the dot OR question mark from the third block. It searches for blank space (\s) OR any sequence of characters starting with a upper case alphabet ([A-Z].*).
This block is important to split if the input is as
Hello world.Hi I am here today.
i.e. if there is space or no space after the dot.
Naive approach for proper english sentences not starting with non-alphas and not containing quoted parts of speech:
import re
text = """\
Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't.
"""
EndPunctuation = re.compile(r'([\.\?\!]\s+)')
NonEndings = re.compile(r'(?:Mrs?|Jr|i\.e)\.\s*$')
parts = EndPunctuation.split(text)
sentence = []
for part in parts:
if len(part) and len(sentence) and EndPunctuation.match(sentence[-1]) and not NonEndings.search(''.join(sentence)):
print(''.join(sentence))
sentence = []
if len(part):
sentence.append(part)
if len(sentence):
print(''.join(sentence))
False positive splitting may be reduced by extending NonEndings a bit. Other cases will require additional code. Handling typos in a sensible way will prove difficult with this approach.
You will never reach perfection with this approach. But depending on the task it might just work "enough"...
I'm not great at regular expressions, but a simpler version, "brute force" actually, of above is
sentence = re.compile("([\'\"][A-Z]|([A-Z][a-z]*\. )|[A-Z])(([a-z]*\.[a-z]*\.)|([A-Za-z0-9]*\.[A-Za-z0-9])|([A-Z][a-z]*\. [A-Za-z]*)|[^\.?]|[A-Za-z])*[\.?]")
which means
start acceptable units are '[A-Z] or "[A-Z]
please note, most regular expressions are greedy so the order is very important when we do |(or). That's, why I have written i.e. regular expression first, then is come forms like Inc.
Try this:
(?<!\b(?:[A-Z][a-z]|\d|[i.e]))\.(?!\b(?:com|\d+)\b)
I wrote this taking into consideration smci's comments above. It is a middle-of-the-road approach that doesn't require external libraries and doesn't use regex. It allows you to provide a list of abbreviations and accounts for sentences ended by terminators in wrappers, such as a period and quote: [.", ?', .)].
abbreviations = {'dr.': 'doctor', 'mr.': 'mister', 'bro.': 'brother', 'bro': 'brother', 'mrs.': 'mistress', 'ms.': 'miss', 'jr.': 'junior', 'sr.': 'senior', 'i.e.': 'for example', 'e.g.': 'for example', 'vs.': 'versus'}
terminators = ['.', '!', '?']
wrappers = ['"', "'", ')', ']', '}']
def find_sentences(paragraph):
end = True
sentences = []
while end > -1:
end = find_sentence_end(paragraph)
if end > -1:
sentences.append(paragraph[end:].strip())
paragraph = paragraph[:end]
sentences.append(paragraph)
sentences.reverse()
return sentences
def find_sentence_end(paragraph):
[possible_endings, contraction_locations] = [[], []]
contractions = abbreviations.keys()
sentence_terminators = terminators + [terminator + wrapper for wrapper in wrappers for terminator in terminators]
for sentence_terminator in sentence_terminators:
t_indices = list(find_all(paragraph, sentence_terminator))
possible_endings.extend(([] if not len(t_indices) else [[i, len(sentence_terminator)] for i in t_indices]))
for contraction in contractions:
c_indices = list(find_all(paragraph, contraction))
contraction_locations.extend(([] if not len(c_indices) else [i + len(contraction) for i in c_indices]))
possible_endings = [pe for pe in possible_endings if pe[0] + pe[1] not in contraction_locations]
if len(paragraph) in [pe[0] + pe[1] for pe in possible_endings]:
max_end_start = max([pe[0] for pe in possible_endings])
possible_endings = [pe for pe in possible_endings if pe[0] != max_end_start]
possible_endings = [pe[0] + pe[1] for pe in possible_endings if sum(pe) > len(paragraph) or (sum(pe) < len(paragraph) and paragraph[sum(pe)] == ' ')]
end = (-1 if not len(possible_endings) else max(possible_endings))
return end
def find_all(a_str, sub):
start = 0
while True:
start = a_str.find(sub, start)
if start == -1:
return
yield start
start += len(sub)
I used Karl's find_all function from this entry: Find all occurrences of a substring in Python
My example is based on the example of Ali, adapted to Brazilian Portuguese. Thanks Ali.
ABREVIACOES = ['sra?s?', 'exm[ao]s?', 'ns?', 'nos?', 'doc', 'ac', 'publ', 'ex', 'lv', 'vlr?', 'vls?',
'exmo(a)', 'ilmo(a)', 'av', 'of', 'min', 'livr?', 'co?ls?', 'univ', 'resp', 'cli', 'lb',
'dra?s?', '[a-z]+r\(as?\)', 'ed', 'pa?g', 'cod', 'prof', 'op', 'plan', 'edf?', 'func', 'ch',
'arts?', 'artigs?', 'artg', 'pars?', 'rel', 'tel', 'res', '[a-z]', 'vls?', 'gab', 'bel',
'ilm[oa]', 'parc', 'proc', 'adv', 'vols?', 'cels?', 'pp', 'ex[ao]', 'eg', 'pl', 'ref',
'[0-9]+', 'reg', 'f[ilí]s?', 'inc', 'par', 'alin', 'fts', 'publ?', 'ex', 'v. em', 'v.rev']
ABREVIACOES_RGX = re.compile(r'(?:{})\.\s*$'.format('|\s'.join(ABREVIACOES)), re.IGNORECASE)
def sentencas(texto, min_len=5):
# baseado em https://stackoverflow.com/questions/25735644/python-regex-for-splitting-text-into-sentences-sentence-tokenizing
texto = re.sub(r'\s\s+', ' ', texto)
EndPunctuation = re.compile(r'([\.\?\!]\s+)')
# print(NonEndings)
parts = EndPunctuation.split(texto)
sentencas = []
sentence = []
for part in parts:
txt_sent = ''.join(sentence)
q_len = len(txt_sent)
if len(part) and len(sentence) and q_len >= min_len and \
EndPunctuation.match(sentence[-1]) and \
not ABREVIACOES_RGX.search(txt_sent):
sentencas.append(txt_sent)
sentence = []
if len(part):
sentence.append(part)
if sentence:
sentencas.append(''.join(sentence))
return sentencas
Full code in: https://github.com/luizanisio/comparador_elastic
If you want to break up sentences at 3 periods (not sure if this is what you want) you can use this regular expresion:
import re
text = """\
Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't.
"""
sentences = re.split(r'\.{3}', text)
for stuff in sentences:
print(stuff)

Categories

Resources