how to better parse random tracklistings

how to better parse random tracklistings - python

I am interested in parsing tracklistings in a variety of formats, containing lines such as:
artist - title
artist-title
artist / title
artist - "title"
1. artist - title
0:00 - artist - tit le
05 artist - title 12:20
artist - title [record label]
These are text files which generally contain one tracklist but which may also contain other stuff which I don't want to parse, so the regex ideally needs to be strict enough to not include lines which aren't tracklistings, although really this is probably a question of balance.
I am having some success with the following regex:
simple = re.compile(r"""
^
(?P<time>\d?\d:\d\d)? # track time in 00:00 or 0:00
(
(?P<number>\d{1,2}) # track number as 0 01
[^\w] # not followed by word
)?
[-.)]? # possibly followed by something
"?
(?P<artist>[^"##]+) # artist anything except "##
"?
\s[-/\u2013]\s
"? # dash surrounded by spaces, possibly unicode
(?P<title>[^"##]+?) # title, not greedy
"?
(?P<label>\[\w+\])? # label i.e. [something Records]
(//|&\#13;)? # remove some weird endings, i.e. ascii carriage return
$
""", re.VERBOSE)
However, it's a bit horrible, I only started learning regex very recently. It has problems with lines like this:
an artist-a title # couldn't find ' - '
2 Croozin' - 2 Pumpin' # mistakes 2 as track number
05 artist - title 12:20 # doesn't work at all
In the case of 2 Croozin' - 2 Pumpin', the only way of telling that 2 isn't a track number is to take into account the surrounding context, i.e. look at the other tracks. (I forgot to mention this - these tracks are usually part of a tracklist)
So my question is, how can I improve this in general? Some ideas I've had are:
Use several regex, starting with very specific ones and carry on using less specific ones until it has parsed properly.
dump regex and use a proper parser such as pyparsing or parsley, which might be able to make better use of surrounding context, however I know absolutely nothing about parsing
use lookahead/lookbehind in a multiline regex to look at previous/next lines
use separate regex to get time, track number, artist, title
give up and do something less pointless
I can validate that it has parsed properly (to some degree) doing things such as making sure artists and titles are all different, tracks are in order, times are sensible, and possibly even check artists/titles/labels do actually exist.

At best, you are dealing with a context-sensitive grammar which moves you out of the realm of what regexps can handle alone and into parsing.
Even if your parser is implemented as a regexps and a pile of heuristics, it is still a parser and techniques from parsing will be valuable. Some languages have a chicken-egg problem: I'd like to call "The Artist Formerly Known as the Artist Formerly Known as Prince" an artist and not a track title, but until I see it a second time, I don't have the context to make that determination.
To amplify #JonClements comment, if the files do contain internal meta-data there are plenty of tools to extract and manipulate that information. Even if internal metadata just increases the probability that "A Question of Balance" is an album title, you'll need that information.
Steal as many design approaches as you can: look for open source tag manipulators (e.g. EasyTag) and see how they do it. While you are learning, you might just find a tool that does your job for you.

Related

How to place comma accurately with regular expression?

I am new in Python regular expression. here is my text:
'Condition: Remanufactured Grade: Commercial Warranty: 1 Year Parts & On-Site Labor w/Ext. Ships: Fully Assembled Processing Time: Ships from our Warehouse in 2-4 Weeks
I want to add comma using python regular expression and the result will be look like this:
'Condition: Remanufactured ,Grade: Commercial ,Warranty: 1 Year Parts & On-Site Labor w/Ext. Ships: Fully Assembled ,Processing Time: Ships from our Warehouse in 2-4 Weeks
Basically I want to target words which contain colon and want to add comma from second string.

Honestly I wouldn't do this with a regular expression, in large part based on your "Processing Time" example, which makes it looks like you've got a problem which can only be solved by knowing the specific expected strings to solve.
Code can't magically know that "Processing " is more tightly bound to "Time" than to "Fully Assembled".
So I see basically three solution shapes, and I'm just going to focus on the first one because I think its the best one, but I'll briefly summarize all three:
Use a list of known field names which make the comma insertions harder, and replace those strings just for the duration of your comma-insertion logic. This frees your comma-insertion logic to be simpler and regular.
Get a list of all known field names, and look for them specifically to insert commas in front of them. This is probably worse but if the list of names doesn't change and isn't expected to change, and most names are tricky, then this could be cleaner.
Throw a modern language modeling text prediction AI at the problem: given an ambiguous string like "...: Fully Assembled Processing Time: ..." you could basically prompt your AI with "Assembled" and see how much confidence it gives to the next tokens being "Processing Time", and then prompt it with "Processing" and see how much confidence it gives to the next tokens being "Time", and pick the one it has more confidence for as your field name. I think this is overkill unless you really get so little guarantees about your input that you have to treat it like a natural language processing problem.
So I would do option 1, and the general idea looks something like this:
tricky_fields = {
"Processing Time": "ProcessingTime",
# add others here as needed
}
for proper_name, easier_name in tricky_fields:
my_text = my_text.replace(f" {proper_name}: ", f" {easier_name}: ")
# do the actual comma insertions here
for proper_name, easier_name in tricky_fields:
my_text = my_text.replace(f" {easier_name}: ", f" {proper_name}: ")
Notice that I put spaces and the colon around the field names in the replacements. If you know that your fields are always separated by spaces and colons like that, this is better practice because it's less likely to automatically replace something you didn't mean to replace, and thus less likely to be a source of bugs later.
Then the comma insertion itself becomes an easy regex if all of your replacements don't use any spaces or colons, because your target is just [^ :]+:, but regex is a cryptic micro-language which is not optimized for human readability, and it doesn't need to be a regex, because you can just split on : and then for each result of that split you can split on the last and then rejoin with , or , and then rejoin the whole thing:
def insert_commas(text):
parts = text.split(":")
new_parts = []
for part in parts:
most, last = part.split(" ", -1)
new_part = " ,".join((most, last))
new_parts.append(new_part)
return ":".join(new_parts)
But if you really wanted to use a regex, here's a simple one that does what you want:
def insert_commas(text):
return re.sub(' ([^ :]+: )', r' ,\1', text)
Although in real production code I'd improve the tricky field replacements by factoring the two replacements out into one separate testable function and use something bidict instead of a regular dictionary, like this:
from bidict import bidict
tricky_fields = bidict({
"Processing Time": "ProcessingTime",
# add others here as needed
})
def replace_fields(names, text):
for old_name, new_name in names:
text = text.replace(f" {old_name}: ", f" {new_name}: ")
return text
Using a bidict and a dedicated function is clearer, more self-descriptive, more maintainable, less code to keep consistent, and easier to test/verify, and even gets some runtime safety against accidentally mapping two tricky field names to the same replacement field.
So composing those two previous code blocks together:
text = replace_fields(tricky_fields, text)
text = insert_commas(text)
text = replace_fields(tricky_fields.inverse, text)
Of course, if you don't need to do the second replacement to undo the initial replacement, you can just leave it as-is after comma insertion is done. Either way, this way decomposed the comma problem from the problem of tricky names which make the comma problem harder/complected.

Regex for multiple lines separated with "return" and multiple unnecessary spaces

I was trying to parse together a script for a movie into a dataset containing two columns 'speaker_name' and 'line_spoken'. I don't have any issue with the Python part of the problem but parsing the script is the problem.
The schema of the script goes like this:
Now, this, if copied and pasted into a .txt file is something like this:
ARTHUR
Yeah. I mean, that's just--
SOCIAL WORKER
Does my reading it upset you?
He leans in.
ARTHUR
No. I just,-- some of it's
personal. You know?
SOCIAL WORKER
I understand. I just want to make
sure you're keeping up with it.
She slides his journal back to him. He holds it in his lap.
In the above case, the regex filtering should return the speaker name and the dialogue and not what is happening in actions like the last line: "slides his journal back". The dialogues often exceed more than two lines so please do not provide hard-coded solutions for 2 lines only. I think I am thinking about this problem in just one direction, some other method to filter can also work.
I have worked with scripts that are colon-separated and I don't have any problem parsing those. But in this case, I am getting no specific endpoints to end the search at. It would be a great help if the answer you give has 2 groups, one with name, the other with the dialogue. Like in the case of colon-separated, my regex was:
pattern = r'(^[a-zA-z]+):(.+)'
Also, if possible, please try and explain why you used that certain regex. It will be a learning experience for me.

Use https://www.onlineocr.net/ co convert pdf to text,
It shows immediately the outcome, where names and on the same line with dialogs,
which could allow for a simple processing
ARTHUR Yeah. I mean, that's just--
SOCIAL WORKER Does my reading it upset you?
He leans in.
ARTHUR No. I just,-- some of its personal. You know me ?
SOCIAL WORKER I understand. I just want to make sure you're keeping up with it.
She slides his journal back to him. He holds it in his lap.
Not sure will it work for longer dialogs.
Another solution is to extract data from the text file that you can download by clicking the "download output file" link . That file is formatted differently. In that file
10 leading spaces will indicate the dialog, and 5 leading spaces the name - a the least for you sample screenshot
The regex is
r" (.+)(\n( [^ ].+\n)+)"
https://regex101.com/r/FQk8uH/1
it puts in group 1 whatever starts with ten spaces and whatever starts with at the exactly five space into the second :
the subexpression " [^ ].+\n" denotes a line where the first five symbols are spaces, the sixth symbol is anything but space, and the rest of symbols until the end of line are arbitrary. Since dialogs tend to be multiline that expression is followed with another plus.
You will have to delete extra white space from dialogue with additional code and/or regex.
If the amount of spaces varies a bit (say 4-6 and 7 - 14 respectively) but has distinct section the regex needs to be adjusted by using variable repetition operator (curly braces {4, 6}) or optional spaces ?.
r" {7, 14}(.+)(\n( {4-6}[^ ].+\n)+)"
The last idea is to use preexisting list of names in play to match them e.g. (SOCIAL WORKER|JOHN|MARY|ARTUR). The https://www.onlineocr.net/ website still could be used to help spot and delete actions

In Python, you can use DOTALL:
re_pattern = re.compile(r'(\b[A-Z ]{3,}(?=\n))\n*(.*?)\n*(?=\b[A-Z ]{3,}\n|$)', re.DOTALL)
print(re.findall(re_pattern, mystr))
\b[A-Z ]{3,}(?=\n) matches speaker name.
\b matches a word boundary
[A-Z ]{3,} matches three or more upper case letters or spaces. (this means this regex won't recognize speaker names with less than three characters. I did this to avoid false positives in special cases but you might wanna change it. Also check what kind of characters might occur in speaker name (dots, minus, lower case...))
(?=\n) is a lookahead insuring the speaker name is directly followed by a new line (avoids false positive if a similar expression appears in a spoken line)
\n* matches newlines
(.*?) matches everything (including new lines thanks to DOTALL) until the next part of the expression (? makes it lazy instead of greedy)
\n* matches newlines
(?=\b[A-Z ]{3,}\n|$) is a lookahead i.e. a non capturing expression insuring the following part is either a speaker name or the end of your string
Output:
[('ARTHUR', "Yeah. I mean, that's just--"), ('SOCIAL WORKER', 'Does my reading it upset you?\n\nHe leans in.'), ('ARTHUR', "No. I just,-- some of it's\n\npersonal. You know?"), ('SOCIAL WORKER', "I understand. I just want to make\n\nsure you're keeping up with it.\n\nShe slides his journal back to him. He holds it in his lap.")]
You'll have to adjust formatting if you want to remove actions from the result though.

To Split text based on words using python code

I have a long text like the one below. I need to split based on some words say ("In","On","These")
Below is sample data:
On the other hand, we denounce with righteous indignation and dislike men who are so beguiled and demoralized by the charms of pleasure of the moment, so blinded by desire, that they cannot foresee the pain and trouble that are bound to ensue; and equal blame belongs to those who fail in their duty through weakness of will, which is the same as saying through shrinking from toil and pain. These cases are perfectly simple and easy to distinguish. In a free hour, when our power of choice is untrammelled and when nothing prevents our being able to do what we like best, every pleasure is to be welcomed and every pain avoided. But in certain circumstances and owing to the claims of duty or the obligations of business it will frequently occur that pleasures have to be repudiated and annoyances accepted. The wise man therefore always holds in these matters to this principle of selection: he rejects pleasures to secure other greater pleasures, or else he endures pains to avoid worse pains.
Can this problem be solved with a code as I have 1000 rows in a csv file.

As per my comment, I think a good option would be to use regular expression with the pattern:
re.split(r'(?<!^)\b(?=(?:On|In|These)\b)', YourStringVariable)

Yes this can be done in python. You can load the text into a variable and use the built in Split function for string. For example:
with open(filename, 'r') as file:
lines = file.read()
lines = lines.split('These')
# lines is now a list of strings split whenever 'These' string was encountered

To find whole words that are not part of larger words, I like using the regular expression:
[^\w]word[^\w]
Sample python code, assuming the text is in a variable named text:
import re
exp = re.compile(r'[^\w]in[^\w]', flags=re.IGNORECASE)
all_occurrences = list(exp.finditer(text))

How remove occurrences of \*** in a string

I'm parsing through a pdf file that I converted its content to strings and there are many occurrences of \*** (* meaning any symbol)happening inside words. For example:
transaction, a middle ground has seemed workable\xe2\x80\x94norms explicitly articulated, backed by sanctions of the relevant professional associations
Using text.replace("\\***","") obviously does not work and so I was looking into using re.sub().
I'm having trouble with the syntax (reg expressions) to put into the arguements and was hoping for some help with it.

how bout text.decode("utf8") ... thats what i think you actually want to do
or you could strip them out with
text.decode("ascii","ignore")
(in python 3 you might need to use codecs.decode(text,"ascii","ignore") (not entirely sure off hand))

you can use ^ not to filter any none ascii/utf8 character
import re
text = re.sub(r'[^\x00-\x7F]', ' ', text)
result will be
'transaction, a middle ground has seemed workablenorms explicitly articulated, backed by sanctions of the relevant professional associations'

Problem with eastern european characters when scraping data from the European Parliament Website

EDIT: thanks a lot for all the answers an points raised. As a novice I am a bit overwhelmed, but it is a great motivation for continuing learning python!!
I am trying to scrape a lot of data from the European Parliament website for a research project. The first step is to create a list of all parliamentarians, however due to the many Eastern European names and the accents they use i get a lot of missing entries. Here is an example of what is giving me troubles (notice the accents at the end of the family name):
<td class="listcontentlight_left">
ANDRIKIENĖ, Laima Liucija
<br/>
Group of the European People's Party (Christian Democrats)
<br/>
</td>
So far I have been using PyParser and the following code:
#parser_names
name = Word(alphanums + alphas8bit)
begin, end = map(Suppress, "><")
names = begin + ZeroOrMore(name) + "," + ZeroOrMore(name) + end
for name in names.searchString(page):
print(name)
However this does not catch the name from the html above. Any advice in how to proceed?
Best, Thomas
P.S: Here is all the code i have so far:
# -*- coding: utf-8 -*-
import urllib.request
from pyparsing_py3 import *
page = urllib.request.urlopen("http://www.europarl.europa.eu/members/expert/alphaOrder.do?letter=B&language=EN")
page = page.read().decode("utf8")
#parser_names
name = Word(alphanums + alphas8bit)
begin, end = map(Suppress, "><")
names = begin + ZeroOrMore(name) + "," + ZeroOrMore(name) + end
for name in names.searchString(page):
print(name)

I was able to show 31 names starting with A with code:
extended_chars = srange(r"[\0x80-\0x7FF]")
special_chars = ' -'''
name = Word(alphanums + alphas8bit + extended_chars + special_chars)
As John noticed you need more unicode characters (extended_chars) and some names have hypehen etc. (special chars). Count how many names you received and check if page has the same count as I do for 'A'.
Range 0x80-0x87F encode 2 bytes sequences in utf8 of probably all european languages. In pyparsing examples there is greetingInGreek.py for Greek and other example for Korean texts parsing.
If 2 bytes are not enough then try:
extended_chars = u''.join(unichr(c) for c in xrange(127, 65536, 1))

Are you sure that writing your own parser to pick bits out of HTML is the best option? You might find it easier to use a dedicated HTML parser. Beautiful Soup which lets you specify the location you're interested in using the DOM, so pulling the text from the first link inside a table cell with class "listcontentlight_left" is quite easy:
soup = BeautifulSoup(htmlDocument)
cells = soup.findAll("td", "listcontentlight_left")
for cell in cells:
print cell.a.string

Looks like you've got some kind of encoding problem if you are getting western European names OK (they have lots of accents etc also!). Show us all of your code plus the URL of a typical page that you are trying to scrape and has the East-only problem. Displaying the piece of html that you have is not much use; we have no idea what transformations it has been through; at the very least, use the result of the repr() function.
Update The offending character in that MEP's name is U+0116 (LATIN LETTER CAPITAL E WITH DOT ABOVE). So it is not included in pyparsing's "alphanums + alphas8bit". The Westies (latin-1) will all fit in what you've got already. I know little about pyparsing; you'll need to find a pyparsing expression that includes ALL unicode alphabetics ... not just Latin-n in case they start using Cyrillic for the Bulgarian MEPs instead of the current transcription into ASCII :-)
Other observations:
(1) alphaNUMs ... digits in a name?
(2) names may include apostrophe and hyphen e.g. O'Reilly, Foughbarre-Smith

at first i thought i’d recommend to try and build a custom letter class from python’s unicodedata.category method, which, when given a character, will tell you what class that codepoint is assigned to acc to the unicode character category; this would tell you whether a codepoint is e.g. an uppercase or lowercase letter, a digit or whatever.
on second thought and remiscent of an answer i gave the other day, let me suggest another approach. there are many implicit assumptions we have to get rid of when going from national to global; one of them is certainly that ‘a character equals a byte’, and one other is that ‘a person’s name is made up of letters, and i know what the possible letters are’. unicode is vast, and the eu currently has 23 official languages written in three alphabets; exactly what characters are used for each language will involve quite a bit of work to figure out. greek uses those fancy apostrophies and is distributed across at least 367 codepoints; bulgarian uses the cyrillic alphabet with a slew of extra characters unique to the language.
so why not simply turn the tables and take advantage of the larger context those names appear in? i brosed through some sample data and it looks like the general pattern for MEP names is LASTNAME, Firstname with (1) the last name in (almost) upper case; (2) a comma and a space; (3) the given names in ordinary case. this even holds in more ‘deviant’ examples like GERINGER de OEDENBERG, Lidia Joanna, GALLAGHER, Pat the Cope (wow), McGUINNESS, Mairead. It would take some work to recover the ordinary case from the last names (maybe leave all the lower case letters in place, and lower-case any capital letters that are preceded by another capital letters), but to extract the names is, in fact simple:
fullname := lastname ", " firstname
lastname := character+
firstname := character+
that’s right—since the EUP was so nice to present names enclosed in an HTML tag, you already know the maximum extent of it, so you can just cut out that maximum extent and split it up in two parts. as i see it, all you have to look for is the first occurrence of a sequence of comma, space—everything before that is the last, anything behind that the given names of the person. i call that the ‘silhouette approach’ since it’s like looking at the negative, the outline, rather than the positive, what the form is made up from.
as has been noted earlier, some names use hyphens; now there are several codepoints in unicode that look like hyphens. let’s hope the typists over there in brussels were consistent in their usage. ah, and there are many surnames using apostrophes, like d'Hondt, d'Alambert. happy hunting: possible incarnations include U+0060, U+00B4, U+0027, U+02BC and a fair number of look-alikes. most of these codepoints would be ‘wrong’ to use in surnames, but when was the last time you saw thos dits used correctly?
i somewhat distrust that alphanums + alphas8bit + extended_chars + special_chars pattern; at least that alphanums part is a tad bogey as it seems to include digits (which ones? unicode defines a few hundred digit characters), and that alphas8bit thingy does reek of a solvent made for another time. unicode conceptually works in a 32bit space. what’s 8bit intended to mean? letters found in codepage 852? c’mon this is 2010.
ah, and looking back i see you seem to be parsing the HTML with pyparsing. don’t do that. use e.g. beautiful soup for sorting out the markup; it’s quite good at dealing even with faulty HTML (most HTML in the wild does not validate) and once you get your head about it’s admittedly wonderlandish API (all you ever need is probably the find() method) it will be simple to fish out exactly those snippets of text you’re looking for.

Even though BeautifulSoup is the de facto standard for HTML parsing, pyparsing has some alternative approaches that lend themselves to HTML too (certainly a leg up over brute force reg exps). One function in particular is makeHTMLTags, which takes a single string argument (the base tag), and returns a 2-tuple of pyparsing expressions, one for the opening tag and one for the closing tag. Note that the opening tag expression does far more than just return the equivalent of "<"+tag+">". It also:
handles upper/lower casing of the tag
itself
handles embedded attributes
(returning them as named results)
handles attribute names that have
namespaces
handles attribute values in single, double, or no quotes
handles empty tags, as indicated by a
trailing '/' before the closing '>'
can be filtered for specific
attributes using the withAttribute
parse action
So instead of trying to match the specific name content, I suggest you try matching the surrounding <a> tag, and then accessing the title attribute. Something like this:
aTag,aEnd = makeHTMLTags("a")
for t,_,_ in aTag.scanString(page):
if ";id=" in t.href:
print t.title
Now you get whatever is in the title attribute, regardless of character set.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.