Parse a very large text file with Python?

Parse a very large text file with Python? - python

So, the file has about 57,000 book titles, author names and a ETEXT No. I am trying to parse the file to only get the ETEXT NOs
The File is like this:
TITLE and AUTHOR ETEXT NO.
Aspects of plant life; with special reference to the British flora,      56900
by Robert Lloyd Praeger
The Vicar of Morwenstow, by Sabine Baring-Gould 56899
[Subtitle: Being a Life of Robert Stephen Hawker, M.A.]
Raamatun tutkisteluja IV, mennessä Charles T. Russell 56898
[Subtitle: Harmagedonin taistelu]
[Language: Finnish]
Raamatun tutkisteluja III, mennessä Charles T. Russell 56897
[Subtitle: Tulkoon valtakuntasi]
[Language: Finnish]
Tom Thatcher's Fortune, by Horatio Alger, Jr. 56896
A Yankee Flier in the Far East, by Al Avery 56895
and George Rutherford Montgomery
[Illustrator: Paul Laune]
Nancy Brandon's Mystery, by Lillian Garis 56894
Nervous Ills, by Boris Sidis 56893
[Subtitle: Their Cause and Cure]
Pensées sans langage, par Francis Picabia 56892
[Language: French]
Helon's Pilgrimage to Jerusalem, Volume 2 of 2, by Frederick Strauss 56891
[Subtitle: A picture of Judaism, in the century
which preceded the advent of our Savior]
Fra Tommaso Campanella, Vol. 1, di Luigi Amabile 56890
[Subtitle: la sua congiura, i suoi processi e la sua pazzia]
[Language: Italian]
The Blue Star, by Fletcher Pratt 56889
Importanza e risultati degli incrociamenti in avicoltura, 56888
di Teodoro Pascal
[Language: Italian]
And this is what I tried:
def search_by_etext():
fhand = open('GUTINDEX.ALL')
print("Search by ETEXT:")
for line in fhand:
if not line.startswith(" [") and not line.startswith("~"):
if not line.startswith(" ") and not line.startswith("TITLE"):
words = line.rstrip()
words = line.lstrip()
words = words[-7:]
print (words)
search_by_etext()
Well the code mostly works. However for some lines it gives me part of title or other things. Like:
This kind of output(), containing 'decott' which is a part of author name and shouldn't be here.
2
For this:
The Bashful Earthquake, by Oliver Herford                                56765
[Subtitle: and Other Fables and Verses]
The House of Orchids and Other Poems, by George Sterling                 56764
North Italian Folk, by Alice Vansittart Strettel Carr                    56763
 and Randolph Caldecott
[Subtitle: Sketches of Town and Country Life]
Wild Life in New Zealand. Part 1, Mammalia, by George M. Thomson 56762
[Subtitle: New Zealand Board of Science and Art, Manual No. 2]
Universal Brotherhood, Volume 13, No. 10, January 1899, by Various 56761
De drie steden: Lourdes, door Émile Zola 56760
[Language: Dutch]
Another example:
4
For
Rhandensche Jongens, door Jan Lens 56702
[Illustrator: Tjeerd Bottema]
[Language: Dutch]
The Story of The Woman's Party, by Inez Haynes Irwin 56701
Mormon Doctrine Plain and Simple, by Charles W. Penrose 56700
[Subtitle: Or Leaves from the Tree of Life]
The Stone Axe of Burkamukk, by Mary Grant Bruce 56699
[Illustrator: J. Macfarlane]
The Latter-Day Prophet, by George Q. Cannon 56698
[Subtitle: History of Joseph Smith Written for Young People]
Here: Life] shouldn't be there. Lines starting with blank space has been parsed out with this:
if not line.startswith(" [") and not line.startswith("~"):
But Still I am getting those off values in my output results.

Simple solution: regexps to the rescue !
import re
with open("etext.txt") as f:
for line in f:
match = re.search(r" (\d+)$", line.strip())
if match:
print(match.group(1))
the regular expression (\d+)$ will match "at least one space followed by 1 or more digits at the end of the string", and capture only the "one or more digits" group.
You can eventually improve the regexp - ie if you know all etext codes are exactly 5 digits long, you can change the regexp to (\d{5})$.
This works with the example text you posted. If it doesn't properly work on your own file then we need enough of the real data to find out what you really have.

It could be that those extra lines that are not being filtered out start with whitespace other than a " " char, like a tab for example. As a minimal change that might work, try filtering out lines that start with any whitespace rather than specifically a space char?
To check for whitespace in general rather than a space char, you'll need to use regular expressions. Try if not re.match(r'^\s', line) and ...

Related

BeautifulSoup, trying to extract text from anchor tags that contain author names

I am trying to scrape some data from this books site. I need to extract the title, and the author(s). I was able to extract the titles without much trouble. However, I am having issues to extract the authors when there are more than one, since they appear in the same line, and they belong to separate anchor tags within a header h4.
<h4>
"5
. "
The Elements of Style
" by "
William Strunk, Jr
", "
E. B. White
</h4>
This is what I tried:
book_container = soup.find_all('li', class_='item pb-3 pt-3 border-bottom')
for container in book_container:
# title
title = container.h4.a.text
titles.append(title)
# author(s)
author_s = container.h4.find_all('a')
print('### SECOND FOR LOOP ###')
for a in author_s:
if a['href'].startswith('/authors/'):
print(a.text)
I'd like to have two authors in a tuple.

You can extract all <a> links under <h4> (h4 is the tag where are title/authors). First <a> tag is the title, rest of <a> tags are the authors:
import requests
from bs4 import BeautifulSoup
url = 'https://thegreatestbooks.org/the-greatest-nonfiction-since/1900'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
for item in soup.select('h4:has(>a)'):
elements = [i.get_text(strip=True) for i in item.select('a')]
title = elements[0]
authors = elements[1:]
print('{:<40} {}'.format(title, authors))
Prints:
The Diary of a Young Girl ['Anne Frank']
The Autobiography of Malcolm X ['Alex Haley']
Silent Spring ['Rachel Carson']
In Cold Blood ['Truman Capote']
The Elements of Style ['William Strunk, Jr', 'E. B. White']
The Double Helix: A Personal Account of the Discovery of the Structure of DNA ['James D. Watson']
Relativity ['Albert Einstein']
Look Homeward, Angel ['Thomas Wolfe']
Homage to Catalonia ['George Orwell']
Speak, Memory ['Vladimir Nabokov']
The General Theory of Employment, Interest and Money ['John Maynard Keynes']
The Second World War ['Winston Churchill']
The Education of Henry Adams ['Henry Adams']
Out of Africa ['Isak Dinesen']
The Structure of Scientific Revolutions ['Thomas Kuhn']
Dispatches ['Michael Herr']
The Gulag Archipelago ['Aleksandr Solzhenitsyn']
I Know Why the Caged Bird Sings ['Maya Angelou']
The Civil War ['Shelby Foote']
If This Is a Man ['Primo Levi']
Collected Essays of George Orwell ['George Orwell']
The Electric Kool-Aid Acid Test ['Tom Wolfe']
Civilization and Its Discontents ['Sigmund Freud']
The Death and Life of Great American Cities ['Jane Jacobs']
Selected Essays of T. S. Eliot ['T. S. Eliot']
A Room of One's Own ['Virginia Woolf']
The Right Stuff ['Tom Wolfe']
The Road to Serfdom ['Friedrich von Hayek']
R. E. Lee ['Douglas Southall Freeman']
The Varieties of Religious Experience ['Will James']
The Liberal Imagination ['Lionel Trilling']
Angela's Ashes: A Memoir ['Frank McCourt']
The Second Sex ['Simone de Beauvoir']
Mere Christianity ['C. S. Lewis']
Moveable Feast ['Ernest Hemingway']
The Autobiography of Alice B. Toklas ['Gertrude Stein']
The Origins of Totalitarianism ['Hannah Arendt']
Black Lamb and Grey Falcon ['Rebecca West']
Orthodoxy ['G. K. Chesterton']
Philosophical Investigations ['Ludwig Wittgenstein']
Night ['Elie Wiesel']
The Affluent Society ['John Kenneth Galbraith']
Mythology ['Edith Hamilton']
The Open Society ['Karl Popper']
The Color of Water: A Black Man's Tribute to His White Mother ['James McBride']
The Seven Storey Mountain ['Thomas Merton']
Hiroshima ['John Hersey']
Let Us Now Praise Famous Men ['James Agee']
Pragmatism ['Will James']
The Making of the Atomic Bomb ['Richard Rhodes']

This might not be the most pythonic way, but it's a workaround.
newlist = []
for a in author_s:
if a['href'].startswith('/authors/'):
if len(author_s)>2:
newlist.append(a.text)
print(tuple(newlist))
else:
print(a.text)
I'm utilizing the fact that variable author_s would contain a list which we could check for more names. More than 2 in list, means more authors. (Alternatively, you could also check for the existence of newline in print)
You will also notice the printed output will have two tuples. Always extract the second one. The rest with one author will remain the same. Since this request do not have multiple lines of two authors, I couldn't check for complications.
Output:
[The Elements of Style, William Strunk, Jr, E. B. White]
### SECOND FOR LOOP ###
('William Strunk, Jr',)
('William Strunk, Jr', 'E. B. White')

Regex in Python: How to match a word pattern, if not preceded by another word of variable length?

I would like reconstruct full names from photo captions using Regex in Python, by appending last name back to the first name in patterns "FirstName1 and FirstName2 LastName". We can rely on names starting with capital letter.
For example,
'John and Albert McDonald' becomes 'John McDonald' and 'Albert McDonald'
'Stephen Stewart, John and Albert Diamond' becomes 'John Diamond' and 'Albert Diamond'
I would need to avoid matching patterns like this: 'Jay Smith and Albert Diamond' and generate a non-existent name 'Smith Diamond'
The photo captions may or may not have more words before this pattern, for example, 'It was a great day hanging out with John and Stephen Diamond.'
This is the code I have so far:
s = 'John and Albert McDonald'
so = re.search('([A-Z][a-z\-]+)\sand\s([A-Z][a-z\-]+\s[A-Z][a-z\-]+(?:[A-Z][a-z]+)?)', s)
if so:
print so.group(1) + ' ' + so.group(2).split()[1]
print so.group(2)
This returns 'John McDonald' and 'Albert McDonald', but 'Jay Smith and Albert Diamond' will result in a non-existent name 'Smith Diamond'.
An idea would be to check whether the pattern is preceded by a capitalized word, something like (?<![A-Z][a-z\-]+)\s([A-Z][a-z\-]+)\sand\s([A-Z][a-z\-]+\s[A-Z][a-z\-]+(?:[A-Z][a-z]+)?) but unfortunately negative lookbehind only works if we know the exact length of the preceding word, which I don't.
Could you please let me know how I can correct my regex epression? Or is there a better way to do what I want? Thanks!

As you can rely on names starting with a capital letter, then you could do something like:
((?:[A-Z]\w+\s+)+)and\s+((?:[A-Z]\w+(?:\s+|\b))+)
Live preview
Swapping out your current pattern, with this pattern should work with your current Python code. You do need to strip() the captured results though.
Which for your examples and current code would yield:
Input
First print
Second print
John and Albert McDonald
John McDonald
Albert McDonald
Stephen Stewart, John and Albert Diamond
John Diamond
Albert Diamond
It was a great day hanging out with John and Stephen Diamond.
John Diamond
Stephen Diamond

reformat unstructured text into single line after removing punctuation

I have an unstructured text which I want to convert into 1 line and remove all the punctuation marks.
For the punctuation marks i used the following solution Best way to strip punctuation from a string in Python
How can i reformat the unstructured text into 1 line by using python?
Example 1:
The Bourne Identity is a 2002 spy film loosely based on Robert
Ludlum's novel of the same name. It stars Matt Damon as Jason Bourne,
an amnesiac attempting to discover his true identity amidst a
clandestine conspiracy within the Central Intelligence Agency (CIA) to
track him down and arrest or kill him for inexplicably failing to
carry out an officially unsanctioned assassination and then failing to
report back in afterwards. Along the way he teams up with Marie,
played by Franka Potente, who assists him on the initial part of his
journey to learn about his past and regain his memories. The film also
stars Chris Cooper as Alexander Conklin, Clive Owen as The Professor,
Brian Cox as Ward Abbott, and Julia Stiles as Nicky Parsons.
The film was directed by Doug Liman and adapted for the screen by Tony
Gilroy and William Blake Herron from the novel of the same name
written by Robert Ludlum, who also produced the film alongside Frank
Marshall. Universal Studios released the film to theaters in the
United States on June 14, 2002 and it received a positive critical and
public reaction. The film was followed by a 2004 sequel, The Bourne
Supremacy, and a third part released in 2007 entitled The Bourne
Ultimatum.
Plot
Example 2:
12 (0) 0 4 (0) 38 (3) 0 3 (0) 0 1 (0)
Example 3:
Franklin Township is one of the eighteen townships of Monroe County, Ohio,
United States. The 2000 census found 453 people in the township, 367 of whom
lived in the unincorporated portions of the township.
Geography
Located in the western part of the county, it borders the following townships:
The village of Stafford lies in southwestern Franklin Township.
Name and history
It is one of twenty-one Franklin Townships statewide.
Government
The township is governed by a three-member board of trustees, who are elected in
November of odd-numbered years to a four-year term beginning on the following
January 1. Two are elected in the year after the presidential election and one
is elected in the year before it. There is also an elected township clerk, who
serves a four-year term beginning on April 1 of the year after the election,
which is held in November of the year before the presidential election.
Vacancies in the clerkship or on the board of trustees are filled by the
remaining trustees.
As you can see in the previous examples. The text have different formats. How can I turn every single text into 1 line?

This is pretty straight forward - basically, other than the punctuation, you are now also looking to eliminate the line endings.
So, you can simply do:
import string
exclude = set(string.punctuation + "\n\t\r")
print ''.join(ch for ch in input_string if ch not in exclude)
input_string = """The Bourne Identity is a 2002 spy film loosely based on Robert Ludlum's novel of the same name. It stars Matt Damon as Jason Bourne, an amnesiac attempting to discover his true identity amidst a clandestine conspiracy within the Central Intelligence Agency (CIA) to track him down and arrest or kill him for inexplicably failing to carry out an officially unsanctioned assassination and then failing to report back in afterwards. Along the way he teams up with Marie, played by Franka Potente, who assists him on the initial part of his journey to learn about his past and regain his memories. The film also stars Chris Cooper as Alexander Conklin, Clive Owen as The Professor, Brian Cox as Ward Abbott, and Julia Stiles as Nicky Parsons.
The film was directed by Doug Liman and adapted for the screen by Tony Gilroy and William Blake Herron from the novel of the same name written by Robert Ludlum, who also produced the film alongside Frank Marshall. Universal Studios released the film to theaters in the United States on June 14, 2002 and it received a positive critical and public reaction. The film was followed by a 2004 sequel, The Bourne Supremacy, and a third part released in 2007 entitled The Bourne Ultimatum."""
>>> print ''.join(ch for ch in input_string if ch not in exclude)
The Bourne Identity is a 2002 spy film loosely based on Robert Ludlums novel of the same name It stars Matt Damon as Jason Bourne an amnesiac attempting to discover his true identity amidst a clandestine conspiracy within the Central Intelligence Agency CIA to track him down and arrest or kill him for inexplicably failing to carry out an officially unsanctioned assassination and then failing to report back in afterwards Along the way he teams up with Marie played by Franka Potente who assists him on the initial part of his journey to learn about his past and regain his memories The film also stars Chris Cooper as Alexander Conklin Clive Owen as The Professor Brian Cox as Ward Abbott and Julia Stiles as Nicky ParsonsThe film was directed by Doug Liman and adapted for the screen by Tony Gilroy and William Blake Herron from the novel of the same name written by Robert Ludlum who also produced the film alongside Frank Marshall Universal Studios released the film to theaters in the United States on June 14 2002 and it received a positive critical and public reaction The film was followed by a 2004 sequel The Bourne Supremacy and a third part released in 2007 entitled The Bourne Ultimatum

FInd a US street address in text (preferably using Python regex)

Disclaimer: I read very carefully this thread:
Street Address search in a string - Python or Ruby
and many other resources.
Nothing works for me so far.
In some more details here is what I am looking for is:
The rules are relaxed and I definitely am not asking for a perfect code that covers all cases; just a few simple basic ones with assumptions that the address should be in the format:
a) Street number (1...N digits);
b) Street name : one or more words capitalized;
b-2) (optional) would be best if it could be prefixed with abbrev. "S.", "N.", "E.", "W."
c) (optional) unit/apartment/etc can be any (incl. empty) number of arbitrary characters
d) Street "type": one of ("st.", "ave.", "way");
e) City name : 1 or more Capitalized words;
f) (optional) state abbreviation (2 letters)
g) (optional) zip which is any 5 digits.
None of the above needs to be a valid thing (e.g. an existing city or zip).
I am trying expressions like these so far:
pat = re.compile(r'\d{1,4}( \w+){1,5}, (.*), ( \w+){1,5}, (AZ|CA|CO|NH), [0-9]{5}(-[0-9]{4})?', re.IGNORECASE)
>>> pat.search("123 East Virginia avenue, unit 123, San Ramondo, CA, 94444")
Don't work, and for me it's not easy to understand why. Specifically: how do I separate in my pattern a group of any words from one of specific words that should follow, like state abbrev. or street "type ("st., ave.)?
Anyhow: here is an example of what I am hoping to get:
Given
def ex_addr(text):
# does the re magic
# returns 1st address (all addresses?) or None if nothing found
for t in [
'The meeting will be held at 22 West Westin st., South Carolina, 12345 on Nov.-18',
'The meeting will be held at 22 West Westin street, SC, 12345 on Nov.-18',
'Hi there,\n How about meeting tomorr. #10am-sh in Chadds # 123 S. Vancouver ave. in Ottawa? \nThanks!!!',
'Hi there,\n How about meeting tomorr. #10am-sh in Chadds # 123 S. Vancouver avenue in Ottawa? \nThanks!!!',
'This was written in 1999 in Montreal',
"Cool cafe at 420 Funny Lane, Cupertino CA is way too cool",
"We're at a party at 12321 Mammoth Lane, Lexington MA 77777; Come have a beer!"
] print ex_addr(t)
I would like to get:
'22 West Westin st., South Carolina, 12345'
'22 West Westin street, SC, 12345'
'123 S. Vancouver ave. in Ottawa'
'123 S. Vancouver avenue in Ottawa'
None # for 'This was written in 1999 in Montreal',
"420 Funny Lane, Cupertino CA",
"12321 Mammoth Lane, Lexington MA 77777"
Could you please help?

I just ran across this in GitHub as I am having a similar problem. Appears to work and be more robust than your current solution.
https://github.com/madisonmay/CommonRegex
Looking at the code, the regex for street address accounts for many more scenarios. '\d{1,4} [\w\s]{1,20}(?:street|st|avenue|ave|road|rd|highway|hwy|square|sq|trail|trl|drive|dr|court|ct|parkway|pkwy|circle|cir|boulevard|blvd)\W?(?=\s|$)'

\d{1,4}( \w+){1,5}, (.*), ( \w+){1,5}, (AZ|CA|CO|NH), [0-9]{5}(-[0-9]{4})?
In this regex, you have one too many spaces (before ( \w+){1,5}, which already begins with one). Removing it, it matches your example.
I don't think you can assume that a "unit 123" or similar will be there, or there might be several ones (e.g. "building A, apt 3"). Note that in your initial regex, the . might match , which could lead to very long (and unwanted) matches.
You should probably accept several such groups with a limitation on the number (e.g. replace , (.*) with something like (, [^,]{1,20}){0,5}.
In any case, you will probably never get something 100% accurate that will accept any variation people might throw at them. Do lots of tests! Good luck.

Remove sequential characters from the start of a string

What's the best was to strip out the alphabetical letters that are sometimes at the start of Wikipedia references?
e.g. From
a b c d Star Wars Episode III: Revenge of the Sith (DVD). 20th Century Fox. 2005.
to
Star Wars Episode III: Revenge of the Sith (DVD). 20th Century Fox. 2005.
I've hacked together a solution that works, but seems clunky. My version uses a regular expression in the form '^(?:a (?:b (?:c )?)?)?'. What's a proper, fast way to do it?
a = list('abcdefghijklmnopqrstuvwxyz')
reg = "^%s%s" % ( "".join(["(?:%s " %b for b in a]), ")?"*len(a) )
re.sub(reg, "", "a b c d Wikipedia Reference")

I would probably just do something like this:
title = re.sub(r'^([a-z]\s)*', '', 'a b c d Wikipedia Reference')
which does the same as what you've got there. Like #joran-beasley points out, however, you might need something cleverer for the complicated cases.

How about using a character class in your regular expression, i.e.:
re.sub('^([a-z] )*', '', ...)
That should remove any number of leading occurrences of a single alphabetic character followed by a single space.

If you are copying and pasting webpage text rather than processing html, some problems as mentioned in the question are inevitable. But processing html (the relevant line as shown below) using htmllib, you can remove items like <sup><i><b>c</b></i></sup> (which contributes the c) as units. [Edit: I now see htmllib is deprecated; I don't know the proper replacement but believe it is HTMLParser.]
The displayed line is somewhat like
^ a b c d e Star Wars: Episode III Revenge of the Sith DVD commentary featuring George Lucas, Rick McCallum, Rob Coleman, John Knoll and Roger Guyett, [2005]
and the html source of the line is
<li id="cite_note-DVDcom-13"><span class="mw-cite-backlink">^ <sup><i><b>a</b></i></sup> <sup><i><b>b</b></i></sup> <sup><i><b>c</b></i></sup> <sup><i><b>d</b></i></sup> <sup><i><b>e</b></i></sup></span> <span class="reference-text"><i>Star Wars: Episode III Revenge of the Sith</i> DVD commentary featuring George Lucas, Rick McCallum, Rob Coleman, John Knoll and Roger Guyett, [2005]</span></li>

Do they always follow that pattern where there are four extra letters with spaces between in front of the title? If so, you could do this:
s = "a b c d Star Wars Episode III: Revenge of the Sith (DVD). 20th Century Fox. 2005."
if all([len(x) == 1 and x.isalpha() for x in s.split()[0:4]]):
print s[8:]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.