Cleaning Wikipedia content with python - python

Hi I have made use of a python library to collect the data of a topic. For example I chose the topic of New york and I have retreived the content with the following code:
import wikipedia
f2 = open('newyork', 'w')
ny = wikipedia.page("New York")
f2.write(ny.content.encode('utf8')+"\n")
I am able to extract the information in the format below:
New York is a state in the Northeastern United States and is the 27th-most extensive, fourth-most populous, and seventh-most densely populated U.S. state. New York is bordered by New Jersey and Pennsylvania to the south and Connecticut, Massachusetts, and Vermont to the east. The state has a maritime border in the Atlantic Ocean with Rhode Island, east of Long Island, as well as an international border with the Canadian provinces of Quebec to the north and Ontario to the west and north. The state of New York, with an estimated 19.8 million residents in 2015, is often referred to as New York State to distinguish it from New York City, the state's most populous city and its economic hub.
With an estimated population of 8.55 million in 2015, New York City is the most populous city in the United States and the premier gateway for legal immigration to the United States. The New York City Metropolitan Area is one of the most populous urban agglomerations in the world. New York City is a global city, exerting a significant impact upon commerce, finance, media, art, fashion, research, technology, education, and entertainment, its fast pace defining the term New York minute. The home of the United Nations Headquarters, New York City is an important center for international diplomacy and has been described as the cultural and financial capital of the world, as well as the world's most economically powerful city. New York City makes up over 40% of the population of New York State. Two-thirds of the state's population lives in the New York City Metropolitan Area, and nearly 40% live on Long Island. Both the state and New York City were named for the 17th century Duke of York, future King James II of England. The next four most populous cities in the state are Buffalo, Rochester, Yonkers, and Syracuse, while the state capital is Albany.
The earliest Europeans in New York were French colonists and Jesuit missionaries who arrived southward from settlements at Montreal for trade and proselytizing. New York had been inhabited by tribes of Algonquian and Iroquoian-speaking Native Americans for several hundred years by the time Dutch settlers moved into the region in the early 17th century. In 1609, the region was first claimed by Henry Hudson for the Dutch, who built Fort Nassau in 1614 at the confluence of the Hudson and Mohawk rivers, where the present-day capital of Albany later developed. The Dutch soon also settled New Amsterdam and parts of the Hudson Valley, establishing the colony of New Netherland, a multicultural community from its earliest days and a center of trade and immigration. The British annexed the colony from the Dutch in 1664. The borders of the British colony, the Province of New York, were similar to those of the present-day state.
Many landmarks in New York are well known to both international and domestic visitors, with New York State hosting four of the world's ten most-visited tourist attractions in 2013: Times Square, Central Park, Niagara Falls (shared with Ontario), and Grand Central Terminal. New York is home to the Statue of Liberty, a symbol of the United States and its ideals of freedom, democracy, and opportunity. In the 21st century, New York has emerged as a global node of creativity and entrepreneurship, social tolerance, and environmental sustainability. New York's higher education network comprises approximately 200 colleges and universities, including Columbia University, Cornell University, New York University, and Rockefeller University, which have been ranked among the top 35 in the world.
== History ==
=== 16th century ===
In 1524, Giovanni da Verrazzano, an Italian explorer in the service of the French crown, explored the Atlantic coast of North America between the Carolinas and Newfoundland, including New York Harbor and Narragansett Bay. On April 17, 1524 Verrazanno entered New York Bay, by way of the Strait now called the Narrows into the northern bay which he named Santa Margherita, in honour of the King of France's sister. Verrazzano described it as "a vast coastline with a deep delta in which every kind of ship could pass" and he adds: "that it extends inland for a league and opens up to form a beautiful lake. This vast sheet of water swarmed with native boats". He landed on the tip of Manhattan and perhaps on the furthest point of Long Island. Verrazanno's stay in this place was interrupted by a storm which pushed him north towards Martha's Vineyard.
In 1540 French traders from New France built a chateau on Castle Island, within present-day Albany; due to flooding, it was abandoned the next year. In 1614, the Dutch under the command of Hendrick Corstiaensen, rebuilt the French chateau, which they called Fort Nassau. Fort Nassau was the first Dutch settlement in North America, and was located along the Hudson River, also within present-day Albany. The small fort served as a trading post and warehouse. Located on the Hudson River flood plain, the rudimentary "fort" was washed away by flooding in 1617, and abandoned for good after Fort Orange (New Netherland) was built nearby in 1623.
=== 17th century ===
Henry Hudson's 1609 voyage marked the beginning of European involvement with the area. Sailing for the Dutch East India Company and looking for a passage to Asia, he entered the Upper New York Bay on September 11 of that year. Word of his findings encouraged Dutch merchants to explore the coast in search for profitable fur trading with local Native American tribes.
During the 17th century, Dutch trading posts established for the trade of pelts from the Lenape, Iroquois, and other tribes were founded in the colony of New Netherland. The first of these trading posts were Fort Nassau (1614, near present-day Albany); Fort Orange (1624, on the Hudson River just south of the current city of Albany and created to replace Fort Nassau), developing into settlement Beverwijck (1647), and into what became Albany; Fort Amsterdam (1625, to develop into the town New Amsterdam which is present-day New York City); and Esopus, (1653, now Kingston). The success of the patroonship of Rensselaerswyck (1630), which surrounded Albany and lasted until the mid-19th century, was also a key factor in the early success of the colony. The English captured the colony during the Second Anglo-Dutch War and governed it as the Province of New York. The city of New York was recaptured by the Dutch in 1673 during the Third Anglo-Dutch War (1672–1674) and renamed New Orange. It was returned to the English under the terms of the Treaty of Westminster a year later.
== References ==
== Further reading ==
French, John Homer (1860). Historical and statistical gazetteer of New York State. Syracuse, New York: R. Pearsall Smith. OCLC 224691273. (Full text via Google Books.)
New York State Historical Association (1940). New York: A Guide to the Empire State. New York City: Oxford University Press. ISBN 978-1-60354-031-5. OCLC 504264143. (Full text via Google Books.)
== External links ==
New York at DMOZ
Geographic data related to New York at OpenStreetMap
The Problems:
Problem 1:
I have a trouble in trying to remove all the contents from the section " Reference and Further Reading"
For example:
== History ==
some text under the section History
=== 17th century ===
some text under the section 17 century
=== 19th century ===
some text under the section 19 century
== References ==
some references
== Further reading ==
some further reading sources
Desired Result:
== History ==
some text under the section History
=== 17th century ===
some text under the section 17 century
=== 19th century ===
some text under the section 19 century
Problem 1B:
I will be getting the content of many topics so there will be many references to delete , how can I do it?
For example I like to delete all sections that begin with "Reference" and "Further Reading":
== New York ==
== References ==
== Further reading ==
== California ==
== References ==
== Further reading ==
== Floria ==
== References ==
== Further reading ==
Desired Result:
== New York ==
== California ==
== Floria ==
Sorry for the long post and please forgive me as I have very little knowledge of python.
All advice and help is greatly appreciated.
Thank you.
Edit
Current Problem
Hi osantana,
I have tried the code that you have provided as shown below:
import wikipedia
import re
f2 = open('osantana', 'w')
ny = wikipedia.page("New York")
section_title_re = re.compile("^=+\s+.*\s+=+$")
raw_content = ny.content
content = []
skip = False
for l in raw_content.splitlines():
line = l.strip()
if "== References ==" in line.lower():
skip = True # replace with break if this is the last section
continue
if "== Further reading ==" in line.lower():
skip = True # replace with break if this is the last section
continue
if "== External links ==" in line.lower():
skip = True # replace with break if this is the last section
continue
if section_title_re.match(line):
skip = False
continue
if skip:
continue
content.append(line)
content = '\n'.join(content) + '\n'
f2.write(content.encode('utf8')+"\n")
It works fine for all except this 3 part:
Original File:
== References ==
Index of New York-related articles
Outline of New York – organized list of topics about New York
== Further reading ==
French, John Homer (1860). Historical and statistical gazetteer of New York State. Syracuse, New York: R. Pearsall Smith. OCLC 224691273. (Full text via Google Books.)
New York State Historical Association (1940). New York: A Guide to the Empire State. New York City: Oxford University Press. ISBN 978-1-60354-031-5. OCLC 504264143. (Full text via Google Books.)
Result of the code:
Index of New York-related articles
Outline of New York – organized list of topics about New York
French, John Homer (1860). Historical and statistical gazetteer of New York State. Syracuse, New York: R. Pearsall Smith. OCLC 224691273. (Full text via Google Books.)
New York State Historical Association (1940). New York: A Guide to the Empire State. New York City: Oxford University Press. ISBN 978-1-60354-031-5. OCLC 504264143. (Full text via Google Books.)
The headings were removed but the content is still intact.

I'll assume that Reference/Further Reading are not the last sections in all pages. If those topics are the last sections replace the highlighted code below with a break command.
import re
def parse(raw_content):
section_title_re = re.compile("^=+\s+.*\s+=+$")
content = []
skip = False
for l in raw_content.splitlines():
line = l.strip()
if "= references =" in line.lower():
skip = True # replace with break if this is the last section
continue
if "= further reading =" in line.lower():
skip = True # replace with break if this is the last section
continue
if section_title_re.match(line):
skip = False
continue
if skip:
continue
content.append(line)
return '\n'.join(content) + '\n'
print(parse(ny.content))

For problem 2 you could do something like this
contents = re.sub('=+\s*.+\s*=+', '', contents)
Just remember to import re, the regular expressions module.
The method being used is re.sub(pattern, repl, string). pattern is a regular expression pattern* (the re documentation provides an overview on it).
repl is what you want to replace all occurrences of the pattern with. In this case you want to remove the pattern, so just use an empty string as the replacement.
string is of course the string you're performing the substitution on. This method returns the final result, so if you want to overwrite the original string, just assign the returned value back to the input string.
Here's the pattern I used explained just in case. '=+\s*.+\s*=+' means any part of the string where there is one or more equal sign (=+), followed by zero or more spaces (\s*), followed by one or more of any character (.+), followed again by zero or more spaces (\s*), finally ending with one or more equal signs (=+).
For problem 1 I'd say you could probably accomplish what you want to using regular expressions as well, and the re module makes it pretty easy. The link I gave above should help.

def clean_data(f):
def inner(word):
text=f(word)
text=text.encode("utf-8",errors='ignore').decode("utf-8")
text=re.sub("https?:.*(?=\s)",'',text)
text=re.sub("[’‘\"]","'",text)
text=re.sub("[^\x00-\x7f]+",'',text)
text=re.sub('[#&\\*+/<>#[\]^`{|}~ \t\n\r]',' ',text)
text=re.sub('\(.*?\)','',text)
text=re.sub('\=\=.*?\=\=','',text)
text=re.sub(' , ',',',text)
text=re.sub(' \.','.',text)
text=re.sub(" +",' ',text)
text=re.sub(";",'and',text)
return text.strip()
return inner
#clean_data
def get_data(word):
try:
data = wikipedia.summary("Orange",sentences=300)
except wikipedia.DisambiguationError as e:
print("picking the data from:",e.options[:3])
data=''.join([wikipedia.summary(s,sentences=100) for s in e.options[:3]])
return data
data=get_data("Orange")

Related

BeautifulSoup get_text result as a string variable?

I need to collect text from a certain section of a wikipedia page, and put it into a single string variable. I can find the right text relatively easily, but I have no idea how to get it into a string variable.
My code so far:
with open('uottawa_wiki.html', 'rb') as infile:
html_content = infile.read()
soup = BeautifulSoup(html_content, 'html.parser')
soup = soup.body
campus_subsection = soup.find(id='Campus')
campus_subsection_siblings = campus_subsection.find_parent().find_next_siblings()
for sibling in campus_subsection_siblings:
if sibling.name == 'p':
print(sibling.get_text())
else:
break
And this is what it outputs, which is perfect:
The university's main campus is situated within the neighbourhood of
Sandy Hill (Côte-de-Sable). The main campus is bordered to the north
by the ByWard Market district, to the east by Sandy Hill's residential
area, and to the southwest by Nicholas Street, which runs adjacent to
the Rideau Canal on the western half of the university. As of the
2010–2011 academic year, the main campus occupied 35.3 ha (87 acres),
though the university owns and manages other properties throughout the
city, raising the university's total extent to 42.5 ha (105
acres).[32] The main campus moved two times before settling in its
final location in 1856. When the institution was first founded, the
campus was located next to the Notre-Dame Cathedral Basilica. With
space a major issue in 1852, the campus moved to a location that is
now across from the National Gallery of Canada. In 1856, the
institution moved to its present location.[18]
The buildings at the university vary in age from 100 Laurier (1893) to
120 University (Faculty of Social Sciences, 2012).[33] In 2011 the
average age of buildings was 63.[32] In the 2011–2012 academic year,
the university owned and managed 30 main buildings, 806 research
laboratories, 301 teaching laboratories and 257 classrooms and seminar
rooms.[4][32] The main campus is divided between its older Sandy Hill
campus and its Lees campus, purchased in 2007. While Lees Campus is
not adjacent to Sandy Hill, it is displayed as part of the main campus
on school maps.[34] Lees campus, within walking distance of Sandy
Hill, was originally a satellite campus owned by Algonquin
College.[35]
However, I need all of this text, exactly as it is (line break and all) to be in a single string variable. I have NO CLUE how to do that.
Just append to a list in a loop and then "\n".join() it:
paragraphs = []
for sibling in campus_subsection_siblings:
if sibling.name == 'p':
paragraphs.append(sibling.get_text())
else:
break
full_text = "\n".join(paragraphs)

Get rid of weird indents when getting description in beautiful soup

I have a bs4 program where I collect the descriptions of links. It first checks to see if there are any meta description tags and if there aren't any it gets the descriptions from tags.
This is the code:
from bs4 import BeautifulSoup
import requests
def find_title(url):
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
with open('descrip.txt', 'a', encoding='utf-8') as f:
description = soup.find('meta', attrs={'name':'og:description'}) or soup.find('meta', attrs={'property':'description'}) or soup.find('meta', attrs={'name':'description'})
if description:
desc = description["content"]
else:
desc = soup.find_all('p')[0].getText()
lengths = len(desc)
index = 0
while lengths == 1:
index = index + 1
desc = soup.find_all('p')[index].getText()
lengths = len(desc)
if lengths > 300:
desc = soup.find_all('p')[index].getText()[0:300]
elif lengths < 300:
desc = soup.find_all('p')[index].getText()[0:lengths]
print(desc)
f.write(desc + '\n')
find_title('https://en.wikipedia.org/wiki/Portal:The_arts')
find_title('https://en.wikipedia.org/wiki/Portal:Biography')
find_title('https://en.wikipedia.org/wiki/Portal:Geography')
find_title('https://en.wikipedia.org/wiki/November_15')
find_title('https://en.wikipedia.org/wiki/November_16')
find_title('https://en.wikipedia.org/wiki/Wikipedia:Selected_anniversaries/November')
find_title('https://lists.wikimedia.org/mailman/listinfo/daily-article-l')
find_title('https://en.wikipedia.org/wiki/List_of_days_of_the_year')
find_title('https://en.wikipedia.org/wiki/File:Proclama%C3%A7%C3%A3o_da_Rep%C3%BAblica_by_Benedito_Calixto_1893.jpg')
find_title('https://en.wikipedia.org/wiki/First_Brazilian_Republic')
find_title('https://en.wikipedia.org/wiki/Empire_of_Brazil')
find_title('https://en.wikipedia.org/wiki/Pedro_II_of_Brazil')
find_title('https://en.wikipedia.org/wiki/Benedito_Calixto')
find_title('https://en.wikipedia.org/wiki/Rio_de_Janeiro')
find_title('https://en.wikipedia.org/wiki/Deodoro_da_Fonseca')
But the output to descrip.txt has some weird indents and some descriptions go in for multiple lines and there are spaces between some
This is the output:
The arts refers to the theory, human application and physical expression of creativity found in human cultures and societies through skills and imagination in order to produce objects, environments and experiences. Major constituents of the arts include visual arts (including architecture, ceramics,
A biography, or simply bio, is a detailed description of a person's life. It involves more than just the basic facts like education, work, relationships, and death; it portrays a person's experience of these life events. Unlike a profile or curriculum vitae (résumé), a biography presents a subject's
Geography (from Greek: γεωγραφία, geographia, literally "earth description") is a field of science devoted to the study of the lands, features, inhabitants, and phenomena of the Earth and planets. The first person to use the word γεωγραφία was Eratosthenes (276–194 BC). Geography is an all-encompass
November 15 is the 319th day of the year (320th in leap years) in the Gregorian calendar. 46 days remain until the end of the year.
November 16 is the 320th day of the year (321st in leap years) in the Gregorian calendar. 45 days remain until the end of the year.
The arts refers to the theory, human application and physical expression of creativity found in human cultures and societies through skills and imagination in order to produce objects, environments and experiences. Major constituents of the arts include visual arts (including architecture, ceramics,
A biography, or simply bio, is a detailed description of a person's life. It involves more than just the basic facts like education, work, relationships, and death; it portrays a person's experience of these life events. Unlike a profile or curriculum vitae (résumé), a biography presents a subject's
Geography (from Greek: γεωγραφία, geographia, literally "earth description") is a field of science devoted to the study of the lands, features, inhabitants, and phenomena of the Earth and planets. The first person to use the word γεωγραφία was Eratosthenes (276–194 BC). Geography is an all-encompass
November 15 is the 319th day of the year (320th in leap years) in the Gregorian calendar. 46 days remain until the end of the year.
November 16 is the 320th day of the year (321st in leap years) in the Gregorian calendar. 45 days remain until the end of the year.
Selected anniversaries / On this day archive
All · January · February · March · April · May · June · July · August · September · October · November · December
The sum of all human knowledge. Delivered to your inbox every day.
The following pages list the historical events, births, deaths, and holidays and observances of the specified day of the year:
Original file ‎(5,799 × 3,574 pixels, file size: 15.11 MB, MIME type: image/jpeg)
The First Brazilian Republic or República Velha (Portuguese pronunciation: [ʁeˈpublikɐ ˈvɛʎɐ], "Old Republic"), officially the Republic of the United States of Brazil, refers to the period of Brazilian history from 1889 to 1930. The República Velha ended with the Brazilian Revolution of 1930 that installed Getúlio Vargas as a new president.
The Empire of Brazil was a 19th-century state that broadly comprised the territories which form modern Brazil and (until 1828) Uruguay. Its government was a representative parliamentary constitutional monarchy under the rule of Emperors Dom Pedro I and his son Dom Pedro II. A colony of the Kingdom of Portugal, Brazil became the seat of the Portuguese colonial Empire in 1808, when the Portuguese Prince regent, later King Dom João VI, fled from Napoleon's invasion of Portugal and established himself and his government in the Brazilian city of Rio de Janeiro. João VI later returned to Portugal, leaving his eldest son and heir, Pedro, to rule the Kingdom of Brazil as regent. On 7 September 1822, Pedro declared the independence of Brazil and, after waging a successful war against his father's kingdom, was acclaimed on 12 October as Pedro I, the first Emperor of Brazil. The new country was huge, sparsely populated and ethnically diverse.
Early life (1825–40)
Consolidation (1840–53)
Growth (1853–64)
Paraguayan War (1864–70)
Apogee (1870–81)
Decline and fall (1881–89)
Exile and death (1889–91)
Legacy
Benedito Calixto de Jesus (14 October 1853 – 31 May 1927) was a Brazilian painter.[1] His works usually depicted figures from Brazil and Brazilian culture, including a famous portrait of the bandeirante Domingos Jorge Velho in 1923,[2] and scenes from the coastline of São Paulo.[3] Unlike many artis
Rio de Janeiro (/ˈriːoʊ di ʒəˈnɛəroʊ, - deɪ -, - də -/; Portuguese: [ˈʁi.u d(ʒi) ʒɐˈne(j)ɾu] (listen);[3]), or simply Rio,[4] is anchor to the Rio de Janeiro metropolitan area and the second-most populous municipality in Brazil and the sixth-most populous in the Americas. Rio de Janeiro is the capit
Manuel Deodoro da Fonseca (Portuguese pronunciation: [mɐnuˈɛw deoˈdɔɾu da fõˈsekɐ]; 5 August 1827 – 23 August 1892) was a Brazilian politician and military officer who served as the first President of Brazil. He took office after heading a military coup that deposed Emperor Pedro II and proclaimed t
is there any way to fix this problem?
Add strip=True to getText() (note: it’s an alias of get_text()), and than add a space as a separator. For example:
get_text(strip=True, separator=' ')

I want to write the regex expression for the following text

I am trying to parse the text of this format:
1ST Circuit U.S. District Court for NEW YORK SOUTHERN District Judge SMITH, JOHN T., JR
In the text, I want to capture:
Circuit name: In the example above, 1ST CIRCUIT. Circuit number can be between 1ST and 99TH. This information is not always there.
State name: In the text above, NEW YORK SOUTHERN. It can be at most three words. This information is not always there.
Title: It can be either District or Magistrate.
Last Name: Here, it is SMITH
Name: The name is JOHN T.,JR
To make my problem more clear, let me give two more examples of the text I want to parse.
15TH Circuit U.S. District Court for ALABAMA Magistrate Judge NEELY, CATHERINE
Magistrate Judge COOKE, THOMAS M
I have tried the following expression. It was able to capture the name of the judge but failed to capture the circuit and the state.
((?P<circuit>\d{1,2}\w{2} Circuit)?\s?(U\.S\. District Court for )?\s?(?
P<state>\b[A-Z]*(\s[A-Z]*)\b)*)?.* (?<=Judge )(?P<lname>[A-Z]*), (?P<name>
[A-Z,. ]*)( {1,2}\(.*\))?
Many thanks.
This should help.
import re
s = ["1ST Circuit U.S. District Court for NEW YORK SOUTHERN District Judge SMITH, JOHN T., JR", "15TH Circuit U.S. District Court for ALABAMA Magistrate Judge NEELY, CATHERINE"]
for sVal in s:
m = re.search(r"((?P<circuit>\d*(ST|TH) Circuit)) U.S. District Court for (?P<state>\b[A-Z\s]*\b)(?P<title>(District|Magistrate)) Judge (?P<lname>[A-Z]*), (?P<name>.*$)", sVal)
if m:
for i in ["circuit", "state", "title", "lname", "name"]:
print(m.group(i))
print("-----")
Output:
1ST Circuit
NEW YORK SOUTHERN
District
SMITH
JOHN T., JR
-----
15TH Circuit
ALABAMA
Magistrate
NEELY
CATHERINE
-----

reformat unstructured text into single line after removing punctuation

I have an unstructured text which I want to convert into 1 line and remove all the punctuation marks.
For the punctuation marks i used the following solution Best way to strip punctuation from a string in Python
How can i reformat the unstructured text into 1 line by using python?
Example 1:
The Bourne Identity is a 2002 spy film loosely based on Robert
Ludlum's novel of the same name. It stars Matt Damon as Jason Bourne,
an amnesiac attempting to discover his true identity amidst a
clandestine conspiracy within the Central Intelligence Agency (CIA) to
track him down and arrest or kill him for inexplicably failing to
carry out an officially unsanctioned assassination and then failing to
report back in afterwards. Along the way he teams up with Marie,
played by Franka Potente, who assists him on the initial part of his
journey to learn about his past and regain his memories. The film also
stars Chris Cooper as Alexander Conklin, Clive Owen as The Professor,
Brian Cox as Ward Abbott, and Julia Stiles as Nicky Parsons.
The film was directed by Doug Liman and adapted for the screen by Tony
Gilroy and William Blake Herron from the novel of the same name
written by Robert Ludlum, who also produced the film alongside Frank
Marshall. Universal Studios released the film to theaters in the
United States on June 14, 2002 and it received a positive critical and
public reaction. The film was followed by a 2004 sequel, The Bourne
Supremacy, and a third part released in 2007 entitled The Bourne
Ultimatum.
Plot
Example 2:
12 (0) 0 4 (0) 38 (3) 0 3 (0) 0 1 (0)
Example 3:
Franklin Township is one of the eighteen townships of Monroe County, Ohio,
United States. The 2000 census found 453 people in the township, 367 of whom
lived in the unincorporated portions of the township.
Geography
Located in the western part of the county, it borders the following townships:
The village of Stafford lies in southwestern Franklin Township.
Name and history
It is one of twenty-one Franklin Townships statewide.
Government
The township is governed by a three-member board of trustees, who are elected in
November of odd-numbered years to a four-year term beginning on the following
January 1. Two are elected in the year after the presidential election and one
is elected in the year before it. There is also an elected township clerk, who
serves a four-year term beginning on April 1 of the year after the election,
which is held in November of the year before the presidential election.
Vacancies in the clerkship or on the board of trustees are filled by the
remaining trustees.
As you can see in the previous examples. The text have different formats. How can I turn every single text into 1 line?
This is pretty straight forward - basically, other than the punctuation, you are now also looking to eliminate the line endings.
So, you can simply do:
import string
exclude = set(string.punctuation + "\n\t\r")
print ''.join(ch for ch in input_string if ch not in exclude)
input_string = """The Bourne Identity is a 2002 spy film loosely based on Robert Ludlum's novel of the same name. It stars Matt Damon as Jason Bourne, an amnesiac attempting to discover his true identity amidst a clandestine conspiracy within the Central Intelligence Agency (CIA) to track him down and arrest or kill him for inexplicably failing to carry out an officially unsanctioned assassination and then failing to report back in afterwards. Along the way he teams up with Marie, played by Franka Potente, who assists him on the initial part of his journey to learn about his past and regain his memories. The film also stars Chris Cooper as Alexander Conklin, Clive Owen as The Professor, Brian Cox as Ward Abbott, and Julia Stiles as Nicky Parsons.
The film was directed by Doug Liman and adapted for the screen by Tony Gilroy and William Blake Herron from the novel of the same name written by Robert Ludlum, who also produced the film alongside Frank Marshall. Universal Studios released the film to theaters in the United States on June 14, 2002 and it received a positive critical and public reaction. The film was followed by a 2004 sequel, The Bourne Supremacy, and a third part released in 2007 entitled The Bourne Ultimatum."""
>>> print ''.join(ch for ch in input_string if ch not in exclude)
The Bourne Identity is a 2002 spy film loosely based on Robert Ludlums novel of the same name It stars Matt Damon as Jason Bourne an amnesiac attempting to discover his true identity amidst a clandestine conspiracy within the Central Intelligence Agency CIA to track him down and arrest or kill him for inexplicably failing to carry out an officially unsanctioned assassination and then failing to report back in afterwards Along the way he teams up with Marie played by Franka Potente who assists him on the initial part of his journey to learn about his past and regain his memories The film also stars Chris Cooper as Alexander Conklin Clive Owen as The Professor Brian Cox as Ward Abbott and Julia Stiles as Nicky ParsonsThe film was directed by Doug Liman and adapted for the screen by Tony Gilroy and William Blake Herron from the novel of the same name written by Robert Ludlum who also produced the film alongside Frank Marshall Universal Studios released the film to theaters in the United States on June 14 2002 and it received a positive critical and public reaction The film was followed by a 2004 sequel The Bourne Supremacy and a third part released in 2007 entitled The Bourne Ultimatum

How to split string of biographical info into different dictionaries using regex, in Python?

Recently I got my hands on a research project that would greatly benefit from learning how to parse a string of biographical data on several individuals into a set of dictionaries for each individual.
The string contains break words and I was hoping to create keys off of the breakwords and separate dictionaries by line breaks. So here are two people I want to create two different dictionaries for within my data:
Bankers = [ ' Bakstansky, Peter; Senior Vice President, Federal
Reserve Bank of New York, in charge of public information
since 1976, when he joined the NY Fed as Vice President. Senior
Officer in charge of die Office of Regional and Community Affairs,
Ombudsman for the Bank and Senior Administrative Officer for Executive
Group, m zero children Educ City College of New York (Bachelor of
Business Administration, 1961); University of Illinois, Graduate
School, and New York University, Graduate School of Business. 1962-6:
Business and financial writer, New York, on American Banker, New
York-World Telegram & Sun, Neia York Herald Tribune (banking editor
1964-6). 1966-74: Chase Manhattan Bank: Manager of Public Relations,
based in Paris, 1966-71; Manager of Chase's European Marketing and
Planning, based in Brussels, 1971-2; Vice President and Director of
Public Relations, 1972-4.1974-76: Bache & Co., Vice President and
Director of Corporate Communications. Barron, Patrick K.; First Vice
President and < Operating Officer of the Federal Reserve Bank o
Atlanta since February 1996. Member of the Fed" Reserve Systems
Conference of first Vice Preside Vice chairman of the bank's
management Con and of the Discount Committee, m three child Educ
University of Miami (Bachelor's degree in Management); Harvard
Business School (Prog Management Development); Stonier Graduate Sr of
Banking, Rutgers University. 1967: Joined Fed Reserve Bank of Atlanta
in computer operations 1971: transferred to Miami Branch; 1974:
Assist: President; 1987: Senior Vice President.1988: re1- Atlanta as
Head of Corporate Services. Member Executive Committee of the Georgia
Council on Igmic Education; former vice diairman of Greater
ji§?Charnber of Commerce and the President'sof the University of
Miami; in Atlanta, former ||Mte vice chairman for the United Way of
Atlanta feiSinber of Leadership Atlanta. Member of the Council on
Economic Education. Interest. ' ]
So for example, in this data I have two people - Peter Batanksy and Patrick K. Barron. I want to create a dictionary for each individual with these 4 keys: bankerjobs, Number of children, Education, and nonbankerjobs.
In this text there are already break words: "m" = number of children "Educ", and anything before "m" is bankerjobs and anything after the first "." after Educ is nonbankerjobs, and the keyword to break between individuals seems to be any amount of spaces after a "." >1
How can I create a dictionary for each of these two individuals with these 4 keys using regular expressions on these break words?
specifically, what set of regex could help me create a dictionary for these two individuals with these 4 keys (built on the above specified break words)?
A pattern i am thinking would be something like this in perl:
pattern = [r'(m/[ '(.*);(.*)m(.*)Educ(.*)/)']
but i'm not sure..
I'm thinking the code would be similar to this but please correct it if im wrong:
my_banker_parser = re.compile(r'somefancyregex')
def nested_dict_from_text(text):
m = re.search(my_banker_parser, text)
if not m:
raise ValueError
d = m.groupdict()
return { "centralbanker": d }
result = nested_dict_from_text(bankers)
print(result)
My hope is to take this code and run it through the rest of the biographies for all of individuals of interest.
Using named groups will probably be less brittle, since it doesn't depend on the pieces of data being in the same order in each biography. Something like this should work:
>>> import re
>>> regex = re.compile(r'(?P<foo>foo)|(?P<bar>bar)|(?P<baz>baz)')
>>> data = {}
>>> for match in regex.finditer('bar baz foo something'):
... data.update((k, v) for k, v in match.groupdict().items() if v is not None)
...
>>> data
{'baz': 'baz', 'foo': 'foo', 'bar': 'bar'}

Categories

Resources