Python Regular Expression Extract Lookahead - python

I have been trying to extract transport node names and location coordinate strings from a webpage scrape (that I have permission to scrape). The names and locations are in cdata blocks of javascript. See here: http://pastebin.com/6Vtup2dE
Using regular expressions in python
re.findall("(?:\(new\sMicrosoft\.Maps\.Location\()(.+?(?=\)\,))(?:.+?(?=new\ssimpleInfo\(\\\'))(.+?(?=\\)))", test_str)
I get
[(u'55.86527,-4.2517133',
u"new simpleInfo('Buchanan Bus Station','Glasgow, Buchanan Bus Station - entrance to station is situated on Killermont Street. It is a short walk from George Square and within easy reach of Glasgow?s main shopping and leisure areas. Please check the bus station passenger displays for stance information for megabus services.'"),
(u'55.86068,-4.257852', u"new simpleInfo('Central Train Station',''"),
(u'51.492653,-0.14765126',
u"new simpleInfo('Victoria, Buckingham Palace Rd, Stop 10','London Victoria, Buckingham Palace Road - at the corner of Elizabeth Bridge and diagonally across from the main entrance to Victoria Coach Station. megabus Oxford Tube services leave from Stop 10.'"),
(u'51.492596,-0.14985295',
u"new simpleInfo('Victoria Coach Station','London Victoria Coach Station is situated on Buckingham Palace Rd at the junction with Elizabeth St. megabus services depart from Stands 15-20, located in the departures area of North West terminal '"),
(u'51.503437,-0.112076715',
u"new simpleInfo('Waterloo Train Station','London Waterloo - London Waterloo Station is located on Station Approach, SE1 London - just behind the London Eye. The station is a terminus for trains serving the south-west of England and Eurostar services. Waterloo is the largest station in the UK by area. Its spacious, curved concourse is lined with shops and all the modern amenities.\\n'"),
(u'51.53062,-0.12585254',
u"new simpleInfo('St Pancras International Train Station','For East Midlands Trains services only. London St Pancras International, London - St Pancras Station is located on Pancras Rd NW1 between the national Library and Kings Cross station. The station is the terminus for trains serving East Midlands and South Yorkshire. It is also the new London terminal for the Eurostar services to the continent. Kings Cross St Pancras tube station provides links via the London underground to other London destinations.'"),
(u'51.52678,-0.13297649',
u"new simpleInfo('Euston Train Station','For Virgin Trains Services Only. London Euston - The station is the main terminal for trains to London from the West Midlands and North West England. It is connected to Euston Tube Station for easy access to the London Underground network'"),
(u'51.52953,-0.12506014',
u"new simpleInfo('St Pancras, Coach Road','In some instances megabusplus services which operate as coach only will pick up from Coach Road, outside London St Pancras.'"),
(u'55.86527,-4.2517133',
u"new simpleInfo('Buchanan Bus Station','Glasgow, Buchanan Bus Station - entrance to station is situated on Killermont Street. It is a short walk from George Square and within easy reach of Glasgow?s main shopping and leisure areas. Please check the bus station passenger displays for stance information for megabus services.'"),
(u'55.86068,-4.257852', u"new simpleInfo('Central Train Station',''")]
But what I am trying to get is just:
[(u'55.86527,-4.2517133','Buchanan Bus Station'),
(u'55.86068,-4.257852', 'Central Train Station'),
(u'51.492653,-0.14765126','Victoria, Buckingham Palace Rd, Stop 10'),
(u'51.492596,-0.14985295','Victoria Coach Station')....etc]
I've written plenty of regex in my time but I've never had problems like this. I am trying (believe it or not) to hide everything up to and including "new simpleInfo(' and then grab the string up to the next "'" but I can't work it out. help!

Try this:
re.findall(r"(?:\(new\sMicrosoft\.Maps\.Location\(([^)]+)\).+?new\ssimpleInfo\(\\?'(.+?)\\?')", test_str)
This regex find all occurences whether there is \'Buchanan Bus Station\' or 'Buchanan Bus Station'.
Here is the demo

(?:\(new\sMicrosoft\.Maps\.Location\()(.+?(?=\)\,))(?:.+?).*?new\ssimpleInfo\(\\'([^'\\]+)
Try this.This should give you what you want.
import re
p = re.compile(ur'(?:\(new\sMicrosoft\.Maps\.Location\()(.+?(?=\)\,))(?:.+?).*?new\ssimpleInfo\(\\\'([^\'\\]+)')
test_str = u"jQuery(function(){ jQuery(\'#JourneyPlanner_txtOutboundDate\').datepicker({dateFormat: \'dd/mm/yy\', firstDay: 1, beforeShowDay: function(dte){ return [((dte >= new Date(2014,9,29) && dte <= new Date(2015,0,4)) || false)]; }, minDate: new Date(2014,9,29), maxDate: new Date(2015,0,4),buttonImage: \"/images/icon_calendar.gif\", showOn: \"both\", buttonImageOnly: true}); });\njQuery(function(){ jQuery(\'#JourneyPlanner_txtReturnDate\').datepicker({dateFormat: \'dd/mm/yy\', firstDay: 1,buttonImage: \"/images/icon_calendar.gif\", showOn: \"both\", buttonImageOnly: true}); });\nEmperorBing.addCallback(function(){ var map = new Microsoft.Maps.Map(document.getElementById(\'confirm1_Map1\'), {credentials:\'Aodb7Wd7D9Kq5gKNryfW6V29yf8aw2Sbu-tXAlkH7OLJtm8zG2bQzzhDKK5zM9FE\',height: 320,width: 299, zoom: 13, mapTypeId: Microsoft.Maps.MapTypeId.auto, enableClickableLogo: false , enableSearchLogo: false , showDashboard: true, showCopyright: true, showScalebar: true, showMapTypeSelector: true});\r\nEmperorBing.addMarker(map, new Microsoft.Maps.Pushpin(new Microsoft.Maps.Location(55.86527,-4.2517133), { undefined: undefined, icon:\'/images/mapmarker.gif\', width:42, height:42, anchor: new Microsoft.Maps.Point(21,21)}),new simpleInfo(\'Buchanan Bus Station\',\'Glasgow, Buchanan Bus Station - entrance to station is situated on Killermont Street. It is a short walk from George Square and within easy reach of Glasgow?s main shopping and leisure areas. Please check the bus station passenger displays for stance information "
re.findall(p, test_str)
See demo.
http://regex101.com/r/dP9rO4/9

Related

How can I use regular expressions to extract all words with at least one digit in text with Python

I am new to regular expressions and I have a text as follows. How can I use the RegEx to extract all words with at least one digit in it? Really appreciate it.
text = '''The start of the Civil War in 1861 followed by Tennessee’s secession from the Union and the lodging of
wounded Confederate soldiers on campus did not close East Tennessee University. By spring 1862 when the
trustees finally suspended operations, the majority of students had joined the military, President Joseph
Ridley had resigned, and two professors had left the university. Wounded Confederate soldiers were lodged
at university buildings after the January 1862 Battle of Mill Springs in Kentucky, known as the Battle of
Fishing Creek to the Confederacy. In the fall of 1863, Union troops forced the Confederates out of
Knoxville. On the Hill, the Union Army enclosed the three university buildings with an earthen
fortification they named Fort Byington in honor of an officer from Michigan who was killed in the defense
of Knoxville. They used the buildings for their headquarters, barracks, and a hospital for Black troops.
Despite a Confederate attempt to retake the city by siege—climaxed by a bloody, abortive attack on Fort
Sanders on November 29, 1863—the Union held and occupied Knoxville for the rest of the war. During the
battle, the Hill was hit with artillery fire from Confederate guns located in a trench at the site of
UT’s present-day Sorority Village. Campus also sustained a great deal of damage caused by the Union Army.
Troops denuded the grounds of trees, ruined the steward’s house, and destroyed the gymnasium with
misdirected cannon fire aimed at Confederate troops across the river. After the Civil War ended in 1865
and the Union Army left campus, Thomas Humes was elected university president. The university reopened in
1866 and operated for six months downtown in the Deaf and Dumb Asylum while repairs began at the damaged
campus. A petition to the federal war department for monetary compensation for campus damage done by the
Union Army undoubtedly received more favorable consideration because of Humes’s known Union loyalty
throughout the war. A Senate committee which considered the bill for damages also noted that East
Tennessee University was “particularly deserving of the favorable consideration of Congress” because it
was “the only educational institution of known loyalty…in any of the seceding states.” However in 1873,
President Ulysses S. Grant vetoed the bill that would have provided $18,500 to the university because he
felt it would set a bad precedent. The bill was redrafted specifying that the payment was compensation
for aid East Tennessee University gave to the Union during the war. On June 22, 1874, President Grant
signed the new bill and the trustees accepted the funds the same day with an agreement to release the
government from all claims. (More than a century and a half later, a buried Union trench was located in
2019 on the north side of the present-day McClung Museum with the use of ground-penetrating radar.)
'''
You could use this pattern:
'\w*\d+\w*'
How does it work:
\w* matches 0 or more characters (but not space)
\d+ matches 1 or more digits
\w* matches 0 or more characters again
Using re and findall we get:
re.findall('\w*\d+\w*',your_text)
we get:
['1861',
'1862',
'1862',
'1863',
'29',
'1863',
'1865',
'1866',
'1873',
'18',
'500',
'22',
'1874',
'2019']
Is this what you mean?
re.findall(r"\S*\d+\S*", text)
\S any character but a space,
\d any digit,
+ one or more occurrences,
* zero or more occurrences

BeautifulSoup get_text result as a string variable?

I need to collect text from a certain section of a wikipedia page, and put it into a single string variable. I can find the right text relatively easily, but I have no idea how to get it into a string variable.
My code so far:
with open('uottawa_wiki.html', 'rb') as infile:
html_content = infile.read()
soup = BeautifulSoup(html_content, 'html.parser')
soup = soup.body
campus_subsection = soup.find(id='Campus')
campus_subsection_siblings = campus_subsection.find_parent().find_next_siblings()
for sibling in campus_subsection_siblings:
if sibling.name == 'p':
print(sibling.get_text())
else:
break
And this is what it outputs, which is perfect:
The university's main campus is situated within the neighbourhood of
Sandy Hill (Côte-de-Sable). The main campus is bordered to the north
by the ByWard Market district, to the east by Sandy Hill's residential
area, and to the southwest by Nicholas Street, which runs adjacent to
the Rideau Canal on the western half of the university. As of the
2010–2011 academic year, the main campus occupied 35.3 ha (87 acres),
though the university owns and manages other properties throughout the
city, raising the university's total extent to 42.5 ha (105
acres).[32] The main campus moved two times before settling in its
final location in 1856. When the institution was first founded, the
campus was located next to the Notre-Dame Cathedral Basilica. With
space a major issue in 1852, the campus moved to a location that is
now across from the National Gallery of Canada. In 1856, the
institution moved to its present location.[18]
The buildings at the university vary in age from 100 Laurier (1893) to
120 University (Faculty of Social Sciences, 2012).[33] In 2011 the
average age of buildings was 63.[32] In the 2011–2012 academic year,
the university owned and managed 30 main buildings, 806 research
laboratories, 301 teaching laboratories and 257 classrooms and seminar
rooms.[4][32] The main campus is divided between its older Sandy Hill
campus and its Lees campus, purchased in 2007. While Lees Campus is
not adjacent to Sandy Hill, it is displayed as part of the main campus
on school maps.[34] Lees campus, within walking distance of Sandy
Hill, was originally a satellite campus owned by Algonquin
College.[35]
However, I need all of this text, exactly as it is (line break and all) to be in a single string variable. I have NO CLUE how to do that.
Just append to a list in a loop and then "\n".join() it:
paragraphs = []
for sibling in campus_subsection_siblings:
if sibling.name == 'p':
paragraphs.append(sibling.get_text())
else:
break
full_text = "\n".join(paragraphs)

I want to write the regex expression for the following text

I am trying to parse the text of this format:
1ST Circuit U.S. District Court for NEW YORK SOUTHERN District Judge SMITH, JOHN T., JR
In the text, I want to capture:
Circuit name: In the example above, 1ST CIRCUIT. Circuit number can be between 1ST and 99TH. This information is not always there.
State name: In the text above, NEW YORK SOUTHERN. It can be at most three words. This information is not always there.
Title: It can be either District or Magistrate.
Last Name: Here, it is SMITH
Name: The name is JOHN T.,JR
To make my problem more clear, let me give two more examples of the text I want to parse.
15TH Circuit U.S. District Court for ALABAMA Magistrate Judge NEELY, CATHERINE
Magistrate Judge COOKE, THOMAS M
I have tried the following expression. It was able to capture the name of the judge but failed to capture the circuit and the state.
((?P<circuit>\d{1,2}\w{2} Circuit)?\s?(U\.S\. District Court for )?\s?(?
P<state>\b[A-Z]*(\s[A-Z]*)\b)*)?.* (?<=Judge )(?P<lname>[A-Z]*), (?P<name>
[A-Z,. ]*)( {1,2}\(.*\))?
Many thanks.
This should help.
import re
s = ["1ST Circuit U.S. District Court for NEW YORK SOUTHERN District Judge SMITH, JOHN T., JR", "15TH Circuit U.S. District Court for ALABAMA Magistrate Judge NEELY, CATHERINE"]
for sVal in s:
m = re.search(r"((?P<circuit>\d*(ST|TH) Circuit)) U.S. District Court for (?P<state>\b[A-Z\s]*\b)(?P<title>(District|Magistrate)) Judge (?P<lname>[A-Z]*), (?P<name>.*$)", sVal)
if m:
for i in ["circuit", "state", "title", "lname", "name"]:
print(m.group(i))
print("-----")
Output:
1ST Circuit
NEW YORK SOUTHERN
District
SMITH
JOHN T., JR
-----
15TH Circuit
ALABAMA
Magistrate
NEELY
CATHERINE
-----

Cleaning Wikipedia content with python

Hi I have made use of a python library to collect the data of a topic. For example I chose the topic of New york and I have retreived the content with the following code:
import wikipedia
f2 = open('newyork', 'w')
ny = wikipedia.page("New York")
f2.write(ny.content.encode('utf8')+"\n")
I am able to extract the information in the format below:
New York is a state in the Northeastern United States and is the 27th-most extensive, fourth-most populous, and seventh-most densely populated U.S. state. New York is bordered by New Jersey and Pennsylvania to the south and Connecticut, Massachusetts, and Vermont to the east. The state has a maritime border in the Atlantic Ocean with Rhode Island, east of Long Island, as well as an international border with the Canadian provinces of Quebec to the north and Ontario to the west and north. The state of New York, with an estimated 19.8 million residents in 2015, is often referred to as New York State to distinguish it from New York City, the state's most populous city and its economic hub.
With an estimated population of 8.55 million in 2015, New York City is the most populous city in the United States and the premier gateway for legal immigration to the United States. The New York City Metropolitan Area is one of the most populous urban agglomerations in the world. New York City is a global city, exerting a significant impact upon commerce, finance, media, art, fashion, research, technology, education, and entertainment, its fast pace defining the term New York minute. The home of the United Nations Headquarters, New York City is an important center for international diplomacy and has been described as the cultural and financial capital of the world, as well as the world's most economically powerful city. New York City makes up over 40% of the population of New York State. Two-thirds of the state's population lives in the New York City Metropolitan Area, and nearly 40% live on Long Island. Both the state and New York City were named for the 17th century Duke of York, future King James II of England. The next four most populous cities in the state are Buffalo, Rochester, Yonkers, and Syracuse, while the state capital is Albany.
The earliest Europeans in New York were French colonists and Jesuit missionaries who arrived southward from settlements at Montreal for trade and proselytizing. New York had been inhabited by tribes of Algonquian and Iroquoian-speaking Native Americans for several hundred years by the time Dutch settlers moved into the region in the early 17th century. In 1609, the region was first claimed by Henry Hudson for the Dutch, who built Fort Nassau in 1614 at the confluence of the Hudson and Mohawk rivers, where the present-day capital of Albany later developed. The Dutch soon also settled New Amsterdam and parts of the Hudson Valley, establishing the colony of New Netherland, a multicultural community from its earliest days and a center of trade and immigration. The British annexed the colony from the Dutch in 1664. The borders of the British colony, the Province of New York, were similar to those of the present-day state.
Many landmarks in New York are well known to both international and domestic visitors, with New York State hosting four of the world's ten most-visited tourist attractions in 2013: Times Square, Central Park, Niagara Falls (shared with Ontario), and Grand Central Terminal. New York is home to the Statue of Liberty, a symbol of the United States and its ideals of freedom, democracy, and opportunity. In the 21st century, New York has emerged as a global node of creativity and entrepreneurship, social tolerance, and environmental sustainability. New York's higher education network comprises approximately 200 colleges and universities, including Columbia University, Cornell University, New York University, and Rockefeller University, which have been ranked among the top 35 in the world.
== History ==
=== 16th century ===
In 1524, Giovanni da Verrazzano, an Italian explorer in the service of the French crown, explored the Atlantic coast of North America between the Carolinas and Newfoundland, including New York Harbor and Narragansett Bay. On April 17, 1524 Verrazanno entered New York Bay, by way of the Strait now called the Narrows into the northern bay which he named Santa Margherita, in honour of the King of France's sister. Verrazzano described it as "a vast coastline with a deep delta in which every kind of ship could pass" and he adds: "that it extends inland for a league and opens up to form a beautiful lake. This vast sheet of water swarmed with native boats". He landed on the tip of Manhattan and perhaps on the furthest point of Long Island. Verrazanno's stay in this place was interrupted by a storm which pushed him north towards Martha's Vineyard.
In 1540 French traders from New France built a chateau on Castle Island, within present-day Albany; due to flooding, it was abandoned the next year. In 1614, the Dutch under the command of Hendrick Corstiaensen, rebuilt the French chateau, which they called Fort Nassau. Fort Nassau was the first Dutch settlement in North America, and was located along the Hudson River, also within present-day Albany. The small fort served as a trading post and warehouse. Located on the Hudson River flood plain, the rudimentary "fort" was washed away by flooding in 1617, and abandoned for good after Fort Orange (New Netherland) was built nearby in 1623.
=== 17th century ===
Henry Hudson's 1609 voyage marked the beginning of European involvement with the area. Sailing for the Dutch East India Company and looking for a passage to Asia, he entered the Upper New York Bay on September 11 of that year. Word of his findings encouraged Dutch merchants to explore the coast in search for profitable fur trading with local Native American tribes.
During the 17th century, Dutch trading posts established for the trade of pelts from the Lenape, Iroquois, and other tribes were founded in the colony of New Netherland. The first of these trading posts were Fort Nassau (1614, near present-day Albany); Fort Orange (1624, on the Hudson River just south of the current city of Albany and created to replace Fort Nassau), developing into settlement Beverwijck (1647), and into what became Albany; Fort Amsterdam (1625, to develop into the town New Amsterdam which is present-day New York City); and Esopus, (1653, now Kingston). The success of the patroonship of Rensselaerswyck (1630), which surrounded Albany and lasted until the mid-19th century, was also a key factor in the early success of the colony. The English captured the colony during the Second Anglo-Dutch War and governed it as the Province of New York. The city of New York was recaptured by the Dutch in 1673 during the Third Anglo-Dutch War (1672–1674) and renamed New Orange. It was returned to the English under the terms of the Treaty of Westminster a year later.
== References ==
== Further reading ==
French, John Homer (1860). Historical and statistical gazetteer of New York State. Syracuse, New York: R. Pearsall Smith. OCLC 224691273. (Full text via Google Books.)
New York State Historical Association (1940). New York: A Guide to the Empire State. New York City: Oxford University Press. ISBN 978-1-60354-031-5. OCLC 504264143. (Full text via Google Books.)
== External links ==
New York at DMOZ
Geographic data related to New York at OpenStreetMap
The Problems:
Problem 1:
I have a trouble in trying to remove all the contents from the section " Reference and Further Reading"
For example:
== History ==
some text under the section History
=== 17th century ===
some text under the section 17 century
=== 19th century ===
some text under the section 19 century
== References ==
some references
== Further reading ==
some further reading sources
Desired Result:
== History ==
some text under the section History
=== 17th century ===
some text under the section 17 century
=== 19th century ===
some text under the section 19 century
Problem 1B:
I will be getting the content of many topics so there will be many references to delete , how can I do it?
For example I like to delete all sections that begin with "Reference" and "Further Reading":
== New York ==
== References ==
== Further reading ==
== California ==
== References ==
== Further reading ==
== Floria ==
== References ==
== Further reading ==
Desired Result:
== New York ==
== California ==
== Floria ==
Sorry for the long post and please forgive me as I have very little knowledge of python.
All advice and help is greatly appreciated.
Thank you.
Edit
Current Problem
Hi osantana,
I have tried the code that you have provided as shown below:
import wikipedia
import re
f2 = open('osantana', 'w')
ny = wikipedia.page("New York")
section_title_re = re.compile("^=+\s+.*\s+=+$")
raw_content = ny.content
content = []
skip = False
for l in raw_content.splitlines():
line = l.strip()
if "== References ==" in line.lower():
skip = True # replace with break if this is the last section
continue
if "== Further reading ==" in line.lower():
skip = True # replace with break if this is the last section
continue
if "== External links ==" in line.lower():
skip = True # replace with break if this is the last section
continue
if section_title_re.match(line):
skip = False
continue
if skip:
continue
content.append(line)
content = '\n'.join(content) + '\n'
f2.write(content.encode('utf8')+"\n")
It works fine for all except this 3 part:
Original File:
== References ==
Index of New York-related articles
Outline of New York – organized list of topics about New York
== Further reading ==
French, John Homer (1860). Historical and statistical gazetteer of New York State. Syracuse, New York: R. Pearsall Smith. OCLC 224691273. (Full text via Google Books.)
New York State Historical Association (1940). New York: A Guide to the Empire State. New York City: Oxford University Press. ISBN 978-1-60354-031-5. OCLC 504264143. (Full text via Google Books.)
Result of the code:
Index of New York-related articles
Outline of New York – organized list of topics about New York
French, John Homer (1860). Historical and statistical gazetteer of New York State. Syracuse, New York: R. Pearsall Smith. OCLC 224691273. (Full text via Google Books.)
New York State Historical Association (1940). New York: A Guide to the Empire State. New York City: Oxford University Press. ISBN 978-1-60354-031-5. OCLC 504264143. (Full text via Google Books.)
The headings were removed but the content is still intact.
I'll assume that Reference/Further Reading are not the last sections in all pages. If those topics are the last sections replace the highlighted code below with a break command.
import re
def parse(raw_content):
section_title_re = re.compile("^=+\s+.*\s+=+$")
content = []
skip = False
for l in raw_content.splitlines():
line = l.strip()
if "= references =" in line.lower():
skip = True # replace with break if this is the last section
continue
if "= further reading =" in line.lower():
skip = True # replace with break if this is the last section
continue
if section_title_re.match(line):
skip = False
continue
if skip:
continue
content.append(line)
return '\n'.join(content) + '\n'
print(parse(ny.content))
For problem 2 you could do something like this
contents = re.sub('=+\s*.+\s*=+', '', contents)
Just remember to import re, the regular expressions module.
The method being used is re.sub(pattern, repl, string). pattern is a regular expression pattern* (the re documentation provides an overview on it).
repl is what you want to replace all occurrences of the pattern with. In this case you want to remove the pattern, so just use an empty string as the replacement.
string is of course the string you're performing the substitution on. This method returns the final result, so if you want to overwrite the original string, just assign the returned value back to the input string.
Here's the pattern I used explained just in case. '=+\s*.+\s*=+' means any part of the string where there is one or more equal sign (=+), followed by zero or more spaces (\s*), followed by one or more of any character (.+), followed again by zero or more spaces (\s*), finally ending with one or more equal signs (=+).
For problem 1 I'd say you could probably accomplish what you want to using regular expressions as well, and the re module makes it pretty easy. The link I gave above should help.
def clean_data(f):
def inner(word):
text=f(word)
text=text.encode("utf-8",errors='ignore').decode("utf-8")
text=re.sub("https?:.*(?=\s)",'',text)
text=re.sub("[’‘\"]","'",text)
text=re.sub("[^\x00-\x7f]+",'',text)
text=re.sub('[#&\\*+/<>#[\]^`{|}~ \t\n\r]',' ',text)
text=re.sub('\(.*?\)','',text)
text=re.sub('\=\=.*?\=\=','',text)
text=re.sub(' , ',',',text)
text=re.sub(' \.','.',text)
text=re.sub(" +",' ',text)
text=re.sub(";",'and',text)
return text.strip()
return inner
#clean_data
def get_data(word):
try:
data = wikipedia.summary("Orange",sentences=300)
except wikipedia.DisambiguationError as e:
print("picking the data from:",e.options[:3])
data=''.join([wikipedia.summary(s,sentences=100) for s in e.options[:3]])
return data
data=get_data("Orange")

How do I split a string BEFORE a certain character?

timeline = '''
1961 - Second child, John, is born. 1963 - Is ordained as a Presbyterian minister. Makes his on-camera debut as host of a series of 15-minute episodes for children produced in Toronto. The program is titled "Mister Rogers." 1964 - Returns to Pittsburgh
and turns his 15-minute show into the half-hour "Mister Rogers' Neighborhood." 1968 - "Mister Rogers' Neighborhood" debuts on PBS and wins the first of its several Emmy Awards. Rogers is appointed chairman of the Forum on Mass Media and Child Development
of the White House Conference on Youth. 1969 - Wins the first of his two George Foster Peabody Awards for television excellence. Fred Rogers readies the opening pitch to start the Pirates' season in 1988. (Post-Gazette archives) 1971 - Forms his production
company, Family Communications Inc. 1975 - Ceases production on "Mister Rogers' Neighborhood," which continues to air on PBS in reruns. 1978 - Creates the PBS series "Old Friends, New Friends," focusing on older people. 1979 - Production resumes on
"Mister Rogers' Neighborhood." 1981 - Eddie Murphy plays an inner-city children's show host in his "Mister Robinson's Neighborhood" sketch on "Saturday Night Live," which is probably the most famous of the many "Mister Rogers" spoofs. 1984 - The Smithsonian
Institution in Washington makes Fred Rogers' trademark sweater part of its permanent collection. Rogers forces the fast-food chain Burger King to remove ads featuring a Mister Rogers lookalike. 1987 - Appears on a children's TV show in the Soviet
Union and reciprocates by having a Soviet host appear on his show later in the year. 1989 - "Mister Rogers' Neighborhood of Make Believe" opens as an attraction at Idlewild Park. 1990 - Sues the Ku Klux Klan and forces the organization to stop playing
racist telephone recordings featuring imitations of Fred Rogers' voice, speech patterns and theme song. 1991 - Tapes spots for PBS on the eve of the Gulf War to reassure children that they will be all right if hostilities occur. 1996 - TV Guide names
Fred Rogers one of the 50 greatest TV stars of all time. Rogers makes his only appearance as someone other than himself when he takes a cameo role as a preacher on the drama series "Dr. Quinn, Medicine Woman," which he says is one of his favorite
shows. 1997 - Is honored for lifetime achievement by the National Academy of Television Arts and Sciences (who hand out the Emmy Awards) and by the Television Critics Association. Named Pittsburgher of the Year by Pittsburgh magazine. 1998 - Gets
a star on the Hollywood Walk of Fame. The Pittsburgh Children's Museum opens a Mister Rogers exhibit. 1999 - Is inducted into the Television Hall of Fame. 2000 - Unveils a planetarium show featuring "Neighborhood" characters at the Carnegie Science
Center; ceases production on "Mister Rogers' Neighborhood." 2001 - Last original episodes of "Mister Rogers' Neighborhood" airs on PBS. 2002 - In July, is presented the Presidential Medal of Freedom, the nation's highest civilian honor. Filmed PBS
public service announcements in August encouraging parents to read to their children and giving them advice on how to handle the first anniversary of the Sept. 11, 2001, terrorist attacks. Is named grand marshal of the Tournament of Roses Parade in
Pasadena, Calif., with Bill Cosby and Art Linkletter. The theme: "Children's Wishes, Dreams and Imagination." 2003 - Dies of stomach cancer at age 74.
'''
My goal is to make this a dictionary where the keys are the years and the values are the events on the timeline. However, I'm having a little trouble.
I want to first split the timeline into a list like so: {key, value, key, value} and then turn it into a dict. I want to split the string 5 or 6 characters before " - " in order to accomplish this, but I have no idea how to do that. Can anyone help me?
EDIT:
Came up with this possible solution -
timeline = timeline.replace("\n","")
timeline = timeline.replace("\\","")
timeline = timeline.replace("19","zz19")
timeline = timeline.replace("20","zz20")
timeline = timeline.split(" - ")
So basically I end up with this:
['zz1961', 'Second child, John, is born. zz1963', 'Is ordained as a Presbyterian minister. Makes his on-camera debut as host of a series of 15-minute episodes for children produced in Toronto. The program is titled "Mister Rogers." zz1964', 'Returns to Pittsburgh and turns his 15-minute show into the half-hour "Mister Rogers\' Neighborhood." zz1968', '"Mister Rogers\' Neighborhood" debuts on PBS and wins the first of its several Emmy Awards. Rogers is appointed chairman of the Forum on Mass Media and Child Development of the White House Conference on Youth. zz1969', "Wins the first of his two George Foster Peabody Awards for television excellence. Fred Rogers readies the opening pitch to start the Pirates' season in zz1988. (Post-Gazette archives) zz1971", 'Forms his production company, Family Communications Inc. zz1975', 'Ceases production on "Mister Rogers\' Neighborhood," which continues to air on PBS in reruns. zz1978', 'Creates the PBS series "Old Friends, New Friends," focusing on older people. zz1979', 'Production resumes on "Mister Rogers\' Neighborhood." zz1981', 'Eddie Murphy plays an inner-city children\'s show host in his "Mister Robinson\'s Neighborhood" sketch on "Saturday Night Live," which is probably the most famous of the many "Mister Rogers" spoofs. zz1984', "The Smithsonian Institution in Washington makes Fred Rogers' trademark sweater part of its permanent collection. Rogers forces the fast-food chain Burger King to remove ads featuring a Mister Rogers lookalike. zz1987", "Appears on a children's TV show in the Soviet Union and reciprocates by having a Soviet host appear on his show later in the year. zz1989", '"Mister Rogers\' Neighborhood of Make Believe" opens as an attraction at Idlewild Park. zz1990', "Sues the Ku Klux Klan and forces the organization to stop playing racist telephone recordings featuring imitations of Fred Rogers' voice, speech patterns and theme song. zz1991", 'Tapes spots for PBS on the eve of the Gulf War to reassure children that they will be all right if hostilities occur. zz1996', 'TV Guide names Fred Rogers one of the 50 greatest TV stars of all time. Rogers makes his only appearance as someone other than himself when he takes a cameo role as a preacher on the drama series "Dr. Quinn, Medicine Woman," which he says is one of his favorite shows. zz1997', 'Is honored for lifetime achievement by the National Academy of Television Arts and Sciences (who hand out the Emmy Awards) and by the Television Critics Association. Named Pittsburgher of the Year by Pittsburgh magazine. zz1998', "Gets a star on the Hollywood Walk of Fame. The Pittsburgh Children's Museum opens a Mister Rogers exhibit. zz1999", 'Is inducted into the Television Hall of Fame. zz2000', 'Unveils a planetarium show featuring "Neighborhood" characters at the Carnegie Science Center; ceases production on "Mister Rogers\' Neighborhood." zz2001', 'Last original episodes of "Mister Rogers\' Neighborhood" airs on PBS. zz2002', 'In July, is presented the Presidential Medal of Freedom, the nation\'s highest civilian honor. Filmed PBS public service announcements in August encouraging parents to read to their children and giving them advice on how to handle the first anniversary of the Sept. 11, zz2001, terrorist attacks. Is named grand marshal of the Tournament of Roses Parade in Pasadena, Calif., with Bill Cosby and Art Linkletter. The theme: "Children\'s Wishes, Dreams and Imagination." zz2003', 'Dies of stomach cancer at age 74.']
So I would want to split this string by "zz", the only problem is I can't split it twice because it's already a list. Obviously I could concatenate the list, but then I would just run into the same problem as now.
Tried doing this:
for i in timeline:
i = i.split("zz")
But it doesn't work :(
You could make the dictionary like this:
parts = timeline.replace('\n','').split(" - ")
d = dict()
for x in range(len(parts) - 1):
d[parts[x][-4:]] = parts[x + 1][:len(parts[x+1]) - 4]
Since your delimiter is clearly " - " and by doing so you would have set of items like:
'1961'
'blablablablabla. 1963'
'blablablablabla. 1968'
'blablablablabla. 1969'
'blablablablabla'
Then you only need to pair your last 4 characters in the text with the next text except the last four characters. Thus, simple split and slice will work for you.
Using this to show the result:
for a in d:
print(a + " " + d[a])
This is what I got:
1975 Ceases production on "Mister Rogers' Neighborhood," which continues to air on PBS in reruns.
1968 "Mister Rogers' Neighborhood" debuts on PBS and wins the first of its several Emmy Awards. Rogers is appointed chairman of the Forum on Mass Media and Child Development of the White House Conference on Youth.
2001 Last original episodes of "Mister Rogers' Neighborhood" airs on PBS.
1990 Sues the Ku Klux Klan and forces the organization to stop playing racist telephone recordings featuring imitations of Fred Rogers' voice, speech patterns and theme song.
2002 In July, is presented the Presidential Medal of Freedom, the nation's highest civilian honor. Filmed PBS public service announcements in August encouraging parents to read to their children and giving them advice on how to handle the first anniversary of the Sept. 11, 2001, terrorist attacks. Is named grand marshal of the Tournament of Roses Parade in Pasadena, Calif., with Bill Cosby and Art Linkletter. The theme: "Children's Wishes, Dreams and Imagination."
1989 "Mister Rogers' Neighborhood of Make Believe" opens as an attraction at Idlewild Park.
1996 TV Guide names Fred Rogers one of the 50 greatest TV stars of all time. Rogers makes his only appearance as someone other than himself when he takes a cameo role as a preacher on the drama series "Dr. Quinn, Medicine Woman," which he says is one of his favorite shows.
1987 Appears on a children's TV show in the Soviet Union and reciprocates by having a Soviet host appear on his show later in the year.
1963 Is ordained as a Presbyterian minister. Makes his on-camera debut as host of a series of 15-minute episodes for children produced in Toronto. The program is titled "Mister Rogers."
1984 The Smithsonian Institution in Washington makes Fred Rogers' trademark sweater part of its permanent collection. Rogers forces the fast-food chain Burger King to remove ads featuring a Mister Rogers lookalike.
1971 Forms his production company, Family Communications Inc.
1991 Tapes spots for PBS on the eve of the Gulf War to reassure children that they will be all right if hostilities occur.
1997 Is honored for lifetime achievement by the National Academy of Television Arts and Sciences (who hand out the Emmy Awards) and by the Television Critics Association. Named Pittsburgher of the Year by Pittsburgh magazine.
1999 Is inducted into the Television Hall of Fame.
1964 Returns to Pittsburgh and turns his 15-minute show into the half-hour "Mister Rogers' Neighborhood."
2000 Unveils a planetarium show featuring "Neighborhood" characters at the Carnegie Science Center; ceases production on "Mister Rogers' Neighborhood."
1981 Eddie Murphy plays an inner-city children's show host in his "Mister Robinson's Neighborhood" sketch on "Saturday Night Live," which is probably the most famous of the many "Mister Rogers" spoofs.
1978 Creates the PBS series "Old Friends, New Friends," focusing on older people.
1969 Wins the first of his two George Foster Peabody Awards for television excellence. Fred Rogers readies the opening pitch to start the Pirates' season in 1988. (Post-Gazette archives)
1961 Second child, John, is born.
2003 Dies of stomach cancer at age
1998 Gets a star on the Hollywood Walk of Fame. The Pittsburgh Children's Museum opens a Mister Rogers exhibit.
1979 Production resumes on "Mister Rogers' Neighborhood."
Note that dictionary is unordered. Use OrderedDict if you need it to be ordered.
The ordered dictionary is as easily done as dictionary like this:
import collections
parts = timeline.replace('\n','').split(" - ")
od = collections.OrderedDict()
for x in range(len(parts) - 1):
od[parts[x][-4:]] = parts[x + 1][:len(parts[x+1]) - 4]
And using this:
for a in od:
print(a + " " + od[a])
The result will be ordered:
1961 Second child, John, is born.
1963 Is ordained as a Presbyterian minister. Makes his on-camera debut as host of a series of 15-minute episodes for children produced in Toronto. The program is titled "Mister Rogers."
1964 Returns to Pittsburgh and turns his 15-minute show into the half-hour "Mister Rogers' Neighborhood."
1968 "Mister Rogers' Neighborhood" debuts on PBS and wins the first of its several Emmy Awards. Rogers is appointed chairman of the Forum on Mass Media and Child Development of the White House Conference on Youth.
1969 Wins the first of his two George Foster Peabody Awards for television excellence. Fred Rogers readies the opening pitch to start the Pirates' season in 1988. (Post-Gazette archives)
1971 Forms his production company, Family Communications Inc.
1975 Ceases production on "Mister Rogers' Neighborhood," which continues to air on PBS in reruns.
1978 Creates the PBS series "Old Friends, New Friends," focusing on older people.
1979 Production resumes on "Mister Rogers' Neighborhood."
1981 Eddie Murphy plays an inner-city children's show host in his "Mister Robinson's Neighborhood" sketch on "Saturday Night Live," which is probably the most famous of the many "Mister Rogers" spoofs.
1984 The Smithsonian Institution in Washington makes Fred Rogers' trademark sweater part of its permanent collection. Rogers forces the fast-food chain Burger King to remove ads featuring a Mister Rogers lookalike.
1987 Appears on a children's TV show in the Soviet Union and reciprocates by having a Soviet host appear on his show later in the year.
1989 "Mister Rogers' Neighborhood of Make Believe" opens as an attraction at Idlewild Park.
1990 Sues the Ku Klux Klan and forces the organization to stop playing racist telephone recordings featuring imitations of Fred Rogers' voice, speech patterns and theme song.
1991 Tapes spots for PBS on the eve of the Gulf War to reassure children that they will be all right if hostilities occur.
1996 TV Guide names Fred Rogers one of the 50 greatest TV stars of all time. Rogers makes his only appearance as someone other than himself when he takes a cameo role as a preacher on the drama series "Dr. Quinn, Medicine Woman," which he says is one of his favorite shows.
1997 Is honored for lifetime achievement by the National Academy of Television Arts and Sciences (who hand out the Emmy Awards) and by the Television Critics Association. Named Pittsburgher of the Year by Pittsburgh magazine.
1998 Gets a star on the Hollywood Walk of Fame. The Pittsburgh Children's Museum opens a Mister Rogers exhibit.
1999 Is inducted into the Television Hall of Fame.
2000 Unveils a planetarium show featuring "Neighborhood" characters at the Carnegie Science Center; ceases production on "Mister Rogers' Neighborhood."
2001 Last original episodes of "Mister Rogers' Neighborhood" airs on PBS.
2002 In July, is presented the Presidential Medal of Freedom, the nation's highest civilian honor. Filmed PBS public service announcements in August encouraging parents to read to their children and giving them advice on how to handle the first anniversary of the Sept. 11, 2001, terrorist attacks. Is named grand marshal of the Tournament of Roses Parade in Pasadena, Calif., with Bill Cosby and Art Linkletter. The theme: "Children's Wishes, Dreams and Imagination."
2003 Dies of stomach cancer at age
Let's split using regex!
import re
stripped_string = timeline.replace('\n', '').strip() # whitespaces go away
splitted = re.split('(\d{4}) - ', stripped_string) # split to ['', year1, event1, year2, event2, ...]
lst = splitted[1:] # first empty split goes away
keys = lst[0::2] # get years
values = lst[1::2] # get events
dic = dict(zip(keys, values))
You could make it more specific further. For example, if you know that years are 19xx or 20xx, then:
splitted = re.split('((?:19|20)\d\d) - ', stripped_string)
will do as well.
Also, since you have keys and values explicitly, you could do several things easily, like stripping whitespaces of each element in values. For example:
values = [event.strip() for event in lst[1::2]]
Here's the result:
>>> print("\n".join("{}: {}".format(k, v) for k, v in dic.items()))
1989: "Mister Rogers' Neighborhood of Make Believe" opens as an attraction at Idlewild Park.
1987: Appears on a children's TV show in the Soviet Union and reciprocates by having a Soviet host appear on his show later in the year.
1984: The Smithsonian Institution in Washington makes Fred Rogers' trademark sweater part of its permanent collection. Rogers forces the fast-food chain Burger King to remove ads featuring a Mister Rogers lookalike.
1968: "Mister Rogers' Neighborhood" debuts on PBS and wins the first of its several Emmy Awards. Rogers is appointed chairman of the Forum on Mass Media and Child Development of the White House Conference on Youth.
1969: Wins the first of his two George Foster Peabody Awards for television excellence. Fred Rogers readies the opening pitch to start the Pirates' season in 1988. (Post-Gazette archives)
1981: Eddie Murphy plays an inner-city children's show host in his "Mister Robinson's Neighborhood" sketch on "Saturday Night Live," which is probably the most famous of the many "Mister Rogers" spoofs.
1964: Returns to Pittsburgh and turns his 15-minute show into the half-hour "Mister Rogers' Neighborhood."
1961: Second child, John, is born.
1963: Is ordained as a Presbyterian minister. Makes his on-camera debut as host of a series of 15-minute episodes for children produced in Toronto. The program is titled "Mister Rogers."
1991: Tapes spots for PBS on the eve of the Gulf War to reassure children that they will be all right if hostilities occur.
1990: Sues the Ku Klux Klan and forces the organization to stop playing racist telephone recordings featuring imitations of Fred Rogers' voice, speech patterns and theme song.
1997: Is honored for lifetime achievement by the National Academy of Television Arts and Sciences (who hand out the Emmy Awards) and by the Television Critics Association. Named Pittsburgher of the Year by Pittsburgh magazine.
1979: Production resumes on "Mister Rogers' Neighborhood."
1996: TV Guide names Fred Rogers one of the 50 greatest TV stars of all time. Rogers makes his only appearance as someone other than himself when he takes a cameo role as a preacher on the drama series "Dr. Quinn, Medicine Woman," which he says is one of his favorite shows.
1999: Is inducted into the Television Hall of Fame.
1998: Gets a star on the Hollywood Walk of Fame. The Pittsburgh Children's Museum opens a Mister Rogers exhibit.
1975: Ceases production on "Mister Rogers' Neighborhood," which continues to air on PBS in reruns.
1978: Creates the PBS series "Old Friends, New Friends," focusing on older people.
1971: Forms his production company, Family Communications Inc.
2002: In July, is presented the Presidential Medal of Freedom, the nation's highest civilian honor. Filmed PBS public service announcements in August encouraging parents to read to their children and giving them advice on how to handle the first anniversary of the Sept. 11, 2001, terrorist attacks. Is named grand marshal of the Tournament of Roses Parade in Pasadena, Calif., with Bill Cosby and Art Linkletter. The theme: "Children's Wishes, Dreams and Imagination."
2003: Dies of stomach cancer at age 74.
2000: Unveils a planetarium show featuring "Neighborhood" characters at the Carnegie Science Center; ceases production on "Mister Rogers' Neighborhood."
2001: Last original episodes of "Mister Rogers' Neighborhood" airs on PBS.

Categories

Resources