How to improve NLTK sentence segmentation? - python

Given the paragraph from Wikipedia:
An ambitious campus expansion plan was proposed by Fr. Vernon F.
Gallagher in 1952. Assumption Hall, the first student dormitory, was
opened in 1954, and Rockwell Hall was dedicated in November 1958,
housing the schools of business and law. It was during the tenure of
F. Henry J. McAnulty that Fr. Gallagher's ambitious plans were put to
action.
I run NLTK nltk.sent_tokenize to get the sentences. This returns:
['An ambitious campus expansion plan was proposed by Fr.',
'Vernon F. Gallagher in 1952.',
'Assumption Hall, the first student dormitory, was opened in 1954, and Rockwell Hall was dedicated in November 1958, housing the schools of business and law.',
'It was during the tenure of Fr.',
'Henry J. McAnulty that Fr. Gallagher's ambitious plans were put to action.'
]
While NTLK could handle F. Henry J. McAnulty as one entity,
It failed for Fr. Vernon F. Gallagher, and this broke the sentence into two.
The correct tokenization should be:
[
'An ambitious campus expansion plan was proposed by Fr. Vernon F. Gallagher in 1952.',
'Assumption Hall, the first student dormitory, was opened in 1954, and Rockwell Hall was dedicated in November 1958, housing the schools of business and law.',
'It was during the tenure of Fr. Henry J. McAnulty that Fr. Gallagher's ambitious plans were put to action.'
]
How can I improve the tokenizer performance?

The awesome-ness of Kiss and Strunk (2006) Punkt algorithm is that it's unsupervised. So given a new text, you should retrain the model and apply the model to your text, e.g.
>>> from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters
>>> text = "An ambitious campus expansion plan was proposed by Fr. Vernon F. Gallagher in 1952. Assumption Hall, the first student dormitory, was opened in 1954, and Rockwell Hall was dedicated in November 1958, housing the schools of business and law. It was during the tenure of F. Henry J. McAnulty that Fr. Gallagher's ambitious plans were put to action."
# Training a new model with the text.
>>> tokenizer = PunktSentenceTokenizer()
>>> tokenizer.train(text)
<nltk.tokenize.punkt.PunktParameters object at 0x106c5d828>
# It automatically learns the abbreviations.
>>> tokenizer._params.abbrev_types
{'f', 'fr', 'j'}
# Use the customized tokenizer.
>>> tokenizer.tokenize(text)
['An ambitious campus expansion plan was proposed by Fr. Vernon F. Gallagher in 1952.', 'Assumption Hall, the first student dormitory, was opened in 1954, and Rockwell Hall was dedicated in November 1958, housing the schools of business and law.', "It was during the tenure of F. Henry J. McAnulty that Fr. Gallagher's ambitious plans were put to action."]
Where there's not enough data to generate good statistics when re-training the model, you can also put in a pre-determined list of abbreviations before training; see How to avoid NLTK's sentence tokenizer spliting on abbreviations?
>>> from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters
>>> punkt_param = PunktParameters()
>>> abbreviation = ['f', 'fr', 'k']
>>> punkt_param.abbrev_types = set(abbreviation)
>>> tokenizer = PunktSentenceTokenizer(punkt_param)
>>> tokenizer.train(text)
<nltk.tokenize.punkt.PunktParameters object at 0x106c5d828>
>>> tokenizer.tokenize(text)
['An ambitious campus expansion plan was proposed by Fr. Vernon F. Gallagher in 1952.', 'Assumption Hall, the first student dormitory, was opened in 1954, and Rockwell Hall was dedicated in November 1958, housing the schools of business and law.', "It was during the tenure of F. Henry J. McAnulty that Fr. Gallagher's ambitious plans were put to action."]

Related

How can I use regular expressions to extract all words with at least one digit in text with Python

I am new to regular expressions and I have a text as follows. How can I use the RegEx to extract all words with at least one digit in it? Really appreciate it.
text = '''The start of the Civil War in 1861 followed by Tennessee’s secession from the Union and the lodging of
wounded Confederate soldiers on campus did not close East Tennessee University. By spring 1862 when the
trustees finally suspended operations, the majority of students had joined the military, President Joseph
Ridley had resigned, and two professors had left the university. Wounded Confederate soldiers were lodged
at university buildings after the January 1862 Battle of Mill Springs in Kentucky, known as the Battle of
Fishing Creek to the Confederacy. In the fall of 1863, Union troops forced the Confederates out of
Knoxville. On the Hill, the Union Army enclosed the three university buildings with an earthen
fortification they named Fort Byington in honor of an officer from Michigan who was killed in the defense
of Knoxville. They used the buildings for their headquarters, barracks, and a hospital for Black troops.
Despite a Confederate attempt to retake the city by siege—climaxed by a bloody, abortive attack on Fort
Sanders on November 29, 1863—the Union held and occupied Knoxville for the rest of the war. During the
battle, the Hill was hit with artillery fire from Confederate guns located in a trench at the site of
UT’s present-day Sorority Village. Campus also sustained a great deal of damage caused by the Union Army.
Troops denuded the grounds of trees, ruined the steward’s house, and destroyed the gymnasium with
misdirected cannon fire aimed at Confederate troops across the river. After the Civil War ended in 1865
and the Union Army left campus, Thomas Humes was elected university president. The university reopened in
1866 and operated for six months downtown in the Deaf and Dumb Asylum while repairs began at the damaged
campus. A petition to the federal war department for monetary compensation for campus damage done by the
Union Army undoubtedly received more favorable consideration because of Humes’s known Union loyalty
throughout the war. A Senate committee which considered the bill for damages also noted that East
Tennessee University was “particularly deserving of the favorable consideration of Congress” because it
was “the only educational institution of known loyalty…in any of the seceding states.” However in 1873,
President Ulysses S. Grant vetoed the bill that would have provided $18,500 to the university because he
felt it would set a bad precedent. The bill was redrafted specifying that the payment was compensation
for aid East Tennessee University gave to the Union during the war. On June 22, 1874, President Grant
signed the new bill and the trustees accepted the funds the same day with an agreement to release the
government from all claims. (More than a century and a half later, a buried Union trench was located in
2019 on the north side of the present-day McClung Museum with the use of ground-penetrating radar.)
'''
You could use this pattern:
'\w*\d+\w*'
How does it work:
\w* matches 0 or more characters (but not space)
\d+ matches 1 or more digits
\w* matches 0 or more characters again
Using re and findall we get:
re.findall('\w*\d+\w*',your_text)
we get:
['1861',
'1862',
'1862',
'1863',
'29',
'1863',
'1865',
'1866',
'1873',
'18',
'500',
'22',
'1874',
'2019']
Is this what you mean?
re.findall(r"\S*\d+\S*", text)
\S any character but a space,
\d any digit,
+ one or more occurrences,
* zero or more occurrences

Python, keep only sentence from html and remove all non-alphabet, number

I got html from the website and change it to txt.
However, how to clean the txt so that i keep only the sentences in the txt.
for example: I want to remove all irrelevent information such as 1990...himself,1987, the 59th ....
keep the sentences:
tom cruise is an american actor who has starred in many blockbuster movies and as of 2012 is the highest paid actor in hollywood. he is also a film producer and owns a production company. tom cruise has been the winner of three golden globe awards and has been nominated thrice for academy awards. apart from this, many of the movies cruise has starred in have been huge blockbusters on the box office.
after repeated success in many films, tom cruise kept going on with release of two mission impossible movies, war of the worlds which was a super duper box office hit and many more.
and so on.
1990
... himself
1987
the 59th annual academy awards
(tv special)
jack
/
maverick
/
vincent lauria
(uncredited)
related videos
none
none
none
see all 35 videos  »
#csm.csm_widget />
reality tv
the office
late night
sitcoms
music
rappers
action
religion
top paid
how much money does tom cruise make? (salary & net worth)
tom cruise is an american actor who has starred in many blockbuster movies and as of 2012 is the highest paid actor in hollywood. he is also a film producer and owns a production company. tom cruise has been the winner of three golden globe awards and has been nominated thrice for academy awards. apart from this, many of the movies cruise has starred in have been huge blockbusters on the box office.
history
thomas cruise mapother iv a.k.a tom cruise was born in syracuse, new york to mother mary lee and father thomas cruise mapother iii. cruise’s mother was a special education teacher and father was an electrical engineer. tom cruise is basically of irish, german and english origin. cruise’s family had the male domination of his abusive father whom cruise had once described as the merchant of chaos. he was often bullied and beaten by his father and cruise called him a coward. a part of tom cruise’s childhood was spent in canada. however, when cruise was in the sixth grade, his mother left his father and brought cruise and his siblings back to america.
acting career
acting career of tom cruise started quite early but with a small role in the movie endless love (1981). however, he got his big break as a supporting actor in the movie taps later that year. in 1983, his movies risky business and all the right moves along with top gun in 1986 paved the path for tom cruise as an established actor and a superstar. after this there was no looking back and tom cruise went to star in many super-successful movies like cocktail, rain man, days of thunder, interview with the vampire.
then in 1996, he starred as a superspy ethan hunt in the very popular and blockbuster movie which went on to be a series, mission: impossible. that same year he also was seen in the lead role of the movie jerry maguire and won a golden globe for the same. in 1999, his supporting role in the movie magnolia again won him his second golden globe.
after repeated success in many films, tom cruise kept going on with release of two mission impossible movies, war of the worlds which was a super duper box office hit and many more.
net worth
tom cruise’s films have gained $7.3 million worldwide as of 2013. however, the net worth of the highest paid actor in hollywood is $270 million and he still gets paychecks from his previous movies.
154 magazine cover photos
|
none »
official sites:
facebook
|
official site
|
none
»
alternate names:
tomu kurûzu
height:
5' 7" (1.7 m)
none
did you know?
personal quote:
(1992 quote) i really enjoy talking to other actors and directors. sometimes, if i see their movies, i'll call them up or write them a note saying, "i enjoyed it," or asking, "how did you do that? how did you make that work?". i just saw
html text is called: text
sentence = re.sub(' ', '\n', text)
sentence = re.sub('none', '', words)
print sentence
the result: the sentence is destroyed.
ethan
hunt
/
ray
ferrier
(uncredited)
2006
the
late
late
show
with
craig
ferguson
(tv
series)
himself
-
episode
#2.140
(2006)
...
himself
(uncredited)
2006
getaway
(tv
series)
himself
-
seven
wonders
of
the
world
(2006)
...
himself
2006
cmt
insider
(tv
series)
himself
-
episode
dated
29
april
2006
(2006)
...
himself
2005-2006
corazón
de...
(tv
series)
himself
-
episode
dated
19
january
2006
(2006)
...
himself
-
episode
dated
15
november
2005
(2005)
...
himself
-
Try this:
^(\s*?\S*){5}$
The code is currently set to select any line that has five words or less. You can increase/decrease the number of words by changing the value of {5}
Demo: https://regex101.com/r/z2qxrx/3

How do I split a string BEFORE a certain character?

timeline = '''
1961 - Second child, John, is born. 1963 - Is ordained as a Presbyterian minister. Makes his on-camera debut as host of a series of 15-minute episodes for children produced in Toronto. The program is titled "Mister Rogers." 1964 - Returns to Pittsburgh
and turns his 15-minute show into the half-hour "Mister Rogers' Neighborhood." 1968 - "Mister Rogers' Neighborhood" debuts on PBS and wins the first of its several Emmy Awards. Rogers is appointed chairman of the Forum on Mass Media and Child Development
of the White House Conference on Youth. 1969 - Wins the first of his two George Foster Peabody Awards for television excellence. Fred Rogers readies the opening pitch to start the Pirates' season in 1988. (Post-Gazette archives) 1971 - Forms his production
company, Family Communications Inc. 1975 - Ceases production on "Mister Rogers' Neighborhood," which continues to air on PBS in reruns. 1978 - Creates the PBS series "Old Friends, New Friends," focusing on older people. 1979 - Production resumes on
"Mister Rogers' Neighborhood." 1981 - Eddie Murphy plays an inner-city children's show host in his "Mister Robinson's Neighborhood" sketch on "Saturday Night Live," which is probably the most famous of the many "Mister Rogers" spoofs. 1984 - The Smithsonian
Institution in Washington makes Fred Rogers' trademark sweater part of its permanent collection. Rogers forces the fast-food chain Burger King to remove ads featuring a Mister Rogers lookalike. 1987 - Appears on a children's TV show in the Soviet
Union and reciprocates by having a Soviet host appear on his show later in the year. 1989 - "Mister Rogers' Neighborhood of Make Believe" opens as an attraction at Idlewild Park. 1990 - Sues the Ku Klux Klan and forces the organization to stop playing
racist telephone recordings featuring imitations of Fred Rogers' voice, speech patterns and theme song. 1991 - Tapes spots for PBS on the eve of the Gulf War to reassure children that they will be all right if hostilities occur. 1996 - TV Guide names
Fred Rogers one of the 50 greatest TV stars of all time. Rogers makes his only appearance as someone other than himself when he takes a cameo role as a preacher on the drama series "Dr. Quinn, Medicine Woman," which he says is one of his favorite
shows. 1997 - Is honored for lifetime achievement by the National Academy of Television Arts and Sciences (who hand out the Emmy Awards) and by the Television Critics Association. Named Pittsburgher of the Year by Pittsburgh magazine. 1998 - Gets
a star on the Hollywood Walk of Fame. The Pittsburgh Children's Museum opens a Mister Rogers exhibit. 1999 - Is inducted into the Television Hall of Fame. 2000 - Unveils a planetarium show featuring "Neighborhood" characters at the Carnegie Science
Center; ceases production on "Mister Rogers' Neighborhood." 2001 - Last original episodes of "Mister Rogers' Neighborhood" airs on PBS. 2002 - In July, is presented the Presidential Medal of Freedom, the nation's highest civilian honor. Filmed PBS
public service announcements in August encouraging parents to read to their children and giving them advice on how to handle the first anniversary of the Sept. 11, 2001, terrorist attacks. Is named grand marshal of the Tournament of Roses Parade in
Pasadena, Calif., with Bill Cosby and Art Linkletter. The theme: "Children's Wishes, Dreams and Imagination." 2003 - Dies of stomach cancer at age 74.
'''
My goal is to make this a dictionary where the keys are the years and the values are the events on the timeline. However, I'm having a little trouble.
I want to first split the timeline into a list like so: {key, value, key, value} and then turn it into a dict. I want to split the string 5 or 6 characters before " - " in order to accomplish this, but I have no idea how to do that. Can anyone help me?
EDIT:
Came up with this possible solution -
timeline = timeline.replace("\n","")
timeline = timeline.replace("\\","")
timeline = timeline.replace("19","zz19")
timeline = timeline.replace("20","zz20")
timeline = timeline.split(" - ")
So basically I end up with this:
['zz1961', 'Second child, John, is born. zz1963', 'Is ordained as a Presbyterian minister. Makes his on-camera debut as host of a series of 15-minute episodes for children produced in Toronto. The program is titled "Mister Rogers." zz1964', 'Returns to Pittsburgh and turns his 15-minute show into the half-hour "Mister Rogers\' Neighborhood." zz1968', '"Mister Rogers\' Neighborhood" debuts on PBS and wins the first of its several Emmy Awards. Rogers is appointed chairman of the Forum on Mass Media and Child Development of the White House Conference on Youth. zz1969', "Wins the first of his two George Foster Peabody Awards for television excellence. Fred Rogers readies the opening pitch to start the Pirates' season in zz1988. (Post-Gazette archives) zz1971", 'Forms his production company, Family Communications Inc. zz1975', 'Ceases production on "Mister Rogers\' Neighborhood," which continues to air on PBS in reruns. zz1978', 'Creates the PBS series "Old Friends, New Friends," focusing on older people. zz1979', 'Production resumes on "Mister Rogers\' Neighborhood." zz1981', 'Eddie Murphy plays an inner-city children\'s show host in his "Mister Robinson\'s Neighborhood" sketch on "Saturday Night Live," which is probably the most famous of the many "Mister Rogers" spoofs. zz1984', "The Smithsonian Institution in Washington makes Fred Rogers' trademark sweater part of its permanent collection. Rogers forces the fast-food chain Burger King to remove ads featuring a Mister Rogers lookalike. zz1987", "Appears on a children's TV show in the Soviet Union and reciprocates by having a Soviet host appear on his show later in the year. zz1989", '"Mister Rogers\' Neighborhood of Make Believe" opens as an attraction at Idlewild Park. zz1990', "Sues the Ku Klux Klan and forces the organization to stop playing racist telephone recordings featuring imitations of Fred Rogers' voice, speech patterns and theme song. zz1991", 'Tapes spots for PBS on the eve of the Gulf War to reassure children that they will be all right if hostilities occur. zz1996', 'TV Guide names Fred Rogers one of the 50 greatest TV stars of all time. Rogers makes his only appearance as someone other than himself when he takes a cameo role as a preacher on the drama series "Dr. Quinn, Medicine Woman," which he says is one of his favorite shows. zz1997', 'Is honored for lifetime achievement by the National Academy of Television Arts and Sciences (who hand out the Emmy Awards) and by the Television Critics Association. Named Pittsburgher of the Year by Pittsburgh magazine. zz1998', "Gets a star on the Hollywood Walk of Fame. The Pittsburgh Children's Museum opens a Mister Rogers exhibit. zz1999", 'Is inducted into the Television Hall of Fame. zz2000', 'Unveils a planetarium show featuring "Neighborhood" characters at the Carnegie Science Center; ceases production on "Mister Rogers\' Neighborhood." zz2001', 'Last original episodes of "Mister Rogers\' Neighborhood" airs on PBS. zz2002', 'In July, is presented the Presidential Medal of Freedom, the nation\'s highest civilian honor. Filmed PBS public service announcements in August encouraging parents to read to their children and giving them advice on how to handle the first anniversary of the Sept. 11, zz2001, terrorist attacks. Is named grand marshal of the Tournament of Roses Parade in Pasadena, Calif., with Bill Cosby and Art Linkletter. The theme: "Children\'s Wishes, Dreams and Imagination." zz2003', 'Dies of stomach cancer at age 74.']
So I would want to split this string by "zz", the only problem is I can't split it twice because it's already a list. Obviously I could concatenate the list, but then I would just run into the same problem as now.
Tried doing this:
for i in timeline:
i = i.split("zz")
But it doesn't work :(
You could make the dictionary like this:
parts = timeline.replace('\n','').split(" - ")
d = dict()
for x in range(len(parts) - 1):
d[parts[x][-4:]] = parts[x + 1][:len(parts[x+1]) - 4]
Since your delimiter is clearly " - " and by doing so you would have set of items like:
'1961'
'blablablablabla. 1963'
'blablablablabla. 1968'
'blablablablabla. 1969'
'blablablablabla'
Then you only need to pair your last 4 characters in the text with the next text except the last four characters. Thus, simple split and slice will work for you.
Using this to show the result:
for a in d:
print(a + " " + d[a])
This is what I got:
1975 Ceases production on "Mister Rogers' Neighborhood," which continues to air on PBS in reruns.
1968 "Mister Rogers' Neighborhood" debuts on PBS and wins the first of its several Emmy Awards. Rogers is appointed chairman of the Forum on Mass Media and Child Development of the White House Conference on Youth.
2001 Last original episodes of "Mister Rogers' Neighborhood" airs on PBS.
1990 Sues the Ku Klux Klan and forces the organization to stop playing racist telephone recordings featuring imitations of Fred Rogers' voice, speech patterns and theme song.
2002 In July, is presented the Presidential Medal of Freedom, the nation's highest civilian honor. Filmed PBS public service announcements in August encouraging parents to read to their children and giving them advice on how to handle the first anniversary of the Sept. 11, 2001, terrorist attacks. Is named grand marshal of the Tournament of Roses Parade in Pasadena, Calif., with Bill Cosby and Art Linkletter. The theme: "Children's Wishes, Dreams and Imagination."
1989 "Mister Rogers' Neighborhood of Make Believe" opens as an attraction at Idlewild Park.
1996 TV Guide names Fred Rogers one of the 50 greatest TV stars of all time. Rogers makes his only appearance as someone other than himself when he takes a cameo role as a preacher on the drama series "Dr. Quinn, Medicine Woman," which he says is one of his favorite shows.
1987 Appears on a children's TV show in the Soviet Union and reciprocates by having a Soviet host appear on his show later in the year.
1963 Is ordained as a Presbyterian minister. Makes his on-camera debut as host of a series of 15-minute episodes for children produced in Toronto. The program is titled "Mister Rogers."
1984 The Smithsonian Institution in Washington makes Fred Rogers' trademark sweater part of its permanent collection. Rogers forces the fast-food chain Burger King to remove ads featuring a Mister Rogers lookalike.
1971 Forms his production company, Family Communications Inc.
1991 Tapes spots for PBS on the eve of the Gulf War to reassure children that they will be all right if hostilities occur.
1997 Is honored for lifetime achievement by the National Academy of Television Arts and Sciences (who hand out the Emmy Awards) and by the Television Critics Association. Named Pittsburgher of the Year by Pittsburgh magazine.
1999 Is inducted into the Television Hall of Fame.
1964 Returns to Pittsburgh and turns his 15-minute show into the half-hour "Mister Rogers' Neighborhood."
2000 Unveils a planetarium show featuring "Neighborhood" characters at the Carnegie Science Center; ceases production on "Mister Rogers' Neighborhood."
1981 Eddie Murphy plays an inner-city children's show host in his "Mister Robinson's Neighborhood" sketch on "Saturday Night Live," which is probably the most famous of the many "Mister Rogers" spoofs.
1978 Creates the PBS series "Old Friends, New Friends," focusing on older people.
1969 Wins the first of his two George Foster Peabody Awards for television excellence. Fred Rogers readies the opening pitch to start the Pirates' season in 1988. (Post-Gazette archives)
1961 Second child, John, is born.
2003 Dies of stomach cancer at age
1998 Gets a star on the Hollywood Walk of Fame. The Pittsburgh Children's Museum opens a Mister Rogers exhibit.
1979 Production resumes on "Mister Rogers' Neighborhood."
Note that dictionary is unordered. Use OrderedDict if you need it to be ordered.
The ordered dictionary is as easily done as dictionary like this:
import collections
parts = timeline.replace('\n','').split(" - ")
od = collections.OrderedDict()
for x in range(len(parts) - 1):
od[parts[x][-4:]] = parts[x + 1][:len(parts[x+1]) - 4]
And using this:
for a in od:
print(a + " " + od[a])
The result will be ordered:
1961 Second child, John, is born.
1963 Is ordained as a Presbyterian minister. Makes his on-camera debut as host of a series of 15-minute episodes for children produced in Toronto. The program is titled "Mister Rogers."
1964 Returns to Pittsburgh and turns his 15-minute show into the half-hour "Mister Rogers' Neighborhood."
1968 "Mister Rogers' Neighborhood" debuts on PBS and wins the first of its several Emmy Awards. Rogers is appointed chairman of the Forum on Mass Media and Child Development of the White House Conference on Youth.
1969 Wins the first of his two George Foster Peabody Awards for television excellence. Fred Rogers readies the opening pitch to start the Pirates' season in 1988. (Post-Gazette archives)
1971 Forms his production company, Family Communications Inc.
1975 Ceases production on "Mister Rogers' Neighborhood," which continues to air on PBS in reruns.
1978 Creates the PBS series "Old Friends, New Friends," focusing on older people.
1979 Production resumes on "Mister Rogers' Neighborhood."
1981 Eddie Murphy plays an inner-city children's show host in his "Mister Robinson's Neighborhood" sketch on "Saturday Night Live," which is probably the most famous of the many "Mister Rogers" spoofs.
1984 The Smithsonian Institution in Washington makes Fred Rogers' trademark sweater part of its permanent collection. Rogers forces the fast-food chain Burger King to remove ads featuring a Mister Rogers lookalike.
1987 Appears on a children's TV show in the Soviet Union and reciprocates by having a Soviet host appear on his show later in the year.
1989 "Mister Rogers' Neighborhood of Make Believe" opens as an attraction at Idlewild Park.
1990 Sues the Ku Klux Klan and forces the organization to stop playing racist telephone recordings featuring imitations of Fred Rogers' voice, speech patterns and theme song.
1991 Tapes spots for PBS on the eve of the Gulf War to reassure children that they will be all right if hostilities occur.
1996 TV Guide names Fred Rogers one of the 50 greatest TV stars of all time. Rogers makes his only appearance as someone other than himself when he takes a cameo role as a preacher on the drama series "Dr. Quinn, Medicine Woman," which he says is one of his favorite shows.
1997 Is honored for lifetime achievement by the National Academy of Television Arts and Sciences (who hand out the Emmy Awards) and by the Television Critics Association. Named Pittsburgher of the Year by Pittsburgh magazine.
1998 Gets a star on the Hollywood Walk of Fame. The Pittsburgh Children's Museum opens a Mister Rogers exhibit.
1999 Is inducted into the Television Hall of Fame.
2000 Unveils a planetarium show featuring "Neighborhood" characters at the Carnegie Science Center; ceases production on "Mister Rogers' Neighborhood."
2001 Last original episodes of "Mister Rogers' Neighborhood" airs on PBS.
2002 In July, is presented the Presidential Medal of Freedom, the nation's highest civilian honor. Filmed PBS public service announcements in August encouraging parents to read to their children and giving them advice on how to handle the first anniversary of the Sept. 11, 2001, terrorist attacks. Is named grand marshal of the Tournament of Roses Parade in Pasadena, Calif., with Bill Cosby and Art Linkletter. The theme: "Children's Wishes, Dreams and Imagination."
2003 Dies of stomach cancer at age
Let's split using regex!
import re
stripped_string = timeline.replace('\n', '').strip() # whitespaces go away
splitted = re.split('(\d{4}) - ', stripped_string) # split to ['', year1, event1, year2, event2, ...]
lst = splitted[1:] # first empty split goes away
keys = lst[0::2] # get years
values = lst[1::2] # get events
dic = dict(zip(keys, values))
You could make it more specific further. For example, if you know that years are 19xx or 20xx, then:
splitted = re.split('((?:19|20)\d\d) - ', stripped_string)
will do as well.
Also, since you have keys and values explicitly, you could do several things easily, like stripping whitespaces of each element in values. For example:
values = [event.strip() for event in lst[1::2]]
Here's the result:
>>> print("\n".join("{}: {}".format(k, v) for k, v in dic.items()))
1989: "Mister Rogers' Neighborhood of Make Believe" opens as an attraction at Idlewild Park.
1987: Appears on a children's TV show in the Soviet Union and reciprocates by having a Soviet host appear on his show later in the year.
1984: The Smithsonian Institution in Washington makes Fred Rogers' trademark sweater part of its permanent collection. Rogers forces the fast-food chain Burger King to remove ads featuring a Mister Rogers lookalike.
1968: "Mister Rogers' Neighborhood" debuts on PBS and wins the first of its several Emmy Awards. Rogers is appointed chairman of the Forum on Mass Media and Child Development of the White House Conference on Youth.
1969: Wins the first of his two George Foster Peabody Awards for television excellence. Fred Rogers readies the opening pitch to start the Pirates' season in 1988. (Post-Gazette archives)
1981: Eddie Murphy plays an inner-city children's show host in his "Mister Robinson's Neighborhood" sketch on "Saturday Night Live," which is probably the most famous of the many "Mister Rogers" spoofs.
1964: Returns to Pittsburgh and turns his 15-minute show into the half-hour "Mister Rogers' Neighborhood."
1961: Second child, John, is born.
1963: Is ordained as a Presbyterian minister. Makes his on-camera debut as host of a series of 15-minute episodes for children produced in Toronto. The program is titled "Mister Rogers."
1991: Tapes spots for PBS on the eve of the Gulf War to reassure children that they will be all right if hostilities occur.
1990: Sues the Ku Klux Klan and forces the organization to stop playing racist telephone recordings featuring imitations of Fred Rogers' voice, speech patterns and theme song.
1997: Is honored for lifetime achievement by the National Academy of Television Arts and Sciences (who hand out the Emmy Awards) and by the Television Critics Association. Named Pittsburgher of the Year by Pittsburgh magazine.
1979: Production resumes on "Mister Rogers' Neighborhood."
1996: TV Guide names Fred Rogers one of the 50 greatest TV stars of all time. Rogers makes his only appearance as someone other than himself when he takes a cameo role as a preacher on the drama series "Dr. Quinn, Medicine Woman," which he says is one of his favorite shows.
1999: Is inducted into the Television Hall of Fame.
1998: Gets a star on the Hollywood Walk of Fame. The Pittsburgh Children's Museum opens a Mister Rogers exhibit.
1975: Ceases production on "Mister Rogers' Neighborhood," which continues to air on PBS in reruns.
1978: Creates the PBS series "Old Friends, New Friends," focusing on older people.
1971: Forms his production company, Family Communications Inc.
2002: In July, is presented the Presidential Medal of Freedom, the nation's highest civilian honor. Filmed PBS public service announcements in August encouraging parents to read to their children and giving them advice on how to handle the first anniversary of the Sept. 11, 2001, terrorist attacks. Is named grand marshal of the Tournament of Roses Parade in Pasadena, Calif., with Bill Cosby and Art Linkletter. The theme: "Children's Wishes, Dreams and Imagination."
2003: Dies of stomach cancer at age 74.
2000: Unveils a planetarium show featuring "Neighborhood" characters at the Carnegie Science Center; ceases production on "Mister Rogers' Neighborhood."
2001: Last original episodes of "Mister Rogers' Neighborhood" airs on PBS.

reformat unstructured text into single line after removing punctuation

I have an unstructured text which I want to convert into 1 line and remove all the punctuation marks.
For the punctuation marks i used the following solution Best way to strip punctuation from a string in Python
How can i reformat the unstructured text into 1 line by using python?
Example 1:
The Bourne Identity is a 2002 spy film loosely based on Robert
Ludlum's novel of the same name. It stars Matt Damon as Jason Bourne,
an amnesiac attempting to discover his true identity amidst a
clandestine conspiracy within the Central Intelligence Agency (CIA) to
track him down and arrest or kill him for inexplicably failing to
carry out an officially unsanctioned assassination and then failing to
report back in afterwards. Along the way he teams up with Marie,
played by Franka Potente, who assists him on the initial part of his
journey to learn about his past and regain his memories. The film also
stars Chris Cooper as Alexander Conklin, Clive Owen as The Professor,
Brian Cox as Ward Abbott, and Julia Stiles as Nicky Parsons.
The film was directed by Doug Liman and adapted for the screen by Tony
Gilroy and William Blake Herron from the novel of the same name
written by Robert Ludlum, who also produced the film alongside Frank
Marshall. Universal Studios released the film to theaters in the
United States on June 14, 2002 and it received a positive critical and
public reaction. The film was followed by a 2004 sequel, The Bourne
Supremacy, and a third part released in 2007 entitled The Bourne
Ultimatum.
Plot
Example 2:
12 (0) 0 4 (0) 38 (3) 0 3 (0) 0 1 (0)
Example 3:
Franklin Township is one of the eighteen townships of Monroe County, Ohio,
United States. The 2000 census found 453 people in the township, 367 of whom
lived in the unincorporated portions of the township.
Geography
Located in the western part of the county, it borders the following townships:
The village of Stafford lies in southwestern Franklin Township.
Name and history
It is one of twenty-one Franklin Townships statewide.
Government
The township is governed by a three-member board of trustees, who are elected in
November of odd-numbered years to a four-year term beginning on the following
January 1. Two are elected in the year after the presidential election and one
is elected in the year before it. There is also an elected township clerk, who
serves a four-year term beginning on April 1 of the year after the election,
which is held in November of the year before the presidential election.
Vacancies in the clerkship or on the board of trustees are filled by the
remaining trustees.
As you can see in the previous examples. The text have different formats. How can I turn every single text into 1 line?
This is pretty straight forward - basically, other than the punctuation, you are now also looking to eliminate the line endings.
So, you can simply do:
import string
exclude = set(string.punctuation + "\n\t\r")
print ''.join(ch for ch in input_string if ch not in exclude)
input_string = """The Bourne Identity is a 2002 spy film loosely based on Robert Ludlum's novel of the same name. It stars Matt Damon as Jason Bourne, an amnesiac attempting to discover his true identity amidst a clandestine conspiracy within the Central Intelligence Agency (CIA) to track him down and arrest or kill him for inexplicably failing to carry out an officially unsanctioned assassination and then failing to report back in afterwards. Along the way he teams up with Marie, played by Franka Potente, who assists him on the initial part of his journey to learn about his past and regain his memories. The film also stars Chris Cooper as Alexander Conklin, Clive Owen as The Professor, Brian Cox as Ward Abbott, and Julia Stiles as Nicky Parsons.
The film was directed by Doug Liman and adapted for the screen by Tony Gilroy and William Blake Herron from the novel of the same name written by Robert Ludlum, who also produced the film alongside Frank Marshall. Universal Studios released the film to theaters in the United States on June 14, 2002 and it received a positive critical and public reaction. The film was followed by a 2004 sequel, The Bourne Supremacy, and a third part released in 2007 entitled The Bourne Ultimatum."""
>>> print ''.join(ch for ch in input_string if ch not in exclude)
The Bourne Identity is a 2002 spy film loosely based on Robert Ludlums novel of the same name It stars Matt Damon as Jason Bourne an amnesiac attempting to discover his true identity amidst a clandestine conspiracy within the Central Intelligence Agency CIA to track him down and arrest or kill him for inexplicably failing to carry out an officially unsanctioned assassination and then failing to report back in afterwards Along the way he teams up with Marie played by Franka Potente who assists him on the initial part of his journey to learn about his past and regain his memories The film also stars Chris Cooper as Alexander Conklin Clive Owen as The Professor Brian Cox as Ward Abbott and Julia Stiles as Nicky ParsonsThe film was directed by Doug Liman and adapted for the screen by Tony Gilroy and William Blake Herron from the novel of the same name written by Robert Ludlum who also produced the film alongside Frank Marshall Universal Studios released the film to theaters in the United States on June 14 2002 and it received a positive critical and public reaction The film was followed by a 2004 sequel The Bourne Supremacy and a third part released in 2007 entitled The Bourne Ultimatum

In NLTK, how do I get the concordance of a text?

>>> c = t.concordance('president')
Displaying 25 of 142 matches:
of Hays , Kan. as the school's new president . Dr. Clark will succeed Dr. J. R.
dollars , said C. Virgil Martin , president of Carson Pirie Scott & Co. , comm
dants '' . Washington , July 24 -- president Kennedy today pushed aside other W
ionwide television and radio . The president spent much of the week-end at his
drafts '' . Salinger said the work president Kennedy , advisers , and members o
miss them . Washington , Feb. 9 -- president Kennedy today proposed a mammoth n
railroad retirement programs . The president , in a special message to Congress
ged care plan , similar to one the president sponsored last year as a senator ,
or up to 240 days an illness . The president noted that Congress last year pass
e medical and dental schools . The president said the nation's 92 medical and 4
go up to 21 millions by 1966 . The president recommended federal `` matching gr
community health services '' , the president called for doubling the present 10
. In the child health field , the president said he will recommend later an in
nstitute . Asks research funds The president said he will ask Congress to incre
building research facilities . The president said he will also propose increasi
ernment research in medicine . The president said his proposals combine the ``
e ( D. , Ore. ) in connection with president Eisenhower's cabinet selections in
r's cabinet selections in 1953 and president Kennedy's in 1961 . Oslo The most
e was critical of what he feels is president Kennedy's tendency to be too conci
cation . But he did recommend that president Kennedy state clearly that if Comm
any observer would have said that president Kennedy had blended a program that
nquency in the United States . The president is deeply concerned over this prob
orities on juvenile problems . The president asks the support and cooperation o
rime trend . Offenses multiply The president has also called upon the Attorney
h the problem . Simultaneously the president announced Thursday the appointment
>>> print c
None
But I can't set it to a variable...I want to be able to set it to a variable so I can do stuff with it.
The code you're calling is in nltk/nltk/text.py, and looks like:
if '_concordance_index' not in self.__dict__:
print "Building index..."
self._concordance_index = ConcordanceIndex(self.tokens,
key=lambda s:s.lower())
self._concordance_index.print_concordance(word, width, lines)
So you should be able to create a ConcordanceIndex yourself, and manipulate it however you want to. The ConcordanceIndex class is in the same file, and includes the code for print_concordance, which is probably a good place to start.

Categories

Resources