I have the following strings:
text_one = str("\"A Ukrainian American woman who lives near Boston, Massachusetts, told Fox News Digital on Monday that she can no longer speak on the phone with her own mother, who lives in southern Ukraine, because of the Russian attacks on Ukraine and the fear these attacks have engendered.")
text_two = str("\n\nMany people in southern Ukraine — as well as throughout the country — are right now living in fear for their lives as Russian soldiers overtake the area, the Boston-area woman said.\"")
I need to replace every instance of s/S with $, but not the first instance of s/S in a given word. So the input/output would look something like:
> Mississippi
> Mis$i$$ippi
My idea is to do something like 'after every " " character, skip first "s" and then replace all others up until " " character' but I have no idea how I might go about this. I also thought about creating a list to handle each word.
Solution with re:
import re
text_one = '"A Ukrainian American woman who lives near Boston, Massachusetts, told Fox News Digital on Monday that she can no longer speak on the phone with her own mother, who lives in southern Ukraine, because of the Russian attacks on Ukraine and the fear these attacks have engendered.'
text_two = '\n\nMany people in southern Ukraine — as well as throughout the country — are right now living in fear for their lives as Russian soldiers overtake the area, the Boston-area woman said."'
def replace(s):
return re.sub(
r"(?<=[sS])(\S+)",
lambda g: g.group(1).replace("s", "$").replace("S", "$"),
s,
)
print(replace(text_one))
print(replace(text_two))
Prints:
"A Ukrainian American woman who lives near Boston, Mas$achu$ett$, told Fox News Digital on Monday that she can no longer speak on the phone with her own mother, who lives in southern Ukraine, because of the Rus$ian attacks on Ukraine and the fear these attacks have engendered.
Many people in southern Ukraine — as well as throughout the country — are right now living in fear for their lives as Rus$ian soldier$ overtake the area, the Boston-area woman said."
The first thing you're going to want to do is to find the index of the first s
Then, you'll want to split the string so you get the string until after the first s and the rest of the string into two separate variables
Next, replace all of the s's in the second string with dollar signs
Finally, join the two strings with an empty string
test = "mississippi"
first_index = test.find("s")
tests = [test[:first_index+1], test[first_index+1:]]
tests[1] = tests[1].replace("s", "$")
result = ''.join(tests)
print(result)
Related
I am new to regular expressions and I have a text as follows. How can I use the RegEx to extract all words with at least one digit in it? Really appreciate it.
text = '''The start of the Civil War in 1861 followed by Tennessee’s secession from the Union and the lodging of
wounded Confederate soldiers on campus did not close East Tennessee University. By spring 1862 when the
trustees finally suspended operations, the majority of students had joined the military, President Joseph
Ridley had resigned, and two professors had left the university. Wounded Confederate soldiers were lodged
at university buildings after the January 1862 Battle of Mill Springs in Kentucky, known as the Battle of
Fishing Creek to the Confederacy. In the fall of 1863, Union troops forced the Confederates out of
Knoxville. On the Hill, the Union Army enclosed the three university buildings with an earthen
fortification they named Fort Byington in honor of an officer from Michigan who was killed in the defense
of Knoxville. They used the buildings for their headquarters, barracks, and a hospital for Black troops.
Despite a Confederate attempt to retake the city by siege—climaxed by a bloody, abortive attack on Fort
Sanders on November 29, 1863—the Union held and occupied Knoxville for the rest of the war. During the
battle, the Hill was hit with artillery fire from Confederate guns located in a trench at the site of
UT’s present-day Sorority Village. Campus also sustained a great deal of damage caused by the Union Army.
Troops denuded the grounds of trees, ruined the steward’s house, and destroyed the gymnasium with
misdirected cannon fire aimed at Confederate troops across the river. After the Civil War ended in 1865
and the Union Army left campus, Thomas Humes was elected university president. The university reopened in
1866 and operated for six months downtown in the Deaf and Dumb Asylum while repairs began at the damaged
campus. A petition to the federal war department for monetary compensation for campus damage done by the
Union Army undoubtedly received more favorable consideration because of Humes’s known Union loyalty
throughout the war. A Senate committee which considered the bill for damages also noted that East
Tennessee University was “particularly deserving of the favorable consideration of Congress” because it
was “the only educational institution of known loyalty…in any of the seceding states.” However in 1873,
President Ulysses S. Grant vetoed the bill that would have provided $18,500 to the university because he
felt it would set a bad precedent. The bill was redrafted specifying that the payment was compensation
for aid East Tennessee University gave to the Union during the war. On June 22, 1874, President Grant
signed the new bill and the trustees accepted the funds the same day with an agreement to release the
government from all claims. (More than a century and a half later, a buried Union trench was located in
2019 on the north side of the present-day McClung Museum with the use of ground-penetrating radar.)
'''
You could use this pattern:
'\w*\d+\w*'
How does it work:
\w* matches 0 or more characters (but not space)
\d+ matches 1 or more digits
\w* matches 0 or more characters again
Using re and findall we get:
re.findall('\w*\d+\w*',your_text)
we get:
['1861',
'1862',
'1862',
'1863',
'29',
'1863',
'1865',
'1866',
'1873',
'18',
'500',
'22',
'1874',
'2019']
Is this what you mean?
re.findall(r"\S*\d+\S*", text)
\S any character but a space,
\d any digit,
+ one or more occurrences,
* zero or more occurrences
I got html from the website and change it to txt.
However, how to clean the txt so that i keep only the sentences in the txt.
for example: I want to remove all irrelevent information such as 1990...himself,1987, the 59th ....
keep the sentences:
tom cruise is an american actor who has starred in many blockbuster movies and as of 2012 is the highest paid actor in hollywood. he is also a film producer and owns a production company. tom cruise has been the winner of three golden globe awards and has been nominated thrice for academy awards. apart from this, many of the movies cruise has starred in have been huge blockbusters on the box office.
after repeated success in many films, tom cruise kept going on with release of two mission impossible movies, war of the worlds which was a super duper box office hit and many more.
and so on.
1990
... himself
1987
the 59th annual academy awards
(tv special)
jack
/
maverick
/
vincent lauria
(uncredited)
related videos
none
none
none
see all 35 videos »
#csm.csm_widget />
reality tv
the office
late night
sitcoms
music
rappers
action
religion
top paid
how much money does tom cruise make? (salary & net worth)
tom cruise is an american actor who has starred in many blockbuster movies and as of 2012 is the highest paid actor in hollywood. he is also a film producer and owns a production company. tom cruise has been the winner of three golden globe awards and has been nominated thrice for academy awards. apart from this, many of the movies cruise has starred in have been huge blockbusters on the box office.
history
thomas cruise mapother iv a.k.a tom cruise was born in syracuse, new york to mother mary lee and father thomas cruise mapother iii. cruise’s mother was a special education teacher and father was an electrical engineer. tom cruise is basically of irish, german and english origin. cruise’s family had the male domination of his abusive father whom cruise had once described as the merchant of chaos. he was often bullied and beaten by his father and cruise called him a coward. a part of tom cruise’s childhood was spent in canada. however, when cruise was in the sixth grade, his mother left his father and brought cruise and his siblings back to america.
acting career
acting career of tom cruise started quite early but with a small role in the movie endless love (1981). however, he got his big break as a supporting actor in the movie taps later that year. in 1983, his movies risky business and all the right moves along with top gun in 1986 paved the path for tom cruise as an established actor and a superstar. after this there was no looking back and tom cruise went to star in many super-successful movies like cocktail, rain man, days of thunder, interview with the vampire.
then in 1996, he starred as a superspy ethan hunt in the very popular and blockbuster movie which went on to be a series, mission: impossible. that same year he also was seen in the lead role of the movie jerry maguire and won a golden globe for the same. in 1999, his supporting role in the movie magnolia again won him his second golden globe.
after repeated success in many films, tom cruise kept going on with release of two mission impossible movies, war of the worlds which was a super duper box office hit and many more.
net worth
tom cruise’s films have gained $7.3 million worldwide as of 2013. however, the net worth of the highest paid actor in hollywood is $270 million and he still gets paychecks from his previous movies.
154 magazine cover photos
|
none »
official sites:
facebook
|
official site
|
none
»
alternate names:
tomu kurûzu
height:
5' 7" (1.7 m)
none
did you know?
personal quote:
(1992 quote) i really enjoy talking to other actors and directors. sometimes, if i see their movies, i'll call them up or write them a note saying, "i enjoyed it," or asking, "how did you do that? how did you make that work?". i just saw
html text is called: text
sentence = re.sub(' ', '\n', text)
sentence = re.sub('none', '', words)
print sentence
the result: the sentence is destroyed.
ethan
hunt
/
ray
ferrier
(uncredited)
2006
the
late
late
show
with
craig
ferguson
(tv
series)
himself
-
episode
#2.140
(2006)
...
himself
(uncredited)
2006
getaway
(tv
series)
himself
-
seven
wonders
of
the
world
(2006)
...
himself
2006
cmt
insider
(tv
series)
himself
-
episode
dated
29
april
2006
(2006)
...
himself
2005-2006
corazón
de...
(tv
series)
himself
-
episode
dated
19
january
2006
(2006)
...
himself
-
episode
dated
15
november
2005
(2005)
...
himself
-
Try this:
^(\s*?\S*){5}$
The code is currently set to select any line that has five words or less. You can increase/decrease the number of words by changing the value of {5}
Demo: https://regex101.com/r/z2qxrx/3
Hi I have made use of a python library to collect the data of a topic. For example I chose the topic of New york and I have retreived the content with the following code:
import wikipedia
f2 = open('newyork', 'w')
ny = wikipedia.page("New York")
f2.write(ny.content.encode('utf8')+"\n")
I am able to extract the information in the format below:
New York is a state in the Northeastern United States and is the 27th-most extensive, fourth-most populous, and seventh-most densely populated U.S. state. New York is bordered by New Jersey and Pennsylvania to the south and Connecticut, Massachusetts, and Vermont to the east. The state has a maritime border in the Atlantic Ocean with Rhode Island, east of Long Island, as well as an international border with the Canadian provinces of Quebec to the north and Ontario to the west and north. The state of New York, with an estimated 19.8 million residents in 2015, is often referred to as New York State to distinguish it from New York City, the state's most populous city and its economic hub.
With an estimated population of 8.55 million in 2015, New York City is the most populous city in the United States and the premier gateway for legal immigration to the United States. The New York City Metropolitan Area is one of the most populous urban agglomerations in the world. New York City is a global city, exerting a significant impact upon commerce, finance, media, art, fashion, research, technology, education, and entertainment, its fast pace defining the term New York minute. The home of the United Nations Headquarters, New York City is an important center for international diplomacy and has been described as the cultural and financial capital of the world, as well as the world's most economically powerful city. New York City makes up over 40% of the population of New York State. Two-thirds of the state's population lives in the New York City Metropolitan Area, and nearly 40% live on Long Island. Both the state and New York City were named for the 17th century Duke of York, future King James II of England. The next four most populous cities in the state are Buffalo, Rochester, Yonkers, and Syracuse, while the state capital is Albany.
The earliest Europeans in New York were French colonists and Jesuit missionaries who arrived southward from settlements at Montreal for trade and proselytizing. New York had been inhabited by tribes of Algonquian and Iroquoian-speaking Native Americans for several hundred years by the time Dutch settlers moved into the region in the early 17th century. In 1609, the region was first claimed by Henry Hudson for the Dutch, who built Fort Nassau in 1614 at the confluence of the Hudson and Mohawk rivers, where the present-day capital of Albany later developed. The Dutch soon also settled New Amsterdam and parts of the Hudson Valley, establishing the colony of New Netherland, a multicultural community from its earliest days and a center of trade and immigration. The British annexed the colony from the Dutch in 1664. The borders of the British colony, the Province of New York, were similar to those of the present-day state.
Many landmarks in New York are well known to both international and domestic visitors, with New York State hosting four of the world's ten most-visited tourist attractions in 2013: Times Square, Central Park, Niagara Falls (shared with Ontario), and Grand Central Terminal. New York is home to the Statue of Liberty, a symbol of the United States and its ideals of freedom, democracy, and opportunity. In the 21st century, New York has emerged as a global node of creativity and entrepreneurship, social tolerance, and environmental sustainability. New York's higher education network comprises approximately 200 colleges and universities, including Columbia University, Cornell University, New York University, and Rockefeller University, which have been ranked among the top 35 in the world.
== History ==
=== 16th century ===
In 1524, Giovanni da Verrazzano, an Italian explorer in the service of the French crown, explored the Atlantic coast of North America between the Carolinas and Newfoundland, including New York Harbor and Narragansett Bay. On April 17, 1524 Verrazanno entered New York Bay, by way of the Strait now called the Narrows into the northern bay which he named Santa Margherita, in honour of the King of France's sister. Verrazzano described it as "a vast coastline with a deep delta in which every kind of ship could pass" and he adds: "that it extends inland for a league and opens up to form a beautiful lake. This vast sheet of water swarmed with native boats". He landed on the tip of Manhattan and perhaps on the furthest point of Long Island. Verrazanno's stay in this place was interrupted by a storm which pushed him north towards Martha's Vineyard.
In 1540 French traders from New France built a chateau on Castle Island, within present-day Albany; due to flooding, it was abandoned the next year. In 1614, the Dutch under the command of Hendrick Corstiaensen, rebuilt the French chateau, which they called Fort Nassau. Fort Nassau was the first Dutch settlement in North America, and was located along the Hudson River, also within present-day Albany. The small fort served as a trading post and warehouse. Located on the Hudson River flood plain, the rudimentary "fort" was washed away by flooding in 1617, and abandoned for good after Fort Orange (New Netherland) was built nearby in 1623.
=== 17th century ===
Henry Hudson's 1609 voyage marked the beginning of European involvement with the area. Sailing for the Dutch East India Company and looking for a passage to Asia, he entered the Upper New York Bay on September 11 of that year. Word of his findings encouraged Dutch merchants to explore the coast in search for profitable fur trading with local Native American tribes.
During the 17th century, Dutch trading posts established for the trade of pelts from the Lenape, Iroquois, and other tribes were founded in the colony of New Netherland. The first of these trading posts were Fort Nassau (1614, near present-day Albany); Fort Orange (1624, on the Hudson River just south of the current city of Albany and created to replace Fort Nassau), developing into settlement Beverwijck (1647), and into what became Albany; Fort Amsterdam (1625, to develop into the town New Amsterdam which is present-day New York City); and Esopus, (1653, now Kingston). The success of the patroonship of Rensselaerswyck (1630), which surrounded Albany and lasted until the mid-19th century, was also a key factor in the early success of the colony. The English captured the colony during the Second Anglo-Dutch War and governed it as the Province of New York. The city of New York was recaptured by the Dutch in 1673 during the Third Anglo-Dutch War (1672–1674) and renamed New Orange. It was returned to the English under the terms of the Treaty of Westminster a year later.
== References ==
== Further reading ==
French, John Homer (1860). Historical and statistical gazetteer of New York State. Syracuse, New York: R. Pearsall Smith. OCLC 224691273. (Full text via Google Books.)
New York State Historical Association (1940). New York: A Guide to the Empire State. New York City: Oxford University Press. ISBN 978-1-60354-031-5. OCLC 504264143. (Full text via Google Books.)
== External links ==
New York at DMOZ
Geographic data related to New York at OpenStreetMap
The Problems:
Problem 1:
I have a trouble in trying to remove all the contents from the section " Reference and Further Reading"
For example:
== History ==
some text under the section History
=== 17th century ===
some text under the section 17 century
=== 19th century ===
some text under the section 19 century
== References ==
some references
== Further reading ==
some further reading sources
Desired Result:
== History ==
some text under the section History
=== 17th century ===
some text under the section 17 century
=== 19th century ===
some text under the section 19 century
Problem 1B:
I will be getting the content of many topics so there will be many references to delete , how can I do it?
For example I like to delete all sections that begin with "Reference" and "Further Reading":
== New York ==
== References ==
== Further reading ==
== California ==
== References ==
== Further reading ==
== Floria ==
== References ==
== Further reading ==
Desired Result:
== New York ==
== California ==
== Floria ==
Sorry for the long post and please forgive me as I have very little knowledge of python.
All advice and help is greatly appreciated.
Thank you.
Edit
Current Problem
Hi osantana,
I have tried the code that you have provided as shown below:
import wikipedia
import re
f2 = open('osantana', 'w')
ny = wikipedia.page("New York")
section_title_re = re.compile("^=+\s+.*\s+=+$")
raw_content = ny.content
content = []
skip = False
for l in raw_content.splitlines():
line = l.strip()
if "== References ==" in line.lower():
skip = True # replace with break if this is the last section
continue
if "== Further reading ==" in line.lower():
skip = True # replace with break if this is the last section
continue
if "== External links ==" in line.lower():
skip = True # replace with break if this is the last section
continue
if section_title_re.match(line):
skip = False
continue
if skip:
continue
content.append(line)
content = '\n'.join(content) + '\n'
f2.write(content.encode('utf8')+"\n")
It works fine for all except this 3 part:
Original File:
== References ==
Index of New York-related articles
Outline of New York – organized list of topics about New York
== Further reading ==
French, John Homer (1860). Historical and statistical gazetteer of New York State. Syracuse, New York: R. Pearsall Smith. OCLC 224691273. (Full text via Google Books.)
New York State Historical Association (1940). New York: A Guide to the Empire State. New York City: Oxford University Press. ISBN 978-1-60354-031-5. OCLC 504264143. (Full text via Google Books.)
Result of the code:
Index of New York-related articles
Outline of New York – organized list of topics about New York
French, John Homer (1860). Historical and statistical gazetteer of New York State. Syracuse, New York: R. Pearsall Smith. OCLC 224691273. (Full text via Google Books.)
New York State Historical Association (1940). New York: A Guide to the Empire State. New York City: Oxford University Press. ISBN 978-1-60354-031-5. OCLC 504264143. (Full text via Google Books.)
The headings were removed but the content is still intact.
I'll assume that Reference/Further Reading are not the last sections in all pages. If those topics are the last sections replace the highlighted code below with a break command.
import re
def parse(raw_content):
section_title_re = re.compile("^=+\s+.*\s+=+$")
content = []
skip = False
for l in raw_content.splitlines():
line = l.strip()
if "= references =" in line.lower():
skip = True # replace with break if this is the last section
continue
if "= further reading =" in line.lower():
skip = True # replace with break if this is the last section
continue
if section_title_re.match(line):
skip = False
continue
if skip:
continue
content.append(line)
return '\n'.join(content) + '\n'
print(parse(ny.content))
For problem 2 you could do something like this
contents = re.sub('=+\s*.+\s*=+', '', contents)
Just remember to import re, the regular expressions module.
The method being used is re.sub(pattern, repl, string). pattern is a regular expression pattern* (the re documentation provides an overview on it).
repl is what you want to replace all occurrences of the pattern with. In this case you want to remove the pattern, so just use an empty string as the replacement.
string is of course the string you're performing the substitution on. This method returns the final result, so if you want to overwrite the original string, just assign the returned value back to the input string.
Here's the pattern I used explained just in case. '=+\s*.+\s*=+' means any part of the string where there is one or more equal sign (=+), followed by zero or more spaces (\s*), followed by one or more of any character (.+), followed again by zero or more spaces (\s*), finally ending with one or more equal signs (=+).
For problem 1 I'd say you could probably accomplish what you want to using regular expressions as well, and the re module makes it pretty easy. The link I gave above should help.
def clean_data(f):
def inner(word):
text=f(word)
text=text.encode("utf-8",errors='ignore').decode("utf-8")
text=re.sub("https?:.*(?=\s)",'',text)
text=re.sub("[’‘\"]","'",text)
text=re.sub("[^\x00-\x7f]+",'',text)
text=re.sub('[#&\\*+/<>#[\]^`{|}~ \t\n\r]',' ',text)
text=re.sub('\(.*?\)','',text)
text=re.sub('\=\=.*?\=\=','',text)
text=re.sub(' , ',',',text)
text=re.sub(' \.','.',text)
text=re.sub(" +",' ',text)
text=re.sub(";",'and',text)
return text.strip()
return inner
#clean_data
def get_data(word):
try:
data = wikipedia.summary("Orange",sentences=300)
except wikipedia.DisambiguationError as e:
print("picking the data from:",e.options[:3])
data=''.join([wikipedia.summary(s,sentences=100) for s in e.options[:3]])
return data
data=get_data("Orange")
Disclaimer: I read very carefully this thread:
Street Address search in a string - Python or Ruby
and many other resources.
Nothing works for me so far.
In some more details here is what I am looking for is:
The rules are relaxed and I definitely am not asking for a perfect code that covers all cases; just a few simple basic ones with assumptions that the address should be in the format:
a) Street number (1...N digits);
b) Street name : one or more words capitalized;
b-2) (optional) would be best if it could be prefixed with abbrev. "S.", "N.", "E.", "W."
c) (optional) unit/apartment/etc can be any (incl. empty) number of arbitrary characters
d) Street "type": one of ("st.", "ave.", "way");
e) City name : 1 or more Capitalized words;
f) (optional) state abbreviation (2 letters)
g) (optional) zip which is any 5 digits.
None of the above needs to be a valid thing (e.g. an existing city or zip).
I am trying expressions like these so far:
pat = re.compile(r'\d{1,4}( \w+){1,5}, (.*), ( \w+){1,5}, (AZ|CA|CO|NH), [0-9]{5}(-[0-9]{4})?', re.IGNORECASE)
>>> pat.search("123 East Virginia avenue, unit 123, San Ramondo, CA, 94444")
Don't work, and for me it's not easy to understand why. Specifically: how do I separate in my pattern a group of any words from one of specific words that should follow, like state abbrev. or street "type ("st., ave.)?
Anyhow: here is an example of what I am hoping to get:
Given
def ex_addr(text):
# does the re magic
# returns 1st address (all addresses?) or None if nothing found
for t in [
'The meeting will be held at 22 West Westin st., South Carolina, 12345 on Nov.-18',
'The meeting will be held at 22 West Westin street, SC, 12345 on Nov.-18',
'Hi there,\n How about meeting tomorr. #10am-sh in Chadds # 123 S. Vancouver ave. in Ottawa? \nThanks!!!',
'Hi there,\n How about meeting tomorr. #10am-sh in Chadds # 123 S. Vancouver avenue in Ottawa? \nThanks!!!',
'This was written in 1999 in Montreal',
"Cool cafe at 420 Funny Lane, Cupertino CA is way too cool",
"We're at a party at 12321 Mammoth Lane, Lexington MA 77777; Come have a beer!"
] print ex_addr(t)
I would like to get:
'22 West Westin st., South Carolina, 12345'
'22 West Westin street, SC, 12345'
'123 S. Vancouver ave. in Ottawa'
'123 S. Vancouver avenue in Ottawa'
None # for 'This was written in 1999 in Montreal',
"420 Funny Lane, Cupertino CA",
"12321 Mammoth Lane, Lexington MA 77777"
Could you please help?
I just ran across this in GitHub as I am having a similar problem. Appears to work and be more robust than your current solution.
https://github.com/madisonmay/CommonRegex
Looking at the code, the regex for street address accounts for many more scenarios. '\d{1,4} [\w\s]{1,20}(?:street|st|avenue|ave|road|rd|highway|hwy|square|sq|trail|trl|drive|dr|court|ct|parkway|pkwy|circle|cir|boulevard|blvd)\W?(?=\s|$)'
\d{1,4}( \w+){1,5}, (.*), ( \w+){1,5}, (AZ|CA|CO|NH), [0-9]{5}(-[0-9]{4})?
In this regex, you have one too many spaces (before ( \w+){1,5}, which already begins with one). Removing it, it matches your example.
I don't think you can assume that a "unit 123" or similar will be there, or there might be several ones (e.g. "building A, apt 3"). Note that in your initial regex, the . might match , which could lead to very long (and unwanted) matches.
You should probably accept several such groups with a limitation on the number (e.g. replace , (.*) with something like (, [^,]{1,20}){0,5}.
In any case, you will probably never get something 100% accurate that will accept any variation people might throw at them. Do lots of tests! Good luck.
Recently I got my hands on a research project that would greatly benefit from learning how to parse a string of biographical data on several individuals into a set of dictionaries for each individual.
The string contains break words and I was hoping to create keys off of the breakwords and separate dictionaries by line breaks. So here are two people I want to create two different dictionaries for within my data:
Bankers = [ ' Bakstansky, Peter; Senior Vice President, Federal
Reserve Bank of New York, in charge of public information
since 1976, when he joined the NY Fed as Vice President. Senior
Officer in charge of die Office of Regional and Community Affairs,
Ombudsman for the Bank and Senior Administrative Officer for Executive
Group, m zero children Educ City College of New York (Bachelor of
Business Administration, 1961); University of Illinois, Graduate
School, and New York University, Graduate School of Business. 1962-6:
Business and financial writer, New York, on American Banker, New
York-World Telegram & Sun, Neia York Herald Tribune (banking editor
1964-6). 1966-74: Chase Manhattan Bank: Manager of Public Relations,
based in Paris, 1966-71; Manager of Chase's European Marketing and
Planning, based in Brussels, 1971-2; Vice President and Director of
Public Relations, 1972-4.1974-76: Bache & Co., Vice President and
Director of Corporate Communications. Barron, Patrick K.; First Vice
President and < Operating Officer of the Federal Reserve Bank o
Atlanta since February 1996. Member of the Fed" Reserve Systems
Conference of first Vice Preside Vice chairman of the bank's
management Con and of the Discount Committee, m three child Educ
University of Miami (Bachelor's degree in Management); Harvard
Business School (Prog Management Development); Stonier Graduate Sr of
Banking, Rutgers University. 1967: Joined Fed Reserve Bank of Atlanta
in computer operations 1971: transferred to Miami Branch; 1974:
Assist: President; 1987: Senior Vice President.1988: re1- Atlanta as
Head of Corporate Services. Member Executive Committee of the Georgia
Council on Igmic Education; former vice diairman of Greater
ji§?Charnber of Commerce and the President'sof the University of
Miami; in Atlanta, former ||Mte vice chairman for the United Way of
Atlanta feiSinber of Leadership Atlanta. Member of the Council on
Economic Education. Interest. ' ]
So for example, in this data I have two people - Peter Batanksy and Patrick K. Barron. I want to create a dictionary for each individual with these 4 keys: bankerjobs, Number of children, Education, and nonbankerjobs.
In this text there are already break words: "m" = number of children "Educ", and anything before "m" is bankerjobs and anything after the first "." after Educ is nonbankerjobs, and the keyword to break between individuals seems to be any amount of spaces after a "." >1
How can I create a dictionary for each of these two individuals with these 4 keys using regular expressions on these break words?
specifically, what set of regex could help me create a dictionary for these two individuals with these 4 keys (built on the above specified break words)?
A pattern i am thinking would be something like this in perl:
pattern = [r'(m/[ '(.*);(.*)m(.*)Educ(.*)/)']
but i'm not sure..
I'm thinking the code would be similar to this but please correct it if im wrong:
my_banker_parser = re.compile(r'somefancyregex')
def nested_dict_from_text(text):
m = re.search(my_banker_parser, text)
if not m:
raise ValueError
d = m.groupdict()
return { "centralbanker": d }
result = nested_dict_from_text(bankers)
print(result)
My hope is to take this code and run it through the rest of the biographies for all of individuals of interest.
Using named groups will probably be less brittle, since it doesn't depend on the pieces of data being in the same order in each biography. Something like this should work:
>>> import re
>>> regex = re.compile(r'(?P<foo>foo)|(?P<bar>bar)|(?P<baz>baz)')
>>> data = {}
>>> for match in regex.finditer('bar baz foo something'):
... data.update((k, v) for k, v in match.groupdict().items() if v is not None)
...
>>> data
{'baz': 'baz', 'foo': 'foo', 'bar': 'bar'}