How can I clean this string and leave only text (Python)

How can I clean this string and leave only text (Python) - python

I have the following string in python:
"\n[[[\"guns\",0,[143,362,396,357],{\"zf\":33,\"zl\":8,\"zp\":{\"gs_ss\":\"1\"}}],[\"china chinese spy balloon\",0,[143,362,396,357],{\"zf\":33,\"zl\":8,\"zp\":{\"gs_ss\":\"1\"}}],[\"aris hampers grand rapids\",0,[143,362,396,357],{\"zf\":33,\"zl\":8,\"zp\":{\"gs_ss\":\"1\"}}],[\"mountain lion p 22\",0,[143,362,396,357],{\"zf\":33,\"zl\":8,\"zp\":{\"gs_ss\":\"1\"}}],[\"real estate housing market\",0,[143,362,396,357],{\"zf\":33,\"zl\":8,\"zp\":{\"gs_ss\":\"1\"}}],[\"hunter biden\",46,[143,362,396,357],{\"lm\":[],\"zf\":33,\"zh\":\"Hunter Biden\",\"zi\":\"American attorney\",\"zl\":8,\"zp\":{\"gs_ssp\":\"eJzj4tLP1TcwycrOK88xYPTiySjNK0ktUkjKTEnNAwBulQip\"},\"zs\":\"https://encrypted-tbn0.gstatic.com/images?q\\u003dtbn:ANd9GcQaO4eyFc6sDCa7A26Y_9g71clgC0Ot11Elt0KxAFiQo0Ey7Tp69FWxS8o\\u0026s\\u003d10\"}],[\"maui firefighter tre evans dumaran\",0,[143,362,396,357],{\"zf\":33,\"zl\":8,\"zp\":{\"gs_ss\":\"1\"}}],[\"pope francis benedict\",0,[143,362,396,357],{\"zf\":33,\"zl\":8,\"zp\":{\"gs_ss\":\"1\"}}],[\"coast guard rescue stolen boat\",0,[143,362,396,357],{\"zf\":33,\"zl\":8,\"zp\":{\"gs_ss\":\"1\"}}],[\"lauren boebert\",46,[143,362,396,357],{\"lm\":[],\"zf\":33,\"zh\":\"Lauren Boebert\",\"zi\":\"United States Representative\",\"zl\":8,\"zp\":{\"gs_ssp\":\"eJzj4tVP1zc0zDIqMzCrMCswYPTiy0ksLUrNU0jKT01KLSoBAJDsCeg\"},\"zs\":\"https://encrypted-tbn0.gstatic.com/images?q\\u003dtbn:ANd9GcS1qLJyZQJkVxsOTuP4gnADPLG5oBWe0LWSFClElzhcVrwVCfnNa_s64Zs\\u0026s\\u003d10\"}]],{\"ag\":{\"a\":{\"8\":[\"Trending searches\"]}}}"
how can I clean it using python so that it only outputs the text:
"guns",
"china chinese spy balloon",
"aris hampers grand rapids",
"mountain lion p 22",
....

I am assuming you left off the last ] character. With the addition of that, you have a valid json string. You can just parse it and grab the things you want. Here I am assuming you want the strings from the lists:
import json
s = "\n[[[\"guns\",0,[143,362,396,357],{\"zf\":33,\"zl\":8,\"zp\":{\"gs_ss\":\"1\"}}],[\"china chinese spy balloon\",0,[143,362,396,357],{\"zf\":33,\"zl\":8,\"zp\":{\"gs_ss\":\"1\"}}],[\"aris hampers grand rapids\",0,[143,362,396,357],{\"zf\":33,\"zl\":8,\"zp\":{\"gs_ss\":\"1\"}}],[\"mountain lion p 22\",0,[143,362,396,357],{\"zf\":33,\"zl\":8,\"zp\":{\"gs_ss\":\"1\"}}],[\"real estate housing market\",0,[143,362,396,357],{\"zf\":33,\"zl\":8,\"zp\":{\"gs_ss\":\"1\"}}],[\"hunter biden\",46,[143,362,396,357],{\"lm\":[],\"zf\":33,\"zh\":\"Hunter Biden\",\"zi\":\"American attorney\",\"zl\":8,\"zp\":{\"gs_ssp\":\"eJzj4tLP1TcwycrOK88xYPTiySjNK0ktUkjKTEnNAwBulQip\"},\"zs\":\"https://encrypted-tbn0.gstatic.com/images?q\\u003dtbn:ANd9GcQaO4eyFc6sDCa7A26Y_9g71clgC0Ot11Elt0KxAFiQo0Ey7Tp69FWxS8o\\u0026s\\u003d10\"}],[\"maui firefighter tre evans dumaran\",0,[143,362,396,357],{\"zf\":33,\"zl\":8,\"zp\":{\"gs_ss\":\"1\"}}],[\"pope francis benedict\",0,[143,362,396,357],{\"zf\":33,\"zl\":8,\"zp\":{\"gs_ss\":\"1\"}}],[\"coast guard rescue stolen boat\",0,[143,362,396,357],{\"zf\":33,\"zl\":8,\"zp\":{\"gs_ss\":\"1\"}}],[\"lauren boebert\",46,[143,362,396,357],{\"lm\":[],\"zf\":33,\"zh\":\"Lauren Boebert\",\"zi\":\"United States Representative\",\"zl\":8,\"zp\":{\"gs_ssp\":\"eJzj4tVP1zc0zDIqMzCrMCswYPTiy0ksLUrNU0jKT01KLSoBAJDsCeg\"},\"zs\":\"https://encrypted-tbn0.gstatic.com/images?q\\u003dtbn:ANd9GcS1qLJyZQJkVxsOTuP4gnADPLG5oBWe0LWSFClElzhcVrwVCfnNa_s64Zs\\u0026s\\u003d10\"}]],{\"ag\":{\"a\":{\"8\":[\"Trending searches\"]}}}]"
obj = json.loads(s)
def get_strings(item):
if isinstance(item, str):
yield item
elif isinstance(item, list):
for subitem in item:
yield from get_strings(subitem)
list(get_strings(obj))
This will give you:
['guns',
'china chinese spy balloon',
'aris hampers grand rapids',
'mountain lion p 22',
'real estate housing market',
'hunter biden',
'maui firefighter tre evans dumaran',
'pope francis benedict',
'coast guard rescue stolen boat',
'lauren boebert']
This assumes there's nothing you want in those dictionaries (like: {\"zf\":33,\"zl\":8,\"zp\"). If there is, it's simple enough to add another clause to deal with them, but you will need to figure out which text is junk and what is real (it all looked like junk to me).

Related

Identify and replace using regex some strings, stored within a list, within a string that may or may not contain them

import re
#list of names to identify in input strings
result_list = ['Thomas Edd', 'Melissa Clark', 'Ada White', 'Louis Pasteur', 'Edd Thomas', 'Clark Melissa', 'White Eda', 'Pasteur Louis', 'Thomas', 'Melissa', 'Ada', 'Louis', 'Edd', 'Clark', 'White', 'Pasteur']
result_list.sort() # sorts normally by alphabetical order (optional)
result_list.sort(key=len, reverse=True) # sorts by descending length
#example 1
input_text = "Melissa went for a walk in the park, then Melisa Clark went to the cosmetics store. There Thomas showed her a wide variety of cosmetic products. Edd Thomas is a great salesman, even so Thomas Edd is a skilled but responsible salesman, as Edd is always honest with his customers. White is a new client who came to Edd's business due to the good social media reviews she saw from Melissa, her co-worker."
#In this example 2, it is almost the same however, some of the names were already encapsulated
# under the ((PERS)name) structure, and should not be encapsulated again.
input_text = "((PERS)Melissa) went for a walk in the park, then Melisa Clark went to the cosmetics store. There Thomas showed her a wide variety of cosmetic products. Edd Thomas is a great salesman, even so ((PERS)Thomas Edd) is a skilled but responsible salesman, as Edd is always honest with his customers. White is a new client who came to Edd's business due to the good social media reviews she saw from Melissa, her co-worker." #example 1
for i in result_list:
input_text = re.sub(r"\(\(PERS\)" + r"(" + str(i) + r")" + r"\)",
lambda m: (f"((PERS){m[1]})"),
input_text)
print(repr(input_text)) # --> output
Note that the names meet certain conditions under which they must be identified, that is, they must be in the middle of 2 whitespaces \s*the searched name\s* or be at the beginning (?:(?<=\s)|^) or/and at the end of the input string.
It may also be the case that a name is followed by a comma, for example "Ada White, Melissa and Louis went shopping" or if spaces are accidentally omitted "Ada White,Melissa and Louis went shopping".
For this reason it is important that after [.,;] the possibility that it does find a name.
Cases where the names should NOT be encapsulated, would be for example...
"the Edd's business"
"The whitespace"
"the pasteurization process takes time"
"Those White-spaces in that text are unnecessary"
, since in these cases the name is followed or preceded by another word that should not be part of the name that is being searched for.
For examples 1 and 2 (note that example 2 is the same as example 1 but already has some encapsulated names and you have to prevent them from being encapsulated again), you should get the following output.
"((PERS)Melissa) went for a walk in the park, then ((PERS)Melisa Clark) went to the cosmetics store. There ((PERS)Thomas) showed her a wide variety of cosmetic products. ((PERS)Edd Thomas) is a great salesman, even so ((PERS)Thomas Edd) is a skilled but responsible salesman, as ((PERS)Edd) is always honest with his customers. ((PERS)White) is a new client who came to Edd's business due to the good social media reviews she saw from ((PERS)Melissa), her co-worker."

You can use lookarounds to exclude already encapsulated names and those followed by ', an alphanumeric character or -:
import re
result_list = ['Thomas Edd', 'Melissa Clark', 'Ada White', 'Louis Pasteur', 'Edd Thomas', 'Clark Melissa', 'White Eda', 'Pasteur Louis', 'Thomas', 'Melissa', 'Ada', 'Louis', 'Edd', 'Clark', 'White', 'Pasteur']
result_list.sort(key=len, reverse=True) # sorts by descending length
input_text = "((PERS)Melissa) went for a walk in the park, then Melissa Clark went to the cosmetics store. There Thomas showed her a wide variety of cosmetic products. Edd Thomas is a great salesman, even so ((PERS)Thomas Edd) is a skilled but responsible salesman, as Edd is always honest with his customers. White is a new client who came to Edd's business due to the good social media reviews she saw from Melissa, her co-worker." #example 1
pat = re.compile(rf"(?<!\(PERS\))({'|'.join(result_list)})(?!['\w)-])")
input_text = re.sub(pat, r'((PERS)\1)', input_text)
Output:
((PERS)Melissa) went for a walk in the park, then ((PERS)Melissa Clark) went to the cosmetics store. There ((PERS)Thomas) showed her a wide variety of cosmetic products. ((PERS)Edd Thomas) is a great salesman, even so ((PERS)Thomas Edd) is a skilled but responsible salesman, as ((PERS)Edd) is always honest with his customers. ((PERS)White) is a new client who came to Edd's business due to the good social media reviews she saw from ((PERS)Melissa), her co-worker.
Of course you can refine the content of your lookahead based on further edge cases.

Can't Scrape All the UL Tag's text in python webscrape

I'm new in python webscraping and trying to scrape one of the wikipedia quote page for the practice purpose.
Link of the wikipedia Page
Code I Tried:
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import re
req = Request('https://en.wikiquote.org/wiki/India', headers={'User-Agent':
'Mozilla/5.0'})
webpage = urlopen(req).read()
html = BeautifulSoup(webpage, 'html.parser')
quotes = html.find('ul').findAll("b")
print(quotes)
I got first quote but I want all of the quotes on the page.
Can Anyone Provide the Solution? TIA!

You have to use findAll to get all ul, then extract text from each one:
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import re
req = Request('https://en.wikiquote.org/wiki/India', headers={'User-Agent':
'Mozilla/5.0'})
webpage = urlopen(req).read()
html = BeautifulSoup(webpage, 'html.parser')
quotes = html.findAll('ul')
for quote in quotes:
print(quote.get_text())
Result:
In India I found a race of mortals living upon the Earth, but not adhering to it. Inhabiting cities, but not being fixed to them, possessing everything but possessed by nothing.
Apollonius of Tyana, quoted in The Transition to a Global Society (1991) by Kishor Gandhi, p. 17, and in The Age of Elephants (2006) by Peter Moss, p. v
Apollonius of Tyana, quoted in The Transition to a Global Society (1991) by Kishor Gandhi, p. 17, and in The Age of Elephants (2006) by Peter Moss, p. v
This also is remarkable in India, that all Indians are free, and no Indian at all is a slave. In this the Indians agree with the Lacedaemonians. Yet the Lacedaemonians have Helots for slaves, who perform the duties of slaves; but the Indians have no slaves at all, much less is any Indian a slave.
Arrian, Anabasis Alexandri, Book VII : Indica, as translated by Edgar Iliff Robson (1929), p. 335
Arrian, Anabasis Alexandri, Book VII : Indica, as translated by Edgar Iliff Robson (1929), p. 335
No Indian ever went outside his own country on a warlike expedition, so righteous were they.
Arrian, Anabasis Alexandri, Book VII : Indica, as translated by Edgar Iliff Robson (1929), p. 18
Arrian, Anabasis Alexandri, Book VII : Indica, as translated by Edgar Iliff Robson (1929), p. 18
India of the ages is not dead nor has She spoken her last creative word; She lives and has still something to do for herself and the human peoples. And that which must seek now to awake is not an Anglicized oriental people, docile pupil of the West and doomed to repeat the cycle of the Occident's success and failure, but still the ancient immemorial Shakti recovering Her deepest self, lifting Her head higher toward the supreme source of light and strength and turning to discover the complete meaning and a vaster form of her Dharma.
Sri Aurobindo, in the last issue of Arya: A Philosophical Review (January 1921), as quoted in The Modern Review, Vol. 29 (1921), p. 626.
Sri Aurobindo, in the last issue of Arya: A Philosophical Review (January 1921), as quoted in The Modern Review, Vol. 29 (1921), p. 626.
For what is a nation? What is our mother-country? It is not a piece of earth, nor a figure of speech, nor a fiction of the mind. It is a mighty Shakti, composed of the Shaktis of all the millions of units that make up the nation, just as Bhawani Mahisha Mardini sprang into being from the Shaktis of all the millions of gods assembled in one mass of force and welded into unity. The Shakti we call India, Bhawani Bharati, is the living unity of the Shaktis of three hundred million people …
Sri Aurobindo (Bhawāni Mandir) quoted in Issues of Identity in Indian English Fiction: A Close Reading of Canonical Indian English Novels by H. S. Komalesha
Sri Aurobindo (Bhawāni Mandir) quoted in Issues of Identity in Indian English Fiction: A Close Reading of Canonical Indian English Novels by H. S. Komalesha
India is the guru of the nations, the physician of the human soul in its profounder maladies; she is destined once more to remould the life of the world and restore the peace of the human spirit. But Swaraj is the necessary condition of her work and before she can do the work , she must fulfil the condition.
Sri Aurobindo, Sri Aurobindo Mandir Annual (1947), p. 196
Sri Aurobindo, Sri Aurobindo Mandir Annual (1947), p. 196
...

Parse a very large text file with Python?

So, the file has about 57,000 book titles, author names and a ETEXT No. I am trying to parse the file to only get the ETEXT NOs
The File is like this:
TITLE and AUTHOR ETEXT NO.
Aspects of plant life; with special reference to the British flora,      56900
by Robert Lloyd Praeger
The Vicar of Morwenstow, by Sabine Baring-Gould 56899
[Subtitle: Being a Life of Robert Stephen Hawker, M.A.]
Raamatun tutkisteluja IV, mennessä Charles T. Russell 56898
[Subtitle: Harmagedonin taistelu]
[Language: Finnish]
Raamatun tutkisteluja III, mennessä Charles T. Russell 56897
[Subtitle: Tulkoon valtakuntasi]
[Language: Finnish]
Tom Thatcher's Fortune, by Horatio Alger, Jr. 56896
A Yankee Flier in the Far East, by Al Avery 56895
and George Rutherford Montgomery
[Illustrator: Paul Laune]
Nancy Brandon's Mystery, by Lillian Garis 56894
Nervous Ills, by Boris Sidis 56893
[Subtitle: Their Cause and Cure]
Pensées sans langage, par Francis Picabia 56892
[Language: French]
Helon's Pilgrimage to Jerusalem, Volume 2 of 2, by Frederick Strauss 56891
[Subtitle: A picture of Judaism, in the century
which preceded the advent of our Savior]
Fra Tommaso Campanella, Vol. 1, di Luigi Amabile 56890
[Subtitle: la sua congiura, i suoi processi e la sua pazzia]
[Language: Italian]
The Blue Star, by Fletcher Pratt 56889
Importanza e risultati degli incrociamenti in avicoltura, 56888
di Teodoro Pascal
[Language: Italian]
And this is what I tried:
def search_by_etext():
fhand = open('GUTINDEX.ALL')
print("Search by ETEXT:")
for line in fhand:
if not line.startswith(" [") and not line.startswith("~"):
if not line.startswith(" ") and not line.startswith("TITLE"):
words = line.rstrip()
words = line.lstrip()
words = words[-7:]
print (words)
search_by_etext()
Well the code mostly works. However for some lines it gives me part of title or other things. Like:
This kind of output(), containing 'decott' which is a part of author name and shouldn't be here.
2
For this:
The Bashful Earthquake, by Oliver Herford                                56765
[Subtitle: and Other Fables and Verses]
The House of Orchids and Other Poems, by George Sterling                 56764
North Italian Folk, by Alice Vansittart Strettel Carr                    56763
 and Randolph Caldecott
[Subtitle: Sketches of Town and Country Life]
Wild Life in New Zealand. Part 1, Mammalia, by George M. Thomson 56762
[Subtitle: New Zealand Board of Science and Art, Manual No. 2]
Universal Brotherhood, Volume 13, No. 10, January 1899, by Various 56761
De drie steden: Lourdes, door Émile Zola 56760
[Language: Dutch]
Another example:
4
For
Rhandensche Jongens, door Jan Lens 56702
[Illustrator: Tjeerd Bottema]
[Language: Dutch]
The Story of The Woman's Party, by Inez Haynes Irwin 56701
Mormon Doctrine Plain and Simple, by Charles W. Penrose 56700
[Subtitle: Or Leaves from the Tree of Life]
The Stone Axe of Burkamukk, by Mary Grant Bruce 56699
[Illustrator: J. Macfarlane]
The Latter-Day Prophet, by George Q. Cannon 56698
[Subtitle: History of Joseph Smith Written for Young People]
Here: Life] shouldn't be there. Lines starting with blank space has been parsed out with this:
if not line.startswith(" [") and not line.startswith("~"):
But Still I am getting those off values in my output results.

Simple solution: regexps to the rescue !
import re
with open("etext.txt") as f:
for line in f:
match = re.search(r" (\d+)$", line.strip())
if match:
print(match.group(1))
the regular expression (\d+)$ will match "at least one space followed by 1 or more digits at the end of the string", and capture only the "one or more digits" group.
You can eventually improve the regexp - ie if you know all etext codes are exactly 5 digits long, you can change the regexp to (\d{5})$.
This works with the example text you posted. If it doesn't properly work on your own file then we need enough of the real data to find out what you really have.

It could be that those extra lines that are not being filtered out start with whitespace other than a " " char, like a tab for example. As a minimal change that might work, try filtering out lines that start with any whitespace rather than specifically a space char?
To check for whitespace in general rather than a space char, you'll need to use regular expressions. Try if not re.match(r'^\s', line) and ...

FInd a US street address in text (preferably using Python regex)

Disclaimer: I read very carefully this thread:
Street Address search in a string - Python or Ruby
and many other resources.
Nothing works for me so far.
In some more details here is what I am looking for is:
The rules are relaxed and I definitely am not asking for a perfect code that covers all cases; just a few simple basic ones with assumptions that the address should be in the format:
a) Street number (1...N digits);
b) Street name : one or more words capitalized;
b-2) (optional) would be best if it could be prefixed with abbrev. "S.", "N.", "E.", "W."
c) (optional) unit/apartment/etc can be any (incl. empty) number of arbitrary characters
d) Street "type": one of ("st.", "ave.", "way");
e) City name : 1 or more Capitalized words;
f) (optional) state abbreviation (2 letters)
g) (optional) zip which is any 5 digits.
None of the above needs to be a valid thing (e.g. an existing city or zip).
I am trying expressions like these so far:
pat = re.compile(r'\d{1,4}( \w+){1,5}, (.*), ( \w+){1,5}, (AZ|CA|CO|NH), [0-9]{5}(-[0-9]{4})?', re.IGNORECASE)
>>> pat.search("123 East Virginia avenue, unit 123, San Ramondo, CA, 94444")
Don't work, and for me it's not easy to understand why. Specifically: how do I separate in my pattern a group of any words from one of specific words that should follow, like state abbrev. or street "type ("st., ave.)?
Anyhow: here is an example of what I am hoping to get:
Given
def ex_addr(text):
# does the re magic
# returns 1st address (all addresses?) or None if nothing found
for t in [
'The meeting will be held at 22 West Westin st., South Carolina, 12345 on Nov.-18',
'The meeting will be held at 22 West Westin street, SC, 12345 on Nov.-18',
'Hi there,\n How about meeting tomorr. #10am-sh in Chadds # 123 S. Vancouver ave. in Ottawa? \nThanks!!!',
'Hi there,\n How about meeting tomorr. #10am-sh in Chadds # 123 S. Vancouver avenue in Ottawa? \nThanks!!!',
'This was written in 1999 in Montreal',
"Cool cafe at 420 Funny Lane, Cupertino CA is way too cool",
"We're at a party at 12321 Mammoth Lane, Lexington MA 77777; Come have a beer!"
] print ex_addr(t)
I would like to get:
'22 West Westin st., South Carolina, 12345'
'22 West Westin street, SC, 12345'
'123 S. Vancouver ave. in Ottawa'
'123 S. Vancouver avenue in Ottawa'
None # for 'This was written in 1999 in Montreal',
"420 Funny Lane, Cupertino CA",
"12321 Mammoth Lane, Lexington MA 77777"
Could you please help?

I just ran across this in GitHub as I am having a similar problem. Appears to work and be more robust than your current solution.
https://github.com/madisonmay/CommonRegex
Looking at the code, the regex for street address accounts for many more scenarios. '\d{1,4} [\w\s]{1,20}(?:street|st|avenue|ave|road|rd|highway|hwy|square|sq|trail|trl|drive|dr|court|ct|parkway|pkwy|circle|cir|boulevard|blvd)\W?(?=\s|$)'

\d{1,4}( \w+){1,5}, (.*), ( \w+){1,5}, (AZ|CA|CO|NH), [0-9]{5}(-[0-9]{4})?
In this regex, you have one too many spaces (before ( \w+){1,5}, which already begins with one). Removing it, it matches your example.
I don't think you can assume that a "unit 123" or similar will be there, or there might be several ones (e.g. "building A, apt 3"). Note that in your initial regex, the . might match , which could lead to very long (and unwanted) matches.
You should probably accept several such groups with a limitation on the number (e.g. replace , (.*) with something like (, [^,]{1,20}){0,5}.
In any case, you will probably never get something 100% accurate that will accept any variation people might throw at them. Do lots of tests! Good luck.

How to split string of biographical info into different dictionaries using regex, in Python?

Recently I got my hands on a research project that would greatly benefit from learning how to parse a string of biographical data on several individuals into a set of dictionaries for each individual.
The string contains break words and I was hoping to create keys off of the breakwords and separate dictionaries by line breaks. So here are two people I want to create two different dictionaries for within my data:
Bankers = [ ' Bakstansky, Peter; Senior Vice President, Federal
Reserve Bank of New York, in charge of public information
since 1976, when he joined the NY Fed as Vice President. Senior
Officer in charge of die Office of Regional and Community Affairs,
Ombudsman for the Bank and Senior Administrative Officer for Executive
Group, m zero children Educ City College of New York (Bachelor of
Business Administration, 1961); University of Illinois, Graduate
School, and New York University, Graduate School of Business. 1962-6:
Business and financial writer, New York, on American Banker, New
York-World Telegram & Sun, Neia York Herald Tribune (banking editor
1964-6). 1966-74: Chase Manhattan Bank: Manager of Public Relations,
based in Paris, 1966-71; Manager of Chase's European Marketing and
Planning, based in Brussels, 1971-2; Vice President and Director of
Public Relations, 1972-4.1974-76: Bache & Co., Vice President and
Director of Corporate Communications. Barron, Patrick K.; First Vice
President and < Operating Officer of the Federal Reserve Bank o
Atlanta since February 1996. Member of the Fed" Reserve Systems
Conference of first Vice Preside Vice chairman of the bank's
management Con and of the Discount Committee, m three child Educ
University of Miami (Bachelor's degree in Management); Harvard
Business School (Prog Management Development); Stonier Graduate Sr of
Banking, Rutgers University. 1967: Joined Fed Reserve Bank of Atlanta
in computer operations 1971: transferred to Miami Branch; 1974:
Assist: President; 1987: Senior Vice President.1988: re1- Atlanta as
Head of Corporate Services. Member Executive Committee of the Georgia
Council on Igmic Education; former vice diairman of Greater
ji§?Charnber of Commerce and the President'sof the University of
Miami; in Atlanta, former ||Mte vice chairman for the United Way of
Atlanta feiSinber of Leadership Atlanta. Member of the Council on
Economic Education. Interest. ' ]
So for example, in this data I have two people - Peter Batanksy and Patrick K. Barron. I want to create a dictionary for each individual with these 4 keys: bankerjobs, Number of children, Education, and nonbankerjobs.
In this text there are already break words: "m" = number of children "Educ", and anything before "m" is bankerjobs and anything after the first "." after Educ is nonbankerjobs, and the keyword to break between individuals seems to be any amount of spaces after a "." >1
How can I create a dictionary for each of these two individuals with these 4 keys using regular expressions on these break words?
specifically, what set of regex could help me create a dictionary for these two individuals with these 4 keys (built on the above specified break words)?
A pattern i am thinking would be something like this in perl:
pattern = [r'(m/[ '(.*);(.*)m(.*)Educ(.*)/)']
but i'm not sure..
I'm thinking the code would be similar to this but please correct it if im wrong:
my_banker_parser = re.compile(r'somefancyregex')
def nested_dict_from_text(text):
m = re.search(my_banker_parser, text)
if not m:
raise ValueError
d = m.groupdict()
return { "centralbanker": d }
result = nested_dict_from_text(bankers)
print(result)
My hope is to take this code and run it through the rest of the biographies for all of individuals of interest.

Using named groups will probably be less brittle, since it doesn't depend on the pieces of data being in the same order in each biography. Something like this should work:
>>> import re
>>> regex = re.compile(r'(?P<foo>foo)|(?P<bar>bar)|(?P<baz>baz)')
>>> data = {}
>>> for match in regex.finditer('bar baz foo something'):
... data.update((k, v) for k, v in match.groupdict().items() if v is not None)
...
>>> data
{'baz': 'baz', 'foo': 'foo', 'bar': 'bar'}

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can I clean this string and leave only text (Python) - python

Related

Identify and replace using regex some strings, stored within a list, within a string that may or may not contain them

Can't Scrape All the UL Tag's text in python webscrape

Parse a very large text file with Python?

FInd a US street address in text (preferably using Python regex)

How to split string of biographical info into different dictionaries using regex, in Python?

Categories

Resources