How to extract particular data from nested data structure in Python - python

['[{"word":"meaning","phonetics":[{"text":"/ˈmiːnɪŋ/","audio":"https://lex-audio.useremarkable.com/mp3/meaning_gb_1.mp3"}],"meanings":[{"partOfSpeech":"noun","definitions":[{"definition":"What '
'is meant by a word, text, concept, or '
'action.","synonyms":["definition","sense","explanation","denotation","connotation","interpretation","elucidation","explication"],"example":"the '
'meaning of the Hindu word is ‘breakthrough, '
'release’"}]},{"partOfSpeech":"adjective","definitions":[{"definition":"Intended '
'to communicate something that is not directly '
'expressed.","synonyms":["meaningful","significant","pointed","eloquent","expressive","pregnant","speaking","telltale","revealing","suggestive"]}]}]}]']
This is the format.
I wanna extract:
"meanings":[{"partOfSpeech":"noun","definitions":[{"definition":"A single distinct meaningful element of speech or writing, used with others (or sometimes alone) to form a sentence and typically shown with a space on either side when written or printed.",
How may I do it, in Python.

I believe this is what you want
import json
data = ['[{"word":"meaning","phonetics":[{"text":"/ˈmiːnɪŋ/","audio":"https://lex-audio.useremarkable.com/mp3/meaning_gb_1.mp3"}],"meanings":[{"partOfSpeech":"noun","definitions":[{"definition":"What ' 'is meant by a word, text, concept, or ' 'action.","synonyms":["definition","sense","explanation","denotation","connotation","interpretation","elucidation","explication"],"example":"the ' 'meaning of the Hindu word is ‘breakthrough, ' 'release’"}]},{"partOfSpeech":"adjective","definitions":[{"definition":"Intended ' 'to communicate something that is not directly ' 'expressed.","synonyms":["meaningful","significant","pointed","eloquent","expressive","pregnant","speaking","telltale","revealing","suggestive"]}]}]}]']
json_data = json.loads(data[0])
meanings = json_data[0]['meanings']
print(meanings)
# [{'partOfSpeech': 'noun', 'definitions': [{'definition': 'What is meant by a word, text, concept, or action.', 'synonyms': ['definition', 'sense', 'explanation', 'denotation', 'connotation', 'interpretation', 'elucidation', 'explication'], 'example': 'the meaning of the Hindu word is ‘breakthrough, release’'}]}, {'partOfSpeech': 'adjective', 'definitions': [{'definition': 'Intended to communicate something that is not directly expressed.', 'synonyms': ['meaningful', 'significant', 'pointed', 'eloquent', 'expressive', 'pregnant', 'speaking', 'telltale', 'revealing', 'suggestive']}]}]

Related

Extract text with multiple regex patterns in Python

I have a list with address information
The placement of words in the list can be random.
address = [' South region', ' district KTS', ' 4', ' app. 106', ' ent. 1', ' st. 15']
I want to extract each item of a list in a new string.
r = re.compile(".region")
region = list(filter(r.match, address))
It works, but there are more than 1 pattern "region". For example, there can be "South reg." or "South r-n".
How can I combine a multiple patterns?
And digit 4 in list means building number. There can be onle didts, or smth like 4k1.
How can I extract building number?
Hopefully I understood the requirement correctly.
For extracting the region, I chose to get it by the first word, but if you can be sure of the regions which are accepted, it would be better to construct the regex based on the valid values, not first word.
Also, for the building extraction, I am not sure of which are the characters you want to keep, versus the ones which you may want to remove. In this case I chose to keep only alphanumeric, meaning that everything else would be stripped.
CODE
import re
list1 = [' South region', ' district KTS', ' -4k-1.', ' app. 106', ' ent. 1', ' st. 15']
def GetFirstWord(list2,column):
return re.search(r'\w+', list2[column].strip()).group()
def KeepAlpha(list2,column):
return re.sub(r'[^A-Za-z0-9 ]+', '', list2[column].strip())
print(GetFirstWord(list1,0))
print(KeepAlpha(list1,2))
OUTPUT
South
4k1

Split a string in python with side by side delimiters

The Question:
Given a list of strings create a function that returns the same list but split along any of the following delimiters ['&', 'OR', 'AND', 'AND/OR', 'IFT'] into a list of lists of strings.
Note the delimiters can be mixed inside a string, there can be many adjacent delimiters, and the list is a column from a dataframe.
EX//
function(["Mary & had a little AND lamb", "Twinkle twinkle ITF little OR star"])
>> [['Mary', 'had a little', 'lamb'], ['Twinkle twinkle', 'little', 'star']]
function(["Mary & AND had a little OR IFT lamb", "Twinkle twinkle AND & ITF little OR & star"])
>> [['Mary', 'had a little', 'lamb'], ['Twinkle twinkle', 'little', 'star']]
My Solution Attempt
Start by replacing any kind of delimiter with a &. I include spaces on either side so that other words like HANDY dont get affected. Next, split each string along the & delimiter knowing that every other kind of delimiter has been replaced.
def clean_and_split(lolon):
# Constants
banned_list = {' AND ', ' OR ', ' ITF ', ' AND/OR '}
# Loop through each list of strings
for i in range(len(lolon)):
# Loop through each delimiter and replace it with ' & '
for word in banned_list:
lolon[i] = lolon[i].replace(word, ' & ')
# Split the string along the ' & ' delimiter
lolon[i] = lolon[i].split('&')
return lolon
The problem is that often side by side delimiters get replaced in a way that leaves an empty string in the middle. Also certain combinations of delimiters dont get removed. This is because when the 'replace' method reads ' OR OR OR ', it will replace the first ' OR ' (since it matches) but wont replace the second because it reads it as 'OR '.
EX//
clean_and_split(["Mario AND Luigi AND & Peach"]) >> ['Mario ', ' Luigi ', ' ', ' Peach'])
clean_and_split(["Mario OR OR OR Luigi", "Testing AND AND PlsWork "])
>> ['Mario ',' OR ', ' Luigi '], ['Testing', 'AND PlsWork]]
The work around to resolve this is to make banned_list = {' AND ', ' OR ', ' ITF ', ' AND/OR ', ' AND ', ' OR ', ' ITF ', ' AND/OR '} forcing the code to loop through everything twice.
Alternate Solution?
Split the column along a list of delimiters. The problem with this is that back to back delimiters don't get caught
df['Correct_Column'].str.split('(?: AND | IFT | OR | & )')
EX//
function(["Mary & AND had a little OR IFT lamb", "Twinkle twinkle AND & ITF little OR & star"])
>> [['Mary', 'AND had a little', 'IFT lamb'], ['Twinkle twinkle', '& little', '& star']]
There HAS to be a more elegant way!
This is where a lookahead and lookbehind are useful, as they won't eat up the spaces you use to match correctly:
import re
text = 'Mary & had a little AND OR lamb, white as ITF snow OR'
replaced = re.sub('(?<=\s)&|OR|AND|ITF|AND/OR(?=\s)', '&', text)
parts = [stripped for s in replaced.split('&') if (stripped := s.strip())]
print(parts)
Result:
['Mary', 'had a little', 'lamb, white as', 'snow']
However, note that:
the parts = line may solve most of your problems anyway, using your own method;
a lookbehind or lookahead requires a fixed-width pattern in Python, so something like (?<=\s|^) won't work, i.e. the OR at the end causes an empty string to be found at the end;
the lookahead/lookbehind correctly deals with 'AND OR', but still finds an empty string in between, which is removed on the parts = line;
the walrus operator is in the parts = line as a simple way to filter out empty strings; stripped := s.strip() is not truthy if the result is an empty string, so stripped will only show up in the list if it is not an empty string.

Regex - Splitting Strings at full-stops unless it's part of an honorific [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I have a list containing all possible titles:
['Mr.', 'Mrs.', 'Ms.', 'Dr.', 'Prof.', 'Rev.', 'Capt.', 'Lt.-Col.', 'Col.', 'Lt.-Cmdr.', 'The Hon.', 'Cmdr.', 'Flt. Lt.', 'Brgdr.', 'Wng. Cmdr.', 'Group Capt.' ,'Rt.', 'Maj.-Gen.', 'Rear Admrl.', 'Esq.', 'Mx', 'Adv', 'Jr.']
I need a Python 2.7 code that can replace all full-stops \. with newline \n unless it's one of the above titles.
Splitting it into a list of strings would be fine as well.
Sample Input:
Modi is waiting in line to Thank Dr. Manmohan Singh for preparing a road map for introduction of GST in India. The bill is set to pass.
Sample Output:
Modi is waiting in line to Thank Dr. Manmohan Singh for preparing a road map for introduction of GST in India.
The bill is set to pass.
This should do the trick, here we use a list comprehension with a conditional statement to concatenate the words with a \n if they contain a full-stop, and are not in the list of key words. Otherwise just concatenate a space.
Finally the words in the sentence are joined using join(), and we use rstrip() to eliminate any newline remaining at the end of the string.
l = set(['Mr.', 'Mrs.', 'Ms.', 'Dr.', 'Prof.', 'Rev.', 'Capt.', 'Lt.-Col.',
'Col.', 'Lt.-Cmdr.', 'The Hon.', 'Cmdr.', 'Flt. Lt.', 'Brgdr.', 'Wng. Cmdr.',
'Group Capt.' ,'Rt.', 'Maj.-Gen.', 'Rear Admrl.', 'Esq.', 'Mx', 'Adv', 'Jr.'] )
s = 'Modi is waiting in line to Thank Dr. Manmohan Singh for preparing a road
map for introduction of GST in India. The bill is set to pass.'
def split_at_period(input_str, keywords):
final = []
split_l = input_str.split(' ')
for word in split_l:
if '.' in word and word not in keywords:
final.append(word + '\n')
continue
final.append(word + ' ')
return ''.join(final).rstrip()
print split_at_period(s, l)
or a one liner :D
print ''.join([w + '\n' if '.' in w and w not in l else w + ' ' for w in s.split(' ')]).rstrip()
Sample output:
Modi is waiting in line to Thank Dr. Manmohan Singh for preparing a road map for introduction of GST in India.
The bill is set to pass.
How it works?
Firstly we split up our string with a space ' ' delimiter using the split() string function, thus returning the following list:
>>> ['Modi', 'is', 'waiting', 'in', 'line', 'to', 'Thank', 'Dr.',
'Manmohan', 'Singh', 'for', 'preparing', 'a', 'road', 'map', 'for',
'introduction', 'of', 'GST', 'in', 'India.', 'The', 'bill', 'is',
'set', 'to', 'pass.']
We then start to build up a new list by iterating through the split-up list. If we see a word that contains a period, but is not a keyword, (Ex: India. and pass. in this case) then we have to concatenate a newline \n to the word to begin the new sentence. We can then append() to our final list, and continue out of the current iteration.
If the word does not end off a sentence with a period, we can just concatenate a space to rebuild the original string.
This is what final looks like before it is built as a string using join().
>>> ['Modi ', 'is ', 'waiting ', 'in ', 'line ', 'to ', 'Thank ', 'Dr.
', 'Manmohan ', 'Singh ', 'for ', 'preparing ', 'a ', 'road ', 'map ',
'for ', 'introduction ', 'of ', 'GST ', 'in ', 'India.\n', 'The ', 'bill ',
'is ', 'set ', 'to ', 'pass.\n']
Excellent, we have spaces, and newlines where they need to be! Now, we can rebuild the string. Notice however, that the the last element in the list also happens to contain a \n, we can clean that up with calling rstrip() on our new string.
The initial solution did not support spaces in the keywords, I've included a new more robust solution below:
import re
def format_string(input_string, keywords):
regexes = '|'.join(keywords) # Combine all keywords into a regex.
split_list = re.split(regexes, input_string) # Split on keys.
removed = re.findall(regexes, input_string) # Find removed keys.
newly_joined = split_list + removed # Interleave removed and split.
newly_joined[::2] = split_list
newly_joined[1::2] = removed
space_regex = '\.\s*'
for index, section in enumerate(newly_joined):
if '.' in section and section not in removed:
newly_joined[index] = re.sub(space_regex, '.\n', section)
return ''.join(newly_joined).strip()
convert all titles (and sole dot) into a regular expression
use a replacement callback
code:
import re
l = "|".join(map(re.escape,['.','Mr.', 'Mrs.', 'Ms.', 'Dr.', 'Prof.', 'Rev.', 'Capt.', 'Lt.-Col.', 'Col.', 'Lt.-Cmdr.', 'The Hon.', 'Cmdr.', 'Flt. Lt.', 'Brgdr.', 'Wng. Cmdr.', 'Group Capt.' ,'Rt.', 'Maj.-Gen.', 'Rear Admrl.', 'Esq.', 'Mx', 'Adv', 'Jr.']))
e="Dear Mr. Foo, I would like to thank you. Because Lt.-Col. Collins told me blah blah. Bye."
def do_repl(m):
s = m.group(1)
if s==".":
rval=".\n"
else:
rval = s
return rval
z = re.sub("("+l+")",do_repl,e)
# bonus: leading blanks should be stripped even that's not the question
z= re.sub(r"\s*\n\s*","\n",z,re.DOTALL)
print(z)
output:
Dear Mr. Foo, I would like to thank you.
Because Lt.-Col. Collins told me blah blah.
Bye.

Why isn't this regex parsing the whole string?

Writing a simple script to parse a large text file into words, their parent sentences, and some metadata (are they within a quote, etc.). Trying to get the regex to function properly and running into a strange issue. Here's a small bit of test code showing what's going on with my parsing. The white space is intentional, but I can't understand why the last 'word' is not parsing. It is not preceded by any problematic characters (at least as far as I can tell using repr) and when I run parse() on just the problem 'word' it returns the expected array of single words and spaces.
Code:
def parse(new_line):
new_line = new_line.rstrip()
word_array = re.split('([\.\?\!\ ])',new_line,re.M)
print(word_array)
x = full_text.readline()
print(repr(x))
parse(x)
Output:
'Far out in the uncharted backwaters of the unfashionable end of the western spiral arm of the Galaxy\n'
['Far', ' ', 'out', ' ', 'in', ' ', 'the', ' ', 'uncharted', ' ', 'backwaters', ' ', 'of', ' ', 'the', ' ', 'unfashionable end of the western spiral arm of the Galaxy']
re.M is 8, and you're passing that as the maxsplit positional argument. You want flags=re.M instead.

Python: getting a certain no. of strings from a dictionary

I have a dictionary in the following format, i split the different elements (where a comma(,) occured) using a split function and am now trying to extract the names from the list...i am trying to use regular expression but obviously am miserably failing being new to python... the names are in the following formats...
firstname(space)last name
name(space)name(space)name
x.name
x.y.name
name(space) x.(space)(name)
where x and y represent the an name initial like J. for john etc.
also if you can guide me in removing the "\t" keeping other information intact would also be great.
any sort of help would be more than welcome...thank you all.
[[' I. Antonov', ' I. Antonova', ' E. R. Kandel', ' and R. D. Hawkins. Activity-dependent presynaptic facilitation and hebbian ltp are both required and interact during classical conditioning in aplysia. Neuron', ' 37(1):135--47', ' Jan 2003.'], ['\tSander M. Bohte ', ' Joost N. Kok', ' Applications of spiking neural networks', ' Information Processing Letters', ' v.95 n.6', ' p.519-520'], [' L. J. Eshelman. The CHC Adaptive Search Algorithm: How to Have Safe Search When Engaging in Nontraditional Genetic Recombination. Foundations Of Genetic Algorithms', ' pages 265-283', ' 1990.'], ['Wulfram Gerstner ', ' Werner Kistler', ' Spiking Neuron Models: An Introduction', ' Cambridge University Press', ''], [' D. O. Hebb. Organization of behavior. New York: Wiley', ' 1949.'], [' D. Z. Jin. Spiking neural network for recognizing spatiotemporal sequences of spikes. Physical Review E', '69', ' 2004.'], ['Wolfgang Maass ', ' Christopher M. Bishop', ' Pulsed Neural Networks', ' MIT Press', ' '], ['Wolfgang Maass ', ' Henry Markram', ' Synapses as dynamic memory buffers', ' Neural Networks', ' v.15 n.2', ' p.'], [' H. Markram', ' Y. Wang', ' and M. Tsodyks. Differential signaling via the same axon of neocortical pyramidal neurons. Neurobiology', ' 95:5323--5328', ' April 1998.'], ['\t\tD. E. Rumelhart ', ' G. E. Hinton ', ' R. J. Williams', ' Learning internal representations by error propagation', ' Parallel distributed processing: explorations in the microstructure of cognition', ' vol. 1: foundations', ' MIT Press', ' Cambridge', ' MA', ' 1986 </a> \t\t\t\t\t\t\t\t\t'], ['\t J. D. Schaffer', ' L. D. Whitley', ' and L. J. Eshelman. Combinations of genetic algorithms and neural networks: A survey of the state of the art. In Combinations of Genetic Algorithms and NeuralNetworks', ' 1992.', ' COGANN-92. International Workshop on', ' pages 1--37', ' Philips Labs.', ' Briarcliff Manor', ' NY', ' 6 Jun 1992.'], ['\t S. Song', ' K. D. Miller', ' and L. F. Abbott. Competitive hebbian learning through spike-timing-dependent synaptic plasticity. Nature Neuroscience', ' 3(9):919--926', ' 2000.'], ['\t L. Watts. Event-driven simulation of networks of spiking neurons. Advances in Neural Information Processing Systems', ' 6:927--934', ' 1994.']]
It looks like you're going to have to tailor this pretty heavily to your input. Because there are so many different words and constructs in the text you're parsing, you're probably not going to get 100% accuracy with the rules you create. Here's an example, though, assuming your original input text is called input_text (and I don't think using the split() method is really all that useful, because the commas don't just delimit names):
import re
regexes = (r'[A-Z][a-z]+ [A-Z][a-z]+', # capitalized first and last name
r'[A-Z]\. [A-Z][a-z]+') # capitalized initial, then last name
names = []
for regex in regexes:
names += re.findall(regex, input_text)
You'd obviously want to write additional specific regexes for your vaious name types. This does a good job of finding names, but also comes up with a lot of false positives (Information Processing looks a lot like a name based on these rules). This should give you a starting point though.
To remove the tab (and other empty spaces at beginning or end of the strings):
stripped = [s.strip() for t in mylist]
To be honest, if you are trying to extract names, splitting lines like that will not help -- notice how some names are still grouped together with titles. Would be better to build a good regex that will match names, and use re.findall on individual lines.
To remove tabs and extra spaces, use strip():
>>> "\t foobar \t\t\t".strip()
'foobar'
It may also be, that its easier to find some online source of information where this job has been already done. For example, at places like this or this.
strip all the strings
identify the string that are surely not names (very long ones, ones that include numbers, and one after these in the list)
indentify string that are surely names (short strings at the begining of the list, string starting by the pattern $[A-Z][a-z]{0,3}.?\s (Dr., Miss, Mr, Prof, etc)
sudy the last strings that you can't match with these rules, and try to make fuzzy rules to chose by creating a coefficient of certidude: the close to the beginin of the list, the shorter strings will have a hight score that something at the end with a big size. Add criterias like that and set a minimum score.
If you need a hight accuracy, loof for names database and bayesian filters.
It won't be perfect: it's very hard to know the difference between 'name name name' and 'word word word'

Categories

Resources