Converting A Value In A Dictionary to List - python

So I have a list of dictionaries. Here are some of the entries in the dictionary that I'm trying to search through.
[{
'title': '"Adult" Pimiento Cheese ',
'categories': [
'Cheese',
'Vegetable',
'No-Cook',
'Vegetarian',
'Quick & Easy',
'Cheddar',
'Hot Pepper',
'Winter',
'Gourmet',
'Alabama',
],
'ingredients': [
'2 or 3 large garlic cloves',
'a 2-ounce jar diced pimientos',
'3 cups coarsely grated sharp Cheddar (preferably English, Canadian, or Vermont; about 12 ounces)'
,
'1/3 to 1/2 cup mayonnaise',
'crackers',
'toasted baguette slices',
"crudit\u00e9s",
],
'directions': ['Force garlic through a garlic press into a large bowl and stir in pimientos with liquid in jar. Add Cheddar and toss mixture to combine well. Stir in mayonnaise to taste and season with freshly ground black pepper. Cheese spread may be made 1 day ahead and chilled, covered. Bring spread to room temperature before serving.'
, 'Serve spread with accompaniments.'],
'rating': 3.125,
}, {
'title': '"Blanketed" Eggplant ',
'categories': [
'Tomato',
'Vegetable',
'Appetizer',
'Side',
'Vegetarian',
'Eggplant',
'Pan-Fry',
'Vegan',
"Bon App\u00e9tit",
],
'ingredients': [
'8 small Japanese eggplants, peeled',
'16 large fresh mint leaves',
'4 large garlic cloves, 2 slivered, 2 flattened',
'2 cups olive oil (for deep frying)',
'2 pounds tomatoes',
'7 tablespoons extra-virgin olive oil',
'1 medium onion, chopped',
'6 fresh basil leaves',
'1 tablespoon dried oregano',
'1 1/2 tablespoons drained capers',
],
'directions': ['Place eggplants on double thickness of paper towels. Salt generously. Let stand 1 hour. Pat dry with paper towels. Cut 2 deep incisions in each eggplant. Using tip of knife, push 1 mint leaf and 1 garlic sliver into each incision.'
,
"Pour 2 cups oil into heavy medium saucepan and heat to 375\u00b0F. Add eggplants in batches and fry until deep golden brown, turning occasionally, about 4 minutes. Transfer eggplants to paper towels and drain."
,
'Blanch tomatoes in pot of boiling water for 20 seconds. Drain. Peel tomatoes. Cut tomatoes in half; squeeze out seeds. Chop tomatoes; set aside.'
,
"Heat 4 tablespoons extra-virgin olive oil in large pot over high heat. Add 2 flattened garlic cloves; saut\u00e9 until light brown, about 3 minutes. Discard garlic. Add onion; saut\u00e9 until translucent, about 5 minutes. Add reduced to 3 cups, stirring occasionally, about 20 minutes."
,
'Mix capers and 3 tablespoons extra-virgin olive oil into sauce. Season with salt and pepper. Reduce heat. Add eggplants. Simmer 5 minutes, spooning sauce over eggplants occasionally. Spoon sauce onto platter. Top with eggplants. Serve warm or at room temperature.'
],
'rating': 3.75,
'calories': 1386.0,
'protein': 9.0,
'fat': 133.0,
}]
I have the current code that is searching through the dictionary and creating a list of recipes that contain all the words in the query argument.
function to find the matching recipes and return them in a list of dictionaries. tokenisation is another function that basically removes all punctuation and digits from the query as well as make it lower case. It returns a list of each word found in the query.
For example, the query "cheese!22banana" would be turned to [cheese, banana].
def matching(query):
#split up the input string and have a list to put the recipes in
token_list = tokenisation(query)
matching_recipes = []
#loop through whole file
for recipe in recipes:
recipe_tokens = []
#check each key
for key in recipe:
#checking the keys for types
if type(recipe[key]) != list:
continue
#look at the values for each key
for sentence in recipe[key]:
#make a big list of tokens from the keys
recipe_tokens.extend([t for t in tokenisation(sentence)])
#checking if all the query tokens appear in the recipe, if so append them
if all([tl in recipe_tokens for tl in token_list]):
matching_recipes.append(recipe)
return matching_recipes
The issue I am having is that the first key in the dictionary isn't a list, so as a result the function isn't checking if the words appear in the title and instead is just checking every other key and then adding each word to a list. I then check if every word in the query is present in the list of words in the recipes. The issue I am having is that because it's skipping the title as it's not in a list, if the word appears in the title, it won't return the recipe.
How would I add this title check into this code? I've tried turning it into a list as the title current has type string but then get a 'float' is not iterable error and have no clue how about tackling this issue.

To avoid the error, simply replace the
if type(recipe[key]) != list:
to
if type(recipe[key]) == str:
Or better,
if isinstance(value, str):
You get the error from trying to use the tokenisation function on certain values, because there are values in the dicts that are indeed of type float, for example, the value of the 'rating' key.
If the tokenization function returns a list of sentences, this should work:
def matching(query):
token_list = tokenisation(query)
matching_recipes = []
for recipe in recipes:
recipe_tokens = []
for value in recipe.values():
if isinstance(value, str):
recipe_tokens.append(value)
continue
for sentence in value:
recipe_tokens.extend(tokenisation(sentence))
if all([tl in recipe_tokens for tl in token_list]):
matching_recipes.append(recipe)
return matching_recipes
If it returns a list of words:
def matching(query):
token_list = tokenisation(query)
matching_recipes = []
for recipe in recipes:
recipe_tokens = []
for value in recipe.values():
if isinstance(value, str):
value = tokenisation(value)
for sentence in value:
recipe_tokens.extend(tokenisation(sentence))
if all([tl in recipe_tokens for tl in token_list]):
matching_recipes.append(recipe)
return matching_recipes

Related

cast separate lists into to one list

I am following this example semantic clustering:
!pip install sentence_transformers
from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans
embedder = SentenceTransformer('all-MiniLM-L6-v2')
# Corpus with example sentences
corpus = ['A man is eating food.',
'A man is eating a piece of bread.',
'A man is eating pasta.',
'The girl is carrying a baby.',
'The baby is carried by the woman',
'A man is riding a horse.',
'A man is riding a white horse on an enclosed ground.',
'A monkey is playing drums.',
'Someone in a gorilla costume is playing a set of drums.',
'A cheetah is running behind its prey.',
'A cheetah chases prey on across a field.'
]
corpus_embeddings = embedder.encode(corpus)
# Perform kmean clustering
num_clusters = 5
clustering_model = KMeans(n_clusters=num_clusters)
clustering_model.fit(corpus_embeddings)
cluster_assignment = clustering_model.labels_
clustered_sentences = [[] for i in range(num_clusters)]
for sentence_id, cluster_id in enumerate(cluster_assignment):
clustered_sentences[cluster_id].append(corpus[sentence_id])
for i, cluster in enumerate(clustered_sentences):
print("Cluster", i+1)
print(cluster)
print(len(cluster))
print("")
Which results to the following lists:
Cluster 1
['The girl is carrying a baby.', 'The baby is carried by the woman']
2
Cluster 2
['A man is riding a horse.', 'A man is riding a white horse on an enclosed ground.']
2
Cluster 3
['A man is eating food.', 'A man is eating a piece of bread.', 'A man is eating pasta.']
3
Cluster 4
['A cheetah is running behind its prey.', 'A cheetah chases prey on across a field.']
2
Cluster 5
['A monkey is playing drums.', 'Someone in a gorilla costume is playing a set of drums.']
2
How to add these separate list to one?
Expected outcome:
list2[['The girl is carrying a baby.', 'The baby is carried by the woman'], .....['A monkey is playing drums.', 'Someone in a gorilla costume is playing a set of drums.']]
I tried the following:
list2=[]
for i in cluster:
list2.append(i)
list2
But I returns me only the last one:
['A monkey is playing drums.',
'Someone in a gorilla costume is playing a set of drums.']
Any ideas?
Following that example, you don't need to anything to get a list of lists; that's already been done for you.
Try printing clustered_sentences.
Basically, you need to get a "flat" list from a list of lists, you can achieve that with python list comprehension:
flat = [item for sub in clustered_sentences for item in sub]

Python for matching paper IDs in Scholarly

I have a list of the following authors for Google Scholar papers: Zoe Pikramenou, James H. R. Tucker, Alison Rodger, Timothy Dafforn. I want to extract and print titles for the papers present for at least 3 of these.
You can get a dictionary of paper info from each author using Scholarly:
from scholarly import scholarly
AuthorList = ['Zoe Pikramenou', 'James H. R. Tucker', 'Alison Rodger', 'Timothy Dafforn']
for Author in AuthorList:
search_query = scholarly.search_author(Author)
author = next(search_query).fill()
print(author)
The output looks something like (just a small excerpt from what you'd get from one author)
{'bib': {'cites': '69',
'title': 'Chalearn looking at people and faces of the world: Face '
'analysis workshop and challenge 2016',
'year': '2016'},
'filled': False,
'id_citations': 'ZhUEBpsAAAAJ:_FxGoFyzp5QC',
'source': 'citations'},
{'bib': {'cites': '21',
'title': 'The NoXi database: multimodal recordings of mediated '
'novice-expert interactions',
'year': '2017'},
'filled': False,
'id_citations': 'ZhUEBpsAAAAJ:0EnyYjriUFMC',
'source': 'citations'},
{'bib': {'cites': '11',
'title': 'Automatic habitat classification using image analysis and '
'random forest',
'year': '2014'},
'filled': False,
'id_citations': 'ZhUEBpsAAAAJ:qjMakFHDy7sC',
'source': 'citations'},
{'bib': {'cites': '10',
'title': 'AutoRoot: open-source software employing a novel image '
'analysis approach to support fully-automated plant '
'phenotyping',
'year': '2017'},
'filled': False,
'id_citations': 'ZhUEBpsAAAAJ:hqOjcs7Dif8C',
'source': 'citations'}
How can I collect the bib and specifically title for papers which are present for three or more out of the four authors?
EDIT: in fact it's been pointed out id_citations is not unique for each paper, my mistake. Better to just use title itself
Expanding on my comment, you can achieve this using Pandas groupby:
import pandas as pd
from scholarly import scholarly
AuthorList = ['Zoe Pikramenou', 'James H. R. Tucker', 'Alison Rodger', 'Timothy Dafforn']
frames = []
for Author in AuthorList:
search_query = scholarly.search_author(Author)
author = next(search_query).fill()
# creating DataFrame with authors
df = pd.DataFrame([x.__dict__ for x in author.publications])
df['author'] = Author
frames.append(df.copy())
# joining all author DataFrames
df = pd.concat(frames, axis=0)
# taking bib dict into separate columns
df[['title', 'cites', 'year']] = pd.DataFrame(df.bib.to_list())
# counting unique authors attached to each title
n_authors = df.groupby('title').author.nunique()
# locating the unique titles for all publications with n_authors >= 2
output = n_authors[n_authors >= 2].index
This finds 202 papers which have 2 or more of the authors in that list (out of 774 total papers). Here is an example of the output:
Index(['1, 1′-Homodisubstituted ferrocenes containing adenine and thymine nucleobases: synthesis, electrochemistry, and formation of H-bonded arrays',
'722: Iron chelation by biopolymers for an anti-cancer therapy; binding up the'ferrotoxicity'in the colon',
'A Luminescent One-Dimensional Copper (I) Polymer',
'A Unidirectional Energy Transfer Cascade Process in a Ruthenium Junction Self-Assembled by r-and-Cyclodextrins',
'A Zinc(II)-Cyclen Complex Attached to an Anthraquinone Moiety that Acts as a Redox-Active Nucleobase Receptor in Aqueous Solution',
'A ditopic ferrocene receptor for anions and cations that functions as a chromogenic molecular switch',
'A ferrocene nucleic acid oligomer as an organometallic structural mimic of DNA',
'A heterodifunctionalised ferrocene derivative that self-assembles in solution through complementary hydrogen-bonding interactions',
'A locking X-ray window shutter and collimator coupling to comply with the new Health and Safety at Work Act',
'A luminescent europium hairpin for DNA photosensing in the visible, based on trimetallic bis-intercalators',
...
'Up-Conversion Device Based on Quantum Dots With High-Conversion Efficiency Over 6%',
'Vectorial Control of Energy‐Transfer Processes in Metallocyclodextrin Heterometallic Assemblies',
'Verteporfin selectively kills hypoxic glioma cells through iron-binding and increased production of reactive oxygen species',
'Vibrational Absorption from Oxygen-Hydrogen (Oi-H2) Complexes in Hydrogenated CZ Silicon',
'Virginia review of sociology',
'Wildlife use of log landings in the White Mountain National Forest',
'Yttrium 1995',
'ZUSCHRIFTEN-Redox-Switched Control of Binding Strength in Hydrogen-Bonded Metallocene Complexes Stichworter: Carbonsauren. Elektrochemie. Metallocene. Redoxchemie …',
'[2] Rotaxanes comprising a macrocylic Hamilton receptor obtained using active template synthesis: synthesis and guest complexation',
'pH-controlled delivery of luminescent europium coated nanoparticles into platelets'],
dtype='object', name='title', length=202)
Since all of the data is in Pandas, you can also explore what the attached authors on each of the papers is as well as all of the other information you have access to within the author.publications array coming in from scholarly.
First, let's convert this into a more friendly format. You say that the id_citations is unique for each paper, so we'll use it as a hashtable/dict key.
We can then map each id_citation to the bib dict and author(s) it appears for, as a list of tuples (bib, author_name).
author_list = ['Zoe Pikramenou', 'James H. R. Tucker', 'Alison Rodger', 'Timothy Dafforn']
bibs = {}
for author_name in author_list:
search_query = scholarly.search_author(author_name)
for bib in search_query:
bib = bib.fill()
bibs.setdefault(bib['id_citations'], []).append((bib, author_name))
Thereafter, we can sort the keys in bibs based on how many authors are attached to them:
most_cited = sorted(bibs.items(), key=lambda k: len(k[1]))
# most_cited is now a list of tuples (key, value)
# which maps to (id_citation, [(bib1, author1), (bib2, author2), ...])
and/or filter that list to citations that have only three or more appearances:
cited_enough = [tup[1][0][0] for tup in most_cited if len(tup[1]) >= 3]
# using key [0] in the middle is arbitrary. It can be anything in the
# list, provided the bib objects are identical, but index 0 is guaranteed
# to be there.
# otherwise, the first index is to grab the list rather than the id_citation,
# and the last index is to grab the bib, rather than the author_name
and now we can retrieve the titles of the papers from there:
paper_titles = [bib['bib']['title'] for bib in cited_enough]

Filtering SpaCy noun_chunks by pos_tag

As the subj line says, I'm trying to extract elements of noun_chunks based on their individual POS tags. It seems that elements of a noun_chunk do not have access to the global sentence POS tags.
To demonstrate the issue:
[i.pos_ for i in nlp("Great coffee at a place with a great view!").noun_chunks]
>>>
AttributeError: 'spacy.tokens.span.Span' object has no attribute 'pos_'
Here is my inefficient solution:
def parse(text):
doc = nlp(text.lower())
tags = [(idx,i.text,i.pos_) for idx,i in enumerate(doc)]
chunks = [i for i in doc.noun_chunks]
indices = []
for c in chunks:
indices.extend(j for j in range(c.start_char,c.end_char))
non_chunks = [w for w in ''.join([i for idx,i in enumerate(text) if idx not in indices]).split(' ')
if w != '']
chunk_words = [tup[1] for tup in tags if tup[1] not in non_chunks and tup[2] not in ['DET','VERB','SYM','NUM']] #these are the POS tags which I wanted to filter out from the beginning!
new_chunks = []
for c in chunks:
new_words = [w for w in str(c).split(' ') if w in chunk_words]
if len(new_words) > 1:
new_chunk = ' '.join(new_words)
new_chunks.append(new_chunk)
return new_chunks
parse(
"""
I may be biased about Counter Coffee since I live in town, but this is a great place that makes a great cup of coffee. I have been coming here for about 2 years and wish I would have found it sooner. It is located right in the heart of Forest Park and there is a ton of street parking. The coffee here is great....many other words could describe it, but that sums it up perfectly. You can by coffee by the pound, order a hot drink, and they also have food. On the weekend, there are donuts brought in from Do-Rite Donuts which have almost a cult like following. The food is a little on the high end price wise, but totally worth it. I am a self admitted latte snob and they make an amazing latte here. You can add skim, whole, almond or oat milk and they will make it happen. I always order easy foam and they always make it perfectly. My girlfriend loves the Chai Latte with Oat Milk and I will admit it is pretty good. Give them a try.
""")
>>>
['counter coffee',
'great place',
'great cup',
'forest park',
'street parking',
'many other words',
'hot drink',
'almost cult',
'high end price',
'latte snob',
'amazing latte',
'oat milk',
'easy foam',
'chai latte',
'oat milk']
Any quicker approaches to the same solution would be welcomed!
This doesn't work:
[i.pos_ for i in nlp("Great coffee at a place with a great view!").noun_chunks]
because noun_chunks returns Span objects, not Token objects.
You can get to the POS tags within each noun chunk by iterating over the tokens:
nlp = spacy.load("en_core_web_md")
for i in nlp("Great coffee at a place with a great view!").noun_chunks:
print(i, [t.pos_ for t in i])
which will give you
Great coffee ['ADJ', 'NOUN']
a place ['DET', 'NOUN']
a great view ['DET', 'ADJ', 'NOUN']
Original credit to this link:
Phrase extraction
def get_nns(doc):
nns = []
for token in doc:
# Try this with other parts of speech for different subtrees.
if token.pos_ == 'NOUN':
pp = ' '.join([tok.orth_ for tok in token.subtree])
nns.append(pp)
return nns
import spacy
nlp = spacy.load('en_core_web_sm')
ex = 'I am having a Great coffee at a place with a great view!'
doc = nlp(ex)
print(get_nns(doc))
Output:
['a Great coffee', 'a place with a great view', 'a great view']

How to format a list to rows with certain number of items?

I'm having issues on formatting a list to a formatted output that each row contains five elements but I am stuck.
words = ["letter", "good", "course", "land", "car", "tea", "speaker",\
"music", "length", "apple", "cash", "floor", "dance", "rice",\
"bow", "peach", "cook", "hot", "none", "word", "happy", "apple",\
"monitor", "light", "access"]
Output:
letter good course land car
tea speaker music length apple
cash floor dance rice bow
peach cook hot none word
happy apple monitor light access
Try this:
>>> for i in range(0, len(words), 5):
... print ' '.join(words[i:(i+5)])
...
letter good course land car
tea speaker music length apple
cash floor dance rice bow
peach cook hot none word
happy apple monitor light access
Using list comprehension
num=5
[' '.join(words[i:i+num]) for i in range(0,len(words),num)]
Can also use chunked but might have to install more_itertools first
from more_itertools import chunked
list(chunked(words, 5))

how to sort the target word by value of dictionary and count associated words?

I have two text files, one is sample.txt and the other is common.txt. First I would like to remove common words from sample.txt. Common words are found in common.txt and in the code sample.txt has been modified as desired. common.txt is:
a
about
after
again
against
ago
all
along
also
always
an
and
another
any
are
around
as
at
away
back
be
because
been
before
began
being
between
both
but
by
came
can
come
could
course
day
days
did
do
down
each
end
even
ever
every
first
for
four
from
get
give
go
going
good
got
great
had
half
has
have
he
head
her
here
him
his
house
how
hundred
i
if
in
into
is
it
its
just
know
last
left
life
like
little
long
look
made
make
man
many
may
me
men
might
miles
more
most
mr
much
must
my
never
new
next
no
not
nothing
now
of
off
old
on
once
one
only
or
other
our
out
over
own
people
pilot
place
put
right
said
same
saw
say
says
see
seen
she
should
since
so
some
state
still
such
take
tell
than
that
the
their
them
then
there
these
they
thing
think
this
those
thousand
three
through
time
times
to
told
too
took
two
under
up
upon
us
use
used
very
want
was
way
we
well
went
were
what
when
where
which
while
who
will
with
without
work
world
would
year
years
yes
yet
you
young
your
sample.txt is:
THE Mississippi is well worth reading about. It is not a commonplace
river, but on the contrary is in all ways remarkable. Considering the
Missouri its main branch, it is the longest river in the world--four
thousand three hundred miles. It seems safe to say that it is also the
crookedest river in the world, since in one part of its journey it uses
up one thousand three hundred miles to cover the same ground that the
crow would fly over in six hundred and seventy-five. It discharges three
times as much water as the St. Lawrence, twenty-five times as much
as the Rhine, and three hundred and thirty-eight times as much as the
Thames. No other river has so vast a drainage-basin: it draws its water
supply from twenty-eight States and Territories; from Delaware, on the
Atlantic seaboard, and from all the country between that and Idaho on
the Pacific slope--a spread of forty-five degrees of longitude. The
Mississippi receives and carries to the Gulf water from fifty-four
subordinate rivers that are navigable by steamboats, and from some
hundreds that are navigable by flats and keels. The area of its
drainage-basin is as great as the combined areas of England, Wales,
Scotland, Ireland, France, Spain, Portugal, Germany, Austria, Italy,
and Turkey; and almost all this wide region is fertile; the Mississippi
valley, proper, is exceptionally so.
after removing the common words, I need to break it into sentences and use "." as full stop and count appearance of target word in sentences. Also, need to create profile for target word to show associated words and their counts. For example, if "river" is the target word, the associated words include "commonplace", "contrary" and so on that happen in the same sentence (within a full stop) with "river". The desired output is listed in descending order:
river 4
ground: 1
journey: 1
longitude: 1
main: 1
world--four: 1
contrary: 1
cover: 1
...
mississippi 3
area: 1
steamboats: 1
germany: 1
reading: 1
france: 1
proper: 1
...
Three dots mean the associated words should be more and are not listed in here. And now here is the coding so far:
def open_file(file):
file = "/Users/apple/Documents/sample.txt"
file1 = "/Users/apple/Documents/common.txt"
with open(file1, "r") as f:
common_words = {i.strip() for i in f}
punctionmark = ":;,'\"."
trans_table = str.maketrans(punctionmark, " " * len(punctionmark))
word_counter = {}
with open(file, "r") as f:
for line in f:
for word in line.translate(trans_table).split():
if word.lower() not in common_words:
word_counter[word.lower()] = word_counter.get(word, 0) + 1
#print(word_counter)
print("\n".join("{} {}".format(w, c) for w, c in word_counter.items()))
And my output now is:
mississipi 1
reading 1
about 1
commonplace 1
river 4
.
.
.
And so far I have counted the occurrence of target word but stuck to sort the target words in descending order and to get the counts for their associated words. Anyone can provide solution without importing other modules? Thank you so much.
You can use re.findall to tokenize, filter, and group the text into sentences, and then traverse your structure of target and associated words to find the final counts:
import re, string
from collections import namedtuple
import itertools
stop_words = [i.strip('\n') for i in open('filename.txt')]
text = open('filename.txt').read()
grammar = {'punctuation':string.punctuation, 'stopword':stop_words}
token = namedtuple('token', ['name', 'value'])
tokenized_file = [token((lambda x:'word' if not x else x[0])([a for a, b in grammar.items() if i.lower() in b]), i) for i in re.findall('\w+|\!|\-|\.|;|,:', text)]
filtered_file = [i for i in tokenized_file if i.name != 'stopword']
grouped_data = [list(b) for _, b in itertools.groupby(filtered_file, key=lambda x:x.value not in '!.?')]
text_with_sentences = ' '.join([' '.join([c.value for c in grouped_data[i]])+grouped_data[i+1][0].value for i in range(0, len(grouped_data), 2)])
Currently, the result of text_with_sentences is:
'Mississippi worth reading. commonplace river contrary ways remarkable. Considering Missouri main branch longest river - -. seems safe crookedest river part journey uses cover ground crow fly six seventy - five. discharges water St. Lawrence twenty - five Rhine thirty - eight Thames. river vast drainage - basin draws water supply twenty - eight States Territories ; Delaware Atlantic seaboard country Idaho Pacific slope - - spread forty - five degrees longitude. Mississippi receives carries Gulf water fifty - subordinate rivers navigable steamboats hundreds navigable flats keels. area drainage - basin combined areas England Wales Scotland Ireland France Spain Portugal Germany Austria Italy Turkey ; almost wide region fertile ; Mississippi valley proper exceptionally.'
To find the counts for the keyword profiling, you can use collections.Counter:
import collections
counts = collections.Counter(map(str.lower, re.findall('[\w\-]+', text)))
structure = [['river', ['ground', 'journey', 'longitude', 'main', 'world--four', 'contrary', 'cover']], ['mississippi', ['area', 'steamboats', 'germany', 'reading', 'france', 'proper']]]
new_structure = [{'keyword':counts.get(a, 0), 'associated':{i:counts.get(i, 0) for i in b}} for a, b in structure]
Output:
[{'associated': {'cover': 1, 'longitude': 1, 'journey': 1, 'contrary': 1, 'main': 1, 'world--four': 1, 'ground': 1}, 'keyword': 4}, {'associated': {'area': 1, 'france': 1, 'germany': 1, 'proper': 1, 'reading': 1, 'steamboats': 1}, 'keyword': 3}]
Without using any modules, str.split can be used:
words = [[i[:-1], i[-1]] if i[-1] in string.punctuation else [i] for i in text.split()]
new_words = [i for b in words for i in b if i.lower() not in stop_words]
def find_groups(d, _pivot = '.'):
current = []
for i in d:
if i == _pivot:
yield ' '.join(current)+'.'
current = []
else:
current.append(i)
print(list(find_groups(new_words)))
counts = {}
for i in new_words:
if i.lower() not in counts:
counts[i.lower()] = 1
else:
counts[i.lower()] += 1
structure = [['river', ['ground', 'journey', 'longitude', 'main', 'world--four', 'contrary', 'cover']], ['mississippi', ['area', 'steamboats', 'germany', 'reading', 'france', 'proper']]]
new_structure = [{'keyword':counts.get(a, 0), 'associated':{i:counts.get(i, 0) for i in b}} for a, b in structure]
Output:
['Mississippi worth reading.', 'commonplace river , contrary ways remarkable.', 'Considering Missouri main branch , longest river world--four.', 'seems safe crookedest river , part journey uses cover ground crow fly six seventy-five.', 'discharges water St.', 'Lawrence , twenty-five Rhine , thirty-eight Thames.', 'river vast drainage-basin : draws water supply twenty-eight States Territories ; Delaware , Atlantic seaboard , country Idaho Pacific slope--a spread forty-five degrees longitude.', 'Mississippi receives carries Gulf water fifty-four subordinate rivers navigable steamboats , hundreds navigable flats keels.', 'area drainage-basin combined areas England , Wales , Scotland , Ireland , France , Spain , Portugal , Germany , Austria , Italy , Turkey ; almost wide region fertile ; Mississippi valley , proper , exceptionally.']
[{'associated': {'cover': 1, 'longitude': 1, 'journey': 1, 'contrary': 1, 'main': 1, 'world--four': 1, 'ground': 1}, 'keyword': 4}, {'associated': {'area': 1, 'france': 1, 'germany': 1, 'proper': 1, 'reading': 1, 'steamboats': 1}, 'keyword': 3}]

Categories

Resources