Refine core info from string

Refine core info from string - python

A list of fruits with details in strings. I want to refine the core info, i.e. the fruit name from the strings.
Some of them have 1 word in name (e.g. Apple, Grape), some have 2 (Water melon, Start fruit).
I tried ngram way:
from nltk import ngrams
from collections import Counter
strings = [
"Apples Fresh Golden Delicious",
"Apples Fresh Red Delicious 12",
"Apple Sliced 12",
"Apple Diced 24",
"Water melon Fresh Petite Green on the turn",
"16 count Star fruit",
"8 count Star fruit",
"4 count Star fruit",
"Grapes Red Fresh Seedless",
"Grapes Green Fresh Seedless",
"Orange Naval Fresh 100 Count",
"Orange Naval Fresh 48 Count",
"Orange Naval Fresh 24 Count",
"Orange Naval Fresh 12 Count"]
basket = []
for s in strings:
grams = ngrams(s.split(), 1) # 2 for 2-gram
for g in grams:
basket.append("-".join(g))
print (Counter(basket))
1-gram:
Counter({'Fresh': 9, 'Orange': 4, 'Naval': 4})
2-gram:
Counter({'Orange-Naval': 4, 'Naval-Fresh': 4, 'count-Star': 3})
Obviously it's not working well.
What would be a better way? Thank you.

Related

Python: Split string such that each substring is a key in a dictionary

I have a sample string:
"green apple, sly fox, cunning quick fox fur, cool water, yellow sand"
and a dictionary:
strr_dict = {"green": "color", "apple": "fruit", "sly": "behavior", "fox": "animal", "cunning": "behavior", "quick fox": "animal", "cool water": "drink", "yellow": "color", "sand": "matter"}
I want to display substrings in the string with their values from the dictionary as a dataframe. This is what I have done:
import pandas as pd
sample_str = "green apple, sly fox, cunning quick fox fur, cool water, yellow sand"
strr_dict = {"green": "color", "apple": "fruit", "sly": "behavior", "fox": "animal", "cunning": "behavior", "quick fox": "animal", "cool water": "drink", "yellow": "color", "sand": "matter"}
df_list = []
stripped_list = [i.strip() for i in sample_str.split(',')]
for i in stripped_list:
if i in strr_dict:
df_list.append([i, strr_dict[i]])
else:
for j in i.split():
if j in strr_dict:
df_list.append([j, strr_dict[j]])
else:
df_list.append([j, ""])
strr_df = pd.DataFrame(df_list, columns=['Text', 'Value'])
print(strr_df)
The output I am getting is:
Text Value
0 green color
1 apple fruit
2 sly behavior
3 fox animal
4 cunning behavior
5 quick
6 fox animal
7 fur
8 cool water drink
9 yellow color
10 sand matter
My desired output is:
Text Value
0 green color
1 apple fruit
2 sly behavior
3 fox animal
4 cunning behavior
5 quick fox animal
6 fur
7 cool water drink
8 yellow color
9 sand matter
I want to display the values if the substrings are an exact match with the dictionary keys. I am wondering how to split the string accordingly. In this case, cunning quick fox fur should be split as cunning, quick fox, fur. But this may not be the case always, sometimes it should be split as cunning, quick fox fur to get their values from the dictionary. I am very confused on how to handle this case.

So this does give the output you specified. I dont know how and why you would want this, and i dont know if this works for the other input cases you might have, but it should - feel free to test with whatever other eldritch data sets you have ready.
import pandas as pd
sample_str = "green apple, sly fox, cunning quick fox fur, cool water, yellow sand"
strr_dict = {"green": "color", "apple": "fruit", "sly": "behavior", "fox": "animal", "cunning": "behavior",
"quick fox": "animal", "cool water": "drink", "yellow": "color", "sand": "matter"}
df_list = []
stripped_list = [i.strip() for i in sample_str.split(',')]
checklist = []
for i in stripped_list:
if i in strr_dict:
df_list.append([i, strr_dict[i]])
checklist.append(i)
else:
for z in list(strr_dict.keys()):
if z in str(checklist):
continue
if z in i:
try:
df_list.append([i, strr_dict[i]])
checklist.append(i)
except:
df_list.append([z, strr_dict[z]])
checklist.append(z)
for x in i.split():
if x not in str(checklist) and x not in list(strr_dict.keys()):
df_list.append([x, ""])
strr_df = pd.DataFrame(df_list, columns=['Text', 'Value'])
print(strr_df)
Output:
Text Value
0 green color
1 apple fruit
2 sly behavior
3 fox animal
4 cunning behavior
5 quick fox animal
6 fur
7 cool water drink
8 yellow color
9 sand matter
Process finished with exit code 0

How do I order vectors from sentence embeddings and give them out with their respective input?

I managed to generate vectors for every sentence in my two corpora and calculate the Cosine Similarity between every possible pair (dot product):
import tensorflow_hub as hub
from sklearn.metrics.pairwise import cosine_similarity
embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")
embeddings1 = ["I'd like an apple juice",
"An apple a day keeps the doctor away",
"Eat apple every day",
"We buy apples every week",
"We use machine learning for text classification",
"Text classification is subfield of machine learning"]
embeddings1 = embed(embeddings1)
embeddings2 = ["I'd like an orange juice",
"An orange a day keeps the doctor away",
"Eat orange every day",
"We buy orange every week",
"We use machine learning for document classification",
"Text classification is some subfield of machine learning"]
embeddings2 = embed(embeddings2)
print(cosine_similarity(embeddings1, embeddings2))
array([[ 0.7882168 , 0.3366559 , 0.22973989, 0.15428472, -0.10180502,
-0.04344492],
[ 0.256085 , 0.7713026 , 0.32120776, 0.17834462, -0.10769081,
-0.09398925],
[ 0.23850328, 0.446203 , 0.62606746, 0.25242645, -0.03946173,
-0.00908459],
[ 0.24337521, 0.35571027, 0.32963073, 0.6373588 , 0.08571904,
-0.01240187],
[-0.07001016, -0.12002315, -0.02002328, 0.09045915, 0.9141338 ,
0.8373743 ],
[-0.04525191, -0.09421931, -0.00631144, -0.00199519, 0.75919366,
0.9686416 ]]
In order to have a meaningful output I would need to order them, then return them with the respective input sentences. Does anyone have an idea how doing that? I did not find any tutorial for that task.

You might use, np.argsort(...) for sorting,
import tensorflow_hub as hub
from sklearn.metrics.pairwise import cosine_similarity
embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")
seq1 = ["I'd like an apple juice",
"An apple a day keeps the doctor away",
"Eat apple every day",
"We buy apples every week",
"We use machine learning for text classification",
"Text classification is subfield of machine learning"]
embeddings1 = embed(seq1)
seq2 = ["I'd like an orange juice",
"An orange a day keeps the doctor away",
"Eat orange every day",
"We buy orange every week",
"We use machine learning for document classification",
"Text classification is some subfield of machine learning"]
embeddings2 = embed(seq2)
a = cosine_similarity(embeddings1, embeddings2)
def get_pairs(a, b):
a = np.array(a)
b = np.array(b)
c = np.array(np.meshgrid(a, b))
c = c.T.reshape(len(a), -1, 2)
return c
pairs = get_pairs(seq1, seq2)
sorted_idx = np.argsort(a, axis=0)[..., None]
sorted_pairs = pairs[sorted_idx]
print(pairs[0, 0])
print(pairs[0, 1])
print(pairs[0, 2])
["I'd like an apple juice" "I'd like an orange juice"]
["I'd like an apple juice" 'An orange a day keeps the doctor away']
["I'd like an apple juice" 'Eat orange every day']

I passed strings instead of a lsit of strings. Problem solved.

Converting A Value In A Dictionary to List

So I have a list of dictionaries. Here are some of the entries in the dictionary that I'm trying to search through.
[{
'title': '"Adult" Pimiento Cheese ',
'categories': [
'Cheese',
'Vegetable',
'No-Cook',
'Vegetarian',
'Quick & Easy',
'Cheddar',
'Hot Pepper',
'Winter',
'Gourmet',
'Alabama',
],
'ingredients': [
'2 or 3 large garlic cloves',
'a 2-ounce jar diced pimientos',
'3 cups coarsely grated sharp Cheddar (preferably English, Canadian, or Vermont; about 12 ounces)'
,
'1/3 to 1/2 cup mayonnaise',
'crackers',
'toasted baguette slices',
"crudit\u00e9s",
],
'directions': ['Force garlic through a garlic press into a large bowl and stir in pimientos with liquid in jar. Add Cheddar and toss mixture to combine well. Stir in mayonnaise to taste and season with freshly ground black pepper. Cheese spread may be made 1 day ahead and chilled, covered. Bring spread to room temperature before serving.'
, 'Serve spread with accompaniments.'],
'rating': 3.125,
}, {
'title': '"Blanketed" Eggplant ',
'categories': [
'Tomato',
'Vegetable',
'Appetizer',
'Side',
'Vegetarian',
'Eggplant',
'Pan-Fry',
'Vegan',
"Bon App\u00e9tit",
],
'ingredients': [
'8 small Japanese eggplants, peeled',
'16 large fresh mint leaves',
'4 large garlic cloves, 2 slivered, 2 flattened',
'2 cups olive oil (for deep frying)',
'2 pounds tomatoes',
'7 tablespoons extra-virgin olive oil',
'1 medium onion, chopped',
'6 fresh basil leaves',
'1 tablespoon dried oregano',
'1 1/2 tablespoons drained capers',
],
'directions': ['Place eggplants on double thickness of paper towels. Salt generously. Let stand 1 hour. Pat dry with paper towels. Cut 2 deep incisions in each eggplant. Using tip of knife, push 1 mint leaf and 1 garlic sliver into each incision.'
,
"Pour 2 cups oil into heavy medium saucepan and heat to 375\u00b0F. Add eggplants in batches and fry until deep golden brown, turning occasionally, about 4 minutes. Transfer eggplants to paper towels and drain."
,
'Blanch tomatoes in pot of boiling water for 20 seconds. Drain. Peel tomatoes. Cut tomatoes in half; squeeze out seeds. Chop tomatoes; set aside.'
,
"Heat 4 tablespoons extra-virgin olive oil in large pot over high heat. Add 2 flattened garlic cloves; saut\u00e9 until light brown, about 3 minutes. Discard garlic. Add onion; saut\u00e9 until translucent, about 5 minutes. Add reduced to 3 cups, stirring occasionally, about 20 minutes."
,
'Mix capers and 3 tablespoons extra-virgin olive oil into sauce. Season with salt and pepper. Reduce heat. Add eggplants. Simmer 5 minutes, spooning sauce over eggplants occasionally. Spoon sauce onto platter. Top with eggplants. Serve warm or at room temperature.'
],
'rating': 3.75,
'calories': 1386.0,
'protein': 9.0,
'fat': 133.0,
}]
I have the current code that is searching through the dictionary and creating a list of recipes that contain all the words in the query argument.
function to find the matching recipes and return them in a list of dictionaries. tokenisation is another function that basically removes all punctuation and digits from the query as well as make it lower case. It returns a list of each word found in the query.
For example, the query "cheese!22banana" would be turned to [cheese, banana].
def matching(query):
#split up the input string and have a list to put the recipes in
token_list = tokenisation(query)
matching_recipes = []
#loop through whole file
for recipe in recipes:
recipe_tokens = []
#check each key
for key in recipe:
#checking the keys for types
if type(recipe[key]) != list:
continue
#look at the values for each key
for sentence in recipe[key]:
#make a big list of tokens from the keys
recipe_tokens.extend([t for t in tokenisation(sentence)])
#checking if all the query tokens appear in the recipe, if so append them
if all([tl in recipe_tokens for tl in token_list]):
matching_recipes.append(recipe)
return matching_recipes
The issue I am having is that the first key in the dictionary isn't a list, so as a result the function isn't checking if the words appear in the title and instead is just checking every other key and then adding each word to a list. I then check if every word in the query is present in the list of words in the recipes. The issue I am having is that because it's skipping the title as it's not in a list, if the word appears in the title, it won't return the recipe.
How would I add this title check into this code? I've tried turning it into a list as the title current has type string but then get a 'float' is not iterable error and have no clue how about tackling this issue.

To avoid the error, simply replace the
if type(recipe[key]) != list:
to
if type(recipe[key]) == str:
Or better,
if isinstance(value, str):
You get the error from trying to use the tokenisation function on certain values, because there are values in the dicts that are indeed of type float, for example, the value of the 'rating' key.
If the tokenization function returns a list of sentences, this should work:
def matching(query):
token_list = tokenisation(query)
matching_recipes = []
for recipe in recipes:
recipe_tokens = []
for value in recipe.values():
if isinstance(value, str):
recipe_tokens.append(value)
continue
for sentence in value:
recipe_tokens.extend(tokenisation(sentence))
if all([tl in recipe_tokens for tl in token_list]):
matching_recipes.append(recipe)
return matching_recipes
If it returns a list of words:
def matching(query):
token_list = tokenisation(query)
matching_recipes = []
for recipe in recipes:
recipe_tokens = []
for value in recipe.values():
if isinstance(value, str):
value = tokenisation(value)
for sentence in value:
recipe_tokens.extend(tokenisation(sentence))
if all([tl in recipe_tokens for tl in token_list]):
matching_recipes.append(recipe)
return matching_recipes

Searching over a list of individual sentences by a specific term in Python

I have a list of terms in Python that look like this.
Fruit
apple
banana
grape
orange
As well as a list of individual sentences that may contain the name of that fruit in a data frame. Something similar to this:
Customer Review
1 ['the banana was delicious','he called the firetruck','I had only half an orange']
2 ['I liked the banana','there was a worm in my apple','Cantaloupes are better then melons']
3 ['It could use some more cheese','the grape and orange was sour']
And I want to take the sentences in the review column, match them with the fruit mentioned in the text and print out a data frame of that as a final result. So, something like this:
Fruit Review
apple ['the banana was delicious','I liked the banana']
banana ['there was a worm in my apple']
grape ['the grape and orange was sour']
orange ['the grape and orange was sour','I had only half an orange']
Hoe could I go about doing this?

While the exact answer depends on how you're storing the data, I think the methodology is the same:
Create and store an empty list for every fruit name to store its reviews
For each review, check each of the fruits to see if they appear. If a fruit appears in the comment at all, add the review to that fruit's list
Here's an example of what that would look like:
#The list of fruits
fruits = ['apple', 'banana', 'grape', 'orange']
#The collection of reviews (based on the way it was presented, I'm assuming it was in a dictionary)
reviews = {
'1':['the banana was delicious','he called the firetruck','I had only half an orange'],
'2':['I liked the banana','there was a worm in my apple','Cantaloupes are better then melons'],
'3':['It could use some more cheese','the grape and orange was sour']
}
fruitDictionary = {}
#1. Create and store an empty list for every fruit name to store its reviews
for fruit in fruits:
fruitDictionary[fruit] = []
for customerReviews in reviews.values():
#2. For each review,...
for review in customerReviews:
#...check each of the fruits to see if they appear.
for fruit in fruits:
# If a fruit appears in the comment at all,...
if fruit.lower() in review:
#...add the review to that fruit's list
fruitDictionary[fruit].append(review)
This differs from previous answers in that sentences like "I enjoyed this grape. I thought the grape was very juicy" are only added to the grape section once.
If your data is stored as a list of lists, the process is very similar:
#The list of fruits
fruits = ['apple', 'banana', 'grape', 'orange']
#The collection of reviews
reviews = [
['the banana was delicious','he called the firetruck','I had only half an orange'],
['I liked the banana','there was a worm in my apple','Cantaloupes are better then melons'],
['It could use some more cheese','the grape and orange was sour']
]
fruitDictionary = {}
#1. Create and store an empty list for every fruit name to store its reviews
for fruit in fruits:
fruitDictionary[fruit] = []
for customerReviews in reviews:
#2. For each review,...
for review in customerReviews:
#...check each of the fruits to see if they appear.
for fruit in fruits:
# If a fruit appears in the comment at all,...
if fruit.lower() in review:
#...add the review to that fruit's list
fruitDictionary[fruit].append(review)

You can hold a dictionary, and then search by word
# your fruits list
fruits = ["apple", "banana", "grape", "orange"]
reviews = [['the banana was delicious','he called the firetruck','I had only half an orange'], ['I liked the banana','there was a worm in my apple','Cantaloupes are better then melons'], ['It could use some more cheese','the grape and orange was sour']]
# Initialize the dictionary, make each fruit a key
fruitReviews = {fruit.lower():[] for fruit in fruits}
# for each review, if a word in the review is a fruit, add it to that
# fruit's reviews list
for reviewer in reviews
for review in reviewer:
for word in review.split():
fruitReview = fruitReviews.get(word.lower(), None)
if fruitReview is not None:
fruitReview.append(review)
"""
result:
{
"orange": [
"I had only half an orange",
"the grape and orange was sour"
],
"grape": [
"the grape and orange was sour"
],
"apple": [
"there was a worm in my apple"
],
"banana": [
"the banana was delicious",
"I liked the banana"
]
}
"""

You can use the .explode function to expand the reviews then use sets to find intersectio
import pandas as pd
fruits = pd.DataFrame({'Fruit':'apple banana grape orange'.split()})
reviews =pd.DataFrame({'Customer':[1,2,3],
'Review':[['the banana was delicious','he called the firetruck','I had only half an orange'],
['I liked the banana','there was a worm in my apple','Cantaloupes are better then melons'],
['It could use some more cheese','the grape and orange was sour'],
]})
# review per row
explode_reviews = reviews.explode('Review')
# create a set
fruits_set = set(fruits['Fruit'].tolist())
# find intersection
explode_reviews['Fruit'] = explode_reviews['Review'].apply(lambda x: ' '.join(set(x.split()).intersection(fruits_set)))
print(explode_reviews)
Results:
If you don’t want to explode your data, you can just do:
# ...
flatten = lambda l: [item for sublist in l for item in sublist]
reviews['Fruit'] = reviews['Review'].apply(lambda x: flatten([set(i.split()).intersection(fruits_set) for i in x]))
Results:
Credit for flatten code

How to format a list to rows with certain number of items?

I'm having issues on formatting a list to a formatted output that each row contains five elements but I am stuck.
words = ["letter", "good", "course", "land", "car", "tea", "speaker",\
"music", "length", "apple", "cash", "floor", "dance", "rice",\
"bow", "peach", "cook", "hot", "none", "word", "happy", "apple",\
"monitor", "light", "access"]
Output:
letter good course land car
tea speaker music length apple
cash floor dance rice bow
peach cook hot none word
happy apple monitor light access

Try this:
>>> for i in range(0, len(words), 5):
... print ' '.join(words[i:(i+5)])
...
letter good course land car
tea speaker music length apple
cash floor dance rice bow
peach cook hot none word
happy apple monitor light access

Using list comprehension
num=5
[' '.join(words[i:i+num]) for i in range(0,len(words),num)]
Can also use chunked but might have to install more_itertools first
from more_itertools import chunked
list(chunked(words, 5))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Refine core info from string - python

Related

Python: Split string such that each substring is a key in a dictionary

How do I order vectors from sentence embeddings and give them out with their respective input?

Converting A Value In A Dictionary to List

Searching over a list of individual sentences by a specific term in Python

How to format a list to rows with certain number of items?

Categories

Resources