Fast replacement of 2 matrix in python - python

Sorry if it's a simple question, I'm new in python. I have an string (array of words) and a 2 dimensions of words which I'm going to replace them one by one as something like follow:
str="Jim is a good person"
# and will convert to:
parts=['Jim','is','a','good','person']
and a 2 dimensions array which each dimension is a array of words that can be replaced with an element with same index in parts. for example something like this:
replacement=[['john','Nock','Kati'],
['were','was','are'],
['a','an'],
['bad','perfect','awesome'],
['cat','human','dog']]
result can be something like this:
1: nike is a good person
2: John are an bad human
3: Kati were a perfect cat
and so on
Actually I'm going to replace each word of a sentence with some possible words and then do some calculation on the new sentence. I need to achieve all possible replacement.
Many thanks.

itertools.product might be the best choice for creating all of the combinations that you're looking for.
Let's use your replacement list as a starting point for what could work. A way to get all the combinations you're looking for could look something like this
from itertools import product
word_options=[['john','Nock','Kati'],
['were','was','are'],
['a','an'],
['bad','perfect','awesome'],
['cat','human','dog']]
for option in product(*word_options):
new_sentence = ' '.join(option)
#do calculation on new_sentence
Each option that is being iterated through is a tuple, where each element is a single choice from each of the individual sub-lists of the original 2D list. Then the ' '.join(option) will combine the individual strings into a single string where the words are separated by a space. If you were to just print new_sentence, the output would look as follows.
john were a bad cat
john were a bad human
john were a bad dog
john were a perfect cat
john were a perfect human
john were a perfect dog
.
.
.
Kati are an perfect cat
Kati are an perfect human
Kati are an perfect dog
Kati are an awesome cat
Kati are an awesome human
Kati are an awesome dog

Related

Extract full sentence with list of words

I hope to extract the full sentence, if containing certain key words (like or love).
text = 'I like blueberry icecream. He has a green car. She has blue car.'
pattern = '[^.]* like|love [^.]*\.'
re.findall(pattern,text)
Using | for the divider , I was expected ['I like blueberry icecream.']
But only got ['I like']
I also tried pattern = '[^.]*(like|love)[^.]*\.' but got only ['like']
What did I do wrong as I know single word works with following RegEx - '[^.]* like [^.]*\.'
You need to put a group around like|love. Otherwise the | applies to the entire patterns on either side of it. So it's matching either a string ending with like or a string beginning with love.
pattern = '[^.]* (?:like|love) [^.]*\.'
Research more and found out I was missing ?:
text = 'I like blueberry icecream. He has a green car. She has blue car.'
pattern = '[^.]*(?:like|love)[^.]*\.'
Output
['I like blueberry icecream.']
Source: https://www.ocpsoft.org/tutorials/regular-expressions/or-in-regex/
I actually think it would be easier to do this without regex. Just my two cents.
text = 'I like blueberry icecream. He has a green car. She has blue car. I love dogs.'
print([x for x in text.split('.') if any(y in x for y in ['like', 'love'])])
You can use below regex
regex = /[^.]* (?:like|love) [^.]*\./g
Demo here

Python - Applying a function to separate string in column every two words

I want to add a separator (,) every two words capture/better delineate the full names of the row.
For example df['Names'] is currently:
John Smith David Smith Golden Brown Austin James
and I would like to be:
John Smith, David Smith, Golden Brown, Austin James
I was able to find some code which splits the string every x words which would be perfect for my purposes shown below:
def splitText(string):
words = string.split()
grouped_words = [' '.join(words[i: i + 2]) for i in range(0, len(words), 2)]
return grouped_words
However I'm not sure how to apply this to the column of choice.
I tried the following:
df['Names'].apply(splitText())
This gives me a missing positional argument.
Asking for any advice on either modifying the function or my application of it to a column dataframe. I'm pretty new to this stuff so any advice would be great!
Cheers
You can pass only function without ():
df['Names'].apply(splitText)
Working like using lambda function:
df['Names'].apply(lambda x: splitText(x))

Create a list of alphabetically sorted UNIQUE words and display the first N words in python

I am new to Python, apologize for a simple question. My task is the following:
Create a list of alphabetically sorted unique words and display the first 5 words
I have text variable, which contains a lot of text information
I did
test = text.split()
sorted(test)
As a result, I receive a list, which starts from symbols like $ and numbers.
How to get to words and print N number of them.
I'm assuming by "word", you mean strings that consist of only alphabetical characters. In such a case, you can use .filter to first get rid of the unwanted strings, turn it into a set, sort it and then print your stuff.
text = "$1523-the king of the 521236 mountain rests atop the king mountain's peak $#"
# Extract only the words that consist of alphabets
words = filter(lambda x: x.isalpha(), text.split(' '))
# Print the first 5 words
sorted(set(words))[:5]
Output-
['atop', 'king', 'mountain', 'of', 'peak']
But the problem with this is that it will still ignore words like mountain's, because of that pesky '. A regex solution might actually be far better in such a case-
For now, we'll be going for this regex - ^[A-Za-z']+$, which means the string must only contain alphabets and ', you may add more to this regex according to what you deem as "words". Read more on regexes here.
We'll be using re.match instead of .isalpha this time.
WORD_PATTERN = re.compile(r"^[A-Za-z']+$")
text = "$1523-the king of the 521236 mountain rests atop the king mountain's peak $#"
# Extract only the words that consist of alphabets
words = filter(lambda x: bool(WORD_PATTERN.match(x)), text.split(' '))
# Print the first 5 words
sorted(set(words))[:5]
Output-
['atop', 'king', 'mountain', "mountain's", 'of']
Keep in mind however, this gets tricky when you have a string like hi! What's your name?. hi!, name? are all words except they are not fully alphabetic. The trick to this is to split them in such a way that you get hi instead of hi!, name instead of name? in the first place.
Unfortunately, a true word split is far outside the scope of this question. I suggest taking a look at this question
I am newbie here, apologies for mistakes. Thank you.
test = '''The coronavirus outbreak has hit hard the cattle farmers in Pabna and Sirajganj as they are now getting hardly any customer for the animals they prepared for the last year targeting the Eid-ul-Azha this year.
Normally, cattle traders flock in large numbers to the belt -- one of the biggest cattle producing areas of the country -- one month ahead of the festival, when Muslims slaughter animals as part of their efforts to honour Prophet Ibrahim's spirit of sacrifice.
But the scene is different this year.'''
test = test.lower().split()
test2 = sorted([j for j in test if j.isalpha()])
print(test2[:5])
You can slice the sorted return list until the 5 position
sorted(test)[:5]
or if looking only for words
sorted([i for i in test if i.isalpha()])[:5]
or by regex
sorted([i for i in test if re.search(r"[a-zA-Z]")])
by using the slice of a list you will be able to get all list elements until a specific index in this case 5.

Function that insert words into text

I have a text that goes like this:
text = "All human beings are born free and equal in dignity and rights. They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood."
How do I write a function hedging(text) that processes my text and produces a new version that inserts the word "like" in the every third word of the text?
The outcome should be like that:
text2 = "All human beings like are born free like and equal in like..."
Thank you!
Instead of giving you something like
solution=' like '.join(map(' '.join, zip(*[iter(text.split())]*3)))
I'm posting a general advice on how to approach the problem. The "algorithm" is not particularly "pythonic", but hopefully easy to understand:
words = split text into words
number of words processed = 0
for each word in words
output word
number of words processed += 1
if number of words processed is divisible by 3 then
output like
Let us know if you have questions.
You could go with something like that:
' '.join([n + ' like' if i % 3 == 2 else n for i, n in enumerate(text.split())])

What is a good strategy to group similar words?

Say I have a list of movie names with misspellings and small variations like this -
"Pirates of the Caribbean: The Curse of the Black Pearl"
"Pirates of the carribean"
"Pirates of the Caribbean: Dead Man's Chest"
"Pirates of the Caribbean trilogy"
"Pirates of the Caribbean"
"Pirates Of The Carribean"
How do I group or find such sets of words, preferably using python and/or redis?
Have a look at "fuzzy matching". Some great tools in the thread below that calculates similarities between strings.
I'm especially fond of the difflib module
>>> get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy'])
['apple', 'ape']
>>> import keyword
>>> get_close_matches('wheel', keyword.kwlist)
['while']
>>> get_close_matches('apple', keyword.kwlist)
[]
>>> get_close_matches('accept', keyword.kwlist)
['except']
https://stackoverflow.com/questions/682367/good-python-modules-for-fuzzy-string-comparison
You might notice that similar strings have large common substring, for example:
"Bla bla bLa" and "Bla bla bRa" => common substring is "Bla bla ba" (notice the third word)
To find common substring you may use dynamic programming algorithm. One of algorithms variations is Levenshtein distance (distance between most similar strings is very small, and between more different strings distance is bigger) - http://en.wikipedia.org/wiki/Levenshtein_distance.
Also for quick performance you may try to adapt Soundex algorithm - http://en.wikipedia.org/wiki/Soundex.
So after calculating distance between all your strings, you have to clusterize them. The most simple way is k-means (but it needs you to define number of clusters). If you actually don't know number of clusters, you have to use hierarchical clustering. Note that number of clusters in your situation is number of different movies titles + 1(for totally bad spelled strings).
I believe there is in fact two distinct problems.
The first is spell correction. You can have one in Python here
http://norvig.com/spell-correct.html
The second is more functional. Here is what I'd do after the spell correction. I would make a relation function.
related( sentence1, sentence2 ) if and only if sentence1 and sentence2 have rare common words. By rare, I mean words different than (The, what, is, etc...). You can take a look at the TF/IDF system to determine if two document are related using their words. Just googling a bit I found this:
https://code.google.com/p/tfidf/
To add another tip to Fredrik's answer, you could also get inspired from search engines like code, such as this one :
def dosearch(terms, searchtype, case, adddir, files = []):
found = []
if files != None:
titlesrch = re.compile('>title<.*>/title<')
for file in files:
title = ""
if not (file.lower().endswith("html") or file.lower().endswith("htm")):
continue
filecontents = open(BASE_DIR + adddir + file, 'r').read()
titletmp = titlesrch.search(filecontents)
if titletmp != None:
title = filecontents.strip()[titletmp.start() + 7:titletmp.end() - 8]
filecontents = remove_tags(filecontents)
filecontents = filecontents.lstrip()
filecontents = filecontents.rstrip()
if dofind(filecontents, case, searchtype, terms) > 0:
found.append(title)
found.append(file)
return found
Source and more information: http://www.zackgrossbart.com/hackito/search-engine-python/
Regards,
Max
One approach would be to pre-process all the strings before you compare them: convert all to lowercase, standardize whitespace (eg, replace any whitespace with single spaces). If punctuation is not important to your end goal, you can remove all punctuation characters as well.
Levenshtein distance is commonly-used to determine similarity of a string, this should help you group strings which differ by small spelling errors.

Categories

Resources