Python text document similarities (w/o libraries) [closed] - python

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I need to create a python program vanilla (without library), which can compute text document similarities between different documents.
The program takes documents as an input and computes a dictionary (matrix) for words of the given input. Each document consists of a sentence and when a new document goes into the program, we need to compare it to the other documents in order to find similar documents. See example below:
Given text input:
input_text = ["Why I like music", "Beer and music is my favorite combination",
"The sun is shining", "How to dance in GTA5", ]
The sentences have to be transformed into vectors, see example:
Hope you can help.

Here some ideas:
use new_str = str.upper() so beer and Beer will be same (if you
need this)
use list = str.split() to make a list of the words
in your string.
use set = set(list) to get rid of double words
if needed.
start with an empty word_list. Copy the first set in the word_list. In the following steps you can loop over the entries in your set and check if they are part of your word_list.
for word in set:
if word not in word_list:
word_list.append(word)
Now you can make a multi-hot vector from your sentence. (1 if word_list[i] in sentence else 0)
Don't forget to make your multi-hot vectors longer (additional zeros) if you add a word to word_list.
last step: make a matrix from your vectors.

Related

Looking for Words in a List with Similar Letters [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I have a list with types of cheese and I want to be able to search for gouda by just writing "g" and "o" instead of writing the full sentence.
I've looked for solutions but none are exactly what I am looking for. Maybe this is something common but I just started a week ago with Python I don't know many of the terms.
For some reason I got this cancelled so Im writing this paragraph so the person that answered can answer again
Here is a link to another StackOverflow post I found: Link.
This explains what I think you are looking for in your problem.
This code will print gouda from the wordlist:
wordlist = ['gouda','miss','lake','que','mess']
letters = set('g')
for word in wordlist:
if letters & set(word):
print(word)
All you have to do is set whatever letters you want to search for in the list to the letter variable (in the brackets) and it will return the words that contain the letters you entered.
ex. I added gouda (your example) to this list. If you set the letters variable, to g, it searches the wordlist for any words that contain the letter g, in this case it will return gouda from the wordlist as it is the only word that contains the letter 'g'.
The only downfall of this is if you enter 'ms' to search this wordlist you will get two responses, miss and mess as they both contain letters 'm,s' so in some cases you will have to be more specific if you only want one word to be returned.
Note: this is not my code, I got it from the post linked here, and above.

Search multi word strings in a single word list in python [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I have a list of words that I store in a set() for fast lookups such as:
one
two
three
I want to search if a given string (i.e. 'one three') can be written by using the words in the dictionary (it would be a multiword anagram)
My first idea to accomplish this would be to create a new wordlist such as:
one
two
three
one two
one three
two three
And to do a lookup for the matching string, I see some flaws with this approach:
Generated wordlist will be very big and huge if I decide to create three word combinations.
Best way to create the wordlist
At the end the solution proposed (thanks #all) is to split the the multiword string instead and look if each member is in the wordlist.
If your words are a set, lookup is constant time. There's no need to make all permutation of the words. With a word list in a set you can split the string into words and check the all are in the set:
words = {'one', 'two','three'}
sentence = "one two two three"
all(s in words for s in sentence.split())
# True
sentence = "one two two three four"
all(s in words for s in sentence.split())
# False
If you store all combinations of words in the set, the set is likely to grow exponentially without providing much value. To check if a particular string can be made using words from the set, we should check each word in the string at runtime:
def words_in_set(my_str, words_set):
words = my_str.split()
return all(word in words_set for word in words)

Computer lengths of words inside string of a list Python [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 years ago.
Improve this question
Im looking for a way to be able to calculate the mean number of letters per each sentence in my list. Im trying to split the strings by white space and then count the length of each word inside but im not able to.
Any guidance would be helpful.
I am not going to write the code for you, but this will probably help. If I am understanding the question correctly you are trying to take a big block of text (paragraph) split it into sentences then get the average len of characters in each sentence.
So:
1) Break text block into sentences. Here is a post that should help you do that.
2) Count the letters in a sentence. Here is another post that will help remove the whitespace from the sentence. If you need to remove all the punctuation (everything except letters) from the string check out this post. Now you have a string that you can simply do a len(sentence_string) to get how many characters are in the sentence.
Next time please post the code you have tried, the errors you have gotten, and the text of the data you are trying to use. DON'T post pictures of words. It makes it a lot harder to help when we can't just copy and paste everything and debug it ourselves.
This code would work:
word = sentence.split(' ')
total = 0
for i in word:
total += len(i)
return float(total)/len(word)
The code splits on whitespace, then adds the length of the words and divides by the number of the words in the sentence. This calculates the average.

pseudo code in python [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I am fairly new to python
One of the exercises I have been given is to create the python pseudo code for the following problem:
write an algorithm that given a dictionary (list) of words, finds up to 10 anagrams of a given word.
I'm stuck on ideas on how to solve this.
Currently I have (it's not even proper pseudo)
# Go through the words in the list
# put each letter in some sort of array
# Find words with the letters from this array
I guess this is way too generalistic, I have searched online for specific functions I could use but have not found any.
Any help on specific functions that would help, in making slightly more specified pseudo code?
Here is some help, without writing the code for you
#define a key method
#determine the key using each set of letters, such as the letters of a word in
#alphabetical order
#keyof("word") returns "dorw"
#keyof("dad") returns "add"
#keyof("add") returns "add"
#ingest the word set method
#put the word set into a dictionary which maps
#key->list of up to 10 angrams
#get angrams method
#accept a word as a parameter
#convert the word to its key
#look up the key in the dictionary
#return the up to 10 angrams
#test case: add "dad" and "add" to the word set.
# getting angrams for "dad" should return "dad" and "add"
#test case: add "palm" and "lamp" to the word set.
# getting angrams for "palm" should return "palm" and "lamp"
#consider storing 11 angrams in the list
#a01, a02, a03, a04, a05, a06, a07, a08, a09, a10, a11.
#Then if a01 is provided, you can return a02-a11, which is 10 angrams

Elongated words and combination of words in a sentence python [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I have a few lines such as:
biggestfoolofall, sooo, hiiieee, footballfan
If you notice the pattern above, either there are a combination of words in 1 word itself such as "biggestfoolofall" "footballfan".
1) I wanted to know how I can understand that its a multi-word within 1 words.
2) sooo and hiiieee are elongated words.I want to detect elongated words in python. How can I do that?
I am new to python so got stuck at this part. Also, if you can share any helpful sites to learn for loops, strings split etc then it would be very helpful
I guess you have a list of valid words. So iterate over your words and check if they are in your line:
for word in words: # iterate over all valid words
if word in line: # if a valid word is found in line
print 'I found a valid word: '+word
line.replace(word,'') # remove the word from your line
At the end, you end up with finding all valid words and only junk characters left in your "line" variable.
See string methods for further string operations.

Categories

Resources