Bag of Words encoding for Python with vocabulary

Bag of Words encoding for Python with vocabulary - python

I am trying to implement new columns into my ML model. A numeric column should be created if a specific word is found in the text of the scraped data. For this I created a dummy script for testing.
import pandas as pd
bagOfWords = ["cool", "place"]
wordsFound = ""
mystring = "This is a cool new place"
mystring = mystring.lower()
for word in bagOfWords:
if word in mystring:
wordsFound = wordsFound + word + " "
print(wordsFound)
pd.get_dummies(wordsFound)
The output is
cool place
0 1
This means there is one sentence "0" and one entry of "cool place". This is not correct. Expectations would be like this:
cool place
0 1 1

Found a different solution, as I cound not find any way forward. Its a simple direct hot encoding. For this I enter for every word I need a new column into the dataframe and create the encoding directly.
vocabulary = ["achtung", "suchen"]
for word in vocabulary:
df2[word] = 0
for index, row in df2.iterrows():
if word in row["title"].lower():
df2.set_value(index, word, 1)

Related

String search by coincidence?

I just wanted to know if there's a simple way to search a string by coincidence with another one in Python. Or if anyone knows how it could be done.
To make myself clear I'll do an example.
text_sample = "baguette is a french word"
words_to_match = ("baguete","wrd")
letters_to_match = ('b','a','g','u','t','e','w','r','d') # With just one 'e'
coincidences = sum(text_sample.count(x) for x in letters_to_match)
# coincidences = 14 Current output
# coincidences = 10 Expected output
My current method breaks the words_to_match into single characters as in letters_to_match but then it is matched as follows: "baguette is a french word" (coincidences = 14).
But I want to obtain (coincidences = 10) where "baguette is a french word" were counted as coincidences. By checking the similarity between words_to_match and the words in text_sample.
How do I get my expected output?

It looks like you need the length of the longest common subsequence (LCS). See the algorithm in the Wikipedia article for computing it. You may also be able to find a C extension which computes it quickly. For example, this search has many results, including pylcs. After installation (pip install pylcs):
import pylcs
text_sample = "baguette is a french word"
words_to_match = ("baguete","wrd")
print(pylcs.lcs2(text_sample, ' '.join(words_to_match.join))) #: 14

first, split words_to_match with
words = ''
for item in words_to_match:
words += item
letters = [] # create a list
for letter in words:
letters.append(letter)
letters = tuple(letters)
then, see if its in it
x = 0
for i in sample_text:
if letters[x] == i:
x += 1
coincidence += 1
also if it's not in sequence just do:
for i in sample_text:
if i in letters: coincidence += 1
(note that some versions of python you'l need a newline)

Find the word behind the index numbers in python

I have a question as I am new at the NLP. I have a dataframe consists of 2 columns. The first has a sentence lets say "Hello World, how are you?" and the second column has [6,7,8,9,10] which represents the word "World" (index positions) from the first sentence. Is there any function in python which can give me the opportunity to recognize and appear the word that in each row is specified by the index numbers?
Thank you
https://ibb.co/LRnWM7G

I think it's what you need (updated according to comments)
dataframe = ["Hello World",[6,7,8,9,10]],
["Hello World",[6,7,8,9]],
["Hello World",[7,8,9,10]]
resultdataframe=[]
def textbyindexlist(df):
text = df[0]
letterindex=df[1]
newtext=""
for ind in letterindex:
newtext = newtext + text[ind]
return newtext
for i in dataframe:
resultdataframe.append(textbyindexlist(i))
print (resultdataframe)

How to determine the number of negation words per sentence

I would like to know how to count how many negative words (no, not) and abbreviation (n't) there are in a sentence and in the whole text.
For number of sentences I am applying the following one:
df["sent"]=df['text'].str.count('[\w][\.!\?]')
However this gives me the count of sentences in a text. I would need to look per each sentence at the number of negation words and within the whole text.
Can you please give me some tips?
The expected output for text column is shown below
text sent count_n_s count_tot
I haven't tried it yet 1 1 1
I do not like it. What do you think? 2 0.5 1
It's marvellous!!! 1 0 0
No, I prefer the other one. 2 1 1
count_n_s is given by counting the total number of negotiation words per sentence, then dividing by the number of sentences.
I tried
split_w = re.split("\w+",df['text'])
neg_words=['no','not','n\'t']
words = [w for i,w in enumerate(split_w) if i and (split_w[i-1] in neg_words)]

This would get a count of total negations in the text (not for individual sentences):
import re
NEG = r"""(?:^(?:no|not)$)|n't"""
NEG_RE = re.compile(NEG, re.VERBOSE)
def get_count(text):
count = 0
for word in text:
if NEG_RE .search(word):
count+=1
continue
else:
pass
return count
df['text_list'] = df['text'].apply(lambda x: x.split())
df['count'] = df['text_list'].apply(lambda x: get_count(x))

To get count of negations for individual lines use the code below. For words like haven't you can add it to neg_words since it is not a negation if you strip the word of everything else if it has n't
import re
str1 = '''I haven't tried it yet
I do not like it. What do you think?
It's marvellous!!!
No, I prefer the other one.'''
neg_words=['no','not','n\'t']
for text in str1.split('\n'):
split_w = re.split("\s", text.lower())
# to get rid of special characters such as comma in 'No,' use the below search
split_w = [re.search('^\w+', w).group(0) for w in split_w]
words = [w for w in split_w if w in neg_words]
print(len(words))

To replace internet acronyms in a dataframe using dictionary

I'm working on a text mining project where I'm trying to replace abbreviations, slang words and internet acronyms present in text (In a dataframe column) using a manually prepared dictionary.
The problem I'm facing is the code stops with the first word of the text in the dataframe column and does not replace it with lookup words from dict
Here is the sample dictionary and code I use:
abbr_dict = {"abt":"about", "b/c":"because"}
def _lookup_words(input_text):
words = input_text.split()
new_words = []
for word in words:
if word.lower() in abbr_dict:
word = abbr_dict[word.lower()]
new_words.append(word)
new_text = " ".join(new_words)
return new_text
df['new_text'] = df['text'].apply(_lookup_words)
Example Input:
df['text'] =
However, industry experts are divided ab whether a Bitcoin ETF is necessary or not.
Desired Output:
df['New_text'] =
However, industry experts are divided about whether a Bitcoin ETF is necessary or not.
Current Output:
df['New_text'] =
However

You can try as following with using lambda and join along with split:
import pandas as pd
abbr_dict = {"abt":"about", "b/c":"because"}
df = pd.DataFrame({'text': ['However, industry experts are divided abt whether a Bitcoin ETF is necessary or not.']})
df['new_text'] = df['text'].apply(lambda row: " ".join(abbr_dict[w]
if w.lower() in abbr_dict else w for w in row.split()))
Or to fix the code above, I think you need to move the join for new_text and return statement outside of the for loop:
def _lookup_words(input_text):
words = input_text.split()
new_words = []
for word in words:
if word.lower() in abbr_dict:
word = abbr_dict[word.lower()]
new_words.append(word)
new_text = " ".join(new_words) # ..... change here
return new_text # ..... change here also
df['new_text'] = df['text'].apply(_lookup_words)

Replace a word in a String by indexing without "string replace function" -python

Is there a way to replace a word within a string without using a "string replace function," e.g., string.replace(string,word,replacement).
[out] = forecast('This snowy weather is so cold.','cold','awesome')
out => 'This snowy weather is so awesome.
Here the word cold is replaced with awesome.
This is from my MATLAB homework which I am trying to do in python. When doing this in MATLAB we were not allowed to us strrep().
In MATLAB, I can use strfind to find the index and work from there. However, I noticed that there is a big difference between lists and strings. Strings are immutable in python and will likely have to import some module to change it to a different data type so I can work with it like how I want to without using a string replace function.

just for fun :)
st = 'This snowy weather is so cold .'.split()
given_word = 'awesome'
for i, word in enumerate(st):
if word == 'cold':
st.pop(i)
st[i - 1] = given_word
break # break if we found first word
print(' '.join(st))

Here's another answer that might be closer to the solution you described using MATLAB:
st = 'This snow weather is so cold.'
given_word = 'awesome'
word_to_replace = 'cold'
n = len(word_to_replace)
index_of_word_to_replace = st.find(word_to_replace)
print st[:index_of_word_to_replace]+given_word+st[index_of_word_to_replace+n:]

You can convert your string into a list object, find the index of the word you want to replace and then replace the word.
sentence = "This snowy weather is so cold"
# Split the sentence into a list of the words
words = sentence.split(" ")
# Get the index of the word you want to replace
word_to_replace_index = words.index("cold")
# Replace the target word with the new word based on the index
words[word_to_replace_index] = "awesome"
# Generate a new sentence
new_sentence = ' '.join(words)

Using Regex and a list comprehension.
import re
def strReplace(sentence, toReplace, toReplaceWith):
return " ".join([re.sub(toReplace, toReplaceWith, i) if re.search(toReplace, i) else i for i in sentence.split()])
print(strReplace('This snowy weather is so cold.', 'cold', 'awesome'))
Output:
This snowy weather is so awesome.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Bag of Words encoding for Python with vocabulary - python

Related

String search by coincidence?

Find the word behind the index numbers in python

How to determine the number of negation words per sentence

To replace internet acronyms in a dataframe using dictionary

Replace a word in a String by indexing without "string replace function" -python

Categories

Resources