Find the word behind the index numbers in python - python

I have a question as I am new at the NLP. I have a dataframe consists of 2 columns. The first has a sentence lets say "Hello World, how are you?" and the second column has [6,7,8,9,10] which represents the word "World" (index positions) from the first sentence. Is there any function in python which can give me the opportunity to recognize and appear the word that in each row is specified by the index numbers?
Thank you
https://ibb.co/LRnWM7G

I think it's what you need (updated according to comments)
dataframe = ["Hello World",[6,7,8,9,10]],
["Hello World",[6,7,8,9]],
["Hello World",[7,8,9,10]]
resultdataframe=[]
def textbyindexlist(df):
text = df[0]
letterindex=df[1]
newtext=""
for ind in letterindex:
newtext = newtext + text[ind]
return newtext
for i in dataframe:
resultdataframe.append(textbyindexlist(i))
print (resultdataframe)

Related

Apply function to entire group in groupby instead of each row in groupby

I have a dataframe consisting of user_id, zip_code and text. I want to remove the values in text that is not unique within each (user_id,zip_code) pair e.g if
(user_id=2, zip_code=123) has text= ["hello world"] and (user_id=2, zip_code=456) has text=["hello foo"] then I want to remove "hello" from both of those sentences.
My thought was to just grouby ["user_id","zip_code"] and then apply the function to text e.g
import pandas as pd
def remove_common_words(sentences: pd.Series):
all_words= sentences.sum() # Get all the words for each (user,zip) pair
keep_words = set(common_words[common_words == 1]) # Keep words that only is in one pair
return sentences.apply(lambda x: set(x)-keep_words)
data.groupby(["user_id","zip_code"])["text"].unique().apply(remove_common_words)
the issue is that the function seems to take each value/row for each (user,zipcode) pair and not the entire group i.e sentences is not a series but one row thus the first line all_words = sentences.sum() does get all the words for the group, but just for each row.
Is there a way to make the input to apply be the entire grouped dataframe/series, and operate on that, instead of operating on each row at a time?
Example:
df = pd.DataFrame({
"user_id":[1,1,1,2,2,2],
"zip_code":[123,456,789,111,112,334],
"text":["hello world","hello foo", "bar foo baz",
"hello world", "baz boom", "baz foo"]})
def print_values(sentences):
print(sentences)
data.groupby(["user_id","zip_code"])["text"].unique().apply(print_values)
# ["hello world"]
as seen the print(sentences) only prints the first row and not the entire group/series like this:
0 ["hello world"]
1 ["hello foo"]
2 ["bar foo baz"]

How to remove Rows that have English (or a specific languages) sentences in pandas

I have a pandas data frame which has 2 columns, first contains Arabic sentences and the second one contain labels (1,0)
I want to remove all rows that contain English sentences.
Any suggestions?
Here is an example, I want to remove the second row
إيطاليا لتسريع القرارات اللجوء المهاجرين، الترحيل [0]
Border Patrol Agents Recover 44 Migrants from Stash House [0]
الديمقراطيون مواجهة الانتخابات رفض عقد اجتماعات تاون هول [0]
شاهد لايف: احتفال ترامب "اجعل أمريكا عظيمة مرة أخرى" - بريتبارت [0]
المغني البريطاني إم آي إيه: إدارة ترامب مليئة بـ "كذابون باثولوجي" [0]
You can create an array of common English letters and remove a line which contain either of these letters, like this:
ENGletters = ['a', 'o', 'i']
df = df[~df['text_col'].str.contains('|'.join(ENGletters))]
You could make a list of English words and detect when a word is used.
letters = 'hello how a thing person'.split(' ')
def lookForLetters(text):
text = text.split(' ')
for i in letters:
if i.lower() in text:
return(True)
return(False)
print(lookForLetters("المغني البريطاني إم آي إيه: إدارة ترامب مليئة بـ "كذابون باثولوجي"))
print(lookForLetters("Hello sir! How are you doing?"))
The first print line will return False and the second print line will return True.
Be sure to add more common English words to the first list.
You can use langdetect to filter your df.
from langdetect import detect
Df['language'] = df['title'].apply(detect)
Then filter you df to remove non arabic titles.
df = df[df['language']=='ar']

Pandas: Truncate string in column based on substring pulled from other column (Python 3)

I have a dataframe with two pertinent columns, "rm_word" and "article."
Data Sample:
,grouping,fts,article,rm_word
0,"1",fts,"This is the article. This is a sentence. This is a sentence. This is a sentence. This goes on for awhile and that's super ***crazy***. It goes on and on.",crazy
I want to query the last 100 characters of each "article" to determine if its row's respective "rm_word" appears. If it does, then I want to delete the entire sentence in which "rm_word" appears as well as all the sentences that follows it from the "article."
Desired Result (when "crazy" is the "rm_word"):
,grouping,fts,article,rm_word
0,"1",fts,"This is the article. This is a sentence. This is a sentence. This is a sentence.",crazy
This mask is able to determine when an article contains its "rm_word," but I'm having trouble with the sentence deletion bit.
mask = ([ (str(a) in b[-100:].lower()) for a,b in zip(df["rm_word"], df["article"])])
print (df.loc[mask])
Any help would be much appreciated! Thank you so much.
Does this work?
df = pd.DataFrame(
columns=['article', 'rm_word'],
data=[["This is the article. This is a sentence. This is a sentence. This is a sentence.", 'crazy'],
["This is the article. This is a sentence. This is a sentence. This is a sentence. This goes on for awhile and that's super crazy. It goes on and on.", 'crazy']]
)
def clean_article(x):
if x['rm_word'] not in x['article'][-100:].lower():
return x
article = x['article'].rsplit(x['rm_word'])[0]
article = article.split('.')[:-1]
x['article'] = '.'.join(article) + '.'
return x
df = df.apply(lambda x: clean_article(x), axis=1)
df['article'].values
Returns
array(['This is the article. This is a sentence. This is a sentence. This is a sentence.',
'This is the article. This is a sentence. This is a sentence. This is a sentence.'],
dtype=object)

Bag of Words encoding for Python with vocabulary

I am trying to implement new columns into my ML model. A numeric column should be created if a specific word is found in the text of the scraped data. For this I created a dummy script for testing.
import pandas as pd
bagOfWords = ["cool", "place"]
wordsFound = ""
mystring = "This is a cool new place"
mystring = mystring.lower()
for word in bagOfWords:
if word in mystring:
wordsFound = wordsFound + word + " "
print(wordsFound)
pd.get_dummies(wordsFound)
The output is
cool place
0 1
This means there is one sentence "0" and one entry of "cool place". This is not correct. Expectations would be like this:
cool place
0 1 1
Found a different solution, as I cound not find any way forward. Its a simple direct hot encoding. For this I enter for every word I need a new column into the dataframe and create the encoding directly.
vocabulary = ["achtung", "suchen"]
for word in vocabulary:
df2[word] = 0
for index, row in df2.iterrows():
if word in row["title"].lower():
df2.set_value(index, word, 1)

Replace a word in a String by indexing without "string replace function" -python

Is there a way to replace a word within a string without using a "string replace function," e.g., string.replace(string,word,replacement).
[out] = forecast('This snowy weather is so cold.','cold','awesome')
out => 'This snowy weather is so awesome.
Here the word cold is replaced with awesome.
This is from my MATLAB homework which I am trying to do in python. When doing this in MATLAB we were not allowed to us strrep().
In MATLAB, I can use strfind to find the index and work from there. However, I noticed that there is a big difference between lists and strings. Strings are immutable in python and will likely have to import some module to change it to a different data type so I can work with it like how I want to without using a string replace function.
just for fun :)
st = 'This snowy weather is so cold .'.split()
given_word = 'awesome'
for i, word in enumerate(st):
if word == 'cold':
st.pop(i)
st[i - 1] = given_word
break # break if we found first word
print(' '.join(st))
Here's another answer that might be closer to the solution you described using MATLAB:
st = 'This snow weather is so cold.'
given_word = 'awesome'
word_to_replace = 'cold'
n = len(word_to_replace)
index_of_word_to_replace = st.find(word_to_replace)
print st[:index_of_word_to_replace]+given_word+st[index_of_word_to_replace+n:]
You can convert your string into a list object, find the index of the word you want to replace and then replace the word.
sentence = "This snowy weather is so cold"
# Split the sentence into a list of the words
words = sentence.split(" ")
# Get the index of the word you want to replace
word_to_replace_index = words.index("cold")
# Replace the target word with the new word based on the index
words[word_to_replace_index] = "awesome"
# Generate a new sentence
new_sentence = ' '.join(words)
Using Regex and a list comprehension.
import re
def strReplace(sentence, toReplace, toReplaceWith):
return " ".join([re.sub(toReplace, toReplaceWith, i) if re.search(toReplace, i) else i for i in sentence.split()])
print(strReplace('This snowy weather is so cold.', 'cold', 'awesome'))
Output:
This snowy weather is so awesome.

Categories

Resources