I have a very large data frame with two columns called sentence1 and sentence2.
I am trying to make a new column with the words that differ between two sentences, for example:
sentence1=c("This is sentence one", "This is sentence two", "This is sentence three")
sentence2=c("This is the sentence four", "This is the sentence five", "This is the sentence six")
df = as.data.frame(cbind(sentence1,sentence2))
My data frame has the following structure:
ID sentence1 sentence2
1 This is sentence one This is the sentence four
2 This is sentence two This is the sentence five
3 This is sentence three This is the sentence six
And my expected result is:
ID sentence1 sentence2 Expected_Result
1 This is ... This is ... one the four
2 This is ... This is ... two the five
3 This is ... This is ... three the six
In R I was trying to split the sentences and after get the elements which differ between the lists, something like:
df$split_Sentence1<-strsplit(df$sentence1, split=" ")
df$split_Sentence2<-strsplit(df$sentence2, split=" ")
df$Dif<-setdiff(df$split_Sentence1, df$split_Sentence2)
But this approach does not work when applying setdiff...
In Python I was trying to apply NLTK, trying to get tokens first and after extract the difference between the two lists, something like:
from nltk.tokenize import word_tokenize
df['tokensS1'] = df.sentence1.apply(lambda x: word_tokenize(x))
df['tokensS2'] = df.sentence2.apply(lambda x: word_tokenize(x))
And at this point I do not find a function which give me the result i need..
I hope you can help me. Thanks
Here's an R solution.
I've created an exclusiveWords function that finds the unique words between the two sets, and returns a 'sentence' made up of those words. I've wrapped it in Vectorize() so that it works on all rows of the data.frame at once.
df = as.data.frame(cbind(sentence1,sentence2), stringsAsFactors = F)
exclusiveWords <- function(x, y){
x <- strsplit(x, " ")[[1]]
y <- strsplit(y, " ")[[1]]
u <- union(x, y)
u <- union(setdiff(u, x), setdiff(u, y))
return(paste0(u, collapse = " "))
}
exclusiveWords <- Vectorize(exclusiveWords)
df$result <- exclusiveWords(df$sentence1, df$sentence2)
df
# sentence1 sentence2 result
# 1 This is sentence one This is the sentence four the four one
# 2 This is sentence two This is the sentence five the five two
# 3 This is sentence three This is the sentence six the six three
Essentially the same as the answer from #SymbolixAU as an apply function.
df$Dif <- apply(df, 1, function(r) {
paste(setdiff(union (unlist(r[['split_Sentence1']]), unlist(r[['split_Sentence2']])),
intersect(unlist(r[['split_Sentence1']]), unlist(r[['split_Sentence2']]))),
collapse = " ")
})
In Python, you can build a function that treats words in a sentence as a set and calculates the set theoretical exclusive 'or' (a set of words that are in one sentence but not in the other):
df.apply(lambda x:
set(word_tokenize(x['sentence1'])) \
^ set(word_tokenize(x['sentence2'])), axis=1)
The result is a dataframe of sets.
#0 {one, the, four}
#1 {the, two, five}
#2 {the, three, six}
#dtype: object
Related
I have a dataframe consisting of user_id, zip_code and text. I want to remove the values in text that is not unique within each (user_id,zip_code) pair e.g if
(user_id=2, zip_code=123) has text= ["hello world"] and (user_id=2, zip_code=456) has text=["hello foo"] then I want to remove "hello" from both of those sentences.
My thought was to just grouby ["user_id","zip_code"] and then apply the function to text e.g
import pandas as pd
def remove_common_words(sentences: pd.Series):
all_words= sentences.sum() # Get all the words for each (user,zip) pair
keep_words = set(common_words[common_words == 1]) # Keep words that only is in one pair
return sentences.apply(lambda x: set(x)-keep_words)
data.groupby(["user_id","zip_code"])["text"].unique().apply(remove_common_words)
the issue is that the function seems to take each value/row for each (user,zipcode) pair and not the entire group i.e sentences is not a series but one row thus the first line all_words = sentences.sum() does get all the words for the group, but just for each row.
Is there a way to make the input to apply be the entire grouped dataframe/series, and operate on that, instead of operating on each row at a time?
Example:
df = pd.DataFrame({
"user_id":[1,1,1,2,2,2],
"zip_code":[123,456,789,111,112,334],
"text":["hello world","hello foo", "bar foo baz",
"hello world", "baz boom", "baz foo"]})
def print_values(sentences):
print(sentences)
data.groupby(["user_id","zip_code"])["text"].unique().apply(print_values)
# ["hello world"]
as seen the print(sentences) only prints the first row and not the entire group/series like this:
0 ["hello world"]
1 ["hello foo"]
2 ["bar foo baz"]
I have a question as I am new at the NLP. I have a dataframe consists of 2 columns. The first has a sentence lets say "Hello World, how are you?" and the second column has [6,7,8,9,10] which represents the word "World" (index positions) from the first sentence. Is there any function in python which can give me the opportunity to recognize and appear the word that in each row is specified by the index numbers?
Thank you
https://ibb.co/LRnWM7G
I think it's what you need (updated according to comments)
dataframe = ["Hello World",[6,7,8,9,10]],
["Hello World",[6,7,8,9]],
["Hello World",[7,8,9,10]]
resultdataframe=[]
def textbyindexlist(df):
text = df[0]
letterindex=df[1]
newtext=""
for ind in letterindex:
newtext = newtext + text[ind]
return newtext
for i in dataframe:
resultdataframe.append(textbyindexlist(i))
print (resultdataframe)
I have a pandas data frame which has 2 columns, first contains Arabic sentences and the second one contain labels (1,0)
I want to remove all rows that contain English sentences.
Any suggestions?
Here is an example, I want to remove the second row
إيطاليا لتسريع القرارات اللجوء المهاجرين، الترحيل [0]
Border Patrol Agents Recover 44 Migrants from Stash House [0]
الديمقراطيون مواجهة الانتخابات رفض عقد اجتماعات تاون هول [0]
شاهد لايف: احتفال ترامب "اجعل أمريكا عظيمة مرة أخرى" - بريتبارت [0]
المغني البريطاني إم آي إيه: إدارة ترامب مليئة بـ "كذابون باثولوجي" [0]
You can create an array of common English letters and remove a line which contain either of these letters, like this:
ENGletters = ['a', 'o', 'i']
df = df[~df['text_col'].str.contains('|'.join(ENGletters))]
You could make a list of English words and detect when a word is used.
letters = 'hello how a thing person'.split(' ')
def lookForLetters(text):
text = text.split(' ')
for i in letters:
if i.lower() in text:
return(True)
return(False)
print(lookForLetters("المغني البريطاني إم آي إيه: إدارة ترامب مليئة بـ "كذابون باثولوجي"))
print(lookForLetters("Hello sir! How are you doing?"))
The first print line will return False and the second print line will return True.
Be sure to add more common English words to the first list.
You can use langdetect to filter your df.
from langdetect import detect
Df['language'] = df['title'].apply(detect)
Then filter you df to remove non arabic titles.
df = df[df['language']=='ar']
I have the same problem that was discussed in this link
Python extracting sentence containing 2 words
but the difference is that i need to extract only sentences that containe the two words in a defined size windows search . for exemple:
sentences = [ 'There was peace and happiness','hello every one',' How to Find Inner Peace ,love and Happiness ','Inner peace is closely related to happiness']
search_words= ['peace','happiness']
windows_size = 3 #search only the three words after the 1est word 'peace'
#output must be :
output= ['There was peace and happiness',' How to Find Inner Peace love and Happiness ']
Here is a crude solution.
def search(sentences, keyword1, keyword2, window=3):
res = []
for sentence in sentences:
words = sentence.lower().split(" ")
if keyword1 in words and keyword2 in words:
keyword1_idx = words.index(keyword1)
keyword2_idx = words.index(keyword2)
if keyword2_idx - keyword1_idx <= window:
res.append(sentence)
return res
Given a sentences list and two keywords, keyword1 and keyword2, we iterate through the sentences list one by one. We split the sentence into words, assuming that the words are separated by a single space. Then, after performing a cursory check of whether or not both keywords are present in the words list, we find the index of each keyword in words to make sure that the indices are at most window apart, i.e. the words are close together within window words. We append only the sentences that satisfy this condition to the res list, and return that result.
I have a Pandas DataFrame (or a Series, given that I'm just using one column) that contains strings. I also have a list of words. For each word in this list, I want to check how many different rows it appears in at least once. For example:
words = ['hi', 'bye', 'foo', 'bar']
df = pd.Series(["hi hi hi bye foo",
"bye bye bye bye",
"bar foo hi bar",
"hi bye foo bar"])
In this case, the output should be
0 hi 3
1 bye 3
2 foo 3
3 bar 2
Because "hi" appears in three different rows (1st, 3rd and 4th), "bar" appears in two (3rd and 4th), and so on.
I came up with the following way to do this:
word_appearances = {}
for word in words:
appearances = df.str.count(word).clip(upper=1).sum()
word_appearances.update({word: appearances})
pd.DataFrame(word_appearances.items())
This works fine, but the problem is that I have a rather long list of words (around 40,000), around 30,000 rows to check and strings that are not as short as the ones I used in the example. When I try my approach with my real data, it takes forever to run. Is there a way to do this in a more efficient way?
Try list comprehension and str.contains and sum
df_out = pd.DataFrame([[word, sum(df.str.contains(word))] for word in words],
columns=['word', 'word_count'])
Out[58]:
word word_count
0 hi 3
1 bye 3
2 foo 3
3 bar 2
word_appearances = {}
for word in words:
appearances = df.str.count(word).clip(upper=1).sum()
word_appearances[word]= appearances
pd.DataFrame.from_dict(word_appearances,columns=['Frequency'],orient='index')