I have two columns from two different csv. A column contains words. The other column contains sentences. I import the columns as a list. Now I want to find all sentences in which one of the imported words occurs. Then I want to write the sentence and the word back into a csv.
Does anyone have any idea how I need to design the comparison words vs sentences using Python?
Thanks in advance :-)
On the basis of the two columns being separate Lists and you wanting a List to write back to a CSV then use something like:
l1= ['bill has a ball', 'tom has a toy', 'fred has a frog']
l2=['ball', 'frog']
sentences_with_words = []
for s in l1:
for w in l2:
if w in s:
sentences_with_words.append([s, w])
print(sentences_with_words)
Related
I would like to use something like vlook-up/map function in python.
I have only a portion of entire name of some companies. i would like to know if the company is into the dataset, as the follow example.
Thank you
I can recreate the results checking one list against another. It's not very clear or logical what your match criteria are. "john usa" is a successful match with "aviation john" on the basis that "john" appears in both. But would "john usa" constitute a match with "usa mark sas" since "usa" appears in both? What about hyphens, comma's, etc?
It would help if this was cleared up.
In any case, I hope the following will help, good luck:-
#create two lists of tuples based on the existing dataframes.
check_list = list(df_check.to_records(index=False))
full_list = list(df_full.to_records(index=False))
#create a set - entries in a set are unique
results=set()
for check in check_list: #for each record to check...
for search_word in check[0].split(" "): #take the first column and split it into its words using space as a delimiter
found=any(search_word in rec[0] for rec in full_list) #is the word a substring of any of the records in full list? True or False
results.add((check[0], found)) #add the record we checked to the set with the result (the set avoids duplicate entries)
#build a dataframe based on the results
df_results=df(results, columns=["check", "found"])
df1['in DATASET'] = df1['NAME'].isin(df2['FULL DATASET'])
How can I select rows from a dataframe using a list of words as a reference?
E.g. I have a dataframe, df_business, and each item in the last column is a string containing comma-separated categories, like this:
categories: "Restaurants, Burgers, Coffee & Tea, Fast Food, Food"
I tried this, but it only gives me the rows for the businesses containing ONLY the word coffee in their categories:
bus_int = df_business.loc[(df_business['categories'].isin(['Coffee']))]
How can I get the businesses containing my word even when it's present among others, as shown above?
What you want is the contains method:
bus_int = df_business.loc[df_business.categories.str.contains('Coffee', regex=False)]
(The default is to treat the supplied argument as a regular expression, which you don't need if you're simply looking for a substring like this.)
Just use - .isin(List_name/the string you want)
Example -
list = ['apple', 'mango', 'banana'] #creating a list
df[df['Fruits_name'].isin(list)]
# this will find all the rows with this 3 fruit name
I am passing about 11,000 texts extracted from csv as a dataframe to the remove_unique function. I am finding unique words and saving it as list named as "unique" in the function. The unique words is created out of all the unique words found in the entire column.
Using regex I'm trying to remove the unique words from each row(single column) of panda dataframe but the unique words do not get removed as expected, instead all the words are removed and empty "text" is returned.
def remove_unique(text):
//Gets all the unique words in the entire corpus
unique = list(set(text.str.findall("\w+").sum()))
pattern = re.compile(r'\b(?:{})\b'.format('|'.join(unique)))
//Ideally should remove the unique words from the corpus.
text = text.apply(lambda x: re.sub(pattern, '', x))
return text
Can somebody tell point out what is the issue?
before
0 card layout broken window resiz unabl error ex...
1 chart lower rang border patch merg recheck...
2 left align text team close c...
3 descript sma...
4 list disappear navig make contain...
Name: description_plus, dtype: object
0
1 ...
2
3
4 ...
Name: description_plus, dtype: object
Not sure I totally understand. Are you trying to see if a word appears multiple times throughout the entire column?
Maybe
import re
a_list = list(df["column"].values) #column to list
string = " ".join(a_list) # list of rows to string
words = re.findall("(\w+)", string) # split to single list of words
print([item for item in words if words.count(item) > 1]) #list of words that appear multiple times
books_over10['Keywords'] = ""
r = Rake() # Uses stopwords for english from NLTK, and all puntuation characters.
for index, row in books_over10.iterrows():
a=r.extract_keywords_from_text(row['bookTitle'])
c=r.get_ranked_phrases() # To get keyword phrases ranked with scores highest to lowest.
books_over10.at[index, 'Keywords'] = c
books_over10.head()
I am using the above code, in order to process all rows and extract keywords from each row from the column bookTitle and then insert them as a list into a new column named Keywords on the same row. The question is if there is a more efficient way to do this without iterating over all rows because it takes a lot of time. Any help would be appreciated. Thanks in advance !
Solution by Changming:
def extractor(row):
a=r.extract_keywords_from_text(row)
return r.get_ranked_phrases() # To get keyword phrases ranked with scores highest to lowest.
r = Rake() # Uses stopwords for english from NLTK, and all puntuation characters.
books_over10['Keywords'] = books_over10['bookTitle'].map(lambda row : extractor(row))
Try looking into map. Not sure exactly what Rake you are using, and the way you have it coded is a bit confusing, but the general syntax would be.
books_over10['Keywords'] = books_over10['bookTitle'].map(lambda a: FUNCTION(a))
I have a dataframe named final where I have a column named CleanedText in which i have user reviews(Text). A review is of multiple line. I have done preprocessing and removed all commas, fullstops,htmltags,etc. So the data looks like, Review1(row1): pizza extremely delicious delivery late. Just like this, i have 10000 reviews(corresponding to 10000 rows). Now I want a list of list where every review should be in a list. Ex: [['Pizza','extremely','delicious','delivery','late'],['Tommatos','rotten'......[]...[]].
This assumes you've truly stripped the text of all of the 'fun' stuff. Give this a shot.
fulltext = 'stuff with many\nlines and words'
text_array = [line.split() for line in fulltext.splitlines()]