How can I select rows from a dataframe using a list of words as a reference?
E.g. I have a dataframe, df_business, and each item in the last column is a string containing comma-separated categories, like this:
categories: "Restaurants, Burgers, Coffee & Tea, Fast Food, Food"
I tried this, but it only gives me the rows for the businesses containing ONLY the word coffee in their categories:
bus_int = df_business.loc[(df_business['categories'].isin(['Coffee']))]
How can I get the businesses containing my word even when it's present among others, as shown above?
What you want is the contains method:
bus_int = df_business.loc[df_business.categories.str.contains('Coffee', regex=False)]
(The default is to treat the supplied argument as a regular expression, which you don't need if you're simply looking for a substring like this.)
Just use - .isin(List_name/the string you want)
Example -
list = ['apple', 'mango', 'banana'] #creating a list
df[df['Fruits_name'].isin(list)]
# this will find all the rows with this 3 fruit name
Related
I have two columns from two different csv. A column contains words. The other column contains sentences. I import the columns as a list. Now I want to find all sentences in which one of the imported words occurs. Then I want to write the sentence and the word back into a csv.
Does anyone have any idea how I need to design the comparison words vs sentences using Python?
Thanks in advance :-)
On the basis of the two columns being separate Lists and you wanting a List to write back to a CSV then use something like:
l1= ['bill has a ball', 'tom has a toy', 'fred has a frog']
l2=['ball', 'frog']
sentences_with_words = []
for s in l1:
for w in l2:
if w in s:
sentences_with_words.append([s, w])
print(sentences_with_words)
I would like to use something like vlook-up/map function in python.
I have only a portion of entire name of some companies. i would like to know if the company is into the dataset, as the follow example.
Thank you
I can recreate the results checking one list against another. It's not very clear or logical what your match criteria are. "john usa" is a successful match with "aviation john" on the basis that "john" appears in both. But would "john usa" constitute a match with "usa mark sas" since "usa" appears in both? What about hyphens, comma's, etc?
It would help if this was cleared up.
In any case, I hope the following will help, good luck:-
#create two lists of tuples based on the existing dataframes.
check_list = list(df_check.to_records(index=False))
full_list = list(df_full.to_records(index=False))
#create a set - entries in a set are unique
results=set()
for check in check_list: #for each record to check...
for search_word in check[0].split(" "): #take the first column and split it into its words using space as a delimiter
found=any(search_word in rec[0] for rec in full_list) #is the word a substring of any of the records in full list? True or False
results.add((check[0], found)) #add the record we checked to the set with the result (the set avoids duplicate entries)
#build a dataframe based on the results
df_results=df(results, columns=["check", "found"])
df1['in DATASET'] = df1['NAME'].isin(df2['FULL DATASET'])
I have a pandas dataframe called df. It has a column called article. The article column contains 600 strings, each of the strings represent a news article.
I want to only KEEP those articles whose first four sentences contain keywords "COVID-19" AND ("China" OR "Chinese"). But I´m unable to find a way to conduct this on my own.
(in the string, sentences are separated by \n. An example article looks like this:)
\nChina may be past the worst of the COVID-19 pandemic, but they aren’t taking any chances.\nWorkers in Wuhan in service-related jobs would have to take a coronavirus test this week, the government announced, proving they had a clean bill of health before they could leave the city, Reuters reported.\nThe order will affect workers in security, nursing, education and other fields that come with high exposure to the general public, according to the edict, which came down from the country’s National Health Commission.\ .......
First we define a function to return a boolean based on whether your keywords appear in a given sentence:
def contains_covid_kwds(sentence):
kw1 = 'COVID19'
kw2 = 'China'
kw3 = 'Chinese'
return kw1 in sentence and (kw2 in sentence or kw3 in sentence)
Then we create a boolean series by applying this function (using Series.apply) to the sentences of your df.article column.
Note that we use a lambda function in order to truncate the sentence passed on to the contains_covid_kwds up to the fifth occurrence of '\n', i.e. your first four sentences (more info on how this works here):
series = df.article.apply(lambda s: contains_covid_kwds(s[:s.replace('\n', '#', 4).find('\n')]))
Then we pass the boolean series to df.loc, in order to localize the rows where the series was evaluated to True:
filtered_df = df.loc[series]
You can use pandas apply method and do the way I did.
string = "\nChina may be past the worst of the COVID-19 pandemic, but they aren’t taking any chances.\nWorkers in Wuhan in service-related jobs would have to take a coronavirus test this week, the government announced, proving they had a clean bill of health before they could leave the city, Reuters reported.\nThe order will affect workers in security, nursing, education and other fields that come with high exposure to the general public, according to the edict, which came down from the country’s National Health Commission."
df = pd.DataFrame({'article':[string]})
def findKeys(string):
string_list = string.strip().lower().split('\n')
flag=0
keywords=['china','covid-19','wuhan']
# Checking if the article has more than 4 sentences
if len(string_list)>4:
# iterating over string_list variable, which contains sentences.
for i in range(4):
# iterating over keywords list
for key in keywords:
# checking if the sentence contains any keyword
if key in string_list[i]:
flag=1
break
# Else block is executed when article has less than or equal to 4 sentences
else:
# Iterating over string_list variable, which contains sentences
for i in range(len(string_list)):
# iterating over keywords list
for key in keywords:
# Checking if sentence contains any keyword
if key in string_list[i]:
flag=1
break
if flag==0:
return False
else:
return True
and then call the pandas apply method on df:-
df['Contains Keywords?'] = df['article'].apply(findKeys)
First I create a series which contains just the first four sentences from the original `df['articles'] column, and convert it to lower case, assuming that searches should be case-independent.
articles = df['articles'].apply(lambda x: "\n".join(x.split("\n", maxsplit=4)[:4])).str.lower()
Then use a simple boolean mask to filter only those rows where the keywords were found in the first four sentences.
df[(articles.str.contains("covid")) & (articles.str.contains("chinese") | articles.str.contains("china"))]
Here:
found = []
s1 = "hello"
s2 = "good"
s3 = "great"
for string in article:
if s1 in string and (s2 in string or s3 in string):
found.append(string)
I am passing about 11,000 texts extracted from csv as a dataframe to the remove_unique function. I am finding unique words and saving it as list named as "unique" in the function. The unique words is created out of all the unique words found in the entire column.
Using regex I'm trying to remove the unique words from each row(single column) of panda dataframe but the unique words do not get removed as expected, instead all the words are removed and empty "text" is returned.
def remove_unique(text):
//Gets all the unique words in the entire corpus
unique = list(set(text.str.findall("\w+").sum()))
pattern = re.compile(r'\b(?:{})\b'.format('|'.join(unique)))
//Ideally should remove the unique words from the corpus.
text = text.apply(lambda x: re.sub(pattern, '', x))
return text
Can somebody tell point out what is the issue?
before
0 card layout broken window resiz unabl error ex...
1 chart lower rang border patch merg recheck...
2 left align text team close c...
3 descript sma...
4 list disappear navig make contain...
Name: description_plus, dtype: object
0
1 ...
2
3
4 ...
Name: description_plus, dtype: object
Not sure I totally understand. Are you trying to see if a word appears multiple times throughout the entire column?
Maybe
import re
a_list = list(df["column"].values) #column to list
string = " ".join(a_list) # list of rows to string
words = re.findall("(\w+)", string) # split to single list of words
print([item for item in words if words.count(item) > 1]) #list of words that appear multiple times
I have a dataframe named final where I have a column named CleanedText in which i have user reviews(Text). A review is of multiple line. I have done preprocessing and removed all commas, fullstops,htmltags,etc. So the data looks like, Review1(row1): pizza extremely delicious delivery late. Just like this, i have 10000 reviews(corresponding to 10000 rows). Now I want a list of list where every review should be in a list. Ex: [['Pizza','extremely','delicious','delivery','late'],['Tommatos','rotten'......[]...[]].
This assumes you've truly stripped the text of all of the 'fun' stuff. Give this a shot.
fulltext = 'stuff with many\nlines and words'
text_array = [line.split() for line in fulltext.splitlines()]