Searching sub string through 3Million Records in python

Searching sub string through 3Million Records in python - python

I have a huge Data frame which has 3M records which has column called description. Also I have possible sub string set of around 5k.
I want to get the rows in which the description contains any of the sub string.
i used the following looping
for i in range(0,len(searchstring)):
ss=searchsting[i]
for k in range(0,len(df)):
desc=df['description'].iloc[k].lower()
if (bool(re.search(ss,desc))):
trans.append(df.iloc[k])
The issue is it is taking too much time as the search 5k times 3M looping.
Is there any better way to search substring?

Should be faster if you use pandas isin() function
Example:
import pandas as pd
a ='Hello world'
ss = a.split(" ")
df = pd.DataFrame({'col1': ['Hello', 'asd', 'asdasd', 'world']})
df.loc[df['col1'].isin(ss)].index
Returns a list of indexes:
Int64Index([0, 3], dtype='int64')

I found an alternative method. I have created a word dictionary for the description column of 3M data set, by splitting each words. (I have replaced numbers in the description with zeros and used it for dictionary generation)
def tokenize(desc):
desc=re.sub('\d', '0', desc)
tokens=re.split('\s+',desc)
return tokens
def make_inv_index(df):
inv_index={}
for i,tokens in df['description_removed_numbers'].iteritems():
for token in tokens:
try:
inv_index[token].append(i)
except KeyError:
inv_index[token]=[i]
return inv_index
df['description_removed_numbers']=df['description'].apply(tokenize)
inv_index_df=make_inv_index(df)
Now while searching the description, have to apply the same tokenization on the the search string and will take intersection of indexes of that particular words using the dictionary and will search only those fields. This substantially reduced my overall time taken to run the program.

Related

Find date and time in a text column in a dataframe Python

I'm trying to find and extract the date and time in a column that contain text sentences. The example data is as below.
df = {'Id': ['001', '002',...],
'Description': ['
THERE IS AN INTERUPTION/FAILURE # 9.6AM ON 27.1.2020 FOR JB BRANCH. THE INTERUPTION ALSO INVOLVED A, B, C AND SOME OTHER TOWN AREAS. OTC AND SST SERVICES INTERRUPTED AS GENSET ALSO WORKING AT THAT TIME. WE CALL FOR SERVICE. THE TECHNICHIAN COME AT 10.30AM. THEN IT BECOME OK AROUND 10.45AM', 'today is 23/3/2013 #10:AM we have',...],
....
}
df = pd.DataFrame (df, columns = ['Id','Description'])
I have tried the datefinder library below but it gives todays date which is wrong.
findDate = dtf.find_dates(le['Description'][0])
for dates in findDate:
print(dates)
Does anyone know what is the best way to extract it and automatically put it into a new column? Or does anyone know any library that can calculate duration between time and date in a string text. Thank you.

So you have two issues here.
you want to know how to apply a function on a DataFrame.
you want a function to extract a pattern from a bunch of text
Here is how to apply a function on a Serie (if selecting only one column as I did, you get a Serie). Bonus points: Read the DataFrame.apply() and Series.apply() documentation (30s) to become a Pandas-chad!
def do_something(x):
some-code()
df['new_text_column'] = df['original_text_column'].apply(do_something)
And here is one way to extract patterns from a string using regexes. Read the regex doc (or follow a course)and play around with RegExr to become an omniscient god (that is, if you use a command-line on Linux, along with your regex knowledge).
Modified from: How to extract the substring between two markers?
import re
text = 'gfgfdAAA1234ZZZuijjk'
# Searching numbers.
m = re.search('\d+', text)
if m:
found = m.group(0)
# found: 1234

How to filter strings if the first three sentences contain keywords

I have a pandas dataframe called df. It has a column called article. The article column contains 600 strings, each of the strings represent a news article.
I want to only KEEP those articles whose first four sentences contain keywords "COVID-19" AND ("China" OR "Chinese"). But I´m unable to find a way to conduct this on my own.
(in the string, sentences are separated by \n. An example article looks like this:)
\nChina may be past the worst of the COVID-19 pandemic, but they aren’t taking any chances.\nWorkers in Wuhan in service-related jobs would have to take a coronavirus test this week, the government announced, proving they had a clean bill of health before they could leave the city, Reuters reported.\nThe order will affect workers in security, nursing, education and other fields that come with high exposure to the general public, according to the edict, which came down from the country’s National Health Commission.\ .......

First we define a function to return a boolean based on whether your keywords appear in a given sentence:
def contains_covid_kwds(sentence):
kw1 = 'COVID19'
kw2 = 'China'
kw3 = 'Chinese'
return kw1 in sentence and (kw2 in sentence or kw3 in sentence)
Then we create a boolean series by applying this function (using Series.apply) to the sentences of your df.article column.
Note that we use a lambda function in order to truncate the sentence passed on to the contains_covid_kwds up to the fifth occurrence of '\n', i.e. your first four sentences (more info on how this works here):
series = df.article.apply(lambda s: contains_covid_kwds(s[:s.replace('\n', '#', 4).find('\n')]))
Then we pass the boolean series to df.loc, in order to localize the rows where the series was evaluated to True:
filtered_df = df.loc[series]

You can use pandas apply method and do the way I did.
string = "\nChina may be past the worst of the COVID-19 pandemic, but they aren’t taking any chances.\nWorkers in Wuhan in service-related jobs would have to take a coronavirus test this week, the government announced, proving they had a clean bill of health before they could leave the city, Reuters reported.\nThe order will affect workers in security, nursing, education and other fields that come with high exposure to the general public, according to the edict, which came down from the country’s National Health Commission."
df = pd.DataFrame({'article':[string]})
def findKeys(string):
string_list = string.strip().lower().split('\n')
flag=0
keywords=['china','covid-19','wuhan']
# Checking if the article has more than 4 sentences
if len(string_list)>4:
# iterating over string_list variable, which contains sentences.
for i in range(4):
# iterating over keywords list
for key in keywords:
# checking if the sentence contains any keyword
if key in string_list[i]:
flag=1
break
# Else block is executed when article has less than or equal to 4 sentences
else:
# Iterating over string_list variable, which contains sentences
for i in range(len(string_list)):
# iterating over keywords list
for key in keywords:
# Checking if sentence contains any keyword
if key in string_list[i]:
flag=1
break
if flag==0:
return False
else:
return True
and then call the pandas apply method on df:-
df['Contains Keywords?'] = df['article'].apply(findKeys)

First I create a series which contains just the first four sentences from the original `df['articles'] column, and convert it to lower case, assuming that searches should be case-independent.
articles = df['articles'].apply(lambda x: "\n".join(x.split("\n", maxsplit=4)[:4])).str.lower()
Then use a simple boolean mask to filter only those rows where the keywords were found in the first four sentences.
df[(articles.str.contains("covid")) & (articles.str.contains("chinese") | articles.str.contains("china"))]

Here:
found = []
s1 = "hello"
s2 = "good"
s3 = "great"
for string in article:
if s1 in string and (s2 in string or s3 in string):
found.append(string)

is there a more efficient way to iterate over a dataframe?

books_over10['Keywords'] = ""
r = Rake() # Uses stopwords for english from NLTK, and all puntuation characters.
for index, row in books_over10.iterrows():
a=r.extract_keywords_from_text(row['bookTitle'])
c=r.get_ranked_phrases() # To get keyword phrases ranked with scores highest to lowest.
books_over10.at[index, 'Keywords'] = c
books_over10.head()
I am using the above code, in order to process all rows and extract keywords from each row from the column bookTitle and then insert them as a list into a new column named Keywords on the same row. The question is if there is a more efficient way to do this without iterating over all rows because it takes a lot of time. Any help would be appreciated. Thanks in advance !
Solution by Changming:
def extractor(row):
a=r.extract_keywords_from_text(row)
return r.get_ranked_phrases() # To get keyword phrases ranked with scores highest to lowest.
r = Rake() # Uses stopwords for english from NLTK, and all puntuation characters.
books_over10['Keywords'] = books_over10['bookTitle'].map(lambda row : extractor(row))

Try looking into map. Not sure exactly what Rake you are using, and the way you have it coded is a bit confusing, but the general syntax would be.
books_over10['Keywords'] = books_over10['bookTitle'].map(lambda a: FUNCTION(a))

Python Text processing (str.contains)

I am using str.contains for text analytics in Pandas. If for the sentence "My latest Data job was an Analyst" , I want a combination of the words "Data" & "Analyst" but at the same time I want to specify the number of words between the two words used for the combination( here it is 2 words between "Data" and "Analyst".Currently I am using (DataFile.XXX.str.contains('job') & DataFile.XXX.str.contains('Analyst') to get the counts for "job Analyst".
How can I Specify the number of words in between the 2 words in the str.contains syntax.
Thanks in advance

You can't. At least, not in a simple or standardized way.
Even the basics, like how you define a "word," are a lot more complex than you probably imagine. Both word parsing and lexical proximity (e.g. "are two words within distance D of one another in sentence s?") is the realm of natural language processing (NLP). NLP and proximity searches are not part of basic Pandas, nor of Python's standard string processing. You could import something like NLTK, the Natural Language Toolkit to solve this problem in a general way, but that's a whole 'nother story.
Let's look at a simple approach. First you need a way to parse a string into words. The following is rough by NLP standards, but will work for simpler cases:
def parse_words(s):
"""
Simple parser to grab English words from string.
CAUTION: A simplistic solution to a hard problem.
Many possibly-important edge- and corner-cases
not handled. Just one example: Hyphenated words.
"""
return re.findall(r"\w+(?:'[st])?", s, re.I)
E.g.:
>>> parse_words("and don't think this day's last moment won't come ")
['and', "don't", 'think', 'this', "day's", 'last', 'moment', "won't", 'come']
Then you need a way to find all the indices in a list where a target word is found:
def list_indices(target, seq):
"""
Return all indices in seq at which the target is found.
"""
indices = []
cursor = 0
while True:
try:
index = seq.index(target, cursor)
except ValueError:
return indices
else:
indices.append(index)
cursor = index + 1
And finally a decision making wrapper:
def words_within(target_words, s, max_distance, case_insensitive=True):
"""
Determine if the two target words are within max_distance positiones of one
another in the string s.
"""
if len(target_words) != 2:
raise ValueError('must provide 2 target words')
# fold case for case insensitivity
if case_insensitive:
s = s.casefold()
target_words = [tw.casefold() for tw in target_words]
# for Python 2, replace `casefold` with `lower`
# parse words and establish their logical positions in the string
words = parse_words(s)
target_indices = [list_indices(t, words) for t in target_words]
# words not present
if not target_indices[0] or not target_indices[1]:
return False
# compute all combinations of distance for the two words
# (there may be more than one occurance of a word in s)
actual_distances = [i2 - i1 for i2 in target_indices[1] for i1 in target_indices[0]]
# answer whether the minimum observed distance is <= our specified threshold
return min(actual_distances) <= max_distance
So then:
>>> s = "and don't think this day's last moment won't come at last"
>>> words_within(["THIS", 'last'], s, 2)
True
>>> words_within(["think", 'moment'], s, 2)
False
The only thing left to do is map that back to Pandas:
df = pd.DataFrame({'desc': [
'My latest Data job was an Analyst',
'some day my prince will come',
'Oh, somewhere over the rainbow bluebirds fly',
"Won't you share a common disaster?",
'job! rainbow! analyst.'
]})
df['ja2'] = df.desc.apply(lambda x: words_within(["job", 'analyst'], x, 2))
df['ja3'] = df.desc.apply(lambda x: words_within(["job", 'analyst'], x, 3))
This is basically how you'd solve the problem. Keep in mind, it's a rough and simplistic solution. Some simply-posed questions are not simply-answered. NLP questions are often among them.

Find and replace for millions of rows with regex in Python

I have 2 set of data.
First one which serves as dictionary has two columns keyword and id and 180000 rows. Below is some sample data.
Also, note that some of the keyword are as small as 2 character and as big as 700 characters and there is no fixed length of the keywords. Although id has fixed pattern of 3 digit number with a hash symbol before and after the number.
keyword id
salesman #123#
painter #486#
senior painter #215#
Second file has one column which is corpus and it runs into 22 million records and length of each record varies between 10 to 1000. Below is sample data which can be considered as input.
corpus
I am working as a salesman. salesmanship is not my forte, however i have become a good at it
I have been a painter since i was 19
are you the salesman?
Output
corpus
I am working as a #123#. salesmanship is not my forte, however i have become a good at it
I have been a #486# since i was 19
are you the #123#?
Please note that i want to replace complete word and not overlapped words. so in the first sentence salesman was replaced with #123# where as salesmanship was not replaced with #123#ship. This requires me to add regular expression '\b' before and after the keyword. This is why Regex is important for the search function
So this is a search and replace operation for multi-million rows and has regex. I have read
Mass string replace in python?
and
Speed up millions of regex replacements in Python 3, however it is taking me days to do this find and replace, which i can't afford as this is a weekly task. I want to be able to do this much faster. Below is my code
Id = df_dict.Id.tolist()
#convert to list with regex
keyword = [r'\b'+ x + r'\b' for x in df_dict.keyword]
#light on memory to clean file
del df_dict
#replace
df_corpus["corpus_text"].replace(keyword, Id, regex=False,inplace=True)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.