Pandas, finding text via regular expressions, improve efficiency

Pandas, finding text via regular expressions, improve efficiency - python

I'm trying my hand at the Kaggle COVID-19 competition (https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge/tasks) just to see if I can help. I have a question about improving the efficiency of regular expression search in a pandas DataFrame.
I've organised the dataset so I have dataframe with a title, abstract and the full text of the article in each row. The goal is to search the full text for keywords using a regular expression, and then return a set in a new column. As a first step, I am searching for the virus mentioned in each article. I am also using a dataset from the International Committee on Taxonomy of Viruses to help me identify the virus (https://talk.ictvonline.org/files/master-species-lists/m/msl/8266)
While I understand that my dataset is large, and there is a lot of data in the "Full text" column (400,000+ lines, and 100s of words in the full text column) , my current script run has ran for 2 days non-stop. I would like to check whether if there is a way to improve it's efficiency, as I want to run other regular expression searches, and preferably not have to wait for so long.
I've created a mock dataset, but with the script I am using. Is there anyway for me to improve it's efficiency?
import pandas as pd
import re
Mock dataset
df = pd.DataFrame(data = {"Title": ["Article 1", "Article 2", "Article 3"],
"Abstract":["Abstract 1", "Abstract 2", "Abstract 3"],
"Text":["Lorem ipsum dolor sit amet, consectetur adipiscing elit, coronavirus sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, papavirus, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia paleovirus deserunt mollit anim id est laborum.",
"Lorem ipsum dolor sit amet, paleovirus consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui coronavirus officia deserunt mollit anim id est laborum.",
"Lorem coronavirus ipsum dolor sit amet, astrovirus consectetur adipiscing elit, sed do eiusmod tempor astrovirus incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."]
})
This is from the ICTV website, I downloaded the spreadsheet from the website, to replicate, please download the spreadsheet, and change the folder path
virus = pd.read_excel(r"D:\Python work\2020-03-13\ICTV Master Species List 2018b.v2.xlsx", sheet_name = "ICTV 2018b Master Species #34 v")
Organising the data
virus = virus[['Realm', 'Subrealm', 'Kingdom', 'Subkingdom', 'Phylum',
'Subphylum', 'Class', 'Subclass', 'Order', 'Suborder', 'Family',
'Subfamily', 'Genus', 'Subgenus', 'Species']]
I've melted the dataset into a single column. Articles may mention the same virus in different conjucations (coronavirus, coronaviridae, coronavirale etc. I want to capture all versions)
virus = virus.melt(id_vars= None, var_name = "virus_class", value_name = "virus_name")
virus.drop_duplicates(subset = ["virus_name"], inplace=True)
virus.dropna(subset = ["virus_name"], inplace=True)
virus["virus_name"] = virus["virus_name"] .apply(lambda x : x.lower())
From my understanding, all virus names will have the stem "vir" in it's name, it can be at the beginning, the middle or in the end.
These lines attempt to capture prefix fof the virus using a regular expressions
virus["tgt"] = virus["virus_name"].apply(lambda x: re.findall("[a-z].*(?=vir)", x))
This converts the list returned from the regular expression to a string.
virus["tgt"] = virus["tgt"].astype(str)
virus["tgt"] = virus["tgt"].apply(lambda x: x.strip())
virus["tgt"] = virus["tgt"].apply(lambda x: x.replace("[",""))
virus["tgt"] = virus["tgt"].apply(lambda x: x.replace("]",""))
virus["tgt"] = virus["tgt"].apply(lambda x: x.replace("'",""))
virus["tgt"] = "[a-z]+" + virus["tgt"] + "vir[a-z0-9]+"
virus["tgt"].drop_duplicates(inplace=True)
This step takes all the virus in panda's series and puts it into a single string. Also, thank you for providing this code (Python: Elegant way to check if at least one regex in list matches a string)
regexes = virus["tgt"].tolist()
combined = "(" + ")|(".join(regexes) + ")"
This is the code I am most worried about. It runs the regular expression, and returns what it finds to a string.
The "set" is to remove duplicates
def working(x):
x = set(["".join(x) for x in re.findall(combined, x)])
print(x)
return x
This line runs the code and picks up the text by row. However, as mentioned, this is taking a long time.
df["ID_virus"] = df["Text"].apply(lambda x: working(x))
The script returns what I want, but it is slow
I apologise for the very long entry, but I wanted to provide as much information to recreate the problem. The script works (I think), but as mentioned, the script has been running for two days.
Any help would be appreciated.

Consider removing the line
virus["tgt"] = "[a-z]+" + virus["tgt"] + "vir[a-z0-9]+"
and changing
combined = "(" + ")|(".join(regexes) + ")"
to
combined = "([a-z]+(?:" + "|".join(regexes) + ")vir[a-z0-9]+)"
so it process the [a-z]+ only once, instead of once for each of the individual regexes.
Also, notice that
virus["virus_name"]\
.apply(lambda x: re.findall("[a-z].*(?=vir)", x))\
.apply(lambda x: len(x))\
.value_counts()
shows quite a few rows do not match your first regex (because the strings do not contain "vir"):
1 6745
0 153
Name: virus_name, dtype: int64
Therefore, in practice, both versions of the combined regex will match anything that the simpler
[a-z]+vir[a-z0-9]+
matches.

Related

What's the proper way to exclude uppercase word/s in regex python

Let's say I've scrapped this from a website.
PARIS - Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua (2015). Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat 22/05/2015. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
I can just use .replace ('PARIS - ','') and then get the texts with regex, but what if the place is changing in different article?
How do I exclude the first "Paris" and " - " and get the other texts
Should I seperate between the location and the content with regex?
What should I think or do first when facing problem like this?
Here's my code to get the first string for my third question, assume that text is variable that contains these texts
location = re.findall('^\w+', text)

Use a regular expression that matches a sequence of uppercase letters and spaces followed by a hyphen at the beginning, and replaces it with an empty string.
text = re.sub(r'^[A-Z\s]+\s-\s*', '', text)

How to create a word wrapping program in Python 3.6

I am trying to create a program that simulates word wrapping text found in programs like Word or Notepad. If I have a long text, I would like to print out 64 characters (or less) per line, followed by a newline return, without truncating words. Using Windows 10, PyCharm 2018.2.4 and Python 3.6, I've tried the following code:
long_str = "Lorem ipsum dolor sit amet, consectetur adipiscing elit," \
"sed do eiusmod tempor incididunt ut labore et dolore magna aliqua." \
"Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris" \
"nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in" \
"reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur." \
"Excepteur sint occaecat cupidatat non proident, sunt in culpa qui" \
"officia deserunt mollit anim id est laborum."
concat_str = long_str[:64] # The first 64 characters
rest_str = long_str[65:] # The rest of the string
rest_str_len = len(rest_str)
while rest_str_len > 64:
print(concat_str.lstrip() + " (" + str(len(concat_str)) + ")" + "\n")
concat_str = rest_str[:64]
rest_str = rest_str[65:]
rest_str_len = len(rest_str)
print(concat_str.lstrip() + " (" + str(len(concat_str)) + ")" + "\n")
print(rest_str.lstrip() + " (" + str(len(rest_str)) + ")")
This is so close, but there are two problems. First, the code truncates off letters at the end or beginning of lines, such as the following output:
# I've added the total len() at the end of each line just to check-sum.
'Lorem ipsum dolor sit amet, consectetur adipiscing elit,sed do e (64)'
'usmod tempor incididunt ut labore et dolore magna aliqua. Ut enim (64)'
'ad minim veniam, quis nostrud exercitation ullamco laborisnisi u (64)'
'aliquip ex ea commodo consequat. Duis aute irure dolor inrepreh (64)'
'nderit in voluptate velit esse cillum dolore eu fugiat nulla par (64)'
'atur. Excepteur sint occaecat cupidatat non proident, sunt in cul (64)'
'a quiofficia deserunt mollit anim id est laborum. (49)'
The second problem is that I need the code to print a newline only after a whole word (or punctuation), instead of chopping up the word at 64 characters.

Use textwrap.wrap:
import textwrap
long_str = "Lorem ipsum dolor sit amet, consectetur adipiscing elit," \
"sed do eiusmod tempor incididunt ut labore et dolore magna aliqua." \
"Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris" \
"nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in" \
"reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur." \
"Excepteur sint occaecat cupidatat non proident, sunt in culpa qui" \
"officia deserunt mollit anim id est laborum."
lines = textwrap.wrap(long_str, 64, break_long_words=False)
print('\n'.join(lines))
This takes long string and splits it into lines of a particular width. Also, set break_long_words to False to prevent splitting of words.

Proper word wrapping of long text

I have a 1000 character long text string and I want to split this text in chunks smaller than 100 characters without splitting a whole word (99 characters are fine but 100 not). The wrapping/splitting should only be made on whitespaces:
Example:
text = "... this is a test , and so on..."
^
#position: 100
should be splitted to:
newlist = ['... this is a test ,', ' and so on...', ...]
I want to get a list newlist of the text splitted properly into readable (not word-cropped) chunks. How would you do this?

Use the textwrap module's wrap function. The below example splits the lines 10 characters wide:
In [1]: import textwrap
In [2]: textwrap.wrap("... this is a test , and so on...", 10)
Out[2]: ['... this', 'is a test', ', and so', 'on...']

You can use the textwrap module:
In [2]: import textwrap
In [3]: textwrap.wrap("""Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
...: tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam,
...: quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo
...: consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse
...: cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non
...: proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
""", 40)
Out[3]:
['Lorem ipsum dolor sit amet, consectetur',
'adipisicing elit, sed do eiusmod tempor',
'incididunt ut labore et dolore magna',
'aliqua. Ut enim ad minim veniam, quis',
'nostrud exercitation ullamco laboris',
'nisi ut aliquip ex ea commodo consequat.',
'Duis aute irure dolor in reprehenderit',
'in voluptate velit esse cillum dolore eu',
'fugiat nulla pariatur. Excepteur sint',
'occaecat cupidatat non proident, sunt in',
'culpa qui officia deserunt mollit anim',
'id est laborum.']

Wordwrap like the other guys said, however for an alternative option:
def splitter(s, n):
for start in range(0, len(s), n):
yield s[start:start+n]
data = "abcdefghijabcdefghijabcdefghijabcdefghijabcdefghij"
for splitee in splitter(data, 10):
print splitee

Split diary file into multiple files using Python

I keep a diary file of tech notes. Each entry is timestamped like so:
# Monday 02012-05-07 at 01:45:20 PM
This is a sample note
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam,
quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo
consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse
cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non
proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
# Wednesday 02012-06-06 at 03:44:11 PM
Here is another one.
Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia
deserunt mollit anim id est laborum.
Would like to break these notes down into individual files based on timestamp headers. e.g. This is a sample note.txt, Here is another really long title.txt. Im sure I would have to truncate the filename at some point, but the idea would be to seed the filename based on the first line of the diary entry.
It doesn't look like I can modify the file's creation date via python, so I would like to preserve the entries timestamp as part of the note's body.
I've got a RegEx pattern to capture the timestamps that suits me well:
#(\s)(Monday|Tuesday|Wednesday|Thursday|Friday|Saturday|Sunday)(\s)(.*)
and can likely use that regex to loop through the file and break each entry down, but im not quite sure how to loop through the diary file and break it out into individual files. There are a lot of examples of grabbing the actual regex pattern, or particular line, but I want to do a few more things here and am having some difficulty peicing it together.
Here is an example of the desired file contents (datestamp + all text up until next datestamp match):
bash$ cat This\ is\ a\ sample\ note.txt
Monday 02012-05-07 at 01:45:20 PM
This is a sample note
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam,
quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo
consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse
cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non
proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
bash$

Here's the general ;-) approach:
f = open("diaryfile", "r")
body = []
for line in f:
if your_regexp.match(line):
if body:
write_one(body)
body = []
body.append(line)
if body:
write_one(body)
f.close()
In short, you just keep appending all lines to a list (body). When you find a magical line, you call write_one() to dump what you have so far, and clear the list. The last chunk of the file is a special case, because you're not going to find your magical regexp again. So you again dump what you have after the loop.
You can make any transformations you like in your write_one() function. For example, sounds like you want to remove the leading "# " from the input timestamp lines. That's fine - just do, e.g.,
body[0] = body[0][2:]
in write_one. All the lines can be written out in one gulp via, e.g.,
with open(file_name_extracted_from_body_goes_here, "w") as f:
f.writelines(body)
You probably want to check that the file doesn't exist first! If it's anything like my diary, the first line of many entries will be "Rotten day." ;-)

It really doesn't require as much regex as you would think.
First just load the file so you have it based on new lines:
fl = 'file.txt'
with open(fl,'r') as f:
lines = f.readlines()
now just loop through it! Compare each line with the regex you provided, and if it matches, that means it's a new date!
Then you will grab the next non-empty line after that and set it as the name of the file.
Then keep going through and writing lines to that specific file name until you hit another match to your regex, where you know it is now meant to be a new file. Here is the logic loop:
for line in lines:
m = re.match(your regex)
if m:
new_file = True
else:
new_file = False
#now you will know when it's a new entry so you can easily do the rest
Let me know if you need any more of the logic broken down. Hopefully this was helpful

You set the "batch-file" tag in your question, so I wrote a Batch file .bat solution. Here it is:
#echo off
setlocal EnableDelayedExpansion
set daysOfWeek=/Monday/Tuesday/Wednesday/Thursday/Friday/Saturday/Sunday/
for /F "delims=" %%a in (input.txt) do (
if not defined timeStamp (
set timeStamp=%%a
) else if not defined fileName (
set fileName=%%a
(
echo !timeStamp!
echo/
echo !fileName!
echo/
) > "!fileName!.txt"
) else (
for /F "tokens=2" %%b in ("%%a") do if "!daysOfWeek:/%%b/=!" equ "%daysOfWeek%" (
echo %%a>> "!fileName!.txt"
) else (
set timeStamp=%%a
set "fileName="
)
)
)
For example:
C:\Users\Antonio\Documents\test
>dir /B
input.txt
test.bat
C:\Users\Antonio\Documents\test
>test
C:\Users\Antonio\Documents\test
>dir /B
Here is another one.txt
input.txt
test.bat
This is a sample note.txt
C:\Users\Antonio\Documents\test
>type "Here is another one.txt"
# Wednesday 02012-06-06 at 03:44:11 PM
Here is another one
Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia
deserunt mollit anim id est laborum.
C:\Users\Antonio\Documents\test
>type "This is a sample note.txt"
# Monday 02012-05-07 at 01:45:20 PM
This is a sample note
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam,
quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo
consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse
cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non
proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

Python html2text adds random \n

When using the html2text python package to convert html to markdown it adds '\n' to the text. I also see this behaviour when trying the demo at http://www.aaronsw.com/2002/html2text/
Is there any way to turn this off? Of course I can remove them myself, but there might be occurrences of '\n' in the original text which I don't want to remove.
html2text('Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.')
u'Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod\ntempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam,\nquis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo\nconsequat. Duis aute irure dolor in reprehenderit in voluptate velit esse\ncillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non\nproident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n\n'

In the latest version of html2text do this:
import html2text
h = html2text.HTML2Text()
h.body_width = 0
note = h.handle("<p>Hello, <a href='http://earth.google.com/'>world</a>!")
This removes the word wrapping that html2text otherwise does

Looking at the source to html2text.py, it looks like you can disable the wrapping behavior by setting BODY_WIDTH to 0. Something like this:
import html2text
html2text.BODY_WIDTH = 0
text = html2text.html2text('...')
Of course, resetting BODY_WIDTH globally changes the module's behavior. If I had a need to access this functionality, I'd probably seek to patch the module, creating a parameter to html2text() to modify this behavior per-call, and provide this patch back to the author.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.