I have a query
t1 = query.Term("content", "field")
t2 = query.Term("content", "information")
t3 = query.Term("content", "document")
q = spans.SpanNear2([t1, t2, t3], slop=5, ordered=True)
finds and marks individual words.
[information] A [field] is a piece [document] create [document] of
[information] for each [document] in the index,...
but I need to mark whole correct expression. (?)
information A [field is a piece document create document of
information for each document] in the index,...
There has already been some reasearch on this topic. I haven't personnally tried it but it might be helpful:
Another SO thread on the matter: 1
To match a full phrase you can use phrase queries: 2
Apparently some unofficial code has been written regarding this issue: 3
Related
First of all, I checked these previous posts, and did not help me. 1 & 2 & 3
I have this string (or a similar case could be) that need to be handled with regex:
"Text Table 6-2: Management of children study and actions"
What I am supposed to do is detect the word Table and the word(s) before if existed
detect the numbers following and they can be in this format: 6 or 6-2 or 66-22 or 66-2
Finally the rest of the string (in this case: Management of children study and actions)
After doing so, the return value must be like this:
return 1 and 2 as one string, the rest as another string
e.g. returned value must look like this: Text Table 6-2, Management of children study and actions
Below is my code:
mystr = "Text Table 6-2: Management of children study and actions"
if re.match("([a-zA-Z0-9]+[ ])?(figure|list|table|Figure|List|Table)[ ][0-9]([-][0-9]+)?", mystr):
print("True matched")
parts_of_title = re.search("([a-zA-Z0-9]+[ ])?(figure|list|table|Figure|List|Table)[ ][0-9]([-][0-9]+)?", mystr)
print(parts_of_title)
print(" ".join(parts_of_title.group().split()[0:3]), parts_of_title.group().split()[-1])
The first requirement is returned true as should be but the second doesn't so, I changed the code and used compile but the regex functionality changed, the code is like this:
mystr = "Text Table 6-2: Management of children study and actions"
if re.match("([a-zA-Z0-9]+[ ])?(figure|list|table|Figure|List|Table)[ ][0-9]([-][0-9]+)?", mystr):
print("True matched")
parts_of_title = re.compile("([a-zA-Z0-9]+[ ])?(figure|list|table|Figure|List|Table)[ ][0-9]([-][0-9]+)?").split(mystr)
print(parts_of_title)
Output:
True matched
['', 'Text ', 'Table', '-2', ':\tManagement of children study and actions']
So based on this, how I can achieve this and stick to a clean and readable code? and why does using compile change the matching?
The matching changes because:
In the first part, you call .group().split() where .group() returns the full match which is a string.
In the second part, you call re.compile("...").split() where re.compile returns a regular expression object.
In the pattern, this part will match only a single word [a-zA-Z0-9]+[ ], and if this part should be in a capture group [0-9]([-][0-9]+)? the first (single) digit is currently not part of the capture group.
You could write the pattern writing 4 capture groups:
^(.*? )?((?:[Ll]ist|[Tt]able|[Ff]igure))\s+(\d+(?:-\d+)?):\s+(.+)
See a regex demo.
import re
pattern = r"^(.*? )?((?:[Ll]ist|[Tt]able|[Ff]igure))\s+(\d+(?:-\d+)?):\s+(.+)"
s = "Text Table 6-2: Management of children study and actions"
m = re.match(pattern, s)
if m:
print(m.groups())
Output
('Text ', 'Table', '6-2', 'Management of children study and actions')
If you want point 1 and 2 as one string, then you can use 2 capture groups instead.
^((?:.*? )?(?:[Ll]ist|[Tt]able|[Ff]igure)\s+\d+(?:-\d+)?):\s+(.+)
Regex demo
The output will be
('Text Table 6-2', 'Management of children study and actions')
you have already had answers but I wanted to try your problem to train myself so I give you all the same what I found if you are interested:
((?:[a-zA-Z0-9]+)? ?(?:[Ll]ist|[Tt]able|[Ff]igure)).*?((?:[0-9]+\-[0-9]+)|(?<!-)[0-9]+): (.*)
And here is the link to my tests: https://regex101.com/r/7VpPM2/1
So I wrote a program for my Kindle e-reader that searches my highlights and deletes repetitive text (it's usually information about the book title, author, page number, etc.). I thought it was functional but sometimes there would random be periods (.) on certain lines of the output. At first I thought the program was just buggy but then I realized that the regex I'm using to match the books title and author was also matching any sentence that ended in brackets.
This is the code for the regex that I'm using to detect the books title and author
titleRegex = re.compile('(.+)\((.+)\)')
Example
Desired book title and author match: Book title (Author name)
What would also get matched: *I like apples because they are green (they are sometimes red as well). *
In this case it would delete everything and leave just the period at the end of the sentence. This is obviously not ideal because it deletes the text I highlighted
Here is the unformatted text file that goes into my program
The program works by finding all of the matches for the regexes I wrote, looping through those matches and one by one replacing them with empty strings.
Would there be any ways to make my title regex more specific so that it only picks up author titles and not full sentences that end in brackets? If not, what steps would I have to take to restructure this program?
I've attached my code to the bottom of this post. I would greatly appreciate any help as I'm a total coding newbie. Thanks :)
import re
titleRegex = re.compile('(.+)\((.+)\)')
titleRegex2 = re.compile(r'\ufeff (.+)\((.+)\)')
infoRegex = re.compile(r'(.) ([a-zA-Z]+) (Highlight|Bookmark|Note) ([a-zA-Z]+) ([a-zA-Z]+) ([0-9]+) (\|)')
locationRegex = re.compile(r' Location (\d+)(-\d+)? (\|)')
dateRegex = re.compile(r'([a-zA-Z]+) ([a-zA-Z]+) ([a-zA-Z]+), ([a-zA-Z]+) ([0-9]+), ([0-9]+)')
timeRegex = re.compile(r'([0-9]+):([0-9]+):([0-9]+) (AM|PM)')
newlineRegex = re.compile(r'\n')
sepRegex = re.compile('==========')
regexList = [titleRegex, titleRegex2, infoRegex, locationRegex, dateRegex, timeRegex, sepRegex, newlineRegex]
string = open("/Users/devinnagami/myclippings.txt").read()
for x in range(len(regexList)):
newString = re.sub(regexList[x], ' ', string)
string = newString
finalText = newString.split(' ')
with open('booknotes.txt', 'w') as f:
for item in finalText:
f.write('%s\n' % item)
There isn't enough information to tell if "Book title (Book Author)" is different than something like "I like Books (Good Ones)" without context. Thankfully, the text you showed has plenty of context. Instead of creating several different regular expressions, you can combine them into one expression to encode that context.
For instance:
quoteInfoRegex = re.compile(
r"^=+\n(?P<title>.*?) \((?P<author>.*?)\)\n" +
r"- Your Highlight on page (?P<page>[\d]+) \| Location (?P<location>[\d-]+) \| Added on (?P<added>.*?)\n" +
r"\n" +
r"(?P<quote>.*?)\n", flags=re.MULTILINE)
for m in quoteInfoRegex.finditer(data):
print(m.groupdict())
This will pull out each line of the text, and parse it, knowing that the book title is the first line after the equals, and the quote itself is below that.
I'm trying to find and extract the date and time in a column that contain text sentences. The example data is as below.
df = {'Id': ['001', '002',...],
'Description': ['
THERE IS AN INTERUPTION/FAILURE # 9.6AM ON 27.1.2020 FOR JB BRANCH. THE INTERUPTION ALSO INVOLVED A, B, C AND SOME OTHER TOWN AREAS. OTC AND SST SERVICES INTERRUPTED AS GENSET ALSO WORKING AT THAT TIME. WE CALL FOR SERVICE. THE TECHNICHIAN COME AT 10.30AM. THEN IT BECOME OK AROUND 10.45AM', 'today is 23/3/2013 #10:AM we have',...],
....
}
df = pd.DataFrame (df, columns = ['Id','Description'])
I have tried the datefinder library below but it gives todays date which is wrong.
findDate = dtf.find_dates(le['Description'][0])
for dates in findDate:
print(dates)
Does anyone know what is the best way to extract it and automatically put it into a new column? Or does anyone know any library that can calculate duration between time and date in a string text. Thank you.
So you have two issues here.
you want to know how to apply a function on a DataFrame.
you want a function to extract a pattern from a bunch of text
Here is how to apply a function on a Serie (if selecting only one column as I did, you get a Serie). Bonus points: Read the DataFrame.apply() and Series.apply() documentation (30s) to become a Pandas-chad!
def do_something(x):
some-code()
df['new_text_column'] = df['original_text_column'].apply(do_something)
And here is one way to extract patterns from a string using regexes. Read the regex doc (or follow a course)and play around with RegExr to become an omniscient god (that is, if you use a command-line on Linux, along with your regex knowledge).
Modified from: How to extract the substring between two markers?
import re
text = 'gfgfdAAA1234ZZZuijjk'
# Searching numbers.
m = re.search('\d+', text)
if m:
found = m.group(0)
# found: 1234
I have an MS Word document contains some text and headings, I want to extract the headings, I installed Python for win32, but I didn't know which method to use, it seems the help document of python for windows does not list the functions of the word obejct. take the following code as example
import win32com.client as win32
word = win32.Dispatch("Word.Application")
word.Visible = 0
word.Documents.Open("MyDocument")
doc = word.ActiveDocument
how can I know all the functions of the word object?I didn't find anything useful in the help document.
The Word object model can be found here. Your doc object will contain these properties, and you can use them to perform your desired actions (note that I haven't used this feature with Word, so my knowledge of the object model is sparse). For instance, if you wanted to read all the words in a document, you could do:
for word in doc.Words:
print word
And you would get all of the words. Each of those word items would be a Word object (reference here), so you could access those properties during iteration. In your case, here is how you would get the style:
for word in doc.Words:
print word.Style
On a sample doc with a single Heading 1 and normal text, this prints:
Heading 1
Heading 1
Heading 1
Heading 1
Heading 1
Normal
Normal
Normal
Normal
Normal
To group the headings together, you can use itertools.groupby. As explained in the code comments below, you need to reference the str() of the object itself, as using word.Style returns an instance that won't properly group with other instances of the same style:
from itertools import groupby
import win32com.client as win32
# All the same as yours
word = win32.Dispatch("Word.Application")
word.Visible = 0
word.Documents.Open("testdoc.doc")
doc = word.ActiveDocument
# Here we use itertools.groupby (without sorting anything) to
# find groups of words that share the same heading (note it picks
# up newlines). The tricky/confusing thing here is that you can't
# just group on the Style itself - you have to group on the str().
# There was some other interesting behavior, but I have zero
# experience with COMObjects so I'll leave it there :)
# All of these comments for two lines of code :)
for heading, grp_wrds in groupby(doc.Words, key=lambda x: str(x.Style)):
print heading, ''.join(str(word) for word in grp_wrds)
This outputs:
Heading 1 Here is some text
Normal
No header
If you replace the join with a list comprehension, you get the below (where you can see the newlines):
Heading 1 ['Here ', 'is ', 'some ', 'text', '\r']
Normal ['\r', 'No ', 'header', '\r', '\r']
convert word to docx and use python docx module
from docx import Document
file = 'test.docx'
document = Document(file)
for paragraph in document.paragraphs:
if paragraph.style.name == 'Heading 1':
print(paragraph.text)
You can also use the Google Drive SDK to convert the Word document to something more useful, like HTML, where you can easily extract the headers.
https://developers.google.com/drive/manage-uploads
I'm trying to learn how to interpret and parse a string in python. I want to make a "string command" (don't know if is the right expression). But to explain better I will take an example: I want a command like in SQL, where there is a string with keywords that will make a process do what is asking for. Like this: cursor.execute("UPDATE Cars SET Price=? WHERE Id=?", (50000, 1)). But I want to create a format for my project like this (it is not necessary to be with sql): mydef("U={Cars[Price=50000], Id=1}")
Syntax table: <command>={<table>[<value name>=<value (int/str/float/bool)>], <id>=<value to id>}
Where command is: U=update, C=create, S=select, I=insert, D=delete
Well, I really want to learn how can I do it in Python. If is possible.
Pyparsing is a simple pure-Python, small-footprint, liberally-licensed module for creating parsers like the one you describe. Here are a couple of presentations I gave at PyCon'06 (updated for the Texas Python UnConference, 2008), one an intro to pyparsing itself, and one a demo of using pyparsing for parsing and executing a simple command language (a text adventure game).
Intro to Pyparsing - http://www.ptmcg.com/geo/python/confs/TxUnconf2008Pyparsing.html
A Simple Adventure Game Command Parser - http://www.ptmcg.com/geo/python/confs/pyCon2006_pres2.html
Both presentations are written using S5, so if you mouse into the lower right hand corner, you'll see << and >> buttons, a Ø button to see the entire presentation as a single printable web page, and a combo box to jump to a particular page.
You can find out more about pyparsing at http://pyparsing.wikispaces.com.
Just to be clear, are you aware that Python2.5+ includes sqlite?
import sqlite3
conn = sqlite3.connect(dbname.db)
curs = conn.cursor()
curs.execute("""CREATE TABLE Cars (UID INTEGER PRIMARY KEY, \
"Id" VARCHAR(42), \
"Price" VARCHAR(42))""")
curs.execute("UPDATE Cars SET Price=? WHERE Id=?", (50000, 1))
Edit to add: I didn't actually test this; you'll at least need an insert statement to make this work.
I did this code, I don't know if this will work. Just want the opinion.
>>> s = '<command>={<table>[<value name>=<value>], <id>=<value id>}'
>>> s1 = s.split('=', 1)
>>> s2 = s1[1].split(',', 1)
>>> s2 = s1[1].replace('{', '').replace('}', '').split(',', 1)
>>> s3 = s2[0].replace(']', '').split('[')
>>> s4 = s3[1].split('=')
>>> s1
['<command>', '{<table>[<value name>=<value>], <id>=<value id>}']
>>> s2
['<table>[<value name>=<value>]', ' <id>=<value id>']
>>> s3
['<table>', '<value name>=<value>']
>>> s4
['<value name>', '<value>']
>>> s5 = s2[1].split('=')
to split the entire command and get the args:
<command>={<table>[<value name>=<value>],<id>=<value id>}
["<command>", "{<table>[<value name>=<value>],<id>=<value id>}"]
["<table>[<value name>=<value>]", "<id>=<value id>"]
["<table>", "<value name>=<value>"]
["<value name>", "<value>"]
["<id>", "<value id>"]