How to convert inkml file to an image format - python

I have dataset consist of inkml files of handwritten texts. I want to convert it to a usable image format to train a CNN. python script would be helpful.
I found a method given below is the source code
def get_traces_data(inkml_file_abs_path):
traces_data = []
tree = ET.parse(inkml_file_abs_path)
root = tree.getroot()
doc_namespace = "{http://www.w3.org/2003/InkML}"
'Stores traces_all with their corresponding id'
traces_all = [{'id': trace_tag.get('id'),
'coords': [[round(float(axis_coord)) if float(axis_coord).is_integer() else round(float(axis_coord)) \
for axis_coord in coord[1:].split(' ')] if coord.startswith(' ') \
else [round(float(axis_coord)) if float(axis_coord).is_integer() else round(float(axis_coord)) \
for axis_coord in coord.split(' ')] \
for coord in (trace_tag.text).replace('\n', '').split(',')]} \
for trace_tag in root.findall(doc_namespace + 'trace')]
# print("before sort ", traces_all)
'Sort traces_all list by id to make searching for references faster'
traces_all.sort(key=lambda trace_dict: int(trace_dict['id']))
# print("after sort ", traces_all)
'Always 1st traceGroup is a redundant wrapper'
traceGroupWrapper = root.find(doc_namespace + 'traceGroup')
if traceGroupWrapper is not None:
for traceGroup in traceGroupWrapper.findall(doc_namespace + 'traceGroup'):
label = traceGroup.find(doc_namespace + 'annotation').text
'traces of the current traceGroup'
traces_curr = []
for traceView in traceGroup.findall(doc_namespace + 'traceView'):
'Id reference to specific trace tag corresponding to currently considered label'
traceDataRef = int(traceView.get('traceDataRef'))
'Each trace is represented by a list of coordinates to connect'
single_trace = traces_all[traceDataRef]['coords']
traces_curr.append(single_trace)
traces_data.append({'label': label, 'trace_group': traces_curr})
else:
'Consider Validation data that has no labels'
[traces_data.append({'trace_group': [trace['coords']]}) for trace in traces_all]
return traces_data

You may consider using xml.etree.ElementTree in Python to parse your inkml files and use OpenCV's cv2.line method to connect the points to draw the stroke.

Related

Isolating the Sentence in which a Term appears

I have the following script that does the following:
Extracts all text from a PowerPoint (all separated by a ":::")
Compares each term in my search term list to the text and isolates just those lines of text that contain one or more of the terms
Creates a dataframe for the term + file which that term appeared
Iterates through each PowerPoint for the given folder
I am hoping to adjust this to include specifically the sentence in which it appears (e.g. the entire content between the ::: before and ::: after the term appears).
end = r'C:\Users\xxx\Table Lookup.xlsx'
rfps = r'C:\Users\xxx\Folder1'
ls = os.listdir(rfps)
ppt = [s for s in ls if '.ppt' in s]
files = []
text = []
for p in ppt:
try:
prs_text = []
prs = Presentation(os.path.join(rfps, p))
for slide in prs.slides:
for shape in slide.shapes:
if hasattr(shape, "text"):
prs_text.append(shape.text)
prs_text = ':::'.join(prs_text)
files.append(p)
text.append(prs_text)
except:
print("Failed: " + str(p))
agg = pd.DataFrame()
agg['File'] = files
agg['Unstructured'] = text
agg['Unstructured'] = agg['Unstructured'].str.lower()
terms = ['test','testing']
a = [(x, z, i) for x, z, y in zip(agg['File'],agg['Unstructured'], agg['Unstructured']) for i in terms if i in y]
#how do I also include the sentence where this term appears
onepager = pd.DataFrame(a, columns=['File', 'Unstructured', 'Term']) #will need to add a column here
onepager = onepager.drop_duplicates(keep="first")
1 line sample of agg:
File | Unstructured
File1.pptx | competitive offerings:::real-time insights and analyses for immediate use:::disruptive “moves”:::deeper strategic insights through analyses generated and assessed over time:::launch new business models:::enter new markets::::::::::::internal data:::external data:::advanced computing capabilities:::insights & applications::::::::::::::::::machine learning
write algorithms that continue to “learn” or test and improve themselves as they ingest data and identify patterns:::natural language processing
allow interactions between computers and human languages using voice and/or text. machines directly interact, analyze, understand, and reproduce information:::intelligent automation
Adjustment based on input:
onepager = pd.DataFrame(a, columns=['File', 'Unstructured','Term'])
for t in terms:
onepager['Sentence'] = onepager["Unstructured"].apply(lambda x: x[x.rfind(":::", 0, x.find(t))+3: x.find(":::",x.find(t))-3])
To find the sentence containing the word "test", try:
>>> agg["Unstructured"].apply(lambda x: x[x.rfind(":::", 0, x.find("test"))+3: x.find(":::",x.find("test"))-3])
Looping through your terms:
onepager = pd.DataFrame(a, columns=['File', 'Unstructured','Term'])
for t in terms:
onepager[term] = onepager["Unstructured"].apply(lambda x: x[x.rfind(":::", 0, x.find(t))+3: x.find(":::",x.find(t))-3])

Mapping of two text documents with python

I have annotated some textual data and now I am trying to map it with the original text file to get more information out.
I have all information of the annotations in a JSON file, from which I successfully parsed all the relevant information. I stored the information as seen below.
Column = entity class
Column = starting point of the text
Column = length of the text (in char)
Column = value of entity label
Column = actual text that was annotated
My goal now is to include non-annotated text, as well. Not every single sentence or character of a text document has been annotated, but I want to include them to feed all the information into a DL-Algorithm. So every sentence that has not been annotated should be included and showing "None" as of entity class and entity label.
Appreciate any hint or help on that!
Thanks!
The information in your annotation file is not quite accurate. Since you stripped out white spaces, the length of the text should be adjusted properly.
def map_with_text(data_file, ann_file, out_file):
annots = []
# Read annotation information
with open(ann_file, 'r') as file_in:
for line in file_in:
components = line.split('t')
components = line.split("\t")
label = components[0]
begin = int(components[1])
length = int(components[2])
f_4 = int(components[3])
f_5 = int(components[4])
text = components[5].strip()
annots.append((label, begin, length, f_4, f_5, text))
annots = sorted(annots, key=lambda c: c[1])
# Read text data
with open(data_file, 'r') as original:
original_text = original.read()
length_original = len(original_text)
# Get positions of text already annotated. Since it was
# stripped, we cannot use the length. You can modify it if
# you think your information is accurate.
# pos_tup = [(begin, begin+length)
# for _, begin, length, _, _, text in annots]
pos_tup = [(begin, begin+len(text))
for _, begin, length, _, _, text in annots]
# Get position marker
pos_marker = [0] + [e for l in pos_tup for e in l] + [length_original]
# Ranges of positions of text which have not been annotated
not_ann_pos = [(x, y)
for x, y in zip(pos_marker[::2], pos_marker[1::2])]
# Texts which have not been annotated
not_ann_txt = [original_text[start:stop]
for start, stop in not_ann_pos]
# Include it in the list
all_components = [(None, start, len(txt.strip()), None, None, txt.strip())
for start, txt in zip(pos_marker[::2], not_ann_txt) if len(txt.strip()) != 0]
# Add annotated information
all_components += annots
# Sort by the start index
all_components = sorted(all_components, key=lambda c: c[1])
# Write ot the output file
with open(out_file, 'w') as f:
for a in all_components:
f.write(str(a[0]) + "\t" + str(a[1]) + "\t" + str(a[2]) +
"\t" + str(a[3]) + "\t" + str(a[4]) + "\t" + str(a[5]) + "\n")
map_with_text('0.txt', '0.ann', 'out0.tsv')
# You can loop calling the function
#
#

Date recognition in text - Latin

I am working on some Latin texts that contain dates and was using various regex patterns and rule based statements to extract dates. I was wondering if I can use an algorithm to train to extract these dates instead of the method I am currently using. Thanks
This is an extract of my algorithm:
def checkLatinDates(i, record, no):
if(i == 0 and isNumber(record[i])): #get deed no
df.loc[no,'DeedNo'] = record[i]
rec = record[i].lower()
split = rec.split()
if(split[0] == 'die'):
items = deque(split)
items.popleft()
split = list(items)
if('eodem' in rec):
n = no-1
if(no>1):
while ( pd.isnull(df.ix[n]['LatinDate'])):
n = n-1
print n
df['LatinDate'][no] = df.ix[n]['LatinDate']
if(words_in_string(latinMonths, rec.lower()) and len(split)<10):
if not (dates.loc[dates['Latin'] == split[0], 'Number'].empty):
day = dates.loc[dates['Latin'] == split[0], 'Number'].iloc[0]
split[0] = day
nd = ' '.join(map(str, split))
df['LatinDate'][no] = nd
elif(convertArabic(split[0])!= ''):
day = convertArabic(split[0])
split[0] = day
nd = ' '.join(map(str, split))
df['LatinDate'][no] = nd
You could use some machine learning algorithm like adaboost, using IOB tagging
adding some context features, like the type of word, a regex to detect if it is obviously a date, the surrounding words type, etc.
Here is a tutorial.

A Solution for Extracting Tabular Data from a PDF file (sort-of)

I had a need to extract tabular data on a large number of pages from many PDF documents. Using the built-in text export capability from within Adobe’s Acrobat Reader was useless – text extracted that way loses the spacial relationships established by the tables. There have been a number of questions raised by others, and many solutions offered for this problem that I had tried, but the results varied between poor and terrible. So I set about to develop my own solution. It’s developed enough (I think) that it’s ready to share here.
I first tried to look at the distribution of text (in terms of their x & y locations on the page) to try and identify where the row and column breaks are located. By using the Python Module ‘pdfminer’, I extracted the text and BoundingBox parameters, sifted through each piece of text and mapped how many pieces of text were on a page for a given x or y value. The idea was to look through the distribution of text (horizontally for row breaks, and vertically for column breaks), and when the density was zero (meaning there was a clear gap across, or up/down, the table), that would identify a row or column break.
The idea does work, but only sometimes. It assumes the table has the same number and alignment of cells vertically and horizontally (a simple grid), and that there is a distinct gap between the text of adjacent cells. Also, if there’s text that spans across multiple columns (like a title above the table, a footer below the table, merged cells, etc.), identification of column breaks is more difficult – you might be able to identify which text elements above or below the table should be ignored, but I couldn’t find a good approach for dealing with merged cells.
When it came time to look horizontally to identify row breaks, there were several other challenges. First, pdfminer automatically tries to group pieces of text that are located near each other, even when they span more that one cell in the table. In those instances, the BoundingBox for that text object includes multiple lines, obscuring any row breaks that might have been crossed. Even if every line of text were extracted separately, the challenge would be to distinguish what was a normal space separating consecutive lines of text, and what was a row break.
After exploring various work-arounds and conducting a number of tests, I decided to step back and try another approach.
The tables that had the data I needed to extract all have borders around them, so I reasoned I should be able to find the elements in the PDF file that draws those lines. However, when I looked at the elements that I could extract from the source file, I got some surprising results.
You would think that lines would be represented as a “line object”, but you’d be wrong (at least for the files I was looking at). If they aren’t “lines”, then maybe they simply draw rectangles for each cell, adjusting the linewidth attribute to get the line thickness they wanted, right? No. It turned out the lines were actually drawn as “rectangle objects” with a very small dimension (narrow width to create vertical lines, or short height to create horizontal lines). And where it looks like the lines meet at the corners, the rectangles don’t – they have a very small rectangle to fill-in the gaps.
Once I was able to recognize what to look for, I then had to contend with multiple rectangles placed adjacent to one another to create thick lines. Ultimately, I wrote a routine to group similar values and calculate an average value to use for the row and column breaks that I would use later.
Now, it was a matter of processing the text from the table. I chose to use an SQLite database to store, analyze, and regroup the text from the PDF file. I know there are other “pythonic” options out there, and some may find those approaches more familiar and easy to use, but I felt that the amount of data I would be dealing with would be best handled using an actual database file.
As I mentioned earlier, pdfminer groups text that is located near one another, and it may cross cell boundaries. An initial attempt to split pieces of text shown on separate lines within one of these text groups was only partially successful; it’s one of the areas that I intend to develop further (namely, how to bypass the pdfminer LTTextbox routine so I can get the pieces individually).
There is another shortcoming of the pdfminer module when it comes to vertical text. I have been unable to identify any attribute that will either identify when the text is vertical, or what angle (e.g., +90 or -90 degrees) the text is displayed. And the text grouping routine doesn’t seem to know that either, since text rotated +90 degrees (i.e., rotated CCW where the letters are read from the bottom up), it concatenates the letter in reverse order separated by newline characters.
The routine below works fairly well, under the circumstances. I know it’s still rough, there are several enhancements to be made, and it’s not packaged in a way that’s ready for widespread distribution, but it seems to have “broken the code” on how to extract tabular data from a PDF file (for the most part). Hopefully, others may be able to use this for their own purposes, and maybe even improve it.
I welcome any ideas, suggestions, or recommendations that you may have.
EDIT: I posted a revised version that includes additional parameters (cell_htol_up, etc.) to help "tune" the algorithm as to which pieces of text belong to a particular cell in the table.
# This was written for use w/Python 2. Use w/Python 3 hasn't been tested & proper execution is not guaranteed.
import os # Library of Operating System routines
import sys # Library of System routines
import sqlite3 # Library of SQLite dB routines
import re # Library for Regular Expressions
import csv # Library to output as Comma Separated Values
import codecs # Library of text Codec types
import cStringIO # Library of String manipulation routines
from pdfminer.pdfparser import PDFParser # Library of PDF text extraction routines
from pdfminer.pdfdocument import PDFDocument, PDFNoOutlines
from pdfminer.pdfpage import PDFPage, PDFTextExtractionNotAllowed
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice
from pdfminer.layout import LAParams, LTTextBox, LTTextLine, LTFigure, LTImage, LTLine, LTRect, LTTextBoxVertical
from pdfminer.converter import PDFPageAggregator
########################################################################################################################
def add_new_value (new_value, list_values=[]):
# Used to exclude duplicate values in a list
not_in_list = True
for list_value in list_values:
# if list_value == new_value:
if abs(list_value - new_value) < 1:
not_in_list = False
if not_in_list:
list_values.append(new_value)
return list_values
########################################################################################################################
def condense_list (list_values, grp_tolerance = 1):
# Group values & eliminate duplicate/close values
tmp_list = []
for n, list_value in enumerate(list_values):
if sum(1 for val in tmp_list if abs(val - list_values[n]) < grp_tolerance) == 0:
tmp_val = sum(list_values[n] for val in list_values if abs(val - list_values[n]) < grp_tolerance) / \
sum(1 for val in list_values if abs(val - list_values[n]) < grp_tolerance)
tmp_list.append(int(round(tmp_val)))
return tmp_list
########################################################################################################################
class UnicodeWriter:
"""
A CSV writer which will write rows to CSV file "f",
which is encoded in the given encoding.
"""
def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
# Redirect output to a queue
self.queue = cStringIO.StringIO()
self.writer = csv.writer(self.queue, dialect=dialect, quotechar = '"', quoting=csv.QUOTE_ALL, **kwds)
self.stream = f
self.encoder = codecs.getincrementalencoder(encoding)()
def writerow(self, row):
self.writer.writerow([unicode(s).encode("utf-8") for s in row])
# Fetch UTF-8 output from the queue ...
data = self.queue.getvalue()
data = data.decode("utf-8")
# ... and reencode it into the target encoding
data = self.encoder.encode(data)
# write to the target stream
self.stream.write(data)
# empty queue
self.queue.truncate(0)
def writerows(self, rows):
for row in rows:
self.writerow(row)
########################################################################################################################
# In case a connection to the database can't be created, set 'conn' to 'None'
conn = None
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# Define variables for use later on
#_______________________________________________________________________________________________________________________
sqlite_file = "pdf_table_text.sqlite" # Name of the sqlite database file
brk_tol = 3 # Tolerance for grouping LTRect values as line break points
# *** This may require tuning to get optimal results ***
cell_htol_lf = -2 # Horizontal & Vertical tolerances (up/down/left/right)
cell_htol_rt = 2 # for over-scanning table cell bounding boxes
cell_vtol_up = 8 # i.e., how far outside cell bounds to look for text to include
cell_vtol_dn = 0 # *** This may require tuning to get optimal results ***
replace_newlines = True # Switch for replacing newline codes (\n) with spaces
replace_multspaces = True # Switch for replacing multiple spaces with a single space
# txt_concat_str = "' '" # Concatenate cell data with a single space
txt_concat_str = "char(10)" # Concatenate cell data with a line feed
#=======================================================================================================================
# Default values for sample input & output files (path, filename, pagelist, etc.)
filepath = "" # Path of the source PDF file (default = current folder)
srcfile = "" # Name of the source PDF file (quit if left blank)
pagelist = [1, ] # Pages to extract table data (Make an interactive input?)
# --> THIS MUST BE IN THE FORM OF A LIST OR TUPLE!
#=======================================================================================================================
# Impose required conditions & abort execution if they're not met
# Should check if files are locked: sqlite database, input & output files, etc.
if filepath + srcfile == "" or pagelist == None:
print "Source file not specified and/or page list is blank! Execution aborted!"
sys.exit()
dmp_pdf_data = "pdf_data.csv"
dmp_tbl_data = "tbl_data.csv"
destfile = srcfile[:-3]+"csv"
#=======================================================================================================================
# First test to see if this file already exists & delete it if it does
if os.path.isfile(sqlite_file):
os.remove(sqlite_file)
#=======================================================================================================================
try:
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# Open or Create the SQLite database file
#___________________________________________________________________________________________________________________
print "-" * 120
print "Creating SQLite Database & working tables ..."
# Connecting to the database file
conn = sqlite3.connect(sqlite_file)
curs = conn.cursor()
qry_create_table = "CREATE TABLE {tn} ({nf} {ft} PRIMARY KEY)"
qry_alter_add_column = "ALTER TABLE {0} ADD COLUMN {1}"
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# Create 1st Table
#___________________________________________________________________________________________________________________
tbl_pdf_elements = "tbl_pdf_elements" # Name of the 1st table to be created
new_field = "idx" # Name of the index column
field_type = "INTEGER" # Column data type
# Delete the table if it exists so old data is cleared out
curs.execute("DROP TABLE IF EXISTS " + tbl_pdf_elements)
# Create output table for PDF text w/1 column (index) & set it as PRIMARY KEY
curs.execute(qry_create_table.format(tn=tbl_pdf_elements, nf=new_field, ft=field_type))
# Table fields: index, text_string, pg, x0, y0, x1, y1, orient
cols = ("'pdf_text' TEXT",
"'pg' INTEGER",
"'x0' INTEGER",
"'y0' INTEGER",
"'x1' INTEGER",
"'y1' INTEGER",
"'orient' INTEGER")
# Add other columns
for col in cols:
curs.execute(qry_alter_add_column.format(tbl_pdf_elements, col))
# Committing changes to the database file
conn.commit()
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# Create 2nd Table
#___________________________________________________________________________________________________________________
tbl_table_data = "tbl_table_data" # Name of the 2nd table to be created
new_field = "idx" # Name of the index column
field_type = "INTEGER" # Column data type
# Delete the table if it exists so old data is cleared out
curs.execute("DROP TABLE IF EXISTS " + tbl_table_data)
# Create output table for Table Data w/1 column (index) & set it as PRIMARY KEY
curs.execute(qry_create_table.format(tn=tbl_table_data, nf=new_field, ft=field_type))
# Table fields: index, text_string, pg, row, column
cols = ("'tbl_text' TEXT",
"'pg' INTEGER",
"'row' INTEGER",
"'col' INTEGER")
# Add other columns
for col in cols:
curs.execute(qry_alter_add_column.format(tbl_table_data, col))
# Committing changes to the database file
conn.commit()
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# Start PDF text extraction code here
#___________________________________________________________________________________________________________________
print "Opening PDF file & preparing for text extraction:"
print " -- " + filepath + srcfile
# Open a PDF file.
fp = open(filepath + srcfile, "rb")
# Create a PDF parser object associated with the file object.
parser = PDFParser(fp)
# Create a PDF document object that stores the document structure.
# Supply the password for initialization (if needed)
# document = PDFDocument(parser, password)
document = PDFDocument(parser)
# Check if the document allows text extraction. If not, abort.
if not document.is_extractable:
raise PDFTextExtractionNotAllowed
# Create a PDF resource manager object that stores shared resources.
rsrcmgr = PDFResourceManager()
# Create a PDF device object.
device = PDFDevice(rsrcmgr)
# Create a PDF interpreter object.
interpreter = PDFPageInterpreter(rsrcmgr, device)
# Set parameters for analysis.
laparams = LAParams()
# Create a PDF page aggregator object.
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# Extract text & location data from PDF file (examine & process only pages in the page list)
#___________________________________________________________________________________________________________________
# Initialize variables
idx1 = 0
idx2 = 0
lastpg = max(pagelist)
print "Starting text extraction ..."
qry_insert_pdf_txt = "INSERT INTO " + tbl_pdf_elements + " VALUES(?, ?, ?, ?, ?, ?, ?, ?)"
qry_get_pdf_txt = "SELECT group_concat(pdf_text, " + txt_concat_str + \
") FROM {0} WHERE pg=={1} AND x0>={2} AND x1<={3} AND y0>={4} AND y1<={5} ORDER BY y0 DESC, x0 ASC;"
qry_insert_tbl_data = "INSERT INTO " + tbl_table_data + " VALUES(?, ?, ?, ?, ?)"
# Process each page contained in the document.
for i, page in enumerate(PDFPage.create_pages(document)):
interpreter.process_page(page)
# Get the LTPage object for the page.
lt_objs = device.get_result()
pg = device.pageno - 1 # Must subtract 1 to correct 'pageno'
# Exit the loop if past last page to parse
if pg > lastpg:
break
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# If it finds a page in the pagelist, process the contents
if pg in pagelist:
print "- Processing page {0} ...".format(pg)
xbreaks = []
ybreaks = []
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# Iterate thru list of pdf layout elements (LT* objects) then capture the text & attributes of each
for lt_obj in lt_objs:
# Examine LT objects & get parameters for text strings
if isinstance(lt_obj, LTTextBox) or isinstance(lt_obj, LTTextLine):
# Increment index
idx1 += 1
# Assign PDF LTText object parameters to variables
pdftext = lt_obj.get_text() # Need to convert escape codes & unicode characters!
pdftext = pdftext.strip() # Remove leading & trailing whitespaces
# Save integer bounding box coordinates: round down # start, round up # end
# (x0, y0, x1, y1) = lt_obj.bbox
x0 = int(lt_obj.bbox[0])
y0 = int(lt_obj.bbox[1])
x1 = int(lt_obj.bbox[2] + 1)
y1 = int(lt_obj.bbox[3] + 1)
orient = 0 # What attribute gets this value?
#---- These approaches don't work for identifying vertical text ... --------------------------------
# orient = lt_obj.rotate
# orient = lt_obj.char_disp
# if lt_obj.get_writing_mode == "tb-rl":
# orient = 90
# if isinstance(lt_obj, LTTextBoxVertical): # vs LTTextBoxHorizontal
# orient = 90
# if LAParams(lt_obj).detect_vertical:
# orient = 90
#---------------------------------------------------------------------------------------------------
# Split text strings at line feeds
if "\n" in pdftext:
substrs = pdftext.split("\n")
lineheight = (y1-y0) / (len(substrs) + 1)
# y1 = y0 + lineheight
y0 = y1 - lineheight
for substr in substrs:
substr = substr.strip() # Remove leading & trailing whitespaces
if substr != "":
# Insert values into tuple for uploading into dB
pdf_txt_export = [(idx1, substr, pg, x0, y0, x1, y1, orient)]
# Insert values into dB
curs.executemany(qry_insert_pdf_txt, pdf_txt_export)
conn.commit()
idx1 += 1
# y0 = y1
# y1 = y0 + lineheight
y1 = y0
y0 = y1 - lineheight
else:
# Insert values into tuple for uploading into dB
pdf_txt_export = [(idx1, pdftext, pg, x0, y0, x1, y1, orient)]
# Insert values into dB
curs.executemany(qry_insert_pdf_txt, pdf_txt_export)
conn.commit()
elif isinstance(lt_obj, LTLine):
# LTLine - Lines drawn to define tables
pass
elif isinstance(lt_obj, LTRect):
# LTRect - Borders drawn to define tables
# Grab the lt_obj.bbox values
x0 = round(lt_obj.bbox[0], 2)
y0 = round(lt_obj.bbox[1], 2)
x1 = round(lt_obj.bbox[2], 2)
y1 = round(lt_obj.bbox[3], 2)
xmid = round((x0 + x1) / 2, 2)
ymid = round((y0 + y1) / 2, 2)
# rectline = lt_obj.linewidth
# If width less than tolerance, assume it's used as a vertical line
if (x1 - x0) < brk_tol: # Vertical Line or Corner
xbreaks = add_new_value(xmid, xbreaks)
# If height less than tolerance, assume it's used as a horizontal line
if (y1 - y0) < brk_tol: # Horizontal Line or Corner
ybreaks = add_new_value(ymid, ybreaks)
elif isinstance(lt_obj, LTImage):
# An image, so do nothing
pass
elif isinstance(lt_obj, LTFigure):
# LTFigure objects are containers for other LT* objects which shouldn't matter, so do nothing
pass
col_breaks = condense_list(xbreaks, brk_tol) # Group similar values & eliminate duplicates
row_breaks = condense_list(ybreaks, brk_tol)
col_breaks.sort()
row_breaks.sort()
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# Regroup the text into table 'cells'
#___________________________________________________________________________________________________________
print " -- Text extraction complete. Grouping data for table ..."
row_break_prev = 0
col_break_prev = 0
table_data = []
table_rows = len(row_breaks)
for i, row_break in enumerate(row_breaks):
if row_break_prev == 0: # Skip the rest the first time thru
row_break_prev = row_break
else:
for j, col_break in enumerate(col_breaks):
if col_break_prev == 0: # Skip query the first time thru
col_break_prev = col_break
else:
# Run query to get all text within cell lines (+/- htol & vtol values)
curs.execute(qry_get_pdf_txt.format(tbl_pdf_elements, pg, col_break_prev + cell_htol_lf, \
col_break + cell_htol_rt, row_break_prev + cell_vtol_dn, row_break + cell_vtol_up))
rows = curs.fetchall() # Retrieve all rows
for row in rows:
if row[0] != None: # Skip null results
idx2 += 1
table_text = row[0]
if replace_newlines: # Option - Replace newline codes (\n) with spaces
table_text = table_text.replace("\n", " ")
if replace_multspaces: # Option - Replace multiple spaces w/single space
table_text = re.sub(" +", " ", table_text)
table_data.append([idx2, table_text, pg, table_rows - i, j])
col_break_prev = col_break
row_break_prev = row_break
curs.executemany(qry_insert_tbl_data, table_data)
conn.commit()
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# Export the regrouped table data:
# Determine the number of columns needed for the output file
# -- Should the data be extracted all at once or one page at a time?
print "Saving exported table data ..."
qry_col_count = "SELECT MIN([col]) AS colmin, MAX([col]) AS colmax, MIN([row]) AS rowmin, MAX([row]) AS rowmax, " + \
"COUNT([row]) AS rowttl FROM [{0}] WHERE [pg] = {1} AND [tbl_text]!=' ';"
qry_sql_export = "SELECT * FROM [{0}] WHERE [pg] = {1} AND [row] = {2} AND [tbl_text]!=' ' ORDER BY [col];"
f = open(filepath + destfile, "wb")
writer = UnicodeWriter(f)
for pg in pagelist:
curs.execute(qry_col_count.format(tbl_table_data, pg))
rows = curs.fetchall()
if len(rows) > 1:
print "Error retrieving row & column counts! More that one record returned!"
print " -- ", qry_col_count.format(tbl_table_data, pg)
print rows
sys.exit()
for row in rows:
(col_min, col_max, row_min, row_max, row_ttl) = row
# Insert a page separator
writer.writerow(["Data for Page {0}:".format(pg), ])
if row_ttl == 0:
writer.writerow(["Unable to export text from PDF file. No table structure found.", ])
else:
k = 0
for j in range(row_min, row_max + 1):
curs.execute(qry_sql_export.format(tbl_table_data, pg, j))
rows = curs.fetchall()
if rows == None: # No records match the given criteria
pass
else:
i = 1
k += 1
column_data = [k, ] # 1st column as an Index
for row in rows:
(idx, tbl_text, pg_num, row_num, col_num) = row
if pg_num != pg: # Exit the loop if Page # doesn't match
break
while i < col_num:
column_data.append("")
i += 1
if i >= col_num or i == col_max: break
column_data.append(unicode(tbl_text))
i += 1
writer.writerow(column_data)
f.close()
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# Dump the SQLite regrouped data (for error checking):
print "Dumping SQLite table of regrouped (table) text ..."
qry_sql_export = "SELECT * FROM [{0}] WHERE [tbl_text]!=' ' ORDER BY [pg], [row], [col];"
curs.execute(qry_sql_export.format(tbl_table_data))
rows = curs.fetchall()
# Output data with Unicode intact as CSV
with open(dmp_tbl_data, "wb") as f:
writer = UnicodeWriter(f)
writer.writerow(["idx", "tbl_text", "pg", "row", "col"])
writer.writerows(rows)
f.close()
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# Dump the SQLite temporary PDF text data (for error checking):
print "Dumping SQLite table of extracted PDF text ..."
qry_sql_export = "SELECT * FROM [{0}] WHERE [pdf_text]!=' ' ORDER BY pg, y0 DESC, x0 ASC;"
curs.execute(qry_sql_export.format(tbl_pdf_elements))
rows = curs.fetchall()
# Output data with Unicode intact as CSV
with open(dmp_pdf_data, "wb") as f:
writer = UnicodeWriter(f)
writer.writerow(["idx", "pdf_text", "pg", "x0", "y0", "x1", "y2", "orient"])
writer.writerows(rows)
f.close()
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
print "Conversion complete."
print "-" * 120
except sqlite3.Error, e:
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# Rollback the last database transaction if the connection fails
#___________________________________________________________________________________________________________________
if conn:
conn.rollback()
print "Error '{0}':".format(e.args[0])
sys.exit(1)
finally:
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# Close the connection to the database file
#___________________________________________________________________________________________________________________
if conn:
conn.close()

removing iterated string from string array

I am writing a small script that lists the currently connected hard disks on my machine. I only need the disk identifier(disk0), not the partition ID(disk0s1, disk0s2, etc.)
How can I iterate through an array that contains diskID and partitionID and remove the partitionID entries? Here's what I'm trying so far:
import os
allDrives = os.listdir("/dev/")
parsedDrives = []
def parseAllDrives():
parsedDrives = []
matching = []
for driveName in allDrives:
if 'disk' in driveName:
parsedDrives.append(driveName)
else:
continue
for itemName in parsedDrives:
if len(parsedDrives) != 0:
if 'rdisk' in itemName:
parsedDrives.remove(itemName)
else:
continue
else:
continue
#### this is where the problem starts: #####
# iterate through possible partition identifiers
for i in range(5):
#create a string for the partitionID
systemPostfix = 's' + str(i)
matching.append(filter(lambda x: systemPostfix in x, parsedDrives))
for match in matching:
if match in parsedDrives:
parsedDrives.remove(match)
print("found a mactch and removed it")
print("matched: %s" % matching)
print(parsedDrives)
parseAllDrives()
That last bit is just the most recent thing I've tried. Definitely open to going a different route.
try beginning with
allDrives = os.listdir("/dev/")
disks = [drive for drive in allDrives if ('disk' in drive)]
then, given disks id's are only 5-chars length
short_disks = [disk[:6] for disk in disks]
unique_short_disks = list(set(short_disks))

Categories

Resources