Related
I'm relatively new to coding and have a tricky (to me) logic determination sequence that I assume can be simplified from what I currently have. I have yet to be able to find something similar that I can comprehend at my current level of understanding and thus adapt accordingly.
I have a data frame containing a list of ~200 wells. Each well has a different depth of a perforated casing that exposes it to groundwater at various depths (open interval). The length/depth of the open interval ranges for each well. Based on the open interval, I need to determine which layers within a groundwater model should be associated to each individual well (location within the model is given by row/column values) and then append that information to an array for export in text format that will be fed back into the model (I've got that part). Furthermore, the number of layers could potentially increase, so ideally the code could adapt to any number of layers. I would always have some form of the example data set below. If the layers increase, the data set with those new values would be provided by the model. The well open interval will not change, and thus which layers each well's open interval exist within would change.
Example dataset:
import pandas as pd
df = pd.DataFrame({'well_ID': ['GR800', 'HA009', 'HA219', 'HA323','HA463'],
'Top_open_int':[4450.0, 4530.0, 4390.0, 3900.0, 4140.0], #top of open interval
'Bot_open_int':[4110.0, 3800.0, 4250.0, 3750.0, 3650.0], #bottom of open interval
'Top_1':[4500.0, 4550.0, 4100.0, 4200.0, 4150.0], #top of layer 1
'Bot_1':[4300.0, 4250.0, 3900.0, 4050.0, 3900.0], #bottom of layer 1
'Bot_2':[4100.0, 3900.0, 3750.0, 3850.0, 3750.0], #bottom of layer 2
'Bot_3':[3820.0, 3650.0, 3520.0, 3650.0, 3570.0], #bottom of layer 3
'Bot_4':[3360.0, 3480.0, 3300.0, 3380.0, 3350.0]}) #bottom of layer 4
What I'm currently doing is something like below, where I'm writing up every possible boundary condition combination that could exist. If the number of layers increase, I have to add all the additional possible combinations to the script.
Current script approach:
# initiate empty array
layers = []
# loop through all combinations and appending appropriate text if the interval matches
for well, row in df.iterrows():
if row['Top_open_int'] >= row['Bot_1'] and row['Bot_open_int'] < row['Bot_3']:
layers.append((row['well_ID'], '1', row['Row'], row['Column']))
layers.append((row['well_ID'], '2', row['Row'], row['Column']))
layers.append((row['well_ID'], '3', row['Row'], row['Column']))
layers.append((row['well_ID'], '4', row['Row'], row['Column']))
elif row['Top_open_int'] >= row['Bot_1'] and row['Bot_open_int'] < row['Bot_2']:
layers.append((row['well_ID'], '1', row['Row'], row['Column']))
layers.append((row['well_ID'], '2', row['Row'], row['Column']))
layers.append((row['well_ID'], '3', row['Row'], row['Column']))
elif row['Top_open_int'] >= row['Bot_1'] and row['Bot_open_int'] < row['Bot_1']:
layers.append((row['well_ID'], '1', row['Row'], row['Column']))
layers.append((row['well_ID'], '2', row['Row'], row['Column']))
elif row['Top_open_int'] > row['Bot_1'] and row['Bot_open_int'] >= row['Bot_1']:
layers.append((row['well_ID'], '1', row['Row'], row['Column']))
# script continues for all possible combinations that the open interval could
# potentially fall within. There doesn't seem to be a point in writing it all out here
If you run the above code you'll see what the expected outcome would be. It would be an array like this, but for all the wells in the data set:
[('GR800', '1', 20, 100),
('GR800', '2', 20, 100),
('HA009', '1', 45, 10),
('HA009', '2', 45, 10),
('HA009', '3', 45, 10),
('HA219', '1', 105, 65),
('HA463', '1', 250, 15),
('HA463', '2', 250, 15),
('HA463', '3', 250, 15)]
Is there a way to simplify this approach and make it more robust so that it can adapt to changes in the number of layers?
I don't know if it helps but these 3 lines will do the work of that for loop:
df[(df['Top_open_int'] >= df['Bot_1']) & (df['Bot_open_int'] >= df['Bot_3'])]
df[(df['Top_open_int'] >= df['Bot_1']) & (df['Bot_open_int'] >= df['Bot_2'])]
df[(df['Top_open_int'] >= df['Bot_1']) & (df['Bot_open_int'] >= df['Bot_1'])]
Let me try to explain to the best of my ability as I am not a Python wizard. I have read with PyPDF2 a PDF table of data regarding covid-19 in Mexico and tokenize it—long story, I tried doing it with tabula but did not get the format I was expecting and I was going to spend more time reformatting the CSV document I have gotten back than analyzing it—and have gotten a list of strings back with len of 16792 which is fine.
Now, the problem I am facing is that I need to format it in the appropriate way by concatenating some (not all) of those strings together so I can create a list of lists with the same length which is 9 columns.
This is an example of how it looks right now, the columns are Case number, State, Locality, Gender, Age, Date when symptoms started, Status, Type of contagion, Date of arrival to Mexico:
['1', 'PUEBLA', 'PUEBLA', 'M', '49', '15/03/2020', 'Sospechoso', 'Contacto', 'NA', '2', 'GUERRERO', 'ZONA', 'NORTE', 'M', '29', '15/03/2020', 'Sospechoso', 'Contacto', 'NA', '3', 'BAJA', 'CALIFORNIA', 'TIJUANA', 'F', '34', '14/03/2020', 'Sospechoso', 'Estados', 'Unidos', '08/03/2020', '4', 'CIUDAD', 'DE', 'MÉXICO', 'TLALPAN', 'F', '69', '25/02/2020', 'Sospechoso', 'Italia', '03/03/2020', '5', 'JALISCO', 'CENTRO', 'GUADALAJARA', 'M', '19', '18/03/2020', 'Sospechoso', 'España', '17/03/2020'
What I would want is to get certain strings like 'ZONA', 'NORTE' as 'ZONA NORTE' or 'CIUDAD', 'DE', 'MEXICO' as 'CIUDAD DE MEXICO' or 'ESTADOS', 'UNIDOS' as 'ESTADOS UNIDOS'...
I seriously do not know how to tackle this. I have tried, split(), replace(), trying to find the index of each frequency, read all questions about manipulating lists, tried almost all the responses provided... and haven't been able to do it.
Any guidance, will be greatly appreciated. Sorry if this is a very basic question, but I know there has to be a way, I just don't know it.
Since the phrase split is not the same for each row, you have a way to "recognize" it. One way would be if 2 list item that are next to each other have more than 3 letter, then join them.
import re
row_list = ['1', 'PUEBLA', 'PUEBLA', 'M', '49', '15/03/2020', 'Sospechoso', 'Contacto', 'NA', '2', 'GUERRERO', 'ZONA', 'NORTE', 'M', '29', '15/03/2020', 'Sospechoso', 'Contacto', 'NA', '3', 'BAJA', 'CALIFORNIA', 'TIJUANA', 'F', '34', '14/03/2020', 'Sospechoso', 'Estados', 'Unidos', '08/03/2020', '4', 'CIUDAD', 'DE', 'MÉXICO', 'TLALPAN', 'F', '69', '25/02/2020', 'Sospechoso', 'Italia', '03/03/2020', '5', 'JALISCO', 'CENTRO', 'GUADALAJARA', 'M', '19', '18/03/2020', 'Sospechoso', 'España', '17/03/2020']
words_longer_than_3 = r'([^\d\W]){3,}'
def is_a_word(text):
return bool(re.findall(words_longer_than_3, text))
def get_next_item(row_list, i):
try:
next_item = row_list[i+1]
except IndexError:
return
return is_a_word(next_item)
for i, item in enumerate(row_list):
item_is_a_word = is_a_word(row_list[i])
if not item_is_a_word:
continue
next_item_is_a_word = get_next_item(row_list, i)
while next_item_is_a_word:
row_list[i] += f' {row_list[i+1]}'
del row_list[i+1]
next_item_is_a_word = get_next_item(row_list, i)
print(row_list)
result:
['1', 'PUEBLA PUEBLA', 'M', '49', '15/03/2020', 'Sospechoso Contacto', 'NA', '2', 'GUERRERO ZONA NORTE', 'M', '29', '15/03/2020', 'Sospechoso Contacto', 'NA', '3', 'BAJA CALIFORNIA TIJUANA', 'F', '34', '14/03/2020', 'Sospechoso Estados Unidos', '08/03/2020', '4', 'CIUDAD', 'DE', 'MÉXICO TLALPAN', 'F', '69', '25/02/2020', 'Sospechoso Italia', '03/03/2020', '5', 'JALISCO CENTRO GUADALAJARA', 'M', '19', '18/03/2020', 'Sospechoso España', '17/03/2020']
I suppose that the data you want to process comes from a similar file like this one here, which contains 2623 rows x 8 columns.
You can load the data from the PDF file using tabula-py. You can install it through pip install tabula-py==1.3.0. One issue with extracting the table like this is that tabula-py can mess up things sometimes. For example, the header of the table was extracted from the PDF file like this:
"",,,,Identificación Fecha de,
"",,,,de COVID-19,Fecha del llegada a
N° Caso,Estado,Sexo,Edad,Inicio de Procedencia por RT-PCR en,México
"",,,,síntomas,
"",,,,tiempo real,
Pretty nasty huh?
In addition, tabula-py could not separate some of the columns, that is writing a comma in the right place so that the output CSV file would be well-parsed. For instance, the line with número de caso (case number) 8:
8,BAJA CALIFORNIA,F,65,13/03/2020 Sospechoso Estados Unidos,08/03/2020
could be fixed by just replacing " Sospechoso " by ",Sospechoso,". And you are lucky becuase this is the only parse issue you will have to deal with for now. Thus, iterating through the lines of the output CSV file and replacing " Sospechoso " by ",Sospechoso," takes care of everything.
Finally, I added an option (removeaccents) for removing the accents from the data. This can help you avoid encoding issues in the future. For this you will need unicode: pip install unidecode.
Putting everything together, the code that reads the PDF file and converts it to a CSV file and loads it as a pandas dataframe is as follows (You can download the preprocessed CSV file here):
import tabula
from tabula import wrapper
import pandas as pd
import unidecode
"""
1. Adds the header.
2. Skips "corrupted" lines from tabula.
3. Replacing " Sospechoso " for ",Sospechoso," automatically separates
the previous column ("fecha_sintomas") from the next one ("procedencia").
4. Finally, elminiates accents to avoid encoding issues.
"""
def simplePrep(
input_path,
header = [
"numero_caso",
"estado",
"sexo",
"edad",
"fecha_sintomas",
"identification_tiempo_real",
"procedencia",
"fecha_llegada_mexico"
],
lookfor = " Sospechoso ",
replacewith = ",Sospechoso,",
output_path = "preoprocessed.csv",
skiprowsupto = 5,
removeaccents = True
):
fin = open(input_path, "rt")
fout = open(output_path, "wt")
fout.write(",".join(header) + "\n")
count = 0
for line in fin:
if count > skiprowsupto - 1:
if removeaccents:
fout.write(unidecode.unidecode(line.replace(lookfor, replacewith)))
else:
fout.write(line.replace(lookfor, replacewith))
count += 1
fin.close()
fout.close()
"""
Reads all the pdf pages specifying that the table spans multiple pages.
multiple_tables = True, otherwise the first row of each page will be missed.
"""
tabula.convert_into(
input_path = "data.pdf",
output_path = "output.csv",
output_format = "csv",
pages = 'all',
multiple_tables = True
)
simplePrep("output.csv", removeaccents = True)
# reads preprocess data set
df = pd.read_csv("preoprocessed.csv", header = 0)
# prints the first 5 samples in the dataframe
print(df.head(5))
I'm stuck at a point where I have to write multiple pandas dataframe's to a PDF file.The function accepts dataframe as input.
However, I'm able to write to PDF for the first time but all the subsequent calls are overriding the existing data, leaving with only one dataframe in the PDF by the end.
Please find the python function below :
def fn_print_pdf(df):
pp = PdfPages('Sample.pdf')
total_rows, total_cols = df.shape;
rows_per_page = 30; # Number of rows per page
rows_printed = 0
page_number = 1;
while (total_rows >0):
fig=plt.figure(figsize=(8.5, 11))
plt.gca().axis('off')
matplotlib_tab = pd.tools.plotting.table(plt.gca(),df.iloc[rows_printed:rows_printed+rows_per_page],
loc='upper center', colWidths=[0.15]*total_cols)
#Tabular styling
table_props=matplotlib_tab.properties()
table_cells=table_props['child_artists']
for cell in table_cells:
cell.set_height(0.024)
cell.set_fontsize(12)
# Header,Footer and Page Number
fig.text(4.25/8.5, 10.5/11., "Sample", ha='center', fontsize=12)
fig.text(4.25/8.5, 0.5/11., 'P'+str(page_number), ha='center', fontsize=12)
pp.savefig()
plt.close()
#Update variables
rows_printed += rows_per_page;
total_rows -= rows_per_page;
page_number+=1;
pp.close()
And I'm calling this function as ::
raw_data = {
'subject_id': ['1', '2', '3', '4', '5'],
'first_name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'last_name': ['Anderson', 'Ackerman', 'Ali', 'Aoni', 'Atiches']}
df_a = pd.DataFrame(raw_data, columns=['subject_id', 'first_name', 'last_name'])
fn_print_pdf(df_a)
raw_data = {
'subject_id': ['4', '5', '6', '7', '8'],
'first_name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'last_name': ['Bonder', 'Black', 'Balwner', 'Brice', 'Btisan']}
df_b = pd.DataFrame(raw_data, columns=['subject_id', 'first_name', 'last_name'])
fn_print_pdf(df_b)
PDF file is available at
SamplePDF
.As you can see only the data from second dataframe is saved ultimately.Is there a way to append to the same Sample.pdf in the second pass and so on while still preserving the former data?
Your PDF's are being overwritten, because you're creating a new PDF document every time you call fn_print_pdf(). You can try keep your PdfPages instance open between function calls, and make a call to pp.close() only after all your plots are written. For reference see this answer.
Another option is to write the PDF's to a different file, and use pyPDF to merge them, see this answer.
Edit : Here is some working code for the first approach.
Your function is modified to :
def fn_print_pdf(df,pp):
total_rows, total_cols = df.shape;
rows_per_page = 30; # Number of rows per page
rows_printed = 0
page_number = 1;
while (total_rows >0):
fig=plt.figure(figsize=(8.5, 11))
plt.gca().axis('off')
matplotlib_tab = pd.tools.plotting.table(plt.gca(),df.iloc[rows_printed:rows_printed+rows_per_page],
loc='upper center', colWidths=[0.15]*total_cols)
#Tabular styling
table_props=matplotlib_tab.properties()
table_cells=table_props['child_artists']
for cell in table_cells:
cell.set_height(0.024)
cell.set_fontsize(12)
# Header,Footer and Page Number
fig.text(4.25/8.5, 10.5/11., "Sample", ha='center', fontsize=12)
fig.text(4.25/8.5, 0.5/11., 'P'+str(page_number), ha='center', fontsize=12)
pp.savefig()
plt.close()
#Update variables
rows_printed += rows_per_page;
total_rows -= rows_per_page;
page_number+=1;
Call your function with:
pp = PdfPages('Sample.pdf')
fn_print_pdf(df_a,pp)
fn_print_pdf(df_b,pp)
pp.close()
I am working in python using ReportLab. I need to generate report in PDF format. The data is retrieving from the database and insert into table.
Here is simple code:
from reportlab.lib import colors
from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Table, TableStyle
from reportlab.lib.units import inch
doc = SimpleDocTemplate("simple_table.pdf", pagesize=letter)
elements = []
data= [['00', '01', '02', '03', '04'],
['10', 'Here is large field retrieve from database', '12', '13', '14'],
['20', '21', '22', '23', '24'],
['30', '31', '32', 'Here is second value', '34']]
t=Table(data)
columnWidth = 1.9*inch;
for x in range(5):
t._argW[x]= cellWidth
elements.append(t)
doc.build(elements)
There are three issues:
The lengthy data in a cell overlap on the other cell in a row.
When I increase the column-width manually such as cellWidth = 2.9*inch; , the page is not visible and not scroll from Left-Right
I do not know how to append the data in a cell , mean if the size of the data is large ,it should append to the next line in the same cell.
How I reach this problem?
For starters i would not set the column size as you did. just pass Table the colWidths argument like this:
Table(data, colWidths=[1.9*inch] * 5)
Now to your problem. If you don't set the colWidth parameter reportlab will do this for you and space the columns according to your data. If this is not what you want, you can encapsulate your data into Paragraph's, like Bertrand said. Here is an modified example of your code:
from reportlab.lib import colors
from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Table, TableStyle, Paragraph
from reportlab.lib.styles import getSampleStyleSheet
from reportlab.lib.units import inch
styles = getSampleStyleSheet()
doc = SimpleDocTemplate("simple_table.pdf", pagesize=letter)
elements = []
data= [['00', '01', '02', '03', '04'],
['10', Paragraph('Here is large field retrieve from database', styles['Normal']), '12', '13', '14'],
['20', '21', '22', '23', '24'],
['30', '31', '32', 'Here is second value', '34']]
t=Table(data)
elements.append(t)
doc.build(elements)
I think you will get the idea.
I was facing the same problem today. I was looking for a solution where I could only resize a single column - with much different content length than other columns - and have reportlab do the rest for me.
This is what worked for me:
Table(data, colWidths=[1.9*inch] + [None] * (len(data[0]) - 1))
This specifies only the first column. But of course you can easily place the number somewhere in between the Nones as well
Hi I was getting same issue while I was resizing content of table and then I found solution. This solution might will help you and resolve all your 3 issues which is mention here.
from reportlab.lib import colors
from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Table, TableStyle
from reportlab.lib.units import inch
doc = SimpleDocTemplate("simple_table.pdf", pagesize=letter)
elements = []
data= [['00', '01', '02', '03', '04'],
['10', 'Here is large field retrieve from database', '12', '13', '14'],
['20', '21', '22', '23', '24'],
['30', '31', '32', 'Here is second value', '34']]
t=Table(data,colWidths=[1.9*inch]*5, rowHeights=[0.9*inch] *4)
#colWidth = size * number of columns
#rowHeights= size * number of rows
elements.append(t)
doc.build(elements)
My scenario is as follows: I have a table of data (handful of fields, less than a hundred rows) that I use extensively in my program. I also need this data to be persistent, so I save it as a CSV and load it on start-up. I choose not to use a database because every option (even SQLite) is an overkill for my humble requirement (also - I would like to be able to edit the values offline in a simple way, and nothing is simpler than notepad).
Assume my data looks as follows (in the file it's comma separated without titles, this is just an illustration):
Row | Name | Year | Priority
------------------------------------
1 | Cat | 1998 | 1
2 | Fish | 1998 | 2
3 | Dog | 1999 | 1
4 | Aardvark | 2000 | 1
5 | Wallaby | 2000 | 1
6 | Zebra | 2001 | 3
Notes:
Row may be a "real" value written to the file or just an auto-generated value that represents the row number. Either way it exists in memory.
Names are unique.
Things I do with the data:
Look-up a row based on either ID (iteration) or name (direct access).
Display the table in different orders based on multiple field: I need to sort it e.g. by Priority and then Year, or Year and then Priority, etc.
I need to count instances based on sets of parameters, e.g. how many rows have their year between 1997 and 2002, or how many rows are in 1998 and priority > 2, etc.
I know this "cries" for SQL...
I'm trying to figure out what's the best choice for data structure. Following are several choices I see:
List of row lists:
a = []
a.append( [1, "Cat", 1998, 1] )
a.append( [2, "Fish", 1998, 2] )
a.append( [3, "Dog", 1999, 1] )
...
List of column lists (there will obviously be an API for add_row etc):
a = []
a.append( [1, 2, 3, 4, 5, 6] )
a.append( ["Cat", "Fish", "Dog", "Aardvark", "Wallaby", "Zebra"] )
a.append( [1998, 1998, 1999, 2000, 2000, 2001] )
a.append( [1, 2, 1, 1, 1, 3] )
Dictionary of columns lists (constants can be created to replace the string keys):
a = {}
a['ID'] = [1, 2, 3, 4, 5, 6]
a['Name'] = ["Cat", "Fish", "Dog", "Aardvark", "Wallaby", "Zebra"]
a['Year'] = [1998, 1998, 1999, 2000, 2000, 2001]
a['Priority'] = [1, 2, 1, 1, 1, 3]
Dictionary with keys being tuples of (Row, Field):
Create constants to avoid string searching
NAME=1
YEAR=2
PRIORITY=3
a={}
a[(1, NAME)] = "Cat"
a[(1, YEAR)] = 1998
a[(1, PRIORITY)] = 1
a[(2, NAME)] = "Fish"
a[(2, YEAR)] = 1998
a[(2, PRIORITY)] = 2
...
And I'm sure there are other ways... However each way has disadvantages when it comes to my requirements (complex ordering and counting).
What's the recommended approach?
EDIT:
To clarify, performance is not a major issue for me. Because the table is so small, I believe almost every operation will be in the range of milliseconds, which is not a concern for my application.
Having a "table" in memory that needs lookups, sorting, and arbitrary aggregation really does call out for SQL. You said you tried SQLite, but did you realize that SQLite can use an in-memory-only database?
connection = sqlite3.connect(':memory:')
Then you can create/drop/query/update tables in memory with all the functionality of SQLite and no files left over when you're done. And as of Python 2.5, sqlite3 is in the standard library, so it's not really "overkill" IMO.
Here is a sample of how one might create and populate the database:
import csv
import sqlite3
db = sqlite3.connect(':memory:')
def init_db(cur):
cur.execute('''CREATE TABLE foo (
Row INTEGER,
Name TEXT,
Year INTEGER,
Priority INTEGER)''')
def populate_db(cur, csv_fp):
rdr = csv.reader(csv_fp)
cur.executemany('''
INSERT INTO foo (Row, Name, Year, Priority)
VALUES (?,?,?,?)''', rdr)
cur = db.cursor()
init_db(cur)
populate_db(cur, open('my_csv_input_file.csv'))
db.commit()
If you'd really prefer not to use SQL, you should probably use a list of dictionaries:
lod = [ ] # "list of dicts"
def populate_lod(lod, csv_fp):
rdr = csv.DictReader(csv_fp, ['Row', 'Name', 'Year', 'Priority'])
lod.extend(rdr)
def query_lod(lod, filter=None, sort_keys=None):
if filter is not None:
lod = (r for r in lod if filter(r))
if sort_keys is not None:
lod = sorted(lod, key=lambda r:[r[k] for k in sort_keys])
else:
lod = list(lod)
return lod
def lookup_lod(lod, **kw):
for row in lod:
for k,v in kw.iteritems():
if row[k] != str(v): break
else:
return row
return None
Testing then yields:
>>> lod = []
>>> populate_lod(lod, csv_fp)
>>>
>>> pprint(lookup_lod(lod, Row=1))
{'Name': 'Cat', 'Priority': '1', 'Row': '1', 'Year': '1998'}
>>> pprint(lookup_lod(lod, Name='Aardvark'))
{'Name': 'Aardvark', 'Priority': '1', 'Row': '4', 'Year': '2000'}
>>> pprint(query_lod(lod, sort_keys=('Priority', 'Year')))
[{'Name': 'Cat', 'Priority': '1', 'Row': '1', 'Year': '1998'},
{'Name': 'Dog', 'Priority': '1', 'Row': '3', 'Year': '1999'},
{'Name': 'Aardvark', 'Priority': '1', 'Row': '4', 'Year': '2000'},
{'Name': 'Wallaby', 'Priority': '1', 'Row': '5', 'Year': '2000'},
{'Name': 'Fish', 'Priority': '2', 'Row': '2', 'Year': '1998'},
{'Name': 'Zebra', 'Priority': '3', 'Row': '6', 'Year': '2001'}]
>>> pprint(query_lod(lod, sort_keys=('Year', 'Priority')))
[{'Name': 'Cat', 'Priority': '1', 'Row': '1', 'Year': '1998'},
{'Name': 'Fish', 'Priority': '2', 'Row': '2', 'Year': '1998'},
{'Name': 'Dog', 'Priority': '1', 'Row': '3', 'Year': '1999'},
{'Name': 'Aardvark', 'Priority': '1', 'Row': '4', 'Year': '2000'},
{'Name': 'Wallaby', 'Priority': '1', 'Row': '5', 'Year': '2000'},
{'Name': 'Zebra', 'Priority': '3', 'Row': '6', 'Year': '2001'}]
>>> print len(query_lod(lod, lambda r:1997 <= int(r['Year']) <= 2002))
6
>>> print len(query_lod(lod, lambda r:int(r['Year'])==1998 and int(r['Priority']) > 2))
0
Personally I like the SQLite version better since it preserves your types better (without extra conversion code in Python) and easily grows to accommodate future requirements. But then again, I'm quite comfortable with SQL, so YMMV.
A very old question I know but...
A pandas DataFrame seems to be the ideal option here.
http://pandas.pydata.org/pandas-docs/version/0.13.1/generated/pandas.DataFrame.html
From the blurb
Two-dimensional size-mutable, potentially heterogeneous tabular data
structure with labeled axes (rows and columns). Arithmetic operations
align on both row and column labels. Can be thought of as a dict-like
container for Series objects. The primary pandas data structure
http://pandas.pydata.org/
I personally would use the list of row lists. Because the data for each row is always in the same order, you can easily sort by any of the columns by simply accessing that element in each of the lists. You can also easily count based on a particular column in each list, and make searches as well. It's basically as close as it gets to a 2-d array.
Really the only disadvantage here is that you have to know in what order the data is in, and if you change that ordering, you'll have to change your search/sorting routines to match.
Another thing you can do is have a list of dictionaries.
rows = []
rows.append({"ID":"1", "name":"Cat", "year":"1998", "priority":"1"})
This would avoid needing to know the order of the parameters, so you can look through each "year" field in the list.
Have a Table class whose rows is a list of dict or better row objects
In table do not directly add rows but have a method which update few lookup maps e.g. for name
if you are not adding rows in order or id are not consecutive you can have idMap too
e.g.
class Table(object):
def __init__(self):
self.rows = []# list of row objects, we assume if order of id
self.nameMap = {} # for faster direct lookup for row by name
def addRow(self, row):
self.rows.append(row)
self.nameMap[row['name']] = row
def getRow(self, name):
return self.nameMap[name]
table = Table()
table.addRow({'ID':1,'name':'a'})
First, given that you have a complex data retrieval scenario, are you sure even SQLite is overkill?
You'll end up having an ad hoc, informally-specified, bug-ridden, slow implementation of half of SQLite, paraphrasing Greenspun's Tenth Rule.
That said, you are very right in saying that choosing a single data structure will impact one or more of searching, sorting or counting, so if performance is paramount and your data is constant, you could consider having more than one structure for different purposes.
Above all, measure what operations will be more common and decide which structure will end up costing less.
I personally wrote a lib for pretty much that quite recently, it is called BD_XML
as its most fundamental reason of existence is to serve as a way to send data back and forth between XML files and SQL databases.
It is written in Spanish (if that matters in a programming language) but it is very simple.
from BD_XML import Tabla
It defines an object called Tabla (Table), it can be created with a name for identification an a pre-created connection object of a pep-246 compatible database interface.
Table = Tabla('Animals')
Then you need to add columns with the agregar_columna (add_column) method, with can take various key word arguments:
campo (field): the name of the field
tipo (type): the type of data stored, can be a things like 'varchar' and 'double' or name of python objects if you aren't interested in exporting to a data base latter.
defecto (default): set a default value for the column if there is none when you add a row
there are other 3 but are only there for database tings and not actually functional
like:
Table.agregar_columna(campo='Name', tipo='str')
Table.agregar_columna(campo='Year', tipo='date')
#declaring it date, time, datetime or timestamp is important for being able to store it as a time object and not only as a number, But you can always put it as a int if you don't care for dates
Table.agregar_columna(campo='Priority', tipo='int')
Then you add the rows with the += operator (or + if you want to create a copy with an extra row)
Table += ('Cat', date(1998,1,1), 1)
Table += {'Year':date(1998,1,1), 'Priority':2, Name:'Fish'}
#…
#The condition for adding is that is a container accessible with either the column name or the position of the column in the table
Then you can generate XML and write it to a file with exportar_XML (export_XML) and escribir_XML (write_XML):
file = os.path.abspath(os.path.join(os.path.dirname(__file__), 'Animals.xml'))
Table.exportar_xml()
Table.escribir_xml(file)
And then import it back with importar_XML (import_XML) with the file name and indication that you are using a file and not an string literal:
Table.importar_xml(file, tipo='archivo')
#archivo means file
Advanced
This are ways you can use a Tabla object in a SQL manner.
#UPDATE <Table> SET Name = CONCAT(Name,' ',Priority), Priority = NULL WHERE id = 2
for row in Table:
if row['id'] == 2:
row['Name'] += ' ' + row['Priority']
row['Priority'] = None
print(Table)
#DELETE FROM <Table> WHERE MOD(id,2) = 0 LIMIT 1
n = 0
nmax = 1
for row in Table:
if row['id'] % 2 == 0:
del Table[row]
n += 1
if n >= nmax: break
print(Table)
this examples assume a column named 'id'
but can be replaced width row.pos for your example.
if row.pos == 2:
The file can be download from:
https://bitbucket.org/WolfangT/librerias