I'm loading data into an SQLite database via Luigi with the following code:
class LoadData(luigi.Task):
def requires(self):
return TransformData()
def run(self):
with sqlite3.connect('database.db') as db:
cursor = db.cursor()
cursor.execute("INSERT INTO prod SELECT * FROM staging;")
def output(self):
return luigi.LocalTarget('database.db')
This works, but when I want to update or insert new data, the task doesn't execute because Luigi considers it complete (database.db already exists).
Maybe I didn't understand the good use of LocalTarget. What is the right way to approach this?
///EDIT: My question applies to the example given on this page (code for le_create_db.py). How do you solve updates and inserts in that example?
///EDIT: This question about appending to a file is similar, but the solution using marker files does not work because sqla expects an SQLAlchemyTarget output. Are there any other answers, specifically about appending to a database?
Consider using a mock file:
http://gouthamanbalaraman.com/blog/building-luigi-task-pipeline.html
In each execution you will be creating a new file.
Another solution could be using the strategy of creating a marker table inside the db, for example: https://luigi.readthedocs.io/en/stable/api/luigi.contrib.postgres.html#luigi.contrib.postgres.PostgresTarget
I had the same issue and was able to solve it by overriding the complete method to simply return False:
def complete(self):
return False
Now the task is re-run every time, even if database file is present.
Related
I'm using sqlalchemy in combination with sqlite and the databases library and I'm trying to wrap my head around what that combination returns when doing update queries. I'm running a testcase and I have sqlalchemy set up to roll back upon execution of each testcase via force_rollback=True.
db = databases.Database(DB_URL, force_rollback=True)
query = update(my_table).where(my_table.columns.id == some_id_to_update).values(**values)
res = await db.execute(query)
When working with psql, I'd expect res to be the number of rows that were affected by the UPDATE query, but from reading the documentation, sqlite seems to behave differently in that it doesn't return anything. I tested this manually by connecting to the database via sqlite3 and as expected, there is no return when doing UPDATE queries. sqlalchemy however does return something, which I assume is the number of total rows in the table, but I'm not sure. Can anybody shed some light into what is actually returned?
What's more, when I tried to get the number of rows affected by the UPDATE query via SELECT changes(), I'm also getting the number of total rows in the table and not the rows affected by the most recent query. Do I have a misunderstanding of what changes() does?
"The changes() function returns the number of database rows that were changed or inserted or deleted by the most recently completed INSERT, DELETE, or UPDATE statement, exclusive of statements in lower-level triggers."
When you use the Python sqlite3 module, you use .executeXXX interfaces to evaluate/prepare your query. If the query is supposed to modify the database, it does it at this stage. You have to use the same interface to prepare a SELECT statement. In either case, the .executeXXX interfaces never return anything. To get the result of a SELECT query, you have to use a .fetchXXX interface after running .executeXXX.
To get the number of changed rows after INSERT, DELETE, or UPDATE statement via sqlite3, you can also take the difference in con.total_changes before/after running .executeXXX.
I am new to SQL Alchemy and need a way to run a script whenever a new entry is added to a table. I am currently using the following method to get the task done but I am sure there has to be a more efficient way.
I am using python 2 for my project and MS SQL as database.
Suppose my table is carData and I add a new row for car details from website. The new car data is added to carData. My code works as follows
class CarData:
<fields for table class>
with session_scope() as session:
car_data = session.query(CarData)
reference_df = pd.read_sql_query(car_data.statement, car_data.session.bind)
while True:
with session_scope() as session:
new_df = pd.read_sql_query(car_data.statement, car_data.session.bind)
if len(new_df) > len(reference_df):
print "New Car details added"
<code to get the id of new row added>
<run script>
reference_df = new_df
sleep(10)
The above is ofcourse a much simpler version of the code that I am using but the idea is to have a reference point then keep checking every 10 seconds if there is a new entry. However even after using session_scope() I have seen connection issues after a few days as this script is suppose to run indefinitely.
Is there a better way to know that a new row has been added, get the id of the new row and run the required script?
I believe the error you've described is a connectivity issue with the database e.g. a temporary network problem
OperationalError: TCP Provider: Error code 0x68
So what you need to do is cater this with error handling!
try:
new_df = pd.read_sql_query(car_data.statement, car_data.session.bind)
except:
print("Problem with query, will try again shortly")
Note: Using flask_sqlalchemy here
I'm working on adding versioning to multiple services on the same DB. To make sure it works, I'm adding unit tests that confirm I get an error (for this case my error should be StaleDataError). For other services in other languages, I pulled the same object twice from the DB, updated one instance, saved it, updated the other instance, then tried to save that as well.
However, because SQLAlchemy adds a fake-cache layer between the DB and the service, when I update the first object it automatically updates the other object I hold in memory. Does anyone have a way around this? I created a second session (that solution had worked in other languages) but SQLAlchemy knows not to hold the same object in two sessions.
I was able to manually test it by putting time.sleep() halfway through the test and manually changing data in the DB, but I'd like a way to test this using just the unit code.
Example code:
def test_optimistic_locking(self):
c = Customer(formal_name='John', id=1)
db.session.add(c)
db.session.flush()
cust = Customer.query.filter_by(id=1).first()
db.session.expire(cust)
same_cust = Customer.query.filter_by(id=1).first()
db.session.expire(same_cust)
same_cust.formal_name = 'Tim'
db.session.add(same_cust)
db.session.flush()
db.session.expire(same_cust)
cust.formal_name = 'Jon'
db.session.add(cust)
with self.assertRaises(StaleDataError): db.session.flush()
db.session.rollback()
It actually is possible, you need to create two separate sessions. See the unit test of SQLAlchemy itself for inspiration. Here's a code snippet of one of our unit tests written with pytest:
def test_article__versioning(connection, db_session: Session):
article = ProductSheetFactory(title="Old Title", version=1)
db_session.refresh(article)
assert article.version == 1
db_session2 = Session(bind=connection)
article2 = db_session2.query(ProductSheet).get(article.id)
assert article2.version == 1
article.title = "New Title"
article.version += 1
db_session.commit()
assert article.version == 2
with pytest.raises(sqlalchemy.orm.exc.StaleDataError):
article2.title = "Yet another title"
assert article2.version == 1
article2.version += 1
db_session2.commit()
Hope that helps. Note that we use "version_id_generator": False in the model, that's why we increment the version ourselves. See the docs for details.
For anyone that comes across this question, my current hypothesis is that it can't be done. SQLAlchemy is incredibly powerful and, given that the functionality is so good that we can't test this line, we should trust that it works as expected
I have a simple web2py server that we use to visualize data from our PostgreSQL Server. The following functions are all part of the global models in web2py.
The current solution to fetch data is very simple. Every time I connect, and after I get the data I close the connection:
# Old way:
# (imports excluded)
def get_data(query):
postgres_connection = psycopg2.connect("credentials")
df = psql.frame_query(query, con=postgres_connection) # Pandas function to put data from query into DataFrame
postgres.close()
return df
For small queries, opening and closing the connection takes about 9/10 of the time run the function.
Is this a good way to do it instead? If not, what is a better way?
# Better way?
def connect():
"""
Create a connection to server.
"""
return psycopg2.connect("credentials")
db_connection = connect()
def create_pandas_frame(query):
"""
Get query if connection is open.
"""
return psql.frame_query(query, con=db_connection)
def get_data(query):
"""
Try to get data, open a new conneciton if connection is closed.
"""
try:
data = create_pandas_frame(query)
except:
global db_connection
db_connection = connect()
data = create_pandas_frame(query)
return data
If you run that code in a web2py model file, you'll end up creating a new connection on each HTTP request anyway. Instead, you might consider connection pooling.
An easier option might be to use the web2py DAL to fetch the data. Something like:
from pandas.core.api import DataFrame
db = DAL([db connection string], pool_size=10, migrate_enabled=False)
rows = db.executesql(query)
data = DataFrame.from_records(rows, columns=[list, of, column, names])
If you specify the pool_size argument to DAL(), it will automatically maintain a connection pool to be used across requests.
Note, I haven't tried this, so it may need some tweaking, but something along these lines should work.
If you'd like, you can even use the DAL to generate the SQL by defining database table models:
db.define_table('mytable',
Field('field1', 'integer'),
Field('field2', 'double'),
Field('field3', 'boolean'))
rows = db.executesql(db(db.mytable.id > 0)._select())
data = DataFrame.from_records(rows, columns=db.mytable.fields)
The ._select() method just generates the SQL without actually doing the select. The SQL is then passed to .executesql() to fetch the data.
An alternative is to create a special Pandas processor and pass it as the processor argument to .select().
def pandas_processor(rows, fields, columns, cacheable):
return DataFrame.from_records(rows, columns=columns)
data = db(db.mytable.id > 0).select(processor=pandas_processor)
I used Anthony's answer and now have functions that look like this:
# In one of the models files.
from pandas.core.api import DataFrame
external_db = DAL('postgres://connection_stuff',pool_size=10,migrate_enabled=False)
def create_simple_html_table(query):
dict_from_db = external_db.executesql(query, as_dict=True)
return DataFrame(dict_from_db).to_html()
Then later in a view or controller a html table is created using:
# In Controller:
my_table = create_simple_html_table('select * from random_table limit 50')
# In View:
{{=XML(create_simple_html_table('select * from random_table limit 50'))}}
I still need to do more testing, but my understanding so far is that this solution will let me query things from the external db and let web2py keep the connection, and let web2py use the same connection for all users.
Note that this solution is only good if all you want to do is to read and write to you Postgres server with raw SQL.
If you want to use DAL to read and write, you need to either try to find the DAL alternative called MyDAL or play around with the search_path option in Postgres.
I'm looking for a complete example of using select for update in SQLAlchemy, but haven't found one googling. I need to lock a single row and update a column, the following code doesn't work (blocks forever):
s = table.select(table.c.user=="test",for_update=True)
# Do update or not depending on the row
u = table.update().where(table.c.user=="test")
u.execute(email="foo")
Do I need a commit? How do I do that? As far as I know you need to:
begin transaction
select ... for update
update
commit
If you are using the ORM, try the with_for_update function:
foo = session.query(Foo).filter(Foo.id==1234).with_for_update().one()
# this row is now locked
foo.name = 'bar'
session.add(foo)
session.commit()
# this row is now unlocked
Late answer, but maybe someone will find it useful.
First, you don't need to commit (at least not in-between queries, which I'm assuming you are asking about). Your second query hangs indefinitely, because you are effectively creating two concurrent connections to the database. First one is obtaining lock on selected records, then second one tries to modify locked records. So it can't work properly. (By the way in the example given you are not calling first query at all, so I'm assuming in your real tests you did something like s.execute() somewhere). So to the point—working implementation should look more like:
s = conn.execute(table.select(table.c.user=="test", for_update=True))
u = conn.execute(table.update().where(table.c.user=="test"), {"email": "foo"})
conn.commit()
Of course in such simple case there's no reason to do any locking but I guess it is example only and you were planning to add some additional logic between those two calls.
Yes, you do need to commit, which you can execute on the Engine or create a Transaction explicitely. Also the modifiers are specified in the values(...) method, and not execute:
>>> conn.execute(users.update().
... where(table.c.user=="test").
... values(email="foo")
... )
>>> my_engine.commit()