Am planning to create a simple search engine in python(python3).Going through the documentation for sqlite fts3/fts4 ,it became my choice to store the documents,since full text searches are fast.I already have a set of webpages ,with their text extracted and saved in text files.
Hence I planned to create the fts4 table the following way:
conn = sqlite3.connect('/home/xyz/exampledb.db')
c = conn.cursor()
c.execute("CREATE VIRTUAL TABLE mypages USING fts4(docid, name, content)")
Then i would iterate over the text files,store it in a string and insert this string into the fts table along with the name and docid(an integer from 1 to n where n is total documents)
But the following statement in sqlite documentation has me confused and am not sure my above code will work:
A virtual table is an interface to an external storage or computation engine that appears to be a table but does not actually store information in the database file.
So where will the information be stored?if it was a regular sqlite table,i would first create a database file and create the table in this database file.If i had to use the same database in another machine i would simply copy this file and paste it on that machine.I might have missed something in the documentation but i want to be clear on how information will be stored before i implement it.
That statement from the documentation is somewhat misleading; the virtual table itself does not store data in the database, but the engine that implements the virtual table might choose to use other tables to store the data.
What happens for FTS is explained in section 9.1 of the documentation:
For each FTS virtual table in a database, three to five real (non-virtual) tables are created to store the underlying data. These real tables are called "shadow tables". The real tables are named "%_content", "%_segdir", "%_segments", "%_stat", and "%_docsize", where "%" is replaced by the name of the FTS virtual table.
Related
I'm trying to migrate the database for an existing application from access to SQLite. The application uses Autonumbers to generate unique IDs in Access and some tables reference rows from other tables by these unique IDs.
What's a good way to migrate these and keep this functionality intact?
From what I've read, SQLite uses Auto indexing for this. How would I create the links between the tables? Do I have to search the other tables for the row with that unique ID and replace the reference with the SQL generated ID?
example:
table 1, has a column linkedID with a row with the value {7F99297A-DE91-4BD6-9ED8-FC13D668CDA2}, which is linked to a row in table 2 with primaryKey {7F99297A-DE91-4BD6-9ED8-FC13D668CDA2}.
Well, there not really a automated way to do this.
but, what I do to migrate data?
I setup a linked table in Access. Double check if that linked table works (you need to install the odbc driver).
Assuming you have a working linked table?
Then you can do this to export the Access table in VBA to sqlite.
Dim LocalTable As String ' name of local table link
Dim ServerTable As String ' name of table on SQL Lite
LocalTable = "Table1"
ServerTable = "TableSLite"
Dim strCon As String
strCon = CurrentDb.TableDefs("Test1").Connect
' above is a way to steal and get a working connection from a valid
' working linked table (I hate connection strings in code)
Debug.Print strCon
DoCmd.TransferDatabase acExport, "ODBC Database", strCon, acTable, LocalTable, ServerTable
Debug.Print "done export of " & LocalTable
That will get you the table in sqlite. But, there are no DDL (data definition commands) in sqlite to THEN change that PK id from Access to a PK and autonumber.
However, assuming you say have "db browser"?
Then simple export the table(s) as per above.
Now, in db browrser, open up the table, and choose modify, and simple check the AI (auto increemnt, and then PK settings - in fact if you check box AI, then the PK useally selects for you. So, after I done the above export. (and you should consider close down Access - since you had/have linked tables).
So, in db browser, we now do this:
so, for just a few tables, the above is not really hard.
However, the above export (transfer) of data does not set the PK, and auto increment for you.
If you need to do this with code, and this is not a one time export/transfer, then I don't have a good solution.
Unfortantly, SqlLite does NOT allow a alter table command to set PK and set auto increment (if that was possbile, then after a export, you could execute the DDL command in sqlite (or send the command from your client software) to make this alteration.
I not sure if sql lite can spit out the "create table" command that exists for a given table (but, I think it can). So, you might export the schema, get the DDL command, modify that command, drop the table, re-run the create table command (with current PK and auto increment), and THEN use a export or append query in Access.
But, transfer of the table(s) in question can be done quite easy as per above, but the result(s) do not set nor include the PK setting(s) for you.
However, if this is one time export? Then export of tables - and even the dirty work of figuring out the correct data types to be used?
The above works well - but you have to open up the tables in a tool like say db browser, and then set PK and auto increment.
I do the above quite often for transfer of Access tables to sqlLite tables, but it does then require some extra steps to setup the PK and auto increment.
Another possbile way if this had to be done more then one time?
I would export as per above, and then add the PK (and auto increment).
I would then grab say the 8 tables create commands from sqlLite, and save those create table commands in the client software.
then you execute the correct create table command, and then do a append query from Access. So, it really depends if this is a one time export, or this process of having to create the table(s) in sqlLite is to occur over and over.
I'm learning AWS Glue. With traditional ETL a common pattern is to look up the primary key from the destination table to decide if you need to do an update or an insert (aka upsert design pattern). With glue there doesn't seem to be that same control. Plain writing out the dynamic frame is just a insert process. There are two design patterns I can think of how to solve this:
Load the destination as data frame and in spark, left outer join to only insert new rows (how would you update rows if you needed to? delete then insert??? Since I'm new to spark this is most foreign to me)
Load the data into a stage table and then use SQL to perform the final merge
It's this second method that I'm exploring first. How can I in the AWS world execute a SQL script or stored procedure once the AWS Glue job is complete? Do you do a python-shell job, lambda, directly part of glue, some other way?
I have used pymysql library as a zip file uploaded to AWS S3, and configured in the AWS Glue job parameters. And for UPSERTs, I have used INSERT INTO TABLE....ON DUPLICATE KEY.
So based on the primary key validations, the code would either update a record if already exists, or insert a new record. Hope this helps. Please refer this:
import pymysql
rds_host = "rds.url.aaa.us-west-2.rds.amazonaws.com"
name = "username"
password = "userpwd"
db_name = "dbname"
conn = pymysql.connect(host=rds_host, user=name, passwd=password, db=db_name, connect_timeout=5)
with conn.cursor() as cur:
insertQry="INSERT INTO ZIP_TERR(zip_code, territory_code, "
"territory_name,state) "
"VALUES(zip_code, territory_code, territory_name, state) "
"ON DUPLICATE KEY UPDATE territory_name = "
"VALUES(territory_name), state = VALUES(state);"
cur.execute(insertQry)
conn.commit()
cur.close()
In the above code sample, territory-code, zip-code are primary keys. Please refer here as well: More on looping inserts using a for loops
As always, AWS' changing feature list resolves much of these problems (arising from user demand and common work patterns).
AWS have published documentation on Updating and Inserting new data, using staging tables (which you mentioned in your second strategy).
Generally speaking the most rigorous approach for ETL is to truncate and reload source data, but this depends on your source data. If your source data is a time series dataset spanning billions of records, you may need to use a delta/incremental load pattern.
I have two PostgreSQL databases db1 = source database and db2 = destination database in a AWS server endpoint. For db1, I just have read rights and for the db2 I have both read and write rights. db1 being production database has a table called 'public.purchases', my task is to get incrementally all the data from 'public.purchases' table in db1 to a 'to be newly created table' in db2 (let me call the table as 'public.purchases_copy'). And every time I run the script to perform this action the destination table which is 'public.purchases_copy' in db2 needs to be updated without fully reloading the table.
My question is what would be the best way to achieve this task more efficiently. I did quite a bit of research online and I found out that it can be achieved by connecting Python to PostgreSQL using 'psycopg2' module. Me being not so proficient in Python it would be of great help if somebody help me out in pointing out the links in StackOverflow where the similar question was being answered or guide me in what can be done or how this can be achieved or any particular tutorial which I can refer? Thanks in advance.
PostgreSQL version: 9.5,
PostgreSQL GUI using: pgadmin 3,
Python version installed: 3.5
While it is possible to do this using python, I would recommend first looking into Postgres own module postgres_fdw, if it is possible for you to use it :
The postgres_fdw module provides the foreign-data
wrapper postgres_fdw, which can be used to access data stored in
external PostgreSQL servers.
Details available on postgres docs, but specifically after you set it up, you can :
Create a foreign table, using CREATE FOREIGN TABLE or IMPORT FOREIGN
SCHEMA, for each remote table you want to access. The columns of the
foreign table must match the referenced remote table. You can,
however, use table and/or column names different from the remote
table's, if you specify the correct remote names as options of the
foreign table object.
Now you need only SELECT from a foreign table to access the data
stored in its underlying remote table
For simpler setup, it should probably be best to use readonly db as the foreign one.
I would like to store CSV files in SQL Server. I've created a table with column "myDoc" as varbinary(max). I generate the CSV's on a server using Python/Django. I would like to insert the actual CSV (not the path) as a BLOB object so that I can later retrieve the actual CSV file.
How do I do this? I haven't been able to make much headway with this documentation, as it mostly refers to .jpg's
https://msdn.microsoft.com/en-us/library/a1904w6t(VS.80).aspx
Edit:
I wanted to add that I'm trying to avoid filestream. The CSVs are too small (5kb) and I don't need text search over them.
Not sure why you want varbinary over varchar, but it will work either way
Insert Into YourTable (myDoc)
Select doc = BulkColumn FROM OPENROWSET(BULK 'C:\Working\SomeXMLFile.csv', SINGLE_BLOB) x
I'm looking at a way to access a csv file's cells in a random fashion. If I use Python's csv module, I can only iterate through all lines which is rather slow. I should also add that the file is pretty large (>100MB) and that I'm looking at short response time.
I could preprocess the file into a different data format for faster row/column access. Perhaps someone has done this before and can share some experiences.
Background:
I'd like to show an extract of the csv on screen provided by a web server (depending on scroll position). Keeping the file in memory is not an option.
I have found SQLite good for this sort of thing. It is easy to set up and you can store the data locally, but you also get easier control over what you select than csv files and you get the facility to add indexes etc.
There is also a built in facility for loading csv files into a table: http://www.sqlite.org/cvstrac/wiki?p=ImportingFiles.
Let me know if you want any further details on the SQLite route i.e. how to create the table, load the data in or query it from Python.
SQLite Instructions to load .csv file to table
To create a database file you can just add the filename required as an argument when opening SQLite. Navigate to the directory containing the csv file from the command line (I am assuming here that you want the SQLite .db file to be contained in the same dir). If using Windows add SQLite to your PATH environment variable if not already done, (instructions here if you need them) and open SQLite as follows with an argument for the name that you want to give your database file e.g.:
sqlite3 example.db
Check the database file has been created by entering:
.databases
Create a table to hold the data. I am using an example for a simple customer table here. If data types are inconsistent for any columns use text:
create table customers (ID integer, Title text, Forename text, Surname text, Postcode text, Addr_Line1 text, Addr_Line2 text, Town text, County text, Home_Phone text, Mobile text, Comments text);
Specify the separator to be used:
.separator ","
Issue the command to import the data, the sytnax takes the form .import filename.ext table_name e.g.:
.import cust.csv customers
Check that the data has loaded in:
select count(*) from customers;
Add an index for columns that you are likely to filter on (full syntax described here) e.g.:
create index cust_surname on customers(surname);
You should now have fast access to the data when filtering on any of the indexed columns. To leave SQLite use .exit, to get a list of other helpful non-SQL commands use .help.
Python Alternative
Alternatively if you want to stick with pure Python and pre-process the file then you could load the data into a dictionary which would allow much faster access to the data as the dictionary keys behave like an index meaning that you can get to values associated with a key quickly without going through the records one by one. I would need further details of your input data and what fields the lookups would be based on to provide further details on how to implement this.
However, unless you will know in advance when the data will be required (to be able to pre-process the file before the request for data) then you would still have the overhead of loading the file from disk into memory every time you run this. Depending on your exact usage this may make the database solution more appropriate.