I am building an airflow dag that takes csv files from GCS and inserts them to a postgresql table in cloud SQL. I have several options:
Use sqlalchemy to insert the reows.
Use pandas
Explore PostgreSQL airflow operators (I don't know how to connect them with gcs).
Which is the most pythonic way to do so?
You sould go with COPY.
See https://www.postgresql.org/docs/current/sql-copy.html
COPY moves data between PostgreSQL tables and standard file-system files. COPY TO copies the contents of a table to a file, while COPY FROM copies data from a file to a table (appending the data to whatever is in the table already).
Related
I'm new to using Postgres, so I'm not sure if this is a basic question. I obtain a Postgres dump from my company that I load as a Postgres schema using Pgadmin. I then use the psycopg2 package in python to load the various tables of this schema into pandas dataframes and do whatever processing I need.
Every time I get a new version of the data, I have to go through a 3-step process in pgadmin to update the schema before I work with it in python:
"Drop cascade" the existing public schema
Create a new public schema
Restore the public schema by pointing it to the new pgdump file
Rather than doing these three using pgadmin, can this be done programatically from within a python script?
I wish to copy data from S3 to Redshift.
However, the Copy command always duplicates the rows whenever the Lambda function triggers:
cur.execute('copy table from S3...... ' )
Can someone suggest other ways to do it without truncating existing data?
for commenters: I tried to push directly from the dataframe to redshift.. append
There is one library pandas_redshift but it needs s3 connection first which might solve appending issue)
I also tried
#if cur.execute('truncate') it can keep the table empty but I don't have delete rights
cur.execute('select distinct * from ABC.xyz')
cur.execute('copy......')
results keep appending...
Can someone please provide any code or right series of execution.
Unfortunately there is no straight forward option to copy the files to perform upsert which can handle duplicates.
If you don't want to truncate the table, there are two workarounds:
You can create a staging table where you can copy the data first and then perform merge option. That will also act as upsert statement.
https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-upsert.html
You can use manifest to control which files you want to copy and which needs to be avoided.
https://docs.aws.amazon.com/redshift/latest/dg/r_COPY_command_examples.html#copy-command-examples-manifest
Given a parquet file how can I create the table associated with it into my redshift database? Oh the format of the parquet file is snappy.
If you're dealing with multiple files, especially over a long term, then I think the best solution is to upload them to an S3 bucket and run a Glue crawler.
In addition to populating the Glue data catalog, you can also use this information to configure external tables for Redshift Spectrum, and create your on-cluster tables using create table as select.
If this is just a one-off task, then I've used parquet-tools in the past. The version that I've used is a Java library, but I see that there's also a version on PyPi.
I need to copy an existing neo4j database in Python. I even do not need it for backup, just to play around with while keeping the original database untouched. However, there is nothing about copy/backup operations in neo4j.py documentation (I am using python embedded binding).
Can I just copy the whole folder with the original neo4j database to a folder with a new name?
Or is there any special method available in neo4j.py?
Yes,
you can copy the whole DB directory when you have cleanly shut down the DB for backup.
We're currently working on a python project that basically reads and writes M2M data into/from a SQLite database. This database consists of multiple tables, one of them storing current values coming from the cloud. This last table is worrying me a bit since it's being written very often and the application runs on a flash drive.
I've read that virtual tables could be the solution. I've thought in converting the critical table into a virtual one and then link its contents to a real file (XML or JSON) stored in RAM (/tmp for example in Debian). I've been reading this article:
http://drdobbs.com/database/202802959?pgno=1
that explains more or less how to do what I want. It's quite complex and I think that this is not very doable using Python. Maybe we need to develop our own sqlite extension, I don't know...
Any idea about how to "place" our conflicting table in RAM whilst the rest of the database stays in FLASH? Any better/simpler approach about how take the virtual table way under Python?
A very simple, SQL-only solution to create a in-memory table is using SQLite's ATTACH command with the special ":memory:" pseudo-filename:
ATTACH DATABASE ":memory:" AS memdb;
CREATE TABLE memdb.my_table (...);
Since the whole database "memdb" is kept in RAM, the data will be lost once you close the database connection, so you will have to take care of persistence by yourself.
One way to do it could be:
Open your main SQLite database file
Attach a in-memory secondary database
Duplicate your performance-critical table in the in-memory database
Run all queries on the duplicate table
Once done, write the in-memory table back to the original table (BEGIN; DELETE FROM real_table; INSERT INTO real_table SELECT * FROM memory_table;)
But the best advice I can give you: Make sure that you really have a performance problem, the simple solution could just as well be fast enough!
Use an in-memory data structure server. Redis is a sexy option, and you can easily implement a table using lists. Also, it comes with a decent python driver.