I'm trying to create an external table in BQ using data stored in GCS bucket. Below is the DDL command I'm using:
CREATE OR REPLACE EXTERNAL TABLE `external table`
OPTIONS (
format = 'parquet',
uris = ['gs://...', 'gs://...']
);
How can I exclude a particular column from being imported to external table? Since I cannot ALTER an external table to DROP COLUMN after being created.
There is no provision for selecting columns while loading data from GCS.
In the following documentation provided by google all possible configurations and properties for loading data to bigQuery from GCS is provided. But the option that you are looking for is not there.
https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-csv
But there are other ways to do this.
First option,
You can load this to data to a temp table first, and then select the required columns from that and populate the second table.
Second option,
If you don't want to do this manually. You can make use of CloudRun on BigQuery events. Whenever you load data into first table, it triggers Cloudrun in which you can write your code to remove the column you do not want and insert those into the second table.
https://cloud.google.com/blog/topics/developers-practitioners/how-trigger-cloud-run-actions-bigquery-events
Third Option,
If this is just a one time activity, you can load the whole data into one table and then create a view with required columns on top of it.
Related
I am trying to figure out how to iterate through rows in a .CSV files and enter that data into a table in sqlite but only if the data in that row meets certain criteria.
I am trying to build a database of my personal spending. I have used python to categorise my spending data I now want to enter that data into a database with each category as a different table. This means I need to sort the data and enter it into different tables based on the category of spend.
I looked for quite a long time. Can anyone help?
You need to read the CSV file using pandas and store it in a pandas DataFrame. Then (If you did not create already a database) use SQLAlchemy library (Here is the documentation) to create an engine engine = sqlalchemy.create_engine('sqlite:///file.db').
Afterwards, you need to convert the DataFrame to the SQL database using pandas to_sql function (Documentation). df.to_sql('file_name', engine, index=False). I used the index=False to avoid creating a column for the index of the DataFrame.
I have a dataset at BigQuery with 100 thousand+ rows and 10 columns. I'm also continuously adding new data to the dataset. I want to fetch data that not processed, process them and write back to my table. Currently, I'm fetching them to a pandas dataframe using bigquery python library and processing using pandas.
Now, I want to update table with new pre-processed data. One way of doing it using SQL statement and calling query function of the bigquery.Client() class. Or use a job like here.
bqclient = bigquery.Client(
credentials=credentials,
project=project_id,
)
query = """UPDATE `dataset.table` SET field_1 = '3' WHERE field_2 = '1'"""
bqclient.query(query_string)
But it doesn't make sense to create update statement for each row.
Another way I found is using to_gbq function of pandas-gbq package. Disadvantage of this , it updates all table.
Question: What is the best way of updating Bigquery table from pandas dataframe?
Google BigQuery is mainly used for Data Analysis when your data is static and you don't have to update a value, since the arquitecture is basically to do that kind of thinking. Therefore, if you want to update the data, there are some options but are very heavy:
The one you mentioned, with a query and update one by one row.
Recreate the table using only the new values.
Appending the new data with different timestamp.
Using partitioned tables [1] and if possible clustered tables [2], this way when you want to update the table you can use the partitioned and clustered columns to update it and the query will be less heavy. Also, you can append the new data in a new partitioned table, let's say on the current day.
If you are using the data for analytical reasons, maybe the best option is 2 and 3, but I always recommend having [1] and [2].
[1] https://cloud.google.com/bigquery/docs/querying-partitioned-tables
[2] https://cloud.google.com/bigquery/docs/clustered-tables
I want to delete a record in a dataframe object in snowflake table .
Similarly I want to perform update on basis of "key" in dataframe in snowflake table.
My research indicates that the utils method can perform the DDL operation but i am unable to find the some example to refer to.
As you mentioned, you can use the runQuery() method of the Utils object to execute DDL/DML SQL statements:
https://docs.snowflake.net/manuals/user-guide/spark-connector-use.html#executing-ddl-dml-sql-statements
If you want to do it based on some keys, then you can iterate items on DataFrame, and run an SQL for each item:
how to loop through each row of dataFrame in pyspark
But this will be a performance killer. Snowflake is a data warehouse, so you should always prefer "batch updates" instead of single row updates.
I would suggest you to write your dataframe to a staging table in Snowflake, and then call a SQL to update the rows in target table based on the staging table.
I would like to store CSV files in SQL Server. I've created a table with column "myDoc" as varbinary(max). I generate the CSV's on a server using Python/Django. I would like to insert the actual CSV (not the path) as a BLOB object so that I can later retrieve the actual CSV file.
How do I do this? I haven't been able to make much headway with this documentation, as it mostly refers to .jpg's
https://msdn.microsoft.com/en-us/library/a1904w6t(VS.80).aspx
Edit:
I wanted to add that I'm trying to avoid filestream. The CSVs are too small (5kb) and I don't need text search over them.
Not sure why you want varbinary over varchar, but it will work either way
Insert Into YourTable (myDoc)
Select doc = BulkColumn FROM OPENROWSET(BULK 'C:\Working\SomeXMLFile.csv', SINGLE_BLOB) x
I have an html file on network which updates almost every minute with new rows in a table. At any point, the file contains close to 15000 rows I want to create a MySQL table with all data in the table, and then some more that I compute from the available data.
The said HTML table contains, say rows from the last 3 days. I want to store all of them in my mysql table, and update the table every hour or so (can this be done via a cron?)
For connecting to the DB, I'm using MySQLdb which works fine. However, I'm not sure what are the best practices to do so. I can scrape the data using bs4, connect to table using MySQLdb. But how should I update the table? What logic should I use to scrape the page that uses the least resources?
I am not fetching any results, just scraping and writing.
Any pointers, please?
My Suggestion is instead of updating values row by row try to use Bulk Insert in temporary table and then move the data into an actual table based on some timing key. If you have key column that will be good for reading the recent rows as you added.
You can adopt the following approach:
For the purpose of the discussion, let master be the final destination the scraped data.
Then we can adopt the following steps:
Scrape data from the web page.
Store this scraped data within the temporary table within MySQL say temp.
Perform an EXCEPT operation to pull out only those rows which exist within the master but not in temp.
persist the rows obtained in step 3 within the master table.
Please refer to this link for understanding how to perform SET operations in MySQL. Also, it would be advisable to place all this logic within a store procedure and pass it the set of the data to be processed ( not sure if this part is possible in MySQL)
Adding one more step to the approach - Based on the discussion below, we can use a timestamp based column to determine the newest rows that need to be placed into the table. The above approach for SET based operations works well, in case there are no timestamp based columns.