Lost data in bigquery

Lost data in bigquery - python

I made a Python 3 script to process some CSV files, but I have a problem with the data.
I send to the stream with the insert_rows function, if I only import one file I have the same rows in the CSV and the BigQuery, but when I import more files, BigQuery lost rows respect CSV file, but insert_rows don't return errors.
errors = connection.client.insert_rows(table_ref, info, selected_fields=schema) # API request
Thanks for the help

Issue was fixed by adding a new unique column into the CSV file, using this Python Standard Library to generate a new column and add in all rows an unique id.

Related

Data Ingestion: Load Dynamic Files from S3 to Snowflake

Situation: A csv lands into AWS S3 every month. The vendor adds/removes/modifies columns from the file as they please. So the schema is not known ahead of time. The requirement is to create a table on-the-fly in Snowflake and load the data into said table. Matillion is our ELT tool.
This is what I have done so far.
Setup a Lambda to detect the arrival of the file, convert it to JSON, upload to another S3 dir and adds filename to SQS.
Matillion detects SQS message and loads the file with the JSON Data into Variant column in a SF table.
SF Stored proc takes the variant column and generates a table based on the number of fields in the JSON data. The VARIANT column in SF only works in this way if its JSON data. CSV is sadly not supported.
This works with 10,000 rows. The problem arises when I run this with a full file which is over 1GB, which is over 10M rows. It crashes the lambda job with an out of disk space error at runtime.
These are the alternatives I have thought of so far:
Attach an EFS volume to the lambda and use it to store the JSON file prior to the upload to S3. JSON data files are so much larger than their CSV counterparts, I expect the json file to be around 10-20GB since the file has over 10M rows.
Matillion has an Excel Query component where it can take the headers and create a table on the fly and load the file. I was thinking I can convert the header row from the CSV into a XLX file within the Lambda, pass it to over to Matillion, have it create the structures and then load the csv file once the structure is created.
What are my other options here? Considerations include a nice repeatable design pattern to be used for future large CSVs or similar requirements, costs of the EFS, am I making the best use of the tools that I are avaialable to me? Thanks!!!

Why not split the initial csv file into multiple files and then process each file in the same way you currently are?

Why are you converting CSV into JSON; CSV is directly being loaded into table without doing any data transformation specifically required in case of JSON, the lateral flatten to convert json into relational data rows; and why not use Snowflake Snowpipe feature to load data directly into Snowflake without use of Matallion. You can split large csv files into smaller chunks before loading into Snowflake ; this will help in distributing the data processing loads across SF Warehouses.

I also load CSV files from SFTP into Snowflake, using Matillion, with no idea of the schema.
In my process, I create a "temp" table in Snowflake, with 50 VARCHAR columns (Our files should never exceed 50 columns). Our data always contains text, dates or numbers, so VARCHAR isn't a problem. I can then load the .csv file into the temp table. I believe this should work for files coming from S3 as well.
That will at least get the data into Snowflake. How to create the "final" table however, given your scenario, I'm not sure.
I can imagine being able to use the header row, and/or doing some analysis on the 'type' of data contained in each column, to determine the column type needed.
But if you can get the 'final' table created, you could move the data over from temp. Or alter the temp table itself.

This can be achieved using an external table where the external table will be mapped with a single column and the delimiter will be a new line character. The external table also has a special virtual column and that can be processed to extract all the columns dynamically and then create a table based on the number of columns at any given time using the stored procedure. There is an interesting video which talks about this limitation in snowflake (https://youtu.be/hRNu58E6Kmg)

How to update and delete csv data in a flask website

I'm a beginner in Flask and would like to know how to update and delete csv data using a flask website.
My csv Database is:
name
Mark
Tom
Matt
I would like to know how I could add, update, and delete data on a csv file using a flask website.

Try out pandas
# Load the Pandas libraries with alias 'pd'
import pandas as pd
# Read data from file 'filename.csv'
data = pd.read_csv("filename.csv")
# Preview the first 5 lines of the loaded data
data.head()
Check out more here pandas

Why do you need to storing or processing of data into a CSV file ? Probably you will need to conditional CRUD. Looks like very troublesome way.
You may use SQLite or similar databases that more efficiently way instead of a CSV file. SQLite
Even so if you are determined to use a CSV file maybe it helps. CRUD with CSV

Data Visualization with hundreds of AWS billing Data CSV files when the header values change over time

I am developing a Data Visualization Dashboard in Tableau with hundreds of CSV files in the AWS S3 bucket and every day new files will be generated.
In order to achieve this and make the process faster, I am loading the files into AWS Redshift DB. The CSV file has new columns and sometimes previously existing columns will not be present in the incoming files. In order to handle this I have modified my code to read and compare the headers, if new headers are present it will alter table, add new columns.
However the issue I am facing is the following:
CSV file header values change over time, i.e If the current value of a column is 'cost', in the next month the 'cost' column might not be present but it's mapped to a new column by value 'Blended Cost'.
The copy command to Redshift only works when the header positions match the column positions in the table. However with such a dynamic file, matching column positions is not feasible. I'm exploring Dynamo DB option to overcome this issue.
What would be the best way to handle this situation? Any recommendation will be highly appreciated.

Extracting Arrays While Simultaneously Removing Blank Columns

I have a complex excel spreadsheet that I'm trying to ingest and cleanse via xlrd. The existing spreadsheet is really designed to be more of a "readable" document, but I'm tasked with ingesting it as a data source. The trouble is that there is frequently lots of spacing between the field names and the actual data. Ultimately I'd like to read in the contents of the excel file, process it, and write a simplified file with just the data. Any ideas?
For example:
Have:
Want:
Here's the example spreadsheet: download

How can I adapt my code to make it compatible to Microsoft Excel?

Problem
I was trying to implement an web API(based on Flask), which would be used to query the database given some specific conditions, reconstruct the data and finally export the result to a .csv file.
Since the amount of data is really really huge, I can not construct the whole dataset and generate the .csv file all at once(e.g. create a DataFrame using pandas and finally call df.to_csv()), because that would cause a slow query and maybe the http connection would end up timeout.
So I create a generator which query the database 500 records per time and yield the result one by one, like:
def __generator(q):
[...] # some code here
while True:
if records == None:
break
records = q[offset:offset+limit] # q means a sqlalchemy query object
[...] # omit some reconstruct code
for record in records:
yield record
and finally construct a Response object, and send .csv to client side:
return Response(__generate(q), mimetype='text/csv') # Flask
The generator works well and all data are encoded by 'uft-8', but when I try to open the .csv file using Microsoft Excel, it appears to be messy code.
Measures Already Tried
add a BOM header to the export file, doesn't work;
using some other encode like 'gb18030', and 'cp936', most of the messy code disappear, some still remained, and some part of the table structure become weird.
My Question Is
How can I make my code compatible to Microsoft Excel? That means at least two conditions should be satisfied:
no messy code, well displayed;
well structured table;
I would be really appreciated for your answer!

How are you importing the csv file to excel? Have you tried importing the csv as a text file?
By reading as text format for each column, it wont modify columns that it reads as different types like dates. Your code may be correct, and excel may just be modifying the data when it parses it as a csv - by importing as text format, it wont modify anything.

I would recommend you look into xlutils. It's been around for quite some time, and our company has used it both for reading configuration files to run automated test and for generating reports of test results.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.