I want to load csv.gz file from storage to bigquery. Right now I using below code, but I am not sure if it is efficient way to load data to bigquery.
# -*- coding: utf-8 -*-
from io import BytesIO
import pandas as pd
from google.cloud import storage
import pandas_gbq as gbq
client = storage.Client.from_service_account_json(service_account)
bucket = client.get_bucket("bucketname")
blob = storage.blob.Blob("""somefile.csv.gz""", bucket)
content = blob.download_as_string()
df = pd.read_csv(BytesIO(content), delimiter=',', quotechar='"', low_memory=False)
df = df.astype(str)
df.columns = df.columns.str.replace("|", "")
df["dateinsert"] = pd.datetime.now()
gbq.to_gbq(df, 'desttable',
'projectid',
chunksize=None,
if_exists='append'
)
Please assist me to write this code in efficient way
I propose you this process:
Perform a load job into bigquery
Add the schema, yes 150 column is boring...
Add skip leading row option for skipping the header job_config.skip_leading_rows = 1
Name your table like this <dataset>.<tableBaseName>_<Datetime> The date time must be a string format compliant with BigQuery table name. For example YYYYMMDDHHMM
When you query your data, you can query a subset of table, and inject the table name in the query result, like this:
SELECT *,(SELECT table_id
FROM `<project>.<dataset>.__TABLES_SUMMARY__`
WHERE table_id LIKE '<tableBaseName>%') FROM `<project>.<dataset>.<tableBaseName>*`
Of course, you can raffine the * with the year, month, day,...
I think, I meet all your requirements. Comment if something goes wrong
Related
I am trying to load a csv file from s3 to redshift table using python. I have used boto3 to pull data from s3. Used pandas to convert data types (timestamp, string and integer) and tried to upload the dataframe to table using to_sql (sqlalchemy). It ended up with error
cursor.executemany(statement, parameters) psycopg2.errors.StringDataRightTruncation: value too long for type character varying(256)".
Additional Info: string contains large amount of mixed data. Also I am able to take the output as csv in my local machine.
My code as follows,
import io
import boto3
import pandas as pd
from sqlalchemy import create_engine
from datetime import datetime
client = boto3.client('s3', aws_access_key_id="",
aws_secret_access_key="")
response = client.get_object(Bucket='', Key='*.csv')
file = response['Body'].read()
df = pd.read_csv(io.BytesIO(file))
df['date'] = pd.to_datetime(df['date'], infer_datetime_format=True)
df['text'] = df['text'].astype(str)
df['count'] = df['count'].fillna(0).astype(int)
con = create_engine('postgresql://*.redshift.amazonaws.com:5439/dev')
select_list = ['date','text','count']
write = df[select_list]
df = pd.DataFrame(write)
df.to_sql('test', con, schema='parent', index=False, if_exists='replace')
I am a beginner, please help me to understand what I am doing wrong. Ignore any typo errors. Thanks.
I'm using this script to query data from a CSV file that's saved on an AWS S3 Bucket. It works well with CSV files that were originally saved in Comma Separated format but I have a lot of data saved with tab delimiter (Sep='\t') which makes the code fail.
The original data is very massive which makes it difficult to rewrite it. Is there a way to query data where we specify the delimiter/separator for the CSV file?
I used it from this post: https://towardsdatascience.com/how-i-improved-performance-retrieving-big-data-with-s3-select-2bd2850bc428 ... I'd like to thank the writer for the tutorial which helped me save a lot of time.
Here's the code:
import boto3
import os
import pandas as pd
S3_KEY = r'source/df.csv'
S3_BUCKET = 'my_bucket'
TARGET_FILE = 'dataset.csv'
aws_access_key_id= 'my_key'
aws_secret_access_key= 'my_secret'
s3_client = boto3.client(service_name='s3',
region_name='us-east-1',
aws_access_key_id=aws_access_key_id,
aws_secret_access_key=aws_secret_access_key)
query = """SELECT column1
FROM S3Object
WHERE column1 = '4223740573'"""
result = s3_client.select_object_content(Bucket=S3_BUCKET,
Key=S3_KEY,
ExpressionType='SQL',
Expression=query,
InputSerialization={'CSV': {'FileHeaderInfo': 'Use'}},
OutputSerialization={'CSV': {}})
# remove the file if exists, since we append filtered rows line by line
if os.path.exists(TARGET_FILE):
os.remove(TARGET_FILE)
with open(TARGET_FILE, 'a+') as filtered_file:
# write header as a first line, then append each row from S3 select
filtered_file.write('Column1\n')
for record in result['Payload']:
if 'Records' in record:
res = record['Records']['Payload'].decode('utf-8')
filtered_file.write(res)
result = pd.read_csv(TARGET_FILE)
The InputSerialization option also allows you to specify:
RecordDelimiter - A single character used to separate individual records in the input. Instead of the default value, you can specify an arbitrary delimiter.
So you could try:
result = s3_client.select_object_content(
Bucket=S3_BUCKET,
Key=S3_KEY,
ExpressionType='SQL',
Expression=query,
InputSerialization={'CSV': {'FileHeaderInfo': 'Use', 'RecordDelimiter': '\t'}},
OutputSerialization={'CSV': {}})
Actually, I had a TSV file, and I used this InputSerialization:
InputSerialization={'CSV': {'FileHeaderInfo': 'None', 'RecordDelimiter': '\n', 'FieldDelimiter': '\t'}}
It works for files and have Enters between records, and not tabs, but tabs between fields.
How can I create a loop with pandas read_csv?
I need to create a data loop to list and save to the database.
How can I do this loop with the data from a csv?
thank you all for your attention
produtos = pd.read_csv('tabela.csv', delimiter=';')
for produto in produtos:
print(produto['NOME'])
To iterate in the DataFrame resulted by calling pandas read_csv you should use the command iterrows() for iteration, as in the below example:
for produto in produtos.iterrows():
print(produto['NOME'])
If you have files that you need to save, I recommend this
import os
import pandas as pd
from sqlalchemy import create_engine
engine = create_engine('sqlite://', echo=False)
path = "C:/path/to/directory"
# list all files in the directory, assuming this directory
# contains only files csv files that you need to save
for file in os.listdir(path):
df = pd.read_csv(path+file)
# some other data cleaning/manipulation
# write dataframe to database
df.to_sql("table_name", con=engine)
Alternative, you can create a list with all files locations and iterate through that one instead. More info on to_sql() and check out this answer
If you can create loop with Pandas by column name:
produtos = pd.read_csv('tabela.csv', delimiter=';')
for i, produto in produtos.iterrows():
print(produto['NOME'])
But if you want to insert directly on your database use sqlalchemy and function to_sql like this:
from sqlalchemy import create_engine
import pandas as pd
...
engine = create_engine("mysql://user:pwd#localhost/database")
produtos = pd.read_csv('tabela.csv', delimiter=';')
if_exists_do = 'append'
produtos.to_sql('table_name', con=engine, if_exists=if_exists_do)
Then it will be inserted on database. The var 'if_exists_do' can receive value 'replace' if you want this.
I have been scraping csv files from the web every minute and storing them into a directory.
The files are being named according to the time of retrieval:
name = 'train'+str(datetime.datetime.now().strftime("%Y-%m-%d-%H-%M-%S"))+'.csv'
I need to upload each file into a database created on some remote server.
How can I do the above?
You can use pandas and sqlalchemy for loading CSV into databases. I use MSSQL and my code looks like this:
import os
import pandas as pd
import sqlalchemy as sa
server = 'your server'
database = 'your database'
for filename in os.listdir(directory): #iterate over files
df = pandas.read_csv(filename, sep=',')
engine = sa.create_engine('mssql+pyodbc://'+server+'/'+database+'?
driver=SQL+Server+Native+Client+11.0')
tableName = os.path.splitext(filename)[0]) #removes .csv extension
df.to_sql(tableName, con=engine,dtype=None) #sent data to server
By setting the dtype parameter you can change the conversion of datatype (e.g. if you want smallint instead of integer, etc)
to ensure you dont write the same file/table twice I would suggest to perhaps keep a logfile in the directory, where you can log what csv files are written to the DB. and then exclude those in your for-loop.
I have a number of large csv (tab delimited) data stored as azure blobs, and I want to create a pandas dataframe from these. I can do this locally as follows:
from azure.storage.blob import BlobService
import pandas as pd
import os.path
STORAGEACCOUNTNAME= 'account_name'
STORAGEACCOUNTKEY= "key"
LOCALFILENAME= 'path/to.csv'
CONTAINERNAME= 'container_name'
BLOBNAME= 'bloby_data/000000_0'
blob_service = BlobService(account_name=STORAGEACCOUNTNAME, account_key=STORAGEACCOUNTKEY)
# Only get a local copy if haven't already got it
if not os.path.isfile(LOCALFILENAME):
blob_service.get_blob_to_path(CONTAINERNAME,BLOBNAME,LOCALFILENAME)
df_customer = pd.read_csv(LOCALFILENAME, sep='\t')
However, when running the notebook on azure ML notebooks, I can't 'save a local copy' and then read from csv, and so I'd like to do the conversion directly (something like pd.read_azure_blob(blob_csv) or just pd.read_csv(blob_csv) would be ideal).
I can get to the desired end result (pandas dataframe for blob csv data), if I first create an azure ML workspace, and then read the datasets into that, and finally using https://github.com/Azure/Azure-MachineLearning-ClientLibrary-Python to access the dataset as a pandas dataframe, but I'd prefer to just read straight from the blob storage location.
The accepted answer will not work in the latest Azure Storage SDK. MS has rewritten the SDK completely. It's kind of annoying if you are using the old version and update it. The below code should work in the new version.
from azure.storage.blob import ContainerClient
from io import StringIO
import pandas as pd
conn_str = ""
container = ""
blob_name = ""
container_client = ContainerClient.from_connection_string(
conn_str=conn_str,
container_name=container
)
# Download blob as StorageStreamDownloader object (stored in memory)
downloaded_blob = container_client.download_blob(blob_name)
df = pd.read_csv(StringIO(downloaded_blob.content_as_text()))
I think you want to use get_blob_to_bytes, or get_blob_to_text; these should output a string which you can use to create a dataframe as
from io import StringIO
blobstring = blob_service.get_blob_to_text(CONTAINERNAME,BLOBNAME)
df = pd.read_csv(StringIO(blobstring))
Thanks for the answer, I think some correction is needed. You need to get content from the blob object and in the get_blob_to_text there's no need for the local file name.
from io import StringIO
blobstring = blob_service.get_blob_to_text(CONTAINERNAME,BLOBNAME).content
df = pd.read_csv(StringIO(blobstring))
Simple Answer:
Working as on 12th June 2022
Below are the steps to read a CSV file from Azure Blob into a Jupyter notebook dataframe (python).
STEP 1:
First generate a SAS token & URL for the target CSV(blob) file on Azure-storage by right-clicking the blob/storage CSV file(blob file).
STEP 2: Copy the Blob SAS URL that appears below the button used for generating SAS token and URL.
STEP 3: Use the below line of code in your Jupyter notbook to import the desired CSV. Replace url value with your Blob SAS URL copied in the above step.
import pandas as pd
url ='Your Blob SAS URL'
df = pd.read_csv(url)
df.head()
Use ADLFS (pip install adlfs), which is an fsspec-compatible API for Azure lakes (gen1 and gen2):
storage_options = {
'tenant_id': tenant_id,
'account_name': account_name,
'client_id': client_id,
'client_secret': client_secret
}
url = 'az://some/path.csv'
pd.read_csv(url, storage_options=storage_options)