I have been scraping csv files from the web every minute and storing them into a directory.
The files are being named according to the time of retrieval:
name = 'train'+str(datetime.datetime.now().strftime("%Y-%m-%d-%H-%M-%S"))+'.csv'
I need to upload each file into a database created on some remote server.
How can I do the above?
You can use pandas and sqlalchemy for loading CSV into databases. I use MSSQL and my code looks like this:
import os
import pandas as pd
import sqlalchemy as sa
server = 'your server'
database = 'your database'
for filename in os.listdir(directory): #iterate over files
df = pandas.read_csv(filename, sep=',')
engine = sa.create_engine('mssql+pyodbc://'+server+'/'+database+'?
driver=SQL+Server+Native+Client+11.0')
tableName = os.path.splitext(filename)[0]) #removes .csv extension
df.to_sql(tableName, con=engine,dtype=None) #sent data to server
By setting the dtype parameter you can change the conversion of datatype (e.g. if you want smallint instead of integer, etc)
to ensure you dont write the same file/table twice I would suggest to perhaps keep a logfile in the directory, where you can log what csv files are written to the DB. and then exclude those in your for-loop.
Related
I am trying to run a query, with the result saved as a CSV that is uploaded to a SharePoint folder. This is within Databricks via Pyspark.
My code below is close to doing this, but the final line is not functioning correctly - the file generated in SharePoint does not contain any data, though the dataframe does.
I'm new to Python and Databricks, if anyone can provide some guidance on how to correct that final line I'd really appreciate it!
from shareplum import Site
from shareplum.site import Version
import pandas as pd
sharepointUsername =
sharepointPassword =
sharepointSite =
website =
sharepointFolder =
# Connect to SharePoint Folder
authcookie = Office365(website, username=sharepointUsername, password=sharepointPassword).GetCookies()
site = Site(sharepointSite, version=Version.v2016, authcookie=authcookie)
folder = site.Folder(sharepointFolder)
FileName = "Data_Export.csv"
Query = "SELECT * FROM TABLE"
df = spark.sql(Query)
pandasdf = df.toPandas()
folder.upload_file(pandasdf.to_csv(FileName, encoding = 'utf-8'), FileName)
Sure my code is still garbage, but it does work. I needed to convert the dataframe into a variable containing CSV formatted data prior to uploading it to SharePoint; effectively I was trying to skip a step before. Last two lines were updated:
from shareplum.site import Version
import pandas as pd
sharepointUsername =
sharepointPassword =
sharepointSite =
website =
sharepointFolder =
# Connect to SharePoint Folder
authcookie = Office365(website, username=sharepointUsername, password=sharepointPassword).GetCookies()
site = Site(sharepointSite, version=Version.v2016, authcookie=authcookie)
folder = site.Folder(sharepointFolder)
FileName = "Data_Export.csv"
Query = "SELECT * FROM TABLE"
df = (spark.sql(QueryAllocation)).toPandas().to_csv(header=True, index=False, encoding='utf-8')
folder.upload_file(df, FileName)
I have the following code that successfully uploads an excel file to postgreSQL
import pandas as pd
from sqlalchemy import create_engine
dir_path = os.path.dirname(os.path.realpath(__file__))
df = pd.read_excel(dir_path + '/'+file_name, "Sheet1")
engine= create_engine('postgresql://postgres:!Password#localhost/Database')
df.to_sql('identifier', con=engine, if_exists='replace', index=False)
However this leads to problems when trying to do simple queries such as updates in PgAdmin4.
Are there any other ways to insert an excel file into a postgeSQL table using python?
There is a faster way.
Take a look.
I am a novice at python and I am trying to create my first automated code in jupyter notebooks that will export my data pull from SQL server to a specific path and this code needs to run daily.
My questions:
1- It needs to export the CSV file to a specific folder, don't know how to do that
2- I need the code to run by itself on a daily basis
I am stuck, Any help is appreciated.
I have connected to the sql server and successfully pull the report and write a CSV file.
import smtplib
import pyodbc
import pandas as pd
import pandas.io.sql
server = 'example server'
db = 'ExternalUser'
conn = pyodbc.connect('Driver={SQL Server};'
'Server=example server;'
'Database=ExternalUser;'
'Trusted_Connection=yes;')
cursor = conn.cursor()
cursor.execute("my SQL query")
col_headers = [ i[0] for i in cursor.description ]
rows = [ list(i) for i in cursor.fetchall()]
df = pd.DataFrame(rows, columns=col_headers)
df.to_csv("Test v2.csv", header = True, index=False)
For needing to export the csv too a certain folder: It depends where/how you run the script. If you run the script in the folder you want the csv file saved then your current df.to_csv('filename.csv') would work great, or add a path 'Test_dir/filename.csv'. Otherwise you could use a library like shutil (https://docs.python.org/3/library/shutil.html) that will then move the .csv file to a given folder.
For running the code on a daily basis, you could do this locally on your machine (https://medium.com/#thabo_65610/three-ways-to-automate-python-via-jupyter-notebook-d14aaa78de9). Or you could look into configuring a cronjob.
I want to load csv.gz file from storage to bigquery. Right now I using below code, but I am not sure if it is efficient way to load data to bigquery.
# -*- coding: utf-8 -*-
from io import BytesIO
import pandas as pd
from google.cloud import storage
import pandas_gbq as gbq
client = storage.Client.from_service_account_json(service_account)
bucket = client.get_bucket("bucketname")
blob = storage.blob.Blob("""somefile.csv.gz""", bucket)
content = blob.download_as_string()
df = pd.read_csv(BytesIO(content), delimiter=',', quotechar='"', low_memory=False)
df = df.astype(str)
df.columns = df.columns.str.replace("|", "")
df["dateinsert"] = pd.datetime.now()
gbq.to_gbq(df, 'desttable',
'projectid',
chunksize=None,
if_exists='append'
)
Please assist me to write this code in efficient way
I propose you this process:
Perform a load job into bigquery
Add the schema, yes 150 column is boring...
Add skip leading row option for skipping the header job_config.skip_leading_rows = 1
Name your table like this <dataset>.<tableBaseName>_<Datetime> The date time must be a string format compliant with BigQuery table name. For example YYYYMMDDHHMM
When you query your data, you can query a subset of table, and inject the table name in the query result, like this:
SELECT *,(SELECT table_id
FROM `<project>.<dataset>.__TABLES_SUMMARY__`
WHERE table_id LIKE '<tableBaseName>%') FROM `<project>.<dataset>.<tableBaseName>*`
Of course, you can raffine the * with the year, month, day,...
I think, I meet all your requirements. Comment if something goes wrong
I have the following python script that downloads two files from an S3 compatible service. Then merges them and uploads the output to another bucket.
import time
import boto3
import pandas as pd
timestamp = int(time.time())
conn = boto3.client('s3')
conn.download_file('segment', 'segment.csv', 'segment.csv')
conn.download_file('payment', 'payments.csv', 'payments.csv')
paymentsfile = 'payments.csv'
segmentsfile = 'segment.csv'
outputfile = 'payments_merged_' + str(timestamp) + '.csv'
csv_payments = pd.read_csv(paymentsfile, dtype={'ID': float})
csv_segments = pd.read_csv(segmentsfile, dtype={'ID': float})
csv_payments = csv_payments.merge(csv_segments, on='ID')
open(outputfile, 'a').close()
csv_payments.to_csv(outputfile)
conn.upload_file(outputfile, backup, outputfile)
However if I execute the script it stores the files in the folder of my script. For security reasons I would like to prevent this to happen. I could delete the files after the script was executed but let's assume my script is located in the folder /app/script/. This means for a short time, while the script is being executed, someone could open the url example.com/app/script/payments.csv and download the file. What is a good solution for that?
In fact, pandas.read_csv let you read a buffer or byte object. You can do everything in the memory. Either put this script in a instance, even better, you can run it as AWS lambda process if the file is small.
import time
import boto3
import pandas as pd
paymentsfile = 'payments.csv'
segmentsfile = 'segment.csv'
outputfile = 'payments_merged_' + str(timestamp) + '.csv'
s3 = boto3.client('s3')
payment_obj = s3.get_object(Bucket='payment', Key=paymentsfile )
segment_obj = s3.get_object(Bucket='segment', Key=segmentsfile )
csv_payments = pd.read_csv(payment_obj['Body'], dtype={'ID': float})
csv_segments = pd.read_csv(segments_obj['Body'], dtype={'ID': float})
csv_merge = csv_payments.merge(csv_segments, on='ID')
csv_merge.to_csv(buffer)
buffer.seek(0)
s3.upload_fileobj(buffer, 'bucket_name', outputfile )
The simplest way would be to modify the configuration of your web server to not serve the directory that you are writing to or write to a directory that isn't served. For example, a common practice is to use /scr for this type of thing. You would need to modify permissions for the user your web server runs under to ensure it has access to /scr.
To restrict web server access to the directory you write to you can use the following in Nginx -
https://serverfault.com/questions/137907/how-to-restrict-access-to-directory-and-subdirs
For Apache you can use this example -
https://serverfault.com/questions/174708/apache2-how-do-i-restrict-access-to-a-directory-but-allow-access-to-one-file-w