I have the following code that successfully uploads an excel file to postgreSQL
import pandas as pd
from sqlalchemy import create_engine
dir_path = os.path.dirname(os.path.realpath(__file__))
df = pd.read_excel(dir_path + '/'+file_name, "Sheet1")
engine= create_engine('postgresql://postgres:!Password#localhost/Database')
df.to_sql('identifier', con=engine, if_exists='replace', index=False)
However this leads to problems when trying to do simple queries such as updates in PgAdmin4.
Are there any other ways to insert an excel file into a postgeSQL table using python?
There is a faster way.
Take a look.
I am trying to load a csv file from s3 to redshift table using python. I have used boto3 to pull data from s3. Used pandas to convert data types (timestamp, string and integer) and tried to upload the dataframe to table using to_sql (sqlalchemy). It ended up with error
cursor.executemany(statement, parameters) psycopg2.errors.StringDataRightTruncation: value too long for type character varying(256)".
Additional Info: string contains large amount of mixed data. Also I am able to take the output as csv in my local machine.
My code as follows,
import io
import boto3
import pandas as pd
from sqlalchemy import create_engine
from datetime import datetime
client = boto3.client('s3', aws_access_key_id="",
aws_secret_access_key="")
response = client.get_object(Bucket='', Key='*.csv')
file = response['Body'].read()
df = pd.read_csv(io.BytesIO(file))
df['date'] = pd.to_datetime(df['date'], infer_datetime_format=True)
df['text'] = df['text'].astype(str)
df['count'] = df['count'].fillna(0).astype(int)
con = create_engine('postgresql://*.redshift.amazonaws.com:5439/dev')
select_list = ['date','text','count']
write = df[select_list]
df = pd.DataFrame(write)
df.to_sql('test', con, schema='parent', index=False, if_exists='replace')
I am a beginner, please help me to understand what I am doing wrong. Ignore any typo errors. Thanks.
I want to load csv.gz file from storage to bigquery. Right now I using below code, but I am not sure if it is efficient way to load data to bigquery.
# -*- coding: utf-8 -*-
from io import BytesIO
import pandas as pd
from google.cloud import storage
import pandas_gbq as gbq
client = storage.Client.from_service_account_json(service_account)
bucket = client.get_bucket("bucketname")
blob = storage.blob.Blob("""somefile.csv.gz""", bucket)
content = blob.download_as_string()
df = pd.read_csv(BytesIO(content), delimiter=',', quotechar='"', low_memory=False)
df = df.astype(str)
df.columns = df.columns.str.replace("|", "")
df["dateinsert"] = pd.datetime.now()
gbq.to_gbq(df, 'desttable',
'projectid',
chunksize=None,
if_exists='append'
)
Please assist me to write this code in efficient way
I propose you this process:
Perform a load job into bigquery
Add the schema, yes 150 column is boring...
Add skip leading row option for skipping the header job_config.skip_leading_rows = 1
Name your table like this <dataset>.<tableBaseName>_<Datetime> The date time must be a string format compliant with BigQuery table name. For example YYYYMMDDHHMM
When you query your data, you can query a subset of table, and inject the table name in the query result, like this:
SELECT *,(SELECT table_id
FROM `<project>.<dataset>.__TABLES_SUMMARY__`
WHERE table_id LIKE '<tableBaseName>%') FROM `<project>.<dataset>.<tableBaseName>*`
Of course, you can raffine the * with the year, month, day,...
I think, I meet all your requirements. Comment if something goes wrong
I have been scraping csv files from the web every minute and storing them into a directory.
The files are being named according to the time of retrieval:
name = 'train'+str(datetime.datetime.now().strftime("%Y-%m-%d-%H-%M-%S"))+'.csv'
I need to upload each file into a database created on some remote server.
How can I do the above?
You can use pandas and sqlalchemy for loading CSV into databases. I use MSSQL and my code looks like this:
import os
import pandas as pd
import sqlalchemy as sa
server = 'your server'
database = 'your database'
for filename in os.listdir(directory): #iterate over files
df = pandas.read_csv(filename, sep=',')
engine = sa.create_engine('mssql+pyodbc://'+server+'/'+database+'?
driver=SQL+Server+Native+Client+11.0')
tableName = os.path.splitext(filename)[0]) #removes .csv extension
df.to_sql(tableName, con=engine,dtype=None) #sent data to server
By setting the dtype parameter you can change the conversion of datatype (e.g. if you want smallint instead of integer, etc)
to ensure you dont write the same file/table twice I would suggest to perhaps keep a logfile in the directory, where you can log what csv files are written to the DB. and then exclude those in your for-loop.
I would like to know how to read several json files from a single folder (without specifying the files names, just that they are json files).
Also, it is possible to turn them into a pandas DataFrame?
Can you give me a basic example?
One option is listing all files in a directory with os.listdir and then finding only those that end in '.json':
import os, json
import pandas as pd
path_to_json = 'somedir/'
json_files = [pos_json for pos_json in os.listdir(path_to_json) if pos_json.endswith('.json')]
print(json_files) # for me this prints ['foo.json']
Now you can use pandas DataFrame.from_dict to read in the json (a python dictionary at this point) to a pandas dataframe:
montreal_json = pd.DataFrame.from_dict(many_jsons[0])
print montreal_json['features'][0]['geometry']
Prints:
{u'type': u'Point', u'coordinates': [-73.6051013, 45.5115944]}
In this case I had appended some jsons to a list many_jsons. The first json in my list is actually a geojson with some geo data on Montreal. I'm familiar with the content already so I print out the 'geometry' which gives me the lon/lat of Montreal.
The following code sums up everything above:
import os, json
import pandas as pd
# this finds our json files
path_to_json = 'json/'
json_files = [pos_json for pos_json in os.listdir(path_to_json) if pos_json.endswith('.json')]
# here I define my pandas Dataframe with the columns I want to get from the json
jsons_data = pd.DataFrame(columns=['country', 'city', 'long/lat'])
# we need both the json and an index number so use enumerate()
for index, js in enumerate(json_files):
with open(os.path.join(path_to_json, js)) as json_file:
json_text = json.load(json_file)
# here you need to know the layout of your json and each json has to have
# the same structure (obviously not the structure I have here)
country = json_text['features'][0]['properties']['country']
city = json_text['features'][0]['properties']['name']
lonlat = json_text['features'][0]['geometry']['coordinates']
# here I push a list of data into a pandas DataFrame at row given by 'index'
jsons_data.loc[index] = [country, city, lonlat]
# now that we have the pertinent json data in our DataFrame let's look at it
print(jsons_data)
for me this prints:
country city long/lat
0 Canada Montreal city [-73.6051013, 45.5115944]
1 Canada Toronto [-79.3849008, 43.6529206]
It may be helpful to know that for this code I had two geojsons in a directory name 'json'. Each json had the following structure:
{"features":
[{"properties":
{"osm_key":"boundary","extent":
[-73.9729016,45.7047897,-73.4734865,45.4100756],
"name":"Montreal city","state":"Quebec","osm_id":1634158,
"osm_type":"R","osm_value":"administrative","country":"Canada"},
"type":"Feature","geometry":
{"type":"Point","coordinates":
[-73.6051013,45.5115944]}}],
"type":"FeatureCollection"}
Iterating a (flat) directory is easy with the glob module
from glob import glob
for f_name in glob('foo/*.json'):
...
As for reading JSON directly into pandas, see here.
Loads all files that end with * .json from a specific directory into a dict:
import os,json
path_to_json = '/lala/'
for file_name in [file for file in os.listdir(path_to_json) if file.endswith('.json')]:
with open(path_to_json + file_name) as json_file:
data = json.load(json_file)
print(data)
Try it yourself:
https://repl.it/#SmaMa/loadjsonfilesfromfolderintodict
To read the json files,
import os
import glob
contents = []
json_dir_name = '/path/to/json/dir'
json_pattern = os.path.join(json_dir_name, '*.json')
file_list = glob.glob(json_pattern)
for file in file_list:
contents.append(read(file))
If turning into a pandas dataframe, use the pandas API.
More generally, you can use a generator..
def data_generator(my_path_regex):
for filename in glob.glob(my_path_regex):
for json_line in open(filename, 'r'):
yield json.loads(json_line)
my_arr = [_json for _json in data_generator(my_path_regex)]
I am using glob with pandas. Checkout the below code
import pandas as pd
from glob import glob
df = pd.concat([pd.read_json(f_name, lines=True) for f_name in glob('foo/*.json')])
A simple and very easy-to-understand answer.
import os
import glob
import pandas as pd
path_to_json = r'\path\here'
# import all files from folder which ends with .json
json_files = glob.glob(os.path.join(path_to_json, '*.json'))
# convert all files to datafr`enter code here`ame
df = pd.concat((pd.read_json(f) for f in json_files))
print(df.head())
I feel a solution using pathlib is missing :)
from pathlib import Path
file_list = list(Path("/path/to/json/dir").glob("*.json"))
One more option is to read it as a PySpark Dataframe and then convert it to Pandas Dataframe (if really necessary, depending on the operation I'd suggest keeping as a PySpark DF). Spark natively handles using a directory with JSON files as the main path without the need of libraries for reading or iterating over each file:
# pip install pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
spark_df = spark.read.json('/some_dir_with_json/*.json')
Next, in order to convert into a Pandas Dataframe, you can do:
df = spark_df.toPandas()