Dropping some data while saving dataframe into csv file - python

I am running redshift query which is having 40 millions of record. But when I am saving into csv file it is showing only 7 thousands of record. Could you please help me how to solve this?
Example:
Code:
conn = gcso_conn1()
with conn.cursor() as cur:
query = "select * from (select a.src_nm Source_System ,b.day_id Date,b.qty Market_Volume,b.cntng_unt Volume_Units,b.sls_in_lcl_crncy Market_Value,b.crncy_cd Value_Currency,a.panel Sales_Channel,a.cmpny Competitor_Name,a.lcl_mnfcr Local_Manufacturer ,a.src_systm_id SKU_PackID_ProductNumber,upper(a.mol_list) Molecule_Name,a.brnd_nm BrandName_Intl,a.lcl_prod_nm BrandName_Local,d.atc3_desc Brand_Indication,a.prsd_strngth_1_nbr Strength,a.prsd_strngth_1_unt Strength_Units,a.pck_desc Pack_Size_Number,a.prod_nm Product_Description,c.iso3_cntry_cd Country_ISO_Code,c.cntry_nm Country_Name from gcso_prd_cpy.dim_prod a join gcso_prd_cpy.fct_sales b on (a.SRC_NM='IMS' and b.SRC_NM='IMS' and a.prod_id = b.prod_id) join gcso_prd_cpy.dim_cntry c on (a.cntry_id = c.cntry_id) left outer join gcso_prd_cpy.dim_thrc_area d on (a.prod_id = d.prod_id) WHERE a.SRC_NM='IMS' and c.iso3_cntry_cd in ('JPN','IND','CAN','USA') and upper(a.mol_list) in ('AMBRISENTAN', 'BERAPROST','BOSENTAN') ORDER BY b.day_id ) a"
#print(query)
cur.execute(query)
result = cur.fetchall()
conn.commit()
column = [i[0] for i in cur.description]
sqldf = pd.DataFrame(result, columns= column)
print(sqldf.count())
#print(df3)
sqldf.to_csv(Output_Path, index= False, sep= '\001', encoding = 'utf-8')

Everything should work correctly. I think the main problem is debugging using count(). You expect number of records but docs says:
Count non-NA cells for each column or row.
Better to use when debugging DataFrame:
print(len(df))
print(df.shape)
print(df.info())
Also you can do it easier using read_sql:
import pandas as pd
from sqlalchemy import create_engine
header = True
for chunk in pd.read_sql(
'your query here - SELECT * FROM... ',
con=create_engine('creds', echo=True), # set creds - postgres+psycopg2://user:password#host:5432/db_name
chunksize=1000, # read by chunks
):
file_path = '/tmp/path_to_your.csv'
chunk.to_csv(
file_path,
header=header,
mode='a',
index=False,
)
header = False

Related

Python pandas not importing values as is on Oracle

I imported a txt file on my python script and then converted it to dataframe. Then I created a function that uses cx_oracle to insert my data to Oracle database faster. It works pretty well and it only took 15min to import 1mil+ data - but it doesn't copy the values as is. This is a chunk of that code:
sqlquery = 'INSERT INTO {} VALUES({})'.format(tablename, inserttext)
df_list = df.values.tolist()
cur = con.cursor()
cur.execute(sql_query1)
logger.info("Completed: %s", sql_query1)
for b in df_list :
for index, value in enumerate(b):
if isinstance(value, float) and math.isnan(value):
b[index] = None
elif isinstance(value, type(pd.NaT)):
b[index] = None
Here is a sample data of what I expected:
DATE
STORE
COST
PARTIAL
16-JUN-21 08.00.00.000000000 PM
00006
+00000.0082
false
But instead this is being imported
DATE
STORE
COST
PARTIAL
16-JUN-21
6
0.0082
F
I need it to be eaxcatly same with zeros, symbols etc. I've already tried converting the dataframe as string by doing df = df.astype(str) but it doesn't work.
Hopefully you can help!
Without going into whether the schema design and architecture is really what you should be using, then with this schema:
create table t (d varchar2(31), s varchar2(6), c varchar(12), p varchar(5));
and this data in t.csv:
16-JUN-21 08.00.00.000000000 PM,00006,+00000.0082,false
and this code:
import cx_Oracle
import os
import sys
import csv
if sys.platform.startswith("darwin"):
cx_Oracle.init_oracle_client(lib_dir=os.environ.get("HOME")+"/Downloads/instantclient_19_8")
username = os.environ.get("PYTHON_USERNAME")
password = os.environ.get("PYTHON_PASSWORD")
connect_string = os.environ.get("PYTHON_CONNECTSTRING")
connection = cx_Oracle.connect(username, password, connect_string)
with connection.cursor() as cursor:
# Predefine the memory areas to match the table definition
cursor.setinputsizes(31,6,12,5)
# Adjust the batch size to meet your memory and performance requirements
batch_size = 10000
with open('t.csv', 'r') as csv_file:
csv_reader = csv.reader(csv_file, delimiter=',')
sql = "insert into t (d, s, c, p) values (:1, :2, :3, :4)"
data = []
for line in csv_reader:
data.append(line)
if len(data) % batch_size == 0:
cursor.executemany(sql, data)
data = []
if data:
cursor.executemany(sql, data)
connection.commit()
with connection.cursor() as cursor:
sql = """select * from t"""
for r in cursor.execute(sql):
print(r)
the output is:
('16-JUN-21 08.00.00.000000000 PM', '00006', '+00000.0082', 'false')
For general reference see the cx_Oracle documentation Batch Statement Execution and Bulk Loading.

More efficient way to query this SQL table from python?

I need to query rows where a column matches my list of ~60K IDs out of a table that contains millions of IDs. I think normally you would insert a temporary table into the database and merge on that but I can't edit this database. I am doing it like this using a loop w/ a python wrapper, but is there a better way? I mean it works, but still:
import pyodbc
import pandas as pd
# connect to the database using windows authentication
conn = pyodbc.connect('DRIVER={SQL Server Native Client 11.0};SERVER=my_fav_server;DATABASE=my_fav_db;Trusted_Connection=yes;')
cursor = conn.cursor()
# read in all the ids
ids_list = [...60K ids in here..]
# query in 10K chunks to prevent memory error
def chunks(l,n):
# split list into n lists of evenish size
n = max(1,n)
return [l[i:i+n] for i in range(0,len(l), n)]
chunked_ids_lists = chunks(ids_list, 10000)
# looping through to retrieve all cols
for chunk_num, chunked_ids_list in enumerate(chunked_ids_lists):
temp_ids_string = "('" + "','".join(chunked_ids_list) + "')"
temp_sql = f"SELECT * FROM dbo.my_fav_table WHERE ID IN {temp_ids_string};"
temp_data = pd.read_sql_query(temp_sql, conn)
temp_path = f"temp_chunk_{chunk_num}.txt"
temp_data.to_csv(temp_path, sep='\t', index=None)
# read the query chunks
all_data_list = []
for chunk_num in range(len(chunked_ids_lists)):
temp_path = f"temp_chunk_{chunk_num}.txt"
temp_data = pd.read_csv(temp_path, sep='\t')
all_data_list.append(temp_data)
all_data = pd.concat(all_data_list)
Another way use Psycopg's cursor.
import psycopg2
# Connect to an existing database
conn = psycopg2.connect("dbname=test user=postgres")
# Open a cursor to perform database operations
cur = conn.cursor()
# get data from query
# no need construct 'SQL-correct syntax' filter
cur.execute("SELECT * FROM dbo.my_fav_table WHERE ID IN %(filter)s;", {"filter": chunked_ids_lists})
# loop over getted rows
for record in cur:
# we got one record
print(record) # or make other data treatment
Use parameters rather than concatenating strings.
I don't see the need for the CSV files, if you're just going to read them all into Python in the next loop. Just put everything into all_data_list during the query loop.
all_data_list = []
for chunk in chunked_ids_lists:
params = ','.join(['?'] * len(chunk))
sql = f"SELECT * FROM dbo.my_fav_table WHERE ID IN ({params});"
cursor.execute(sql, chunk)
rows = cursor.fetchall()
all_data_list.extend(rows)
all_data = pd.dataFrame(all_data_list)

Inserting selective columns to postgres using pandas python

Objective is to write excel column data into postgres table.But all the columns names in excel doesn't match with the table column.
So what I am doing is trying to insert only the common columns.
I am able to get the common data in the set.
I am stuck as to how to insert the data in a single query.
I am using pandas dataframe.
#Getting table columns in a list
conn = psycopg2.connect(dbname=dbname, host=host, port=port, user=user, password=pwd)
print("Connecting to Database")
cur = conn.cursor()
cur.execute("SELECT * FROM " + table_name + " LIMIT 0")
table_columns = [desc[0] for desc in cur.description]
#print table_columns
#Getting excel sheet columns in a list
df = pd.read_excel('/Users/.../plans.xlsx', sheet_name='plans')
engine = create_engine('postgresql://postgres:postgres#localhost:5432/test_db')
column_list = df.columns.values.tolist()
#print(column_list)
s = set(column_list).intersection(set(table_columns))
for x in df['column_1'] :
sql = "insert into test_table(column_1) values ('" + x + "')"
cur.execute(sql)
cur.execute("commit;")
conn.close()
update the code based on the answers but with this every time i run the program it inserts new records. Is there any option where I can just do the update.
s = set(column_list).intersection(set(table_columns))
df1 = df[df.columns.intersection(table_columns)]
#print df1
df1.to_sql('medical_plans', con=engine, if_exists='append', index=False, index_label=None)

how to insert variables in read_sql_query using python

I am trying to retrieve data from sqlite3 with the help of variables. It is working fine with execute() statement but i would like to retrieve columns also and for that purpose i am using read_sql_query() but i am unable to pass variables in read_sql_query(), please follow below code:
def cal():
tab = ['LCOLOutput']
column_name = 'CUSTOMER_EMAIL_ID'
xyz = '**AVarma1#ra.rockwell.com'
for index, m in enumerate(tab):
table_name = m
sq = "SELECT * FROM ? where ?=?;" , (table_name, column_name, xyz,)
df = pandas.read_sql_query(sq,conn)
writer =
pandas.ExcelWriter('D:\pandas_simple.xlsx',engine='xlsxwriter')
df.to_excel(writer, sheet_name='Sheet1')
writer.save()
You need to change the syntax with the method read_sql_query() from pandas, check the doc.
For sqlite, it should work with :
sq = "SELECT * FROM ? where ?=?;"
param = (table_name, column_name, xyz,)
df = pandas.read_sql_query(sq,conn, params=param)
EDIT :
otherwise try with the following formatting for the table :
sq = "SELECT * FROM {} where ?=?;".format(table_name)
param = (column_name, xyz,)
df = pandas.read_sql_query(sq,conn, params=param)
Check this answer explaining why table cannot be passed as parameter directly.

How can I execute multiple SQL statements in Python script?

I am working with the following code template in Python (using Atom to build/write).
import pyodbc
import pandas as pd
import win32com.client
cnxn = pyodbc.connect('Trusted_Connection=yes', driver = '{SQL
Server}',server ='prodserver', database = 'XXXX')
cnxn.setdecoding(pyodbc.SQL_WCHAR, encoding='utf-8')
cnxn.setencoding(str, encoding='utf-8')
cnxn.setencoding(unicode, encoding='utf-8', ctype=pyodbc.SQL_CHAR)
cursor = cnxn.cursor()
script ="""SELECT AccountsCount.AccountClass, COUNT(*) as Count
FROM
(SELECT *
FROM XXXX.dbo.table
where SubNo='001'
AND (DATENAME(WEEKDAY, GETDATE()) = 'Sunday' AND
convert(date,AddDate) = DATEADD(DAY, -2, CAST(GETDATE() as DATE))
) OR
(DATENAME(WEEKDAY, GETDATE()) = 'Monday' AND
convert(date,AddDate) = DATEADD(DAY, -3, CAST(GETDATE() as DATE))
) OR
(DATENAME(WEEKDAY, GETDATE()) = 'Sunday' AND
convert(date,AddDate) = DATEADD(DAY, -2, CAST(GETDATE() as DATE))
) OR
(DATENAME(WEEKDAY, GETDATE()) NOT IN ('Sunday', 'Monday') AND
convert(date,AddDate) = DATEADD(DAY, -1, CAST(GETDATE() as DATE))
)) AS AccountsCount
Group by AccountsCount.AccountClass
"""
df = pd.read_sql_query(script,cnxn)
writer = pd.ExcelWriter ('ExcelFile.xlsx')
df.to_excel(writer, sheet_name = 'Data Export')
writer.save()
xlApp = win32com.client.DispatchEx('Excel.Application')
xlsPath = ('OtherExcelFile.xlsm')
wb = xlApp.Workbooks.Open(Filename=xlsPath)
xlApp.Run('CopyIntoOutlook')
wb.Save()
xlApp.Quit()
All I need to do is add a second and completely separate SQL command to this script which runs absolutely flawlessly and does what I need it to do as is above. My additional script is something like this
script= """ select AccountClass, COUNT(*) as Count
FROM XXXX.dbo.table
where SubNo='001'
AND AddDate >= '1/1/2017'
Group by AccountClass """
I have had no luck with anything I've tried as far as adding into the script, any help is greatly apprecaited! You'll notice the second script is using the same DB and table as the original, I just need YTD data as well as the top query which is looking at one day previous.
update:
I was able to figure this out, wrote in the following under write.save()
script2= """
select AccountClass, COUNT(*) as Count
FROM XXXX.dbo.table
where SubNo='001'
AND AddDate >= '1/1/2017'
Group by AccountClass
"""
df2 = pd.read_sql_query(script2,cnxn)
writer = pd.ExcelWriter ('ExcelFile.xlsx')
df.to_excel(writer, sheet_name = 'Data Export')
writer.save()

Categories

Resources