I have large text files that need parsing and cleaning. I am using jupyter notebook to do it. I have a sql server DB that I want to insert the data in after the data is prepared. I used pyodbc to insert the final dataframe into sql. df is my dataframe and I put my sql insert query in the field sqlInsertQuery
df_records = df.values.tolist()
cursor.executemany(sqlInsertQuery,df_records)
cursor.commit()
for a few rows it works fine, but when I want to insert the whole dataframe at once with the code above, with executemany() it lasts for hours and it keeps ruuning till I stop the kernel.
I exported one file/dataframe in an excel file and it is about 83 M, As my dataframe contains very large strings and lists.
Someone recommended using fast_executemany instead but seems it is faulty.
some others recommended other packages than pyodbc.
Some said better not to use jupyter and use pycharm or Ipython.
I did not get what is the best/fastet way to insert my data to my DB in my case. I am not a developer and I really appreciate if you help me on this.
Related
I am currently working on a POC where we would like to get the snowflake query results into an email using Python.
For example : When executing an Insert statement in Snowflake, I would like to capture the result showing how many records were inserted. Please note that we are using Python Connector for Snowflake to execute our queries from Python script. Also we are using dataframes to store and process data internally.
Any help is appreciated!
Following the INSERT statement, you can retrieve the number of rows inserted from cursor.rowcount.
currently working within a dev environment in Databricks using a notebook to apply some Python code to analyse some dummy data (just a few 1,000 rows) held in within a database table, I then deploy this to the main environment and run it on the real data, (100's of millions of rows)
to start with I just need values from a single column that meet a certain criteria, in order to get at the data I'm currently doing this:
spk_data = spark.sql("SELECT field FROM database.table WHERE field == 'value'")
data = spk_data.toPandas()
then the rest of the Python notebook does its thing on that data which works fine in the dev environment but when I run it for real it falls over at line 2 saying it's out of memory
I want to import the data DIRECTLY into the Pandas dataframe and so remove the need to convert from Spark as I'm assuming that will avoid the error but after a LOT of Googling I still can't work out how, the only thing I've tried that appears syntactically valid is:
data = pd.read_table (r'database.table')
but just get:
'PermissionError: [Errno 13] Permission denied:'
(nb. unfortunately I have no control over the content, form or location of the database I'm querying)
You've to use pd.read_sql_query for this case.
Your assumption is very likely to be untrue.
Spark is a distributed computation engine, pandas is a single-node toolset. So when you run a query on milions of rows it's likely to fail. When doing df.toPandas, Spark moves all of the data to your driver node, so if it's more than driver memory, it's going to fail with out of memory exception. In other words, if your dataset is larger then memory, pandas are not going to work well.
Also, when using pandas on databricks you are missing all of the benefits of using the underlying cluster. You are just using the driver.
There are two sensible options to solve this:
redo your solution using spark
use koalas which has API mostly compatible with pandas
I have read access to a SQL Server and I reference 2 separate databases on that server. I need to do a query on a set of filtered id's, ranging from 500 - 10,000 id's depending on the day received as an excel spread sheet and loaded into python via pandas DataFrame
Note, I don't have access to this database via SSMS so python is my only way in.
The query is very simple,
query = "SELECT case.id as m, case.owner as o FROM case WHERE case.id = ? "
I put this through a loop and append the data through a list ref
for i in case['id']:
ref.append(cursor.execute(query, i).fetchone()
or append to a dataframe
for i in case['id']:
df.append(pd.read_sql_query(query, con, params = [i]))
Fairly straightforward, however it is agonizingly slow. Am I doing something wrong here?
I used to do this with Visual Basic and, using arrays and loops, was blindingly fast.
Any advice on this would be duly appreciated.
I am running multiple (about 60) queries in impala using impala shell from a file and outputting to a file. I am using :
impala-shell -q "query_here; query_here; etc;" -o output_path.csv -B --output_delimiter=','
The issue is that they are not separated between queries, so query 2 would be directly appended as a new row right onto the bottom of query 1. I need to separate the results to math them up with each query, however I do not know where each query's results are done and another begins because it is a continuous CSV file.
Is there a way to run multiple queries like this and leave some type of space or delimiter between query results or any way to separate the results by which query they came from?
Thanks.
You could insert your own separators by issuing some extra queries, for example select '-----'; between the real queries.
Writing results of individual queries to local files is not yet possible, but there is already a feature request for it (IMPALA-2073). You can, however, easily save query results into HDFS as CSV files. You just have to create a new table to store the results specifying row format delimited fields terminated by ',', then use insert into table [...] select [...] to populate it. Please refer to the documentation sections Using Text Data Files with Impala Tables and INSERT Statement for details.
One comment suggested running the individual queries as separate commands and saving their results into separate CSV files. If you choose this solution, please be aware that DDL statements like create table are only guaranteed to take immediate effect in the connection in which they were issued. This means that creating a table and then immediately querying it in another impala shell is prone to failure. Even if you find that it works correctly, it may fail the next time you run it. On the other hand, running such queries one after the other in the same shell is always okay.
I have a 22 million row .csv file (~850mb) that I am trying to load into a postgres db on Amazon RDS. It fails every time (I get a time-out error), even when I split the file into smaller parts (each of 100,000 rows) and even when I use chunksize.
All I am doing at the moment is loading the .csv as a dataframe and then writing it to the db using df.to_sql(table_name, engine, index=False, if_exists='append', chunksize=1000)
I am using create_engine from sqlalchemy to create the connection: engine = create_engine('postgresql:database_info')
I have tested writing smaller amounts of data with psycopg2 without a problem, but it takes around 50 seconds to write 1000 rows. Obviously for 22m rows that won't work.
Is there anything else I can try?
The pandas DataFrame.to_sql() method is not especially designed for large inserts, since it does not utilize the PostgreSQL COPY command.
Regular SQL queries can time out, it's not the fault of pandas, it's controlled by the database server but can be modified per connection, see this page and search for 'statement_timeout'.
What I would recommend you to do is to consider using Redshift, which is optimized for datawarehousing and can read huge data dumps directly from S3 buckets using the Redshift Copy command.
If you are in no position to use Redshift, I would still recommend finding a way to do this operation using the PostgreSQL COPY command, since it was invented to circumvent exactly the problem you are experiencing.
You can to write the dataframe to a cString and then write this to the database using the copy_from method in Psycopg which I believe does implement the PostgreSql COPY command that #firelynx mentions.
import cStringIO
dboutput = cStringIO.StringIO()
output = output.T.to_dict().values()
dboutput.write('\n'.join([ ''.join([row['1_str'],'\t',
row['2_str'], '\t',
str(row['3_float'])
]) for row in output]))
dboutput.seek(0)
cursor.copy_from(dboutput, 'TABLE_NAME')
connenction.commit()
where output is originally a pandas dataframe with columns [1_str, 2_str, 3_float] that you want to write to the database.