Importing huge .sql script file (30GB) with only inserts - python

I need to import some SQL scripts generated in SSMS (generate scripts). These scripts only contain INSERTS.
So far I managed to import almost everything using DbUp (https://dbup.readthedocs.io/en/latest/). My problem is with larger files, in this case I have 2, one 2GB and one 30GB.
The 2GB I imported using BigSqlRunner (https://github.com/kevinly1989/BigSqlRunner)
The 30GB one I've tried everything (PowerShell, split, etc..) and I'm not succeeding, it always gives a memory error and I can't find anything to help me split the file into multiples small files...
I'm asking for your help if you know of a better way or a solution for this.
The goal is to migrate data from one database (PRODUCTION) to the another empty database (PRODUCTION but not used) and I am doing it through generate scripts (SSMS) and then execute the scripts on the target database (for security due to it being production and I don't want to be reading it line by line while importing the data to the target database).
I am open to other solutions that may exist such as SSIS (SQL Server Integration Services), Python, PowerShell, C#, etc... but I have to be careful not to impact the production database when reading the data from the tables.

To update I managed to solve it through SSIS, I made a source and a destination connection and used the source query with WITH(NOLOCK), it's running and already half way through. It's not the fastest but it's working. Thank you all for your help.

Related

Automate a manual task using Python

I have a question and hope someone can direct me in the right direction; Basically every week I have to run a query (SSMS) to get a table containing some information (date, clientnumber, clientID, orderid etc) and then I copy all the information and that table and past it in a folder as a CSV file. it takes me about 15 min to do all this but I am just thinking can I automate this, if yes how can I do that and also can I schedule it so it can run by itself every week. I believe we live in a technological era and this should be done without human input; so I hope I can find someone here willing to show me how to do it using Python.
Many thanks for considering my request.
This should be pretty simple to automate:
Use some database adapter which can work with your database, for MSSQL the one delivered by pyodbc will be fine,
Within the script, connect to the database, perform the query, parse an output,
Save parsed output to a .csv file (you can use csv Python module),
Run the script as the periodic task using cron/schtask if you work on Linux/Windows respectively.
Please note that your question is too broad, and shows no research effort.
You will find that Python can do the tasks you desire.
There are many different ways to interact with SQL servers, depending on your implementation. I suggest you learn Python+SQL using the built-in sqlite3 library. You will want to save your query as a string, and pass it into an SQL connection manager of your choice; this depends on your server setup, there are many different SQL packages for Python.
You can use pandas for parsing the data, and saving it to a ~.csv file (literally called to_csv).
Python does have many libraries for scheduling tasks, but I suggest you hold off for a while. Develop your code in a way that it can be run manually, which will still be much faster/easier than without Python. Once you know your code works, you can easily implement a scheduler. The downside is that your program will always need to be running, and you will need to keep checking to see if it is running. Personally, I would keep it restricted to manually running the script; you could compile to an ~.exe and bind to a hotkey if you need the accessibility.

Data Integration Structure using Python and/or SSIS

I have a question on the general strategy of how to integrate data into an MSSQL database.
Currently, I use python for my whole ETL process. I use it to clean, transform, and integrate the data in an MSSQL database. My data is small so I think this process works fine for now.
However, I think it a little awkward for my code to constantly read data and write data to the database. I think this strategy will be an issue once I'm dealing with large amount of data and the constant read/write seems very inefficient. However, I don't know enough to know if this is a real problem or not.
I want to know if this is a feasible approach or should I switch entirely to SSIS to handle it. SSIS to me is clunky and I'd prefer not to re-write my entire code. Any input on the general ETL architecture would be very helpful.
Is this practice alright? Maybe?
There are too many factors to give a definitive answer. Conceptually, what you're doing - Extract data from source, Transform it, Load it to destination, ETL, is all that SSIS does. It likely can do things more efficiently than python - at least I've had a devil of a time getting a bulk load to work with memory mapped data. Dump to disk and bulk insert that via python - no problem. But, if the existing process works, then let it go until it doesn't work.
If your team knows Python, introducing SSIS just to do ETL is likely going to be a bigger maintenance cost than scaling up your existing approach. On the other hand, if it's standard-ish Python + libraries and you're on SQL Server 2017+, you might be able to execute your scripts from within the database itself via sp_execute_external_script
If the ETL process runs on the same box as the database, then ensure you have sufficient resources to support both processes at their maximum observed levels of activity. If the ETL runs elsewhere, then you'll want to ensure you have fast, full duplex connectivity between the database server and the processing box.
Stand up a load testing environment that parallels production's resources. Dummy up a 10x increase in source data and observe how the ETL fares. 100x, 1000x. At some point, you'll identify what development sins you committed that do not scale and then you're poised to ask a really good, detailed question describing the current architecture, the specific code that does not perform well under load and how one can reproduce this load.
The above design considerations will hold true for Python, SSIS or any other ETL solution - prepackaged or bespoke.

Pandas HDFStore caching

I am working with a medium-size dataset that consists of around 150 HDF files, 0.5GB each. There is a scheduled process that updates those files using store.append from pd.HDFStore.
I am trying to achieve the following scenario:
For HDF file:
Keep the process that updates the store running
Open a store in a read-only mode
Run a while loop that will be continuously selecting the latest available row from the store.
Close the store on script exit
Now, this works fine, because we can have as many readers as we want, as long as all of them are in read-only mode. However, in step 3, because HDFStore caches the file, it is not returning the rows that were appended after the connection was open. Is there a way to select the newly added rows without re-opening the store?
After doing more research, I concluded that this is not possible with HDF files. The only reliable way of achieving the functionality above is to use a database (SQLite is closest - the read/write speed is lower than HDF but still faster than a fully-fledged database like Postgres or MySQL).

Lock and unlock database access - database is locked

I have three programs running, one of which iterates over a table in my database non-stop (over and over again in a loop), just reading from it, using a SELECT statement.
The other programs have a line where they insert a row into the table and a line where they delete it. The problem is, that I often get an error sqlite3.OperationalError: database is locked.
I'm trying to find a solution but I don't understand the exact source of the problem (is reading and writing in the same time what make this error occur? or the writing and deleting? maybe both aren't supposed to work).
Either way, I'm looking for a solution. If it were a single program, I could match the database I/O with mutexes and other multithreading tools, but it's not. How can I wait until the database is unlocked for reading/writing/deleting without using too much CPU?
you need to switch databases..
I would use the following:
postgresql as my database
psycopg2 as the driver
the syntax is fairly similar to SQLite and the migration shouldn't be too hard for you

What is the best utility/library/strategy with Python to copy files across multiple computers?

I have data across several computers stored in folders. Many of the folders contain 40-100 G of files of size from 500 K to 125 MB. There are some 4 TB of files which I need to archive, and build a unfied meta data system depending on meta data stored in each computer.
All systems run Linux, and we want to use Python. What is the best way to copy the files, and archive it.
We already have programs to analyze files, and fill the meta data tables and they are all running in Python. What we need to figure out is a way to successfully copy files wuthout data loss,and ensure that the files have been copied successfully.
We have considered using rsync and unison use subprocess.POPEn to run them off, but they are essentially sync utilities. These are essentially copy once, but copy properly. Once files are copied the users would move to new storage system.
My worries are 1) When the files are copied there should not be any corruption 2) the file copying must be efficient though no speed expectations are there. The LAN is 10/100 with ports being Gigabit.
Is there any scripts which can be incorporated, or any suggestions. All computers will have ssh-keygen enabled so we can do passwordless connection.
The directory structures would be maintained on the new server, which is very similar to that of old computers.
I would look at the python fabric library. This library is for streamlining the use of SSH, and if you are concerned about data integrity I would use SHA1 or some other hash algorithm for creating a fingerprint for each file before transfer and compare the fingerprint values generated at the initial and final destinations. All of this could be done using fabric.
If a more seamless python integration is the goal you can look at,
Duplicity
pyrsync
I think rsync is the solution. If you are concerned about data integrity, look at the explanation of the "--checksum" parameter in the man page.
Other arguments that might come in handy are "--delete" and "--archive". Make sure the exit code of the command is checked properly.

Categories

Resources