Pandas HDFStore caching

Pandas HDFStore caching - python

I am working with a medium-size dataset that consists of around 150 HDF files, 0.5GB each. There is a scheduled process that updates those files using store.append from pd.HDFStore.
I am trying to achieve the following scenario:
For HDF file:
Keep the process that updates the store running
Open a store in a read-only mode
Run a while loop that will be continuously selecting the latest available row from the store.
Close the store on script exit
Now, this works fine, because we can have as many readers as we want, as long as all of them are in read-only mode. However, in step 3, because HDFStore caches the file, it is not returning the rows that were appended after the connection was open. Is there a way to select the newly added rows without re-opening the store?

After doing more research, I concluded that this is not possible with HDF files. The only reliable way of achieving the functionality above is to use a database (SQLite is closest - the read/write speed is lower than HDF but still faster than a fully-fledged database like Postgres or MySQL).

Related

Importing huge .sql script file (30GB) with only inserts

I need to import some SQL scripts generated in SSMS (generate scripts). These scripts only contain INSERTS.
So far I managed to import almost everything using DbUp (https://dbup.readthedocs.io/en/latest/). My problem is with larger files, in this case I have 2, one 2GB and one 30GB.
The 2GB I imported using BigSqlRunner (https://github.com/kevinly1989/BigSqlRunner)
The 30GB one I've tried everything (PowerShell, split, etc..) and I'm not succeeding, it always gives a memory error and I can't find anything to help me split the file into multiples small files...
I'm asking for your help if you know of a better way or a solution for this.
The goal is to migrate data from one database (PRODUCTION) to the another empty database (PRODUCTION but not used) and I am doing it through generate scripts (SSMS) and then execute the scripts on the target database (for security due to it being production and I don't want to be reading it line by line while importing the data to the target database).
I am open to other solutions that may exist such as SSIS (SQL Server Integration Services), Python, PowerShell, C#, etc... but I have to be careful not to impact the production database when reading the data from the tables.

To update I managed to solve it through SSIS, I made a source and a destination connection and used the source query with WITH(NOLOCK), it's running and already half way through. It's not the fastest but it's working. Thank you all for your help.

Multiple External Processes Reading From the Same Data Source

I have a situation where I have multiple sources that will need to read from the same (small in size) data source, possibly at the same time. For example, multiple different computers calling a function that needs to read from an external data source (e.g. excel file). Since it multiple different sources are involved, I cannot simply read from the data source once and pass it into the function---it must be loaded in the function.
Is there a data source that can handle this effectively? A pandas dataframe was an acceptable format for information that need to be read so I tried storing that dataframe in an sqlite3 databases since according to the sqlite3 website, sqlite3 databases can handle concurrent reads. Unfortunately, it is failing too often. I tried multiple different iterations and simply could not get it to work.
Is there another data format/source that would work/be effective? I tried scouring the internet for whether or not something as simple as an excel file + the pandas read_excel function could handle this type of concurrency but I could not find information. I tried an experiment of using a multiprocessing pool to simultaneously load the same very large (i.e. 1 minute load) excel file and it did not crash. But of course, that is not exactly a perfect experiment.
Thanks!

You can try using openpyxl's read-only mode. It uses generator instead of loading whole file.
Also take a look at processing large xlsx file in python

Large data from SQL server to local hard disc to Tableau and Pandas

I am trying to export a large dataset from SQL Server to my local hard disc for some data analysis. The file size goes up to 30gb, with 6 million over rows and about 10 columns.
This data will then be fed through python Pandas or Tableau for consumption. I am thinking the size of the file itself will give me poor performances during my analysis.
Any best practices to be shared for analyzing big-ish data on a local machine?
I am running an i7 4570 with 8gb ram. I am hoping to be less reliant on SQL queries and be able to run huge analysis offline.
Due to the nature of the database, a fresh extract needs to happen and this process will have to repeat itself, meaning there will not be much of appending happening.
I have explored HDFStores and also Tableau Data Extracts, but still curious whether I can get better performances by reading whole CSV files.
Is there a compression method of sorts that I might be missing out? Again the objective here is to run the analysis without constant querying to the server, the source itself (which I am optimizing) will refresh itself every morning so when I get in office I can just focus on getting coffee and some blazing fast analytics done.

With Tableau you would want to take an extract of the CSV (it will be much quicker to query than a CSV). That should be fine since the extract sits on disk. However, as mentioned, you need to create a new extract once your data changes.
With Pandas I usually load everything into memory, but if it doesn't fit then you can read the CSV in chunks using chunksize (see this thread: How to read a 6 GB csv file with pandas)

How to modify a large file remotely

I have a large XML file, ~30 MB.
Every now and then I need to update some of the values. I am using element tree module to modify the XML. I am currently fetching the entire file, updating it and then placing it again. SO there is ~60 MB of data transfer every time. Is there a way I update the file remotely?
I am using the following code to update the file.
import xml.etree.ElementTree as ET
tree = ET.parse("feed.xml")
root = tree.getroot()
skus = ["RUSSE20924","PSJAI22443"]
qtys = [2,3]
for child in root:
sku = child.find("Product_Code").text.encode("utf-8")
if sku in skus:
print "found"
i = skus.index(sku)
child.find("Quantity").text = str(qtys[i])
child.set('updated', 'yes')
tree.write("feed.xml")

Modifying a file directly via FTP without uploading the entire thing is not possible except when appending to a file.
The reason is that there are only three commands in FTP that actually modify a file (Source):
APPE: Appends to a file
STOR: Uploads a file
STOU: Creates a new file on the server with a unique name
What you could do
Track changes
Cache the remote file locally and track changes to the file using the MDTM command.
Pros:
Will half the required data transfer in many cases.
Hardly requires any change to existing code.
Almost zero overhead.
Cons:
Other clients will have to download the entire thing every time something changes(no change from current situation)
Split up into several files
Split up your XML into several files. (One per product code?)
This way you only have to download the data that you actually need.
Pros:
Less data to transfer
Allows all scripts that access the data to only download what they need
Combinable with suggestion #1
Cons:
All existing code has to be adapted
Additional overhead when downloading or updating all the data
Switch to a delta-sync protocol
If the storage server supports it switching to a delta synchronization protocol like rsync would help a lot because these only transmit the changes (with little overhead).
Pros:
Less data transfer
Requires little change to existing code
Cons:
Might not be available
Do it remotely
You already pointed out that you can't but it still would be the best solution.
What won't help
Switch to a network filesystem
As somebody in the comments already pointed out switching to a network file system (like NFS or CIFS/SMB) would not really help because you cannot actually change parts of the file unless the new data has the exact same length.
What to do
Unless you can do delta synchronization I'd suggest to implement some caching on the client side first and if it doesn't help enough to then split up your files.

Quickly dumping a database in memory to file

I want to take advantage of the speed benefits of holding an SQLite database (via SQLAlchemy) in memory while I go through a one-time process of inserting content, and then dump it to file, stored to be used later.
Consider a bog-standard database created in the usual way:
# in-memory database
e = create_engine('sqlite://')
Is there a quicker way of moving its contents to disc, other than just creating a brand new database and inserting each entry manually?
EDIT:
There is some doubt as to whether or not I'd even see any benefits to using an in-memory database. Unfortunately I already see a huge time difference of about 120x.
This confusion is probably due to me missing out some important detail in the question. Also probably due to a lack of understanding on my part re: caches / page sizes / etc. Allow me to elaborate:
I am running simulations of a system I have set up, with each simulation going through the following stages:
Make some queries to the database.
Make calculations / run a simulation based on the results of those queries.
insert new entries into the database based on the most recent simulation.
Make sure the database is up to date with the new entries by running commit().
While I only ever make a dozen or so insertions on each simulation run, I do however run millions of simulations, and the results of each simulation need to be available for future simulations to take place. As I say, this read and write process takes considerably longer when running a file-backed database; it's the difference between 6 hours and a month.
Hopefully this clarifies things. I can cobble together a simple python script to outline my process further a little further if necessary.

SQLAlchemy and SQLite know how to cache and do batch-inserts just fine.
There is no benefit in using an in-memory SQLite database here, because that database uses pages just like the on-disk version would, and the only difference is that eventually those pages get written to disk for disk-based database. The difference in performance is only 1.5 times, see SQLite Performance Benchmark -- why is :memory: so slow...only 1.5X as fast as disk?
There is also no way to move the in-memory database to a disk-based database at a later time, short of running queries on the in-memory database and executing batch inserts into the disk-based database on two separate connections.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.