I downloaded a large (~70 GB) dataset consisting of 500 '.sec2' files. They should contain the calculated solution to a board game I'm currently researching. I'm expecting some kind of mapping between a current board state and the best action to take.
However, I struggle to open the dataset with Python - it appears to be a really obscure format. There seems to be a connection to HDF5 and I already tried to naivly open it with the h5py reader which of course didn't work. Every text editor also fails to open the file (a read error is displayed)
Do I need to combine the files before reading? Is there a python package or other ressources on how to access sec2 files?
Related
I am working with a medium-size dataset that consists of around 150 HDF files, 0.5GB each. There is a scheduled process that updates those files using store.append from pd.HDFStore.
I am trying to achieve the following scenario:
For HDF file:
Keep the process that updates the store running
Open a store in a read-only mode
Run a while loop that will be continuously selecting the latest available row from the store.
Close the store on script exit
Now, this works fine, because we can have as many readers as we want, as long as all of them are in read-only mode. However, in step 3, because HDFStore caches the file, it is not returning the rows that were appended after the connection was open. Is there a way to select the newly added rows without re-opening the store?
After doing more research, I concluded that this is not possible with HDF files. The only reliable way of achieving the functionality above is to use a database (SQLite is closest - the read/write speed is lower than HDF but still faster than a fully-fledged database like Postgres or MySQL).
I am currently developping some python code to extract data from 14 000 pdfs (7 Mb per pdf). They are dynamic XFAs made from Adobe LiveCycle Designer 11.0 so they contain streams that needs to be decoded later (so there are some non-ascii characters if it makes any difference).
My problem is that calling open() on those files takes around 1 second each if not more.
I tried the same operation on 13Mb text files created from copy-pasting a character and they take less than 0.01 sec to open. Where does this time increase come from when I am opening the dynamic pdfs with open()? Can I avoid this bottleneck?
I got those timings using cProfile like this:
from cProfile import Profile
profiler = Profile()
profiler.enable()
f = open('test.pdf', 'rb')
f.close()
profiler.disable()
profiler.print_stats('tottime')
The result of print_stats is the following for a given xfa pdf:
io.open() takes around 1 second to execute once
Additionnal information:
I have noticed that the opening time is around 10x faster when the same pdf file was opened in the last 15 or 30 minutes, even if I delete the __pycache__ directory inside of my project. A solution that could make this speed increase apply regardless of the elapsed time could be worth it, though I only have 50 Gb left on my pc.
Also, parallel processing of the pdfs is not an option since I only have 1 free core to run my implementation...
To solve this problem you can do one of the following:
specify files/directories/extensions to exclude (no realtime scanning) from Windows Defender settings
temporarily turn off real time protection from Windows Defender.
save the files in an encoded format where Windows Defender cant detect links to other files/websites and decode them on read. (I have not tried)
As "user2357112 supports monica" said in the comments, the culprit is the anti-virus software scanning the files before making them available to python.
I was able to verify this by calling open() on a list of files while having the task manager open. Python used almost 0% of the CPU while Service antivirus Microsoft Defender was maxing out one of my cores.
I compared the results to another run of my script where I opened the same file multiple times and python was maxing out the core while the antivirus stayed at 0%.
I tried to run a quick-scan of a single pdf file 2 times with Windows Defender. The first execution resulted in 800 files being scanned in 1 seconds (hence the 1 second delay of the open() execution) and the second scan resulted in one scanned file instantly.
Explication:
Windows Defender scans through all the file/internet links written in the folder, that is why it takes so long to scan them and it's why there is around 800 files scanned in the first report. Windows defender keeps a cache of files scanned since powering on the pc. Files not linked to the internet dont need to be rescanned by Windows Defender. But XFAs contain links to websites. Since it is impossible to tell if a website was maliciously modified, files that contain them need to be rescanned periodically to make sure they are still safe.
Here is a link to to the Official Microsoft Forum tread.
I have 638 Excel files in a directory that are about 3000 KB large, each. I want to concatenate all of them together, hopefully only using Python or command line (no other programming software or languages).
Essentially, this is part of a larger process that involves some simple data manipulation, and I want it all to be doable by just running a single python file (or double clicking batch file).
I've tried variations of the code below - Pandas, openpyxl, and xlrd and they seem to have about the same speed. Converting to csv seems to require VBA which I do not want to get into.
temp_list=[]
for filename in os.listdir(filepath):
temp = pd.read_excel(filepath + filename,
sheet_name=X, usecols=fields)
temp_list.append(temp)
Are there simpler command line solutions to convert these into csv files or merge into one excel document? Or is this pretty much it, just using the basic libraries to read individual files?
.xls(x) is a very (over)complicated format with lots of features and quirks accumulated over the years and is thus rather hard to parse. And it was never designed for speed or for large amounts of data but rather for ease of use for business people.
So with your number of files, your best bet is to convert those to .csv or another easy-to-parse format (or use such a format for data exchange in the first place) -- and preferrably, do this before you get to process them -- e.g. upon a file's arrival.
E.g. this is how you can save the first sheet of a .xls(x) to .csv with pywin32 using Excel's COM interface:
import win32com.client
# Need the typelib metadata to have Excel-specific constants
x = win32com.client.gencache.EnsureDispatch("Excel.Application")
# Need to pass full paths, see https://stackoverflow.com/questions/16394842/excel-can-only-open-file-if-using-absolute-path-why
w = x.Workbooks.Open("<full path to file>")
s = w.Worksheets(1)
s.SaveAs("<full path to file without extension>",win32com.client.constants.xlCSV)
w.Close(False)
Running this in parallel would normally have no effect because the same server process would be reused. You can force creating a different process for each batch as per How can I force python(using win32com) to create a new instance of excel?.
tl;dr, is there any reason that one file would fail to transfer completely over ftp, while every other file uploaded in the exact same way works fine?
Every week, I use python's ftplib to update my website – usually this consists of transferring around 30-31 files total - 4 that overwrite existing files, and the rest that are completely new. For basically all of these files, my code looks like:
myFile = open('[fileURL]', 'rb')
ftp.storbinary(cmd='STOR [fileURLonServer]', fp=myFile)
myFile.close()
This works completely fine for almost all of my files. Except for one: the top-level index.html file. This file is usually around 7.8-8.1 kb in size, depending on its contents from week to week. It seems to be the case that only the first 4096 bytes of the file are transferred to the server – I have to manually go in and upload the full version of the index every week. Why is it doing this, and how can I get it to stop? Is there some metadata in the file that could be causing the problem?
StackOverflow recommended me this question which doesn't solve my problem – I'm already using the rb mode for opening every file I'm trying to transfer, and each one except this one is working perfectly fine.
In short:
I need to split a single (or more) file(s) into multiple max-sized archives using dummy-safe format (e.g. zip or rar anything that work will do!).
I would love to know when a certain part is done (callback?) so I could start shipping it away.
I would rather not do it using rar or zip command line utilities unless impossible otherwise.
I'm trying to make it os independent for the future but right now I can live if the compression could be made only on linux (my main pc) I still need to make it easily opened in windows (wife's pc)
In long:
I'm writing an hopefully to-be-awesome backup utility that scans my pictures folder, zips each folder and uploads them to whatever uploading class is registered (be it mail-sending, ftp-uploading, http-uploading).
I used zipfile to create a gigantic archive for each folder but since my uploading speed is really bad I let it work at only at nights but my internet goes off occassionally and the whole thing messes up. So I decided to split it into ~10MB pieces. I found no way of doing it with zipfile so I just added files to the zip until it reached > 10MB.
Problem is there are often 200-300MB and sometimes more videos in there and again we reach the middle-of-the-night cutoffs.
I am using Subprocess with "rar" right now to create the split archives but since directories are so big and I'm using large compression this thing takes ages even the first files are already ready - this is why I love to know when a file is ready to be sent.
So short story long I need a good way to split it into max-sized archives.
I am looking at making it somewhat generic and as dummy-proof as possible as eventually I'm planning on making it some awesome extensible backup library..