I have 638 Excel files in a directory that are about 3000 KB large, each. I want to concatenate all of them together, hopefully only using Python or command line (no other programming software or languages).
Essentially, this is part of a larger process that involves some simple data manipulation, and I want it all to be doable by just running a single python file (or double clicking batch file).
I've tried variations of the code below - Pandas, openpyxl, and xlrd and they seem to have about the same speed. Converting to csv seems to require VBA which I do not want to get into.
temp_list=[]
for filename in os.listdir(filepath):
temp = pd.read_excel(filepath + filename,
sheet_name=X, usecols=fields)
temp_list.append(temp)
Are there simpler command line solutions to convert these into csv files or merge into one excel document? Or is this pretty much it, just using the basic libraries to read individual files?
.xls(x) is a very (over)complicated format with lots of features and quirks accumulated over the years and is thus rather hard to parse. And it was never designed for speed or for large amounts of data but rather for ease of use for business people.
So with your number of files, your best bet is to convert those to .csv or another easy-to-parse format (or use such a format for data exchange in the first place) -- and preferrably, do this before you get to process them -- e.g. upon a file's arrival.
E.g. this is how you can save the first sheet of a .xls(x) to .csv with pywin32 using Excel's COM interface:
import win32com.client
# Need the typelib metadata to have Excel-specific constants
x = win32com.client.gencache.EnsureDispatch("Excel.Application")
# Need to pass full paths, see https://stackoverflow.com/questions/16394842/excel-can-only-open-file-if-using-absolute-path-why
w = x.Workbooks.Open("<full path to file>")
s = w.Worksheets(1)
s.SaveAs("<full path to file without extension>",win32com.client.constants.xlCSV)
w.Close(False)
Running this in parallel would normally have no effect because the same server process would be reused. You can force creating a different process for each batch as per How can I force python(using win32com) to create a new instance of excel?.
Related
I downloaded a large (~70 GB) dataset consisting of 500 '.sec2' files. They should contain the calculated solution to a board game I'm currently researching. I'm expecting some kind of mapping between a current board state and the best action to take.
However, I struggle to open the dataset with Python - it appears to be a really obscure format. There seems to be a connection to HDF5 and I already tried to naivly open it with the h5py reader which of course didn't work. Every text editor also fails to open the file (a read error is displayed)
Do I need to combine the files before reading? Is there a python package or other ressources on how to access sec2 files?
I have a situation where I have multiple sources that will need to read from the same (small in size) data source, possibly at the same time. For example, multiple different computers calling a function that needs to read from an external data source (e.g. excel file). Since it multiple different sources are involved, I cannot simply read from the data source once and pass it into the function---it must be loaded in the function.
Is there a data source that can handle this effectively? A pandas dataframe was an acceptable format for information that need to be read so I tried storing that dataframe in an sqlite3 databases since according to the sqlite3 website, sqlite3 databases can handle concurrent reads. Unfortunately, it is failing too often. I tried multiple different iterations and simply could not get it to work.
Is there another data format/source that would work/be effective? I tried scouring the internet for whether or not something as simple as an excel file + the pandas read_excel function could handle this type of concurrency but I could not find information. I tried an experiment of using a multiprocessing pool to simultaneously load the same very large (i.e. 1 minute load) excel file and it did not crash. But of course, that is not exactly a perfect experiment.
Thanks!
You can try using openpyxl's read-only mode. It uses generator instead of loading whole file.
Also take a look at processing large xlsx file in python
I am very new to Python.
In our company we use Base SAS for data analysis (ETL, EDA, basic model building). We want to check whether replacing it with Python is possible for big chunk of data. With respect to that i have following few questions :
How do python handle large files? my PC has RAM of 8gb and i have a flat file of 30gb (say a csv file). I would generally do operations like left join, deleting, group by etc. on such file. This is easily doable in SAS i.e. I don't have to be worried about low RAM. Are the same operations doable in python? would appreciate if somebody can provide the list of libraries & code for the same.
How can i perform SAS operations like "PROC SQL" in python to create dataset in my local PC while fetching the data from server.
i.e. In sas i would download 10mln rows (7.5 gb of data) from SQL server by performing following operation
libname aa ODBC dsn =sql user = pppp pwd = XXXX;
libname bb '<<local PC path>>';
proc sql outobs = 10000000;
create table bb.foo as
select * from aa.bar
;quit;
What is the method to perform the same in python. Again just to remind you - my PC has only 8 gb RAM
Python and specially python 3.X provides a lot of tools for handling large files.One of them is using iterators.
Python returns the result of inputs (reading from text or csv or ...) actually the result of open file as an iterator, thus you won't have the problem of loading whole of file in memory, with this trick you can read your file line by line and based on your need handle them.
For example if you want to chuck your file in a blocks you can use a deque object to preserve the lines which are belong to one block (based on your condition).
Alongside the collections.deque function, you can use some itertools functions to handle and apply your conditions on your lines, for example if you want to access to next line in each iteration you can use itertools.zip_longest function and for creating multiple independent iterator from your file object you can use itertools.tee.
Recently I wrote a code for filtering some huge log files (30GB and larger ) which performs very good.I have putted the code in github which you can check it and use it.
https://github.com/Kasramvd/log-filter
I currently have the following csv writer class:
class csvwriter():
writer = None
writehandler = None
#classmethod
def open(cls,file):
cls.writehandler = open(file,'wb')
cls.writer = csv.writer(cls.writehandler, delimiter=',',quotechar='"', quoting=csv.QUOTE_NONNUMERIC)
#classmethod
def write(cls,arr):
cls.writer.writerow(arr)
#classmethod
def close(cls):
cls.writehandler.close()
which can generate proper csv files without ever having to store the full array in memory at a single time.
However, the files created through use of this code can be quite large, so I'm looking to compress them, rather than writing them uncompressed. (In order to save on disk usage). I can't effectively store the file in memory either, as I'm expecting files of well over 20gb to be a regular occurence.
The recipients of the resulting files are generally not sysadmins of their PCs, nor do they all use linux, so I'm constrained in the types of algorithms I'm able to use for this task. Preferably, the solution would use a compression scheme that's natively readable (no executables required) in Windows, OSX and any linux distribution.
I've found gzip provides a very handy interface in Python, but reading gzipped files in windows seems like quite a hassle.. Ideally I'd put them in a zip archive, but zip archive don't allow you to append data to files already present in the archive, which then forces me to store the whole file in memory, or write the data away to several smaller files that I would be able to fit in memory.
My question: Is there a solution that would benefit from the best of both worlds? Widespread availability of tools to read the target format on the end-user's machine, and also the ability to append, rather than write the whole file in one go?
Thanks in advance for your consideration!
gzlog may provide the functionality you're looking for. It efficiently appends short strings to a gzip file, intended for applications where short messages are appended to a long log.
I just found out that I can save space\ speed up reads of CSV files.
Using the answer of my previous question
How do I create a CSV file from database in Python?
And 'wb' for opens
w = csv.writer(open(Fn,'wb'),dialect='excel')
How can I open all files in a directory and saves all files with the same name as starting name and use 'wb' to reformat all files. I guess convert all CSV's to binary CSV's.
You can't "overwrite a file on the fly". You have two options:
if the files are small enough (smaller than the amount of available RAM by
a comfortable margin), just loop over them (os.listdir makes that loop
easy, or os.walk if you want to catch the whole tree of subdirectories,
not just one directory), and for each, read it in memory first, then
overwrite the on-disk copy.
otherwise, loop over them, and each time write to a new file (e.g. by
appending .new to the name), then move the new file over the old. This
is safer (no risk of running out of memory, no risk of damaging a file if
the computer crashes) but more complicated.
So, what is your situation: small-enough files (and backups for safeguard against computer and disk crashes), in which case I can if you wish show the simple code; or huge multi-GB files -- in which case it will have to be the complex code? Let us know!