I am working on backing up 5,000+ files of sigma plot data. Right now, I have to do it all manually by exporting to .csv and .txt files, and the tabs of data all have to be individually exported. Has anybody had any luck finding a way to process these files? I'd really like to use python to write the back up files, but I'll take any help you have.
Related
I have a situation where I have multiple sources that will need to read from the same (small in size) data source, possibly at the same time. For example, multiple different computers calling a function that needs to read from an external data source (e.g. excel file). Since it multiple different sources are involved, I cannot simply read from the data source once and pass it into the function---it must be loaded in the function.
Is there a data source that can handle this effectively? A pandas dataframe was an acceptable format for information that need to be read so I tried storing that dataframe in an sqlite3 databases since according to the sqlite3 website, sqlite3 databases can handle concurrent reads. Unfortunately, it is failing too often. I tried multiple different iterations and simply could not get it to work.
Is there another data format/source that would work/be effective? I tried scouring the internet for whether or not something as simple as an excel file + the pandas read_excel function could handle this type of concurrency but I could not find information. I tried an experiment of using a multiprocessing pool to simultaneously load the same very large (i.e. 1 minute load) excel file and it did not crash. But of course, that is not exactly a perfect experiment.
Thanks!
You can try using openpyxl's read-only mode. It uses generator instead of loading whole file.
Also take a look at processing large xlsx file in python
I have 638 Excel files in a directory that are about 3000 KB large, each. I want to concatenate all of them together, hopefully only using Python or command line (no other programming software or languages).
Essentially, this is part of a larger process that involves some simple data manipulation, and I want it all to be doable by just running a single python file (or double clicking batch file).
I've tried variations of the code below - Pandas, openpyxl, and xlrd and they seem to have about the same speed. Converting to csv seems to require VBA which I do not want to get into.
temp_list=[]
for filename in os.listdir(filepath):
temp = pd.read_excel(filepath + filename,
sheet_name=X, usecols=fields)
temp_list.append(temp)
Are there simpler command line solutions to convert these into csv files or merge into one excel document? Or is this pretty much it, just using the basic libraries to read individual files?
.xls(x) is a very (over)complicated format with lots of features and quirks accumulated over the years and is thus rather hard to parse. And it was never designed for speed or for large amounts of data but rather for ease of use for business people.
So with your number of files, your best bet is to convert those to .csv or another easy-to-parse format (or use such a format for data exchange in the first place) -- and preferrably, do this before you get to process them -- e.g. upon a file's arrival.
E.g. this is how you can save the first sheet of a .xls(x) to .csv with pywin32 using Excel's COM interface:
import win32com.client
# Need the typelib metadata to have Excel-specific constants
x = win32com.client.gencache.EnsureDispatch("Excel.Application")
# Need to pass full paths, see https://stackoverflow.com/questions/16394842/excel-can-only-open-file-if-using-absolute-path-why
w = x.Workbooks.Open("<full path to file>")
s = w.Worksheets(1)
s.SaveAs("<full path to file without extension>",win32com.client.constants.xlCSV)
w.Close(False)
Running this in parallel would normally have no effect because the same server process would be reused. You can force creating a different process for each batch as per How can I force python(using win32com) to create a new instance of excel?.
I have a pretty complex excel file that includes pivot tables and sizes about 70 MB, and what I need is to edit one single cell with a script in Python. I'm trying openpyxl.
The problem is that it runs out of memory with no more than opening the file. Do you see any way around?
You can try pandas.read_excel. It may be better optimized for your purpose (reading one cell from one sheet).
I will try to describe my requirement as best as I can. But please feel free to ask me if it still unclear.
The environment
I have 5 nodes (will be more in the future). Each of them generating a big CSV file (about 1 to 2 GB) every 5 minutes. I need to use apache spark stream to process these CSV files in five minutes. So these 5 files are my input DStream source.
What I planning to do
I plan to use textFileStream like below:
ssc.textFileStream(dataDirectory)
Every 5 minutes I will put those CSV in a directory on the HDFS. Then use the above function to generate inputDStream.
The problem of the above way
the textFileStream need one complete file instead of 5 files. I do not know how to merge files in HDFS
Question
Can you tell me how to merge files in hdfs by python?
Do you have any better suggestion than my way? Please also advice me
You can always read the files in a directory using wild card character.
That should not be a problem. Which means at any given time your DStream's RDD is the merged result of all files at that given time.
As far as the approach goes, your's is simple and works.
NB: The only thing you should be careful about is the atomicity of the CSV files themselves. Your files should go to the folder(which you are watching for incoming file) as mv not copy
Thanks
Manas
I have a large text file stored in a shared directory on a server in which different other machines have access to that. I'm running various analysis on this text file without changing or updating it. I'd like to know whether I can run different python scripts on different machines in which all of them reading that large text file? None of the scripts make any change to that file, they just need to read it.
You should be able to do multiple read access, but it might get really, really, really slow, compared to reading the same file by several scripts on the same computer (obviously, the degree will very much depend on how much reading you are doing). You may want to copy the file over before processing.