I have a large number of Excel files that I need to download from the web and then extract only the header (column names) from and then move on. So far I have only managed to download the whole file and then read it into a Pandas DF from which I can extract the column names.
Is there a faster way to read, rather than download, or parse only the header, rather than the whole Excel file?
resp = requests.get(test_url)
with open('test.xls', 'wb') as output:
output.write(resp.content)
headers = pd.ExcelFile("test.xls").parse(sheetname = 2)
headers.columns
If there is not an efficient way to "partially" download the Excel file to get only the header, is there an efficient way to read only the header after it has already been downloaded?
I would say no, because xls Excel files are binary files. So the parser of pandas ExcelFile needs a complete file. If you give it a partial file, it should report an incorrect file (with some reason...).
If you really want to do that, you will have to thoroughly analyze (in binary form) some of the Excel files you want to process, and try to identify the minimum size you need to find the names in the first row. Then you should download them by implementing the http protocol at a low enough level to be able to close the connection, or at least stop reading as soon as you have enough bytes. Finally, you have just to write a dedicated parser hoping that nothing changes in those files - because you no longer use high level maintained tools for that but only binary reads.
TL/DR: unless you have a very strong reason to do that, just forget it, because it will be hard, error prone and hardly maintainable if only possible.
Related
I have a set of measurement data in .atfx format. I know it is possible to read this data with ArtemiS SUITE, but I would need to make some post processing in python. I tried to look into the files, but as I see, atfx is a header file (with an xml structure) that points to binary files, so I'm not sure how I could write a python script to decode that, or if it is possible at all.
Is there a way to open ATFX files in python or is there a workaround?
I'm working with slightly big data and i need to write this data to an xlsx file. Sometimes the size of this files can be 15GB. I have a python code that gets data as dataframes and writes data to excel continuously so i need to write data to an existing excel and the existing sheet. I was using 'openpyxl'.
There are two problems that I faced while working with that library.
Firstly to append an existing excel it needs to load workbook which is an impossible thing for me because of the data size. I must use
the lowest RAM I can use. -
Secondly this lib is useful only writing
to the different sheets. When I'm trying to write data to same sheet
even if I give the 'startrow' for the saving process it deletes the
old data and writes new one starting from that row.
I already tried the solution available here to address my problem but it doesn't fit my requirements.
Do you have any idea how I can do this?.
I am programmatically creating csv files using Python. Many end users open and interact with those files using excel. The problem is that Excel by default mutates many of the string values within the file. For example, Excel converts 0123 > 123.
The values being written to the csv are correct and display correctly if I open them with some other program, such as Notepad. If I open a file with Excel, save it, then open it with Notepad, the file now contains incorrect values.
I know that there are ways for an end user to change their Excel settings to disable this behavior, but asking every single user to do so is not possible for my situation.
Is there a way to generate a csv file using Python that a default copy of Excel will NOT mutate the values of?
Edit: Although these files are often opened in Excel, they are not only opened in Excel and must be output as .csv, not .xlsx.
The short answer is no, it is not possible to generate a single CSV that will display (arbitrary) data the same way in Excel and in non-Excel programs.
There are convoluted ways to force strings to appear how you want when you open a CSV in Excel, but then non-Excel programs will almost certainly not display them the way you want.
Though you say you must stick to CSV due to non-Excel programs, you don't say which programs those are. If it is possible that they can open .xlsx files after all, then .xlsx would be the best choice.
The solution is to declare the data type while writing the file. It seems like Excel is trying to be smart and converts the whole column to a numeric type. The output should be written directly into .xlsx format like so:
import pandas as pd
writer = pd.ExcelWriter('path/to/save.xlsx')
data = {'x':['011','012','013'],'y':['022','033','041']}
Df = pd.DataFrame(data = data)
Df.to_excel(writer,"Sheet1")
writer.save()
Source: https://stackoverflow.com/a/31136119/8819895
Have you tried expressly formatting the relevant column(s) to 'str' before exporting?
df['column_ex'] = df['column_ex'].astype('str')
df.to_csv('df_ex.csv')
Another workaround may be to open Excel program (not file), go to Data menu, then Import form Text. Excel's import utility will give you options to define each column's data type. I believe Apache's Liibre office defaults to keep the leading 0s but Excel doesn't.
I wrote a tool that extracts data from a large DB and outputs it to an Excel file along with (conditional) formatting to improve readability. For this I use Python with openpyxl on a Linux machine. It works great, but this package is rather slow for writing Excel.
It seems to be a lot quicker to dump the table as (compressed) csv, import that into Excel and apply formatting there using a macro/vba.
To automate the process I'd like to create an empty Excel file pre-loaded with the required VBA to do the formatting; a template. For every data dump, the data is embedded (compressed using deflate) into the Excel file and loaded into the Workbook upon opening the document (or using a "LOAD" button to circumvent macro related security things).
However, just adding some file into the Excel file raises an error when opened:
We found a problem with some content in 'Werkmap1_test_embed.xlsx'. Do you want us to try to recover as much as we can? If you trust the source of this workbook, click Yes.
Clicking Yes opens the file and shows some tracing information as XML:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<recoveryLog xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
<logFileName>Repair Result to Werkmap1_OLE_Word0.xml</logFileName>
<summary>Errors were detected in file '/Users/joostk/mnt/cluster/Werkmap1_OLE_Word.xlsx'</summary>
<additionalInfo>
<info>Excel completed file level validation and repair. Some parts of this workbook may have been repaired or discarded.</info>
</additionalInfo>
</recoveryLog>
Is it possible to avoid this? How would I embed a file into the Excel ZIP? Do I need to update some file table (which I could not file easily).
When that's done, I'd like to import the data. Can I access files in the Excel ZIP from VBA? I guess not, and I need to extract the data to some temporary path and load it from there.
I have found these helpful answers elsewhere to load ZIP and plain text:
https://stackoverflow.com/a/35781621/4998990
https://stackoverflow.com/a/11267603/4998990
Many thanks for sharing your thoughts!
so my "Answer" here is that this is caused by using Named Ranges, or an underlying table, or an embedded Query/Connection. When you start manipulating this file you will get the error that you are talking about:
There is no harm to the file if you click "yes" and open. Excel will open this in Repaired Mode which will require you to re-save the file.
The way I've worked around this is to re-read the "repaired" file, in python, and save it as another file or replace it. Essentially just do an extra step of re-reading the data into memory, and write it to a new file. The error will go away. As always, test this method before deploying to production to ensure no records are lost. The way I solve it is with two lines of pandas.
import pandas as pd
repair = pd.read_excel('PATH_TO_REPAIR_FILE')
new_file = repair.to_excel('PATH_TO_WHERE_NEW_FILE_GOES')
I have 100's of csv files which appears to be corrupt. All files have the same common problem. Each file has 5 headers but the data is always split among 2 rows e.g.
I was thinking of a python script to correct this and was wondering if there is a function or library that can do this quickly rather than writing a whole script to adjust it. The expected format is below. How can I correct this for all files.