XML to CSV/Excel

XML to CSV/Excel - python

I have an RSS formatted XML file - what I usually do is import that into excel using the PC developer tools. That fancy'ness creates a tree for me automatically and I simply drag and drop the root element onto the spreadsheet, hit refresh data and boom I have a CSV or excel file that I can do any number of things with that I could do with the raw RSS file.
I'd like to skip this step of going to excel on PC etc and use something like python to get the job done on my mac. Problem is I don't want to have to tell phyon the tree, elements etc I want it to figure it out and give me a CSV! -
Any guidance on how I might be able to accomplish this task?

XML2Json actually worked out OK
xml2json --input "/Users/me/Downloads/file.xml" --output "file_2.json
There's some formatting issues in terms of headers but I can clean that up.

Related

Python + Linux - Excel to HTML (keeping format)

I'm looking for a way to convert excel to html while preserving formatting.
I know this is doable on windows due to the availability of some underlying win32 libraries, (eg via xlwings
Python - Excel to HTML (keeping format))
But I'm looking for a solution on Linux.
I've also come by Aspose Cells but this requires a paid license or else it will add a lot of extra junk to the output that needs to be scrubbed out.
And lastly I tried the python lib xlsx2html but it does a very poor job at preserving formatting.
Are there any suggestions for a Linux based solution? I'd also be interested in tools written in other languages that can be easily wrapped around via python.
Thanks in advance!
Update:
Here is an example of a random excel sheet I converted via excel itself that I would like to reproduce. It has some colors, some border variations, some merged cells and some font sizes to see if they all work.

You can use LibreOffice to convert an Excel file to a HTML file using the command line:
# --convert-to implies --headless so it's not mandatory to specify --headless
soffice --headless --convert-to html data.xlsx
You can refer to the documentation to know more about other CLI parameters.

I think you should search for Excel to HTML in the JS world not python (I am not saying it is not possible, but It's more usual in JS), I promise you will get better results.
In my opinion, finding a JS-based solution and make a python wrapper can be more helpful. Because in JS community, they struggled more than another communities to import and work with Excels.
Another idea is to change your approach, look for how you can import an Excel file in an embedded way or iframe inside an HTML page with JS and then export it.
But again, I highly recommend to check JS libraries or GitHub repositories, some of them care about formatting.

Jupyter Notebook issue

I ran some commands on Jupyter Notebook and expected to get a printed output containing data in tabulated form in a .csv file, but then i get an uncompleted output
This is the result i get from the .csv file
I ran this command;
df1=pandas.read_csv("supermarkets.csv", on_bad_lines='skip')
df1
I expected to get a printed output in a tabulated like in the image attached......
The data get printed in well tabulated form here
Here is a link to the online version of the file
[pythonhow.com/supermarkets.csv]

Getting good, clean quality data where the file extension correctly matches the actual content is often a challenge. Assessing the state of the input data is generally always a very important first step.
It appears the data you are trying to get is also online here. Github will render that as a table in the browser because it has a viewer mode. To look at the 'raw' file content, click here. You'll see it is nice comma-delimited file with columns separated by commas and rows each on a different line. The header with the column names is on the first line.
Now open in a good text editor the file you have that you are working with and compare it to the content I pointed you at. That should guide you on what is the issue.
At this point you may just wish to switch to using the version of the file that I pointed you at.
Use the link below to obtain it as proper csv file:
https://raw.githubusercontent.com/kenvilar/data-analysis-using-python/master/supermarkets.csv
You should be able to paste that link in your browser and then right click on the page and choose 'Save as..' to download it to your locak machine. The obtained file should open just fine using the code you showed in the screenshot in your post here.
Please work on writing better questions with specific titles, see here for guidance. The title at present is overly broad and is actually not accurate. This code would not work with the data you apparently have even if you were running it inside a Python code-based script. And so it is not a Jupyter notebook issue. For how to think about making it specific, a good thing to keep in mind is to write for your future self. If you continue to use notebooks you'll have hundreds that would be considered a 'Jupyter Notebook issue', but what makes this issue different from those?

I believe there is an issue with your csv file, not the code.
To me it looks like the data in your csv file are written in json format.
Have you opened the supermarkets.csv file using excel? it should look like a table, not a json formatted file.

did you try df1.show() to see if the csv got read in the first place?

Embed CSV in Excel and import the data

I wrote a tool that extracts data from a large DB and outputs it to an Excel file along with (conditional) formatting to improve readability. For this I use Python with openpyxl on a Linux machine. It works great, but this package is rather slow for writing Excel.
It seems to be a lot quicker to dump the table as (compressed) csv, import that into Excel and apply formatting there using a macro/vba.
To automate the process I'd like to create an empty Excel file pre-loaded with the required VBA to do the formatting; a template. For every data dump, the data is embedded (compressed using deflate) into the Excel file and loaded into the Workbook upon opening the document (or using a "LOAD" button to circumvent macro related security things).
However, just adding some file into the Excel file raises an error when opened:
We found a problem with some content in 'Werkmap1_test_embed.xlsx'. Do you want us to try to recover as much as we can? If you trust the source of this workbook, click Yes.
Clicking Yes opens the file and shows some tracing information as XML:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<recoveryLog xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
<logFileName>Repair Result to Werkmap1_OLE_Word0.xml</logFileName>
<summary>Errors were detected in file '/Users/joostk/mnt/cluster/Werkmap1_OLE_Word.xlsx'</summary>
<additionalInfo>
<info>Excel completed file level validation and repair. Some parts of this workbook may have been repaired or discarded.</info>
</additionalInfo>
</recoveryLog>
Is it possible to avoid this? How would I embed a file into the Excel ZIP? Do I need to update some file table (which I could not file easily).
When that's done, I'd like to import the data. Can I access files in the Excel ZIP from VBA? I guess not, and I need to extract the data to some temporary path and load it from there.
I have found these helpful answers elsewhere to load ZIP and plain text:
https://stackoverflow.com/a/35781621/4998990
https://stackoverflow.com/a/11267603/4998990
Many thanks for sharing your thoughts!

so my "Answer" here is that this is caused by using Named Ranges, or an underlying table, or an embedded Query/Connection. When you start manipulating this file you will get the error that you are talking about:
There is no harm to the file if you click "yes" and open. Excel will open this in Repaired Mode which will require you to re-save the file.
The way I've worked around this is to re-read the "repaired" file, in python, and save it as another file or replace it. Essentially just do an extra step of re-reading the data into memory, and write it to a new file. The error will go away. As always, test this method before deploying to production to ensure no records are lost. The way I solve it is with two lines of pandas.
import pandas as pd
repair = pd.read_excel('PATH_TO_REPAIR_FILE')
new_file = repair.to_excel('PATH_TO_WHERE_NEW_FILE_GOES')

Converting HTML to Excel with Django

I have a reporting module in my Django app that gives the user the ability to see their reports on screen or to export them and have the export opened by Excel.
The export is a cheat. I take the exact same output as the screen version and save it to a file with an .xls extension and
response = HttpResponse(body, content_type='application/vnd.ms-excel')
and badda-boom, badda-bing I have an Excel file that is lightly formatted, i.e. it respects the css styling that I've applied.
The nice thing for the user is that the file auto-opens in Excel; there aren't any extra steps for them. (find the download, import a text file, etc.)
Unfortunately it looks like Excel 2016 has decided (I'm guessing) that that's a security issue and no longer opens the file.
I'm aware of various python -> Excel tools. openpyxl looks promising. But that's going to require me to touch each report.
So, what I'm looking for is something that would give me what I have now, take an html file and have Excel open it as a native file and recognize the existing formatting.

The behavior change has been noted by Microsoft and there are work arounds, for the user:
https://support.microsoft.com/en-us/kb/3181507
It sounds like they're working on a fix.

Extracting Data from a .txt file using python

I many, many .xml files and i need to extract some co-ordinates from them.
Extracting data straight from .xml files seems to be very, very complicated - so i am working saving the .xml files as .txt files and extracting the data that way. However, when i open the .txt file, my data is all bunched together on about 6 lines.. And all the scripts i have found so far select the data by reading the first word on each line.. but obviously that won't work for me!
I need to extract the numbers inbetween these comments:
<gml:lowerCorner>137796 483752</gml:lowerCorner> <gml:upperCorner>138178 484222</gml:upperCorner>
In the text file they are all grouped together! Does anyone know how to extract this data? Thank you!

This is absolutely the wrong approach. Leave it alone and improve your ways :-)
Seriously, if the file is XML, then just use an XML parser to read it. Learning how to do it in Python isn't hard and will make your life easier now and much easier in the future, when you may find yourself facing more complex parsing needs, and you won't have to re-learn it.
Look at xml.etree.ElementTree.ElementTree. Here's some sample code:
>>> from xml.etree.ElementTree import ElementTree
>>> tree = ElementTree()
>>> tree.parse("your_xml_file.xml")
Now just read the documentation of the module and see what you can do with tree. You'll be surprised to find out how simple it is to get to information this way. If you have specific questions about extracting data, I suggest you open another question in which you specify the format of the XML file you have to parse, and what data you have to take out of there. I'm sure you will have working code suggested to you in matters of minutes.

You can also open through the python script .xml file as you open a .txt file.
data = open("file.xml")
xml = data.read()
Then you can use regular expressions to find those numbers you want so badly.

The top answer is still the top answer. However, I've been doing just this with HTML and this link lxml and xpath ideal.
Open your browser to the site (or data) which is of interest. In Chrome, right click and 'Inspect Element'. In the Developer window on the highlighted text right click again and 'Copy XPath'. For google.com and clicking on the main search box I get the following XPath.
//*[#id="lst-ib"]
You can use lxml to grab various data from this item. See what you get when you append 'text()' or '#value' or '#href' on the end.

For really simple xml i just use a regex, can't be botherd to start an slow xml parser for a simple xml packet.
In [1]: data = open("file.txt","r").read()
In [2]: import re
In [3]: re.compile("([\d]+)").findall(data)
Out[3]: ['137796', '483752', '138178', '484222']

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.