XML retrieving from a URL to CSV [closed] - python

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 3 years ago.
Improve this question
Simply, I need to retrieve current currencies and rates from central bank of europe which is in XML format and convert it to CSV file using python. It creates me a file but it does not write the correct things I need.
XML follows here:
https://www.ecb.europa.eu/stats/eurofxref/eurofxref-daily.xml?
This is my code but it does not work please help guys.
import xml.etree.ElementTree as ET
import requests
import csv
kurzbanky_xml = requests.get("https://www.ecb.europa.eu/stats/eurofxref/eurofxref-daily.xml")
root = ET.fromstring(kurzbanky_xml.text)
with open('banka.csv','w',newline='') as Currency_Rate:
csvwriter = csv.writer(Currency_Rate)
csvwriter.writerow(['currency','rate'])
for member in root.iterfind('Cube'):
cur = cube.attrib['currency']
rat = cube.attrib['rate']
csvwriter.writerow([cur,rat])

You can use xmltodict lib to convert XML to JSON and then iterate over JSON:
import csv
import requests
import xmltodict
r = requests.get("https://www.ecb.europa.eu/stats/eurofxref/eurofxref-daily.xml").text
data = xmltodict.parse(r)['gesmes:Envelope']['Cube']['Cube']
with open('{}.csv'.format(data['#time']), 'w', newline='') as f:
csvwriter = csv.writer(f)
csvwriter.writerow(['currency', 'rate'])
for cur in data['Cube']:
csvwriter.writerow([cur['#currency'], cur['#rate']])
Output 2019-03-27.csv file:
currency,rate
USD,1.1261
JPY,124.42
BGN,1.9558
CZK,25.797
DKK,7.4664
GBP,0.85118
etc.

Related

How to pull values from a messy JSON file using python? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 19 hours ago.
Improve this question
I am trying to pull values of all pin and p_status from the below sample code into another file:
{'exchange':'vs',
'msg_count':660,
'payload': {"allocation":"645985649","p_status":"ordered","pin":"323232134455","bytes":"998"},
{'exchange':'vse',
So far I am only able to load and pretty print my json file using below code:
import json
import pprint
file = open('/Users/sjain34/Downloads/jsonviewer.json','r')
data = json.load(file)
my_data= pprint.pprint(data)
print(my_data)
Try this:
import pandas as pd
Df = pd.read_json('/Users/sjain34/Downloads/jsonviewer.json','r')
preferred_df = Df["p_status":"pin"]
preferred_df.to_csv('file_name.csv')
This would serve a csv with just the statuses and pins

Why does my word to JSON converter always return the same output? [closed]

Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 5 days ago.
Improve this question
import aspose.words as aw
import os
import glob
import openpyxl
import json
import aspose.cells
from aspose.cells import License,Workbook,FileFormatType
workbook = Workbook("bookwithChart.xlsx")
os.chdir(os.path.join(r"C:\Users\13216\Downloads\pythontests"))
docx_files = glob.glob("*.docx")
for files in docx_files:
doc = aw.Document(files)
doc.save("document1.docx")
doc = aw.Document("document1.docx")
doc.save("html_output.html", aw.SaveFormat.HTML)
book = Workbook("html_output.html")
book.save("word-to-json.json", FileFormatType.JSON)
I need to convert a set of WORD documents to JSON. It works great. However, when I change the PATH and document name to test out other documents, the output doesn't chain. The JSON file returns the same output shown in the initial test document.
I tried changing the save command for the JSON and HTML files. It didn't work. I assume the program is storing the first output from the very first test document ("document1.docx"). I tried inputting a different document so many times. The output does not change.
There is no need to open/save DOCX document in your code. Also, you save all the documents into the same output file, so it is overridden on each iteration. You can modify your code like this:
import aspose.words as aw
import aspose.cells as ac
import os
import glob
os.chdir(os.path.join(r"C:\Temp"))
docx_files = glob.glob("*.docx")
i = 0
for files in docx_files:
doc = aw.Document(files)
doc.save("tmp.html", aw.SaveFormat.HTML)
book = ac.Workbook("tmp.html")
book.save("word-to-json_" + str(i) + ".json", ac.FileFormatType.JSON)
i += 1

Implementing a class error and returning 0 text [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I hope this finds you well. I am struggling a bit and I have two questions. First, I am trying to implement a class and it returns a code similar to <main. object at 0x02C08790> when I am trying to implement the class. I have referenced other comments and do not quite understand. My second question is when I run the code below it states that there is no items in the pdf I saved earlier. I think I am passing the document incorrectly but I am unsure. I have tested each code separately and the both work independently but not together. Any help is greatly appreciated.
import os
from pdfminer3.layout import LAParams, LTTextBox
from pdfminer3.pdfpage import PDFPage
from pdfminer3.pdfinterp import PDFResourceManager
from pdfminer3.pdfinterp import PDFPageInterpreter
from pdfminer3.converter import PDFPageAggregator
from pdfminer3.converter import TextConverter
import io
import PyPDF2
from PyPDF2 import PdfFileMerger, PdfFileReader
import pandas as pd
class Transform:
# method for extracting data and merging it into one pdf
def __init__(self):
try:
source_dir = os.getcwd()
merger = PdfFileMerger()
for item in os.listdir(source_dir):
if item.endswith("pdf"):
merger.append(item)
except Exception:
print("unable to collect")
finally:
merger.write("test.pdf")
merger.close()
#running that method extract
def extract(self):
resource_manager = PDFResourceManager()
fake_file_handle = io.StringIO()
converter = TextConverter(resource_manager, fake_file_handle, laparams=LAParams())
page_interpreter = PDFPageInterpreter(resource_manager, converter)
with open('test.pdf', 'rb') as fh:
for page in PDFPage.get_pages(fh,
caching=True,
check_extractable=True):
page_interpreter.process_page(page)
text = fake_file_handle.getvalue()
# close open handles
converter.close()
fake_file_handle.close()
print(Transform)
I expect your code should merge all pdf in current dir into test.pdf plus print text of this merged pdf. It needs just two corrections, first replace
print(Transform)
with
print(Transform().extract())
Transform by itself is a class, you need to create (instantiate) object out of it, using Transform(). Then you may call some methods on it like .extract(), this runs method-function defined in that class. You may read about classes and objects here.
Second, add
return text
as last line in def extract(self) function body. This return is necessary so that extract returns text it has extracted from pdf, otherwise it does some work, but doesn't return any result in the original code.
You may Run full corrected code here.

Efficient way of converting CSV file to XML file? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I have written below code to convert a CSV file to an XML file. Input file has 10 millions records. Problem is it runs for hours with 10 millions records. With less number of records like 2000 it takes 5-10 seconds.
Is there a way to do it efficiently in less time?
import csv
import sys
import os
from xml.dom.minidom import Document
filename = sys.argv[1]
filename = os.path.splitext(filename)[0]+'.xml'
pathname = "/tmp/"
output_file = pathname + filename
f = sys.stdin
reader = csv.reader(f)
fields = next(reader)
fields = [x.lower() for x in fields]
fieldsR = fields
doc = Document()
dataRoot = doc.createElement("rowset")
dataRoot.setAttribute('xmlns:xsi', "http://www.w3.org/2001/XMLSchema-instance")
dataRoot.setAttribute('xsi:schemaLocation', "./schema.xsd")
doc.appendChild(dataRoot)
for line in reader:
dataElt = doc.createElement("row")
for i in range(len(fieldsR)):
dataElt.setAttribute(fieldsR[i], line[i])
dataRoot.appendChild(dataElt)
xmlFile = open(output_file,'w')
xmlFile.write(doc.toprettyxml(indent = '\t'))
xmlFile.close()
sys.stdout.write(output_file)
I don't know Python or Minidom, but you seem to be executing the line
dataRoot.appendChild(dataElt)
once for every field in every row, rather than once for every row.
Your performance numbers suggest that something is very wrong here, I don't know if this is it. With 2000 records I would expect to measure the time in milliseconds.
Have to say I'm constantly amazed how people write complex procedural code for this kind of thing when it could be done in half a dozen lines of XSLT or XQuery.

Extracting with Python [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 9 years ago.
I am currently working on a python program that extracts information from a stock website
http://markets.usatoday.com/custom/usatoday-com/html-mktscreener.asp
I need to extract all the columns Symbol - Volume. Before this program I had to create a bash scrip that downloaded the pages every minute for 1 hour to get 60 pages. Which I have done. But I do not understand how to extract the information so I can inject that information into MySQL db.
import libxml2
import sys
import os
import commands
import re
import sys
import MySQLdb
from xml.dom.minidom import parse, parseString
# for converting dict to xml
from cStringIO import StringIO
from xml.parsers import expat
def get_elms_for_atr_val(tag,atr,val):
lst=[]
elms = dom.getElementsByTagName(tag)
# ............
return lst
# get all text recursively to the bottom
def get_text(e):
lst=[]
# ............
return lst
def extract_values(dm):
lst = []
l = get_elms_for_atr_val('table','class','most_actives')
# ............
# get_text(e)
# ............
return lst
I'm very new to python and that's the best provided. There are 60 HTML pages downloaded and all I need to do is just extract the information from 1 page I believe Or at least if I can start on the 1 page I can figure out a loop for the others, and extract that information to be used in MYsql
Any help to get me started is appreciated!
Use a robust HTML parser instead of xml module, as the latter will reject malformed documents, like the URL you pointed appears to be. Here's a quick solution:
from lxml.html import parse
import sys
def process(htmlpage):
tree = parse(htmlpage).getroot()
# Helper function
xpath_to_column = lambda expr: [el.text for el in tree.xpath(expr)]
symbol = xpath_to_column('//*[#id="idcquoteholder"]/table/tr/td[1]/a')
price = xpath_to_column('//*[#id="idcquoteholder"]/table/tr/td[3]')
volume = xpath_to_column('//*[#id="idcquoteholder"]/table/tr/td[6]')
return zip(symbol, price, volume)
def main():
for filename in sys.argv[1:]:
with open(filename, 'r') as page:
print process(page)
if __name__ == '__main__':
main()
You will have to elaborate on this example a bit, as some elements (like the "Symbol") are further contained in span or a nodes, but the spirit is: use XPath to query and extract column contents. Add columns as needed.
Hint: use Chrome Inspector or Firebug to get the right XPath.
EDIT: pass all the filenames on the command line to this script. If you need to process separately each file, then remove the for loop in main().

Categories

Resources