How can I import data for example for the field A1?
When I use etree.parse() I get an error, because I dont have a xml file.
It's a zip file:
import zipfile
from lxml import etree
z = zipfile.ZipFile('mydocument.ods')
data = z.read('content.xml')
data = etree.XML(data)
etree.dump(data)
Related
I would like to know How to parse 16 Gb XML file using python since it is always pop up memory error?
import numpy as np
import xml.etree.ElementTree as ET
import pandas as pd
import datetime
tree = ET.parse('M.xml')
root = tree.getroot()
root.tag
newsitems = []
For such a case use the Pull API for non-blocking parsing. You can feed parts of your XML to the XMLPullParser:
import xml.etree.ElementTree as ET
parser = ET.XMLPullParser(['start', 'end']) # other events are comment, pi, start-ns, end-ns
with open("M.xml", 'r') as f_xml:
for line in f_xml:
parser.feed(line)
for event, elem in parser.read_events():
print(event)
print(elem.tag, 'text=', elem.text)
Currently I am using following code to read xml files and extract data.
import pandas as pd
import numpy as np
import xml.etree.cElementTree as et
import datetime
tree=et.parse(r'/data/dump_xml/myfile1.xml')
root=tree.getroot()
NAME = []
for name in root.iter('name'):
NAME.append(name.text)
UPDATE = []
for update in root.iter('lastupdate'):
UPDATE.append(update.text)
updated = datetime.datetime.fromtimestamp(int(UPDATE[0]))
lastupdate=updated.strftime('%Y-%m-%d %H:%M:%S')
ParaValue = []
for parameterevalue in root.iter('value'):
ParaValue.append(parameterevalue.text)
print(ParaValue[0])
print(ParaValue[1])
print(lastupdate,NAME[0],ParaValue[0])
print(lastupdate,NAME[1],ParaValue[1])
From one file I could get following output as a result
2022-05-23 11:25:01 traffic_in 1.5012356187e+05
2022-05-23 11:25:01 traffic_out 1.7723777592e+05
But I have set of xml files in /data/dump_xml/ and I need to get all the results as below with the file name as well. I need to export all those as a dataframe.Can someone help me to do this for whole directory?
I am a beginner to python and I am stuck in a critical part of my script. The goal is to get fillable pdf data into one single csv file. I have built a script to do so through pdfminer but I am now stuck in the part where I should get my script to write all of the output data into my csv. Instead, I am only getting one data line written in my csv. Any clue please?
Here is my code:
import sys, os
import six
import csv
import numpy as np
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdftypes import resolve1
fp = open("C:\\Users\\Sufyan\\Desktop\\test\\ccc.pdf", "rb")
parser = PDFParser(fp)
doc = PDFDocument(parser)
fields = resolve1(doc.catalog["AcroForm"])["Fields"]
for i in fields:
field = resolve1(i)
name, value = field.get("T"), field.get("V")
filehehe = "{0}:{1}".format(name,value)
values = resolve1(value)
names = resolve1(name)
print(values)
with open('test.csv','wb') as f:
for i in names:
f.write(values)
Clues:
There is no i inside this loop:
with open('test.csv','wb') as f:
for i in names:
f.write(values)
For the same for loop, which values does names contain? Are these a lot? 1? 2? Why?
I have an XML file I am trying to parse and access one root of: DonorAdvisedFundInd which I shouldn't have a problem with but when I'm trying to parse the XML file I get an error message saying:
[Errno 36] File name too long:`
Here's the code I'm currently using: I cut off most of it so it's easier to see the problem. The error is occurring on the parse line.
import pandas as pd
import xml.etree.ElementTree as et
import requests
xml_data = requests.get("https://s3.amazonaws.com/irs-form-990/201903199349320465_public.xml").content
xtree = et.parse(xml_data)
Now the reason I'm so confused is if you open that link, the XML file really isn't all that long. It should be able to be parsed. I'm using IBM Watson Studio's online compiler if it makes any difference.
I'd appreciate any insight or feedback anyone can provide.
Try fromstring:
import pandas as pd
import xml.etree.ElementTree as et
import requests
xml_data = requests.get("https://s3.amazonaws.com/irs-form-990/201903199349320465_public.xml").content
xtree = et.fromstring(xml_data)
Update (for finding the specific element):
for i in xtree.findall(".//"):
if 'DonorAdvisedFundInd' in i.tag:
print(i.tag, i.attrib, i.text)
Another way would have been using this xmltodict lib like this:
result = xmltodict.parse(xml_data)
result['Return']['ReturnData']['IRS990']['DonorAdvisedFundInd']
I would like to automate the download of CSV files from the World Bank's dataset.
My problem is that the URL corresponding to a specific dataset does not lead directly to the desired CSV file but is instead a query to the World Bank's API. As an example, this is the URL to get the GDP per capita data: http://api.worldbank.org/v2/en/indicator/ny.gdp.pcap.cd?downloadformat=csv.
If you paste this URL in your browser, it will automatically start the download of the corresponding file. As a consequence, the code I usually use to collect and save CSV files in Python is not working in the present situation:
baseUrl = "http://api.worldbank.org/v2/en/indicator/ny.gdp.pcap.cd?downloadformat=csv"
remoteCSV = urllib2.urlopen("%s" %(baseUrl))
myData = csv.reader(remoteCSV)
How should I modify my code in order to download the file coming from the query to the API?
This will get the zip downloaded, open it and get you a csv object with whatever file you want.
import urllib2
import StringIO
from zipfile import ZipFile
import csv
baseUrl = "http://api.worldbank.org/v2/en/indicator/ny.gdp.pcap.cd?downloadformat=csv"
remoteCSV = urllib2.urlopen(baseUrl)
sio = StringIO.StringIO()
sio.write(remoteCSV.read())
# We create a StringIO object so that we can work on the results of the request (a string) as though it is a file.
z = ZipFile(sio, 'r')
# We now create a ZipFile object pointed to by 'z' and we can do a few things here:
print z.namelist()
# A list with the names of all the files in the zip you just downloaded
# We can use z.namelist()[1] to refer to 'ny.gdp.pcap.cd_Indicator_en_csv_v2.csv'
with z.open(z.namelist()[1]) as f:
# Opens the 2nd file in the zip
csvr = csv.reader(f)
for row in csvr:
print row
For more information see ZipFile Docs and StringIO Docs
import os
import urllib
import zipfile
from StringIO import StringIO
package = StringIO(urllib.urlopen("http://api.worldbank.org/v2/en/indicator/ny.gdp.pcap.cd?downloadformat=csv").read())
zip = zipfile.ZipFile(package, 'r')
pwd = os.path.abspath(os.curdir)
for filename in zip.namelist():
csv = os.path.join(pwd, filename)
with open(csv, 'w') as fp:
fp.write(zip.read(filename))
print filename, 'downloaded successfully'
From here you can use your approach to handle CSV files.
We have a script to automate access and data extraction for World Bank World Development Indicators like: https://data.worldbank.org/indicator/GC.DOD.TOTL.GD.ZS
The script does the following:
Downloading the metadata data
Extracting metadata and data
Converting to a Data Package
The script is python based and uses python 3.0. It has no dependencies outside of the standard library. Try it:
python scripts/get.py
python scripts/get.py https://data.worldbank.org/indicator/GC.DOD.TOTL.GD.ZS
You also can read our analysis about data from World Bank:
https://datahub.io/awesome/world-bank
Just a suggestion than a solution. You can use pd.read_csv to read any csv file directly from a URL.
import pandas as pd
data = pd.read_csv('http://url_to_the_csv_file')