I am new to Python and am attempting to process some XML into CSV files for later diff validation against the output from a database. The code I have below does a good job of taking the 'tct-id' attributes from the XML and outputting them in a nice column under the heading 'DocumentID', as I need for my validation.
However, the outputs from the database just come as numbers, whereas the outputs from this code include the version number of the XML ID; for example
tct-id="D-TW-0010054;3;"
where I need the ;3; removed so I can validate properly.
This is the code I have; is there any way I can go about rewriting this so it will pre-process the XML snippets to remove that - like only take the first 12 characters from each attribute and write those to the CSV, for example?
from lxml import etree
import csv
xml_fname = 'example.xml'
csv_fname = 'output.csv'
fields = ['tct-id']
xml = etree.parse(xml_fname)
with open(xml_fname) as infile, open(csv_fname, 'w', newline='') as outfile:
r = csv.DictReader(infile)
w = csv.DictWriter(outfile, fields, delimiter=';', extrasaction="ignore")
wtr = csv.writer(outfile)
wtr.writerow(["DocumentID"])
for node in xml.xpath("//*[self::StringVariables or self::ElementVariables or self::PubInfo or self::Preface or self::DocHistory or self::Glossary or self::StatusInfo or self::Chapter]"):
atts = node.attrib
atts["elm_name"] = node.tag
w.writerow(node.attrib)
All help is very appreciated.
Assuming you'll only have one ;3; type string to remove from the tct-id, you can use regular expressions
import re
tct_id="D-TW-0010054;3;"
to_rem=re.findall(r'(;.*;)',tct_id)[0]
tct_id=tct_id.replace(to_rem,'')
note i'm using tct_id instead of tct-id as python doesn't usually allow variables to be set like that
Related
Links:
Sample XML
Tools in use:
Power Automate Desktop (PAD)
Command Prompt (CMD) Session operated by PAD
Python 3.10 (importing: Numpy, Pandas & using Element Tree)
SQL to read data and insert into primary database
Method:
XML to CSV conversion
Im using Power Automate Desktop (PAD) to automate all of this because it is what I know.
The conversion method uses Python, inside of CMD, importing numpy & pandas, using element tree
Goal(s):
I would like to avoid the namespace being written to headers so SQL can interact with the csv data
I would like to append a master csv file with the new data coming in from every call
This last Goal Item is more of a wishlist and I will likely take it over to a PAD forum. Just including it in the rare event someone has a solution:
(I'd like to avoid using CMD all together) but given the limitations of PAD, using IronPython, I have to use a CMD session to utilize regular Python in the CMS session.
I am using Power Automate Desktop (PAD), which has a Python module. The only issue is that is uses Iron Python 2.7 & I cannot access the libraries & dont know how to code the Python code below in IronPython. If I could get this sorted out, the entire process would be much more efficient.
IronPython Module Operators (Again, this would be a nice to have.. The other goals are the priority.
Issues:
xmlns is being added to some of the headers when writing converted data to csv
The python code is generating a new .csv file each loop, where I would like for it to simply append the data into a common .csv file, to build the dataset.
Details (Probaably Unnecessary):
First, I am not Python expert and I am pretty novice with SQL as well.
Currently, the Python code (see below) converts a webservice call, body formatted XML, to csv format. Then it exports that .csv data to an actual csv file. This works great but it overwrites the file each time, which means I need to program scripts to read this data before the file is deleted/replaced with the new file. I then use SQL to INSERT INTO a main dataset (Append data) using SQL. It also prints the XMLNS in some of the headers, which is something I need to avoid.
If you are wondering why I am performing the conversion & not simply parsing the xml data, my client requires csv format for their datasets. Otherwise, I'd just parse out the XML data as needed. The other reason is that this data is being updated incrementally at set intervals, building an audit trail. So a bulk conversion is not possible due to
What I would like to do is have the Python code perform the conversion & then append a main dataset inside of a csv file.
The other issue here is that the XMLNS from the XML is being pulled into some (Not all) of the headers of the CSV table, which has made the use of SQL to read and insert into the main table an issue. I cannot figure a way around this (Again, novice)
Also, if anyone knows how one would write this in IronPython 2.7, that would be great too because I'd be able to get around using the CMD Session.
So, if I could use Python to append a primary table with the converted data, while escaping the namespace, this would solve all of my (current) issues & would have the added benefit of being 100% more efficient in the movement of this data.
Also, due to the limited toolset I have, I am scripting this using Power Automate in the CMD, using a CMD session.
Python Code (within CMD environment):
cd\WORK\XML
python
import numpy as np
import pandas as pd
from xml.etree import ElementTree
maintree = ElementTree.parse('FILE_XML.xml')
parentroot = maintree.getroot()
all_tags = list(set([elem.tag for elem in parentroot.iter()]))
rows = []
for child in parentroot:
temp_dict = {}
for i in all_tags:
tag_values = {}
for inners in child.iter(i):
temp_tag_value = {}
temp_dict.update(inners.attrib)
temp_tag_value[i.rsplit("}", 1)[1]] = inners.text
tag_values.update(temp_tag_value)
temp_dict.update(tag_values)
rows.append(temp_dict)
dataframe = pd.DataFrame.from_dict(rows, orient='columns')
dataframe = dataframe.replace({np.nan: None})
dataframe.to_csv('FILE_TABLE_CMD.csv', index=False)
Given no analysis is needed, avoid pandas and consider building the CSV by using csv.DictWriter where you can specify the append mode of write context. Below parses all underlying descendants of <ClinicalData> and migrates each set into CSV row.
from csv import DictWriter
from xml.etree import ElementTree
maintree = ElementTree.parse('FILE_XML.xml')
parentroot = maintree.getroot()
nmsp = {"doc": "http://www.cdisc.org/ns/odm/v1.3"}
# RETRIEVE ALL ELEMENT TAGS
all_tags = list(set([elem.tag for elem in parentroot.iter()]))
# RETRIEVE ALL ATTRIB KEYS
all_keys = [list(elem.attrib.keys()) for elem in maintree.iter()]
# UNNEST AND DE-DEDUPE
all_keys = set(sum([key for key in all_keys], []))
# COMBINE ELEM AND ATTRIB NAMES
all_tags = all_tags + list(all_keys)
all_tags = [(tag.split('}')[1] if '}' in tag else tag) for tag in all_tags]
# APPEND TO EXISTING DATA WIH 'a'
with open('FILE_TABLE_CMD.csv', 'a') as f:
writer = DictWriter(f, fieldnames = all_tags)
writer.writeheader()
# ITERATE THROUGH ALL ClincalData ELEMENTS
for cd in parentroot.findall('doc:ClinicalData', namespaces=nmsp):
temp_dict = {}
# ITERATE THROUGH ALL DESCENDANTS
for elem in cd.iter():
# UPDATE DICT FOR ELEMENT TAG/TEXT
temp_dict[elem.tag.split("}", 1)[1]] = elem.text
# MERGE ELEM DICT WITH ATTRIB DICT
temp_dict = {**temp_dict, **elem.attrib}
# REMOVE NAMESPACES IN KEYS
temp_dict = {
(k.split('}')[1] if '}' in k else k):v
for k,v
in temp_dict.items()
}
# WRITE ROW TO CSV
writer.writerow(temp_dict)
Actually, you can use the new iterparse feature of pandas.read_xml in latest v1.5. Though intended for very large XML, this feature allows parsing any underlying element or attribute without the restriction of relationships required of XPath.
You will still need to find all element and attributes names. CSV output will differ with above as method removes any all-empty columns and retains order of elements/attributes presented in XML. Also, pandas.DataFrame.csv does support append mode but conditional logic may be needed for writing headers.
import os
from xml.etree import ElementTree
import pandas as pd # VERSION 1.5+
maintree = ElementTree.parse('FILE_XML.xml')
parentroot = maintree.getroot()
# RETRIEVE ALL ELEMENT TAGS
all_tags = list(set([elem.tag for elem in parentroot.iter()]))
# RETRIEVE ALL ATTRIB KEYS
all_keys = [list(elem.attrib.keys()) for elem in maintree.iter()]
# UNNEST AND DE-DEDUPE
all_keys = set(sum([key for key in all_keys], []))
# COMBINE ELEM AND ATTRIB NAMES
all_tags = all_tags + list(all_keys)
all_tags = [(tag.split('}')[1] if '}' in tag else tag) for tag in all_tags]
clinical_data_df = pd.read_xml(
"FILE_XML.xml", iterparse = {"ClinicalData": all_tags}, parser="etree"
)
if os.path.exists("FILE_TABLE_CMD.csv"):
# CREATE CSV WITH HEADERS
clinical_data_df.to_csv("FILE_TABLE_CMD.csv", index=False)
else:
# APPEND TO CSV WITHOUT HEADERS
clinical_data_df.to_csv("FILE_TABLE_CMD.csv", index=False, mode="a", header=False)
I recommend to split large XML in branches and parse this parts separately. This can be done in an object. The object can also hold the data until written to the csv or a database like sqlite3, MySQL, etc.. The object can be called also from different threat.
I have not define the csv write, because I don't know which data you like to catch. But I think you will finish this easy.
Her is my recommended concept:
import xml.etree.ElementTree as ET
import pandas as pd
#import re
class ClinicData:
def __init__(self, branch):
self.clinical_data = []
self.tag_list = []
for elem in branch.iter():
self.tag_list.append(elem.tag)
#print(elem.tag)
def parse_cd(self, branch):
for elem in branch.iter():
if elem.tag in self.tag_list:
print(f"{elem.tag}--->{elem.attrib}")
if elem.tag == "{http://www.cdisc.org/ns/odm/v1.3}AuditRecord":
AuditRec_val = pd.read_xml(ET.tostring(elem))
print(AuditRec_val)
branch.clear()
def main():
"""Parse each clinic data into the class """
xml_file = 'Doug_s.xml'
#tree = ET.parse(xml_file)
#root = tree.getroot()
#ns = re.match(r'{.*}', root.tag).group(0)
#print("Namespace:",ns)
parser = ET.XMLPullParser(['end'])
with open(xml_file, 'r', encoding='utf-8') as et_xml:
for line in et_xml:
parser.feed(line)
for event, elem in parser.read_events():
if elem.tag == "{http://www.cdisc.org/ns/odm/v1.3}ClinicalData" and event =='end':
#print(event, elem.tag)
elem_part = ClinicData(elem)
elem_part.parse_cd(elem)
if __name__ == "__main__":
main()
I've got a bunch of .dcm-files (dice-files) where I would like to extract the header and save the information there in a CSV file.
As you can see in the following picture, I've got a problem with the delimiters:
For example when looking at the second line in the picture: I'd like to split it like this:
0002 | 0000 | File Meta Information Group Length | UL | 174
But as you can see, I've not only multiple delimiters but also sometimes ' ' is one and sometimes not. Also the length of the 3rd column varies, so sometimes there is only a shorter text there, e.g. Image Type further down in the picture.
Does anyone have a clever idea, how to write it in a CSV file?
I use pydicom to read and display the files in my IDE.
I'd be very thankful for any advice :)
I would suggest going back to the data elements themselves and working from there, rather than from a string output (which is really meant for exploring in interactive sessions)
The following code should work for a dataset with no Sequences, would need some modification to work with sequences:
import csv
import pydicom
from pydicom.data import get_testdata_file
filename = get_testdata_file("CT_small.dcm") # substute your own filename here
ds = pydicom.dcmread(filename)
with open('my.csv', 'w', newline='') as csvfile:
writer = csv.writer(csvfile)
writer.writerow("Group Elem Description VR value".split())
for elem in ds:
writer.writerow([
f"{elem.tag.group:04X}", f"{elem.tag.element:04X}",
elem.description(), elem.VR, str(elem.value)
])
It may also require a bit of change to make the elem.value part look how you want it, or you may want to set the CSV writer to use quotes around items, etc.
Output looks like:
Group,Elem,Description,VR,value
0008,0005,Specific Character Set,CS,ISO_IR 100
0008,0008,Image Type,CS,"['ORIGINAL', 'PRIMARY', 'AXIAL']"
0008,0012,Instance Creation Date,DA,20040119
0008,0013,Instance Creation Time,TM,072731
...
using python I'm trying to create summary with existing data of csv and finding difficulties in extracting data from one of the cell.
the input csv file
I want to include only the city name and file path from info 4 column and expecting the summary like - AlexxxxxyyyyzzzzzNewyork\Folder1\Folder2\Test.txt
the code
csv_data_out[csv_line_out].append(conten[Name])
csv_data_out[csv_line_out].append(conten[info 1])
csv_data_out[csv_line_out].append(conten[info 2])
csv_data_out[csv_line_out].append(conten[info 3])
csv_data_out[csv_line_out].append(conten[info 4])
csv_summary = ("".join(csv_data_out[csv_line_out]))
with open(outputfile, 'wb') as newfile:
writer = csv.writer(newfile, delimiter = ';')
writer.writerow(csv_columns_out[:])
writer.writerows(csv_data_out)
newfile.close()
any idea to fetch only the required details from info 4 col ?
Essentially you have a csv inside a csv. There's not info posted to give a fully complete answer but here's most of it.
You can take a string and process it as a csv using io.StringIO (or io.BytesIO if a byte string).
#! /usr/bin/env python
# -*- coding: utf-8 -*-
import csv
from io import StringIO
# Create somewhere to put the inputs in case needed later
stored_items = []
with open('data.csv', 'r') as csvfile:
inputs = csv.reader(csvfile)
# skip the header row
next(inputs)
for row in inputs:
# Extract the Info 4 column for processing
f = StringIO(row[4])
string_file = csv.reader(f,quotechar='"')
build_string = ""
for string_row in string_file:
build_string = f"{string_row[0]}{string_row[1]}"
# Merge everything into a summary
summary_string = f"{row[0]}{row[1]}{row[2]}{row[3]}{build_string}"
# Add all the data back to storage
stored_items.append((row[0],row[1],row[2],row[3],row[4],summary_string))
print(summary_string)
The reason why I say there's not enough information posted in because, for example, will the location always be (a) which can have a fixed text replacement, or will be conditional e.g. it could be (a) or (b) in which case it would possibly require regex. (My preference is not to use regex unless absolutely necessary). Also, is it always the first two terms you are after from Info 4, or will the terms be found in different places in the text etc. Without seeing more samples of the data it's impossible to answer definitely.
I'm trying to write a script that will convert each email element of an .mbox file into a .csv file. I specifically need the following elements, but if there was a way to "write for each element," that'd be preferred:
To, From, CC'd, BCC'd, Date, Subject, Body
I found a script online that looks to be the start of what I need, and the documentation about the email module, but I can't find any specifics on how to
identify the different attribute options (to, from, cc'd, etc.)
how to write them as unique cell values in a .csv.
Here is sample code I found:
import mailbox
import csv
writer = csv.writer(open("clean_mail_B.csv", "wb"))
for message in mailbox.mbox('Saks.mbox'):
writer.writerow([message['to'], message['from'], message['date']])
To do that you would first need to determine the complete list of possible keys present in all mailbox items. Then you can use that to write the CSV header.
Next you need to get all the key value pairs from each message using .items(). This can then be converted back into a dictionary and written to your CSV file.
The mailbox library unfortunately does not directly expose the message dictionary otherwise it would have been possible to write this directly.
import mailbox
import csv
mbox_file = 'sample.mbox'
with open('clean_mail_B.csv', 'w', newline='', encoding='utf-8') as f_output:
# Create a column for the first 30 message payload sections
fieldnames = {f'Part{part:02}' for part in range(1, 31)}
for message in mailbox.mbox(mbox_file):
fieldnames.update(message.keys())
csv_output = csv.DictWriter(f_output, fieldnames=sorted(fieldnames), restval='', extrasaction='ignore')
csv_output.writeheader()
for message in mailbox.mbox(mbox_file):
items = dict(message.items())
for part, payload in enumerate(message.get_payload(), start=1):
items[f'Part{part:02}'] = payload
csv_output.writerow(items)
A DictWriter is used rather than a standard CSV writer. This would then cope better for when certain message do not contain all possible header values.
The message payload can be in multiple parts, these are added as separate column headers e.g. Part01, Part02. Normally there should be 1 or 2 but your sample mbox contained one with a strange signature containing 25?
If the mbox contains more payload entries for a message (i.e. >30), these are ignored using extrasaction='ignore'. An alternative approach would be to combine all payloads into a single column.
I have created a database in MongoDB with tweets and their sentimental analisys based on tweepy and NLTK. After some experience with Mongoexport to create a CSV file with a dataset from this database stored in the MongoDB, I decided to explore other options more flexibles (specially with other delimiters than "coma"), for example, using the Python itself to generate the CSV file. So far I could print the dataset successfully, correcting the ASCII and Unicode problems and using "|" as delimiter, however I am suffering to create a CSV file from the printing results. The code so far is as follow:
import json
import csv
from pymongo import MongoClient
client = MongoClient('localhost', 27017)
db = client['twitter_db_stream_1']
collection = db['twitter_collection']
data_python = collection.find({"user.location":{"$exists":True},"user.location":{"$ne":"null"}},{"created_at":1,"text":1,"user.name":1,"user.location":1,"geo.coordinates":1,"sentiment_value":1,"confidence_value":1})
for data in data_python:
print(data['created_at'],'|',data['text'].encode('utf8'),'|',data['user']['name'].encode('utf8'),'|',data['user']['location'],'|',data['sentiment_value'],'|',data['confidence_value'])
The printing results are as follow:
Tue Apr 18 06:51:58 +0000 2017 | b'Samsung Galaxy S8 International Giveaway #androidauth #giveaway | b'Matt Torok' | None | pos | 1.0
I tried to add the following piece of code using csv.writer, based on some examples from tutorias, however it is not working...
csv_file = open('Sentiment_Analisys.csv', 'wb')
writer = csv.writer(csv_file)
fields = [["created_at"],["text"],["user.name"],["user.location"],["sentiment_value"],["confidential_value"]] #field names
writer.writerow(fields)
for data in data_python:
writer.writerow(data['created_at'],data['text'].encode('utf8'),data['user']['name'].encode('utf8'),data['user']['location'],data['sentiment_value'],data['confidence_value'])
csv_file.close()
Please, could someone give me some guidance in how to create this CSV file from the printing results above?
Thanks a lot!
You appear to have copied a Python 2.x example, but are writing Python 3.x code. The CSV usage is slightly different between these two versions. Also, it is preferable to use a with statement whilst dealing with files, which avoids the need to explicitly close the file at the end.
writerow() takes a list of strings. Your field names were defined as a list of lists, and your data writerow() needed to be converted to use a list as follows:
field_names = ["created_at", "text", "user.name", "user.location", "sentiment_value", "confidential_value"]
with open('Sentiment_Analisys.csv', 'w', newline='') as f_output:
csv_output = csv.writer(f_output)
csv_output.writerow(field_names)
for data in data_python:
csv_output.writerow(
[
data['created_at'],data['text'].encode('utf8', 'ignore'),
data['user']['name'].encode('utf8'),
data['user']['location'],
data['sentiment_value'],
data['confidence_value']
])
Dears, below I would like to share the final code, after getting the support of good friends in stackoverflow. Mongoexport has its advantages, but if you need some flexibility to define your own delimiter to create a CSV file, this code might be interesting. The only problem is that you may lose "emoji" characters since they are converted to text codes through UTF-8 enconding. Anyway, depending on your requirements, such limitation might not be a problem. From previous code posted above, there is a difference in the query "user.location":{"$ne":"null"}} that I transported from Mongo Client, but in Python code, you should change "null" to "None". I hope my journey to find the right code below, and the kind support of my friends in this post, might be useful for someone in the future! Best Regards!
import pymongo
import json
import csv
import numpy
import sys
from pymongo import MongoClient
client = MongoClient('localhost', 27017)
db = client['twitter_db_stream_1']
collection = db['twitter_collection']
data_python = collection.find({"user.location":{"$exists":True},"user.location":{"$ne":None}},{"created_at":1,"text":1,"user.name":1,"user.location":1,"sentiment_value":1,"confidence_value":1})
field_names = ["created_at", "text", "user.name", "user.location", "sentiment_value", "confidential_value"]
with open('Sentiment_Analisys.csv', 'w', newline='') as f_output:
csv_output = csv.writer(f_output, delimiter="|")
csv_output.writerow(field_names)
for data in data_python:
csv_output.writerow(
data['created_at'],
data['text'].encode('utf8', 'ignore'),
data['user']['name'].encode('utf8'),
data['user']['location'],
data['sentiment_value'],
data['confidence_value']
])