Python - Iterate through CSV rows and create XML string

Python - Iterate through CSV rows and create XML string - python

I have a CSV file that contains a header row followed by a potentially unlimited number of rows with values. For example:
FieldA,FieldB,FieldC,FieldD
1,asdf,2,ghjk
3,qwer,4,yuio
5,slslkd,,aldkjslkj
What I need to do is for each row, create a quasi-XML string where the elements are labeled as the column name and information within each element is the value of the cell. Using the above as an example, if I iterate through each of the three rows I would end up with these three strings:
<FieldA>1</FieldA><FieldB>asdf</FieldB><FieldC>2</FieldC><FieldD>ghjk</FieldD>
<FieldA>3</FieldA><FieldB>qwer</FieldB><FieldC>4</FieldC><FieldD>yuio</FieldD>
<FieldA>5</FieldA><FieldB>slslkd</FieldB><FieldD>aldkjslkj</FieldD>
The way I am currently doing is is:
for row in r:
if row['FieldA']:
fielda = '<FieldA>{0}</FieldA>'.format(row['FieldA'])
else:
fielda = ''
if row['FieldB']:
fieldb = '<FieldB>{0}</FieldB>'.format(row['FieldB'])
else:
fieldb = ''
if row['FieldC']:
fieldc = '<FieldC>{0}</FieldC>'.format(row['FieldC'])
else:
fieldc = ''
if row['FieldD']:
fieldd = '<FieldD>{0}</FieldD>'.format(row['FieldD'])
else:
fieldd = ''
# Compile the string
final_string = fielda + fieldb + fieldc + fieldd
# Process further
do_something(final_string)
As it iterates through each row, this creates the appropriate string and then I can pass it on for further processing.
Is there a better way to achieve what I want, or is my approach the best way? My guess is there is a better, more Pythonic, and more efficient way, but I'm new-ish to Python.
Thanks.

Slightly modified code that fixed the issue I was having. Turned out to be pretty trivial:
with open(csv_file) as f:
for row in csv.DictReader(f):
top = Element('event')
for k, v in row.items():
child = SubElement(top, k)
child.text = v
print tostring(top)
Thanks for the help!

Python is Batteries Included.
In this case, you can use the csv module and the xml module, with code that looks like this:
# CSV module
import csv
# Stuff from the XML module
from xml.etree.ElementTree import Element, SubElement, tostring
# Topmost XML element
top = Element('top')
# Open a file
with open('stuff.csv') as csvfile:
# And use a dictionary-reader
for d in csv.DictReader(csvfile)
# For each mapping in the dictionary
for (k, v) in d.iteritems():
# Create an XML node
child = SubElement(top, k)
child.text = v
print tostring(top)

'Top' is just the highest level node -- you could use whatever text you want to wrap the whole document.
You can pretty-print it pretty simply as well:
http://pymotw.com/2/xml/etree/ElementTree/create.html#pretty-printing-xml

Related

Problem skipping line whilst iterating using previous line and current line comparison

I have a list of sorted data arranged so that each item in the list is a csv line to be written to file.
The final step of the script checks the contents of each field and if all but the last field match then it will copy the current line's last field onto the previous line's last field.
I would like to as I've found and processed one of these matches skip the current line where the field was copied from thus only leaving one of the lines.
Here's an example set of data
field1,field2,field3,field4,something
field1,field2,field3,field4,else
Desired output
field1,field2,field3,field4,something else
This is what I have so far
output_csv = ['field1,field2,field3,field4,something',
'field1,field2,field3,field4,else']
# run through the output
# open and create a csv file to save output
with open('output_table.csv', 'w') as f:
previous_line = None
part_duplicate_line = None
part_duplicate_flag = False
for line in output_csv:
part_duplicate_flag = False
if previous_line is not None:
previous = previous_line.split(',')
current = line.split(',')
if (previous[0] == current[0]
and previous[1] == current[1]
and previous[2] == current[2]
and previous[3] == current[3]):
print(previous[0], current[0])
previous[4] = previous[4].replace('\n', '') + ' ' + current[4]
part_duplicate_line = ','.join(previous)
part_duplicate_flag = True
f.write(part_duplicate_line)
if part_duplicate_flag is False:
f.write(previous_line)
previous_line = line
ATM script adds the line but doesn't skip the next line, I've tried various renditions of continue statements after part_duplicate_line is written to file but to no avail.

Looks like you want one entry for each combination of the first 4 fields
You can use a dict to aggregate data -
#First we extract the key and values
output_csv_keys = list(map(lambda x: ','.join(x.split(',')[:-1]), output_csv))
output_csv_values = list(map(lambda x: x.split(',')[-1], output_csv))
#Then we construct a dictionary with these keys and combine the values into a list
from collections import defaultdict
output_csv_dict = defaultdict(list)
for key, value in zip(output_csv_keys, output_csv_values):
output_csv_dict[key].append(value)
#Then we extract the key/value combinations from this dictionary into a list
for_printing = [','.join([k, ' '.join(v)]) for k, v in output_csv_dict.items()]
print(for_printing)
#Output is ['field1,field2,field3,field4,something else']
#Each entry of this list can be output to the csv file

I propose to encapsulate what you want to do in a function where the important part obeys this logic:
either join the new info to the old record
or output the old record and forget it
of course at the end of the loop we have in any case a dangling old record to output
def join(inp_fname, out_fname):
'''Input file contains sorted records, when two (or more) records differ
only in the last field, we join the last fields with a space
and output only once, otherwise output the record as-is.'''
######################### Prepare for action ##########################
from csv import reader, writer
with open(inp_fname) as finp, open(out_fname, 'w') as fout:
r, w = reader(finp), writer(fout)
######################### Important Part starts here ##############
old = next(r)
for new in r:
if old[:-1] == new[:-1]:
old[-1] += ' '+new[-1]
else:
w.writerow(old)
old = new
w.writerow(old)
To check what I've proposed you can use these two snippets (note that these records are shorter than yours, but it's an example and it doesn't matter because we use only -1 to index our records).
The 1st one has a "regular" last record
open('a0.csv', 'w').write('1,1,2\n1,1,3\n1,2,0\n1,3,1\n1,3,2\n3,3,0\n')
join('a0.csv', 'a1.csv')
while the 2nd has a last record that must be joined to the previous one.
open('b0.csv', 'w').write('1,1,2\n1,1,3\n1,2,0\n1,3,1\n1,3,2\n')
join('b0.csv', 'b1.csv')
If you run the snippets, as I have done before posting, in the environment where you have defined join you should get what you want.

Python: dictionary to collection

I have a file with 2 columns:
Anzegem Anzegem
Gijzelbrechtegem Anzegem
Ingooigem Anzegem
Aalst Sint-Truiden
Aalter Aalter
The first column is a town and the second column is the district of that town.
I made a dictionary of that file like this:
def readTowns(text):
input = open(text, 'r')
file = input.readlines()
dict = {}
verzameling = set()
for line in file:
tmp = line.split()
dict[tmp[0]] = tmp[1]
return dict
If I set a variable 'writeTowns' equal to readTowns(text) and do writeTown['Anzegem'], I want to get a collection of {'Anzegem', 'Gijzelbrechtegem', 'Ingooigem'}.
Does anybody know how to do this?

I think you can just create another function that can create appropriate data structure for what you need. Because, at the end you will end up writing code which basically manipulates the dictionary returned by readTowns to generate data as per your requirement. Why not keep the code clean and create another function for that. You Just create a name to list dictionary and you are all set.
def writeTowns(text):
input = open(text, 'r')
file = input.readlines()
dict = {}
for line in file:
tmp = line.split()
dict[tmp[1]] = dict.get(tmp[1]) or []
dict.get(tmp[1]).append(tmp[0])
return dict
writeTown = writeTowns('file.txt')
print writeTown['Anzegem']
And if you are concerned about reading the same file twice, you can do something like this as well,
def readTowns(text):
input = open(text, 'r')
file = input.readlines()
dict2town = {}
town2dict = {}
for line in file:
tmp = line.split()
dict2town[tmp[0]] = tmp[1]
town2dict[tmp[1]] = town2dict.get(tmp[1]) or []
town2dict.get(tmp[1]).append(tmp[0])
return dict2town, town2dict
dict2town, town2dict = readTowns('file.txt')
print town2dict['Anzegem']

You could do something like this, although, please have a look at #ubadub's answer, there are better ways to organise your data.
[town for town, region in dic.items() if region == 'Anzegem']

It sounds like you want to make a dictionary where the keys are the districts and the values are a list of towns.
A basic way to do this is:
def readTowns(text):
with open(text, 'r') as f:
file = input.readlines()
my_dict = {}
for line in file:
tmp = line.split()
if tmp[1] in dict:
my_dict[tmp[1]].append(tmp[0])
else:
my_dict[tmp[1]] = [tmp[0]]
return dict
The if/else blocks can also be achieved using python's defaultdict subclass (docs here) but I've used the if/else statements here for readability.
Also some other points: the variables dict and file are python types so it is bad practice to overwrite these with your own local variable (notice I've changed dict to my_dict in the code above.

If you build your dictionary as {town: district}, so the town is the key and the district is the value, you can't do this easily*, because a dictionary is not meant to be used in that way. Dictionaries allow you to easily find the values associated with a given key. So if you want to find all the towns in a district, you are better of building your dictionary as:
{district: [list_of_towns]}
So for example the district Anzegem would appear as {'Anzegem': ['Anzegem', 'Gijzelbrechtegem', 'Ingooigem']}
And of course the value is your collection.
*you could probably do it by iterating through the entire dict and checking where your matches occur, but this isn't very efficient.

Writing out comma separated values in a single cell in spreadsheet

I am cataloging attribute fields for each feature class in the input list, below, and then I am writing the output to a spreadsheet for the occurance of the attribute in one or more of the feature classes.
import arcpy,collections,re
arcpy.env.overwriteOutput = True
input = [list of feature classes]
outfile= # path to csv file
f=open(outfile,'w')
f.write('ATTRIBUTE,FEATURE CLASS\n\n')
mydict = collections.defaultdict(list)
for fc in input:
cmp=[]
lstflds=arcpy.ListFields(fc)
for fld in lstflds:
cmp.append(fld.name)
for item in cmp:
mydict[item].append(fc)
for keys, vals in mydict.items():
#remove these characters
char_removal = ["[","'",",","]"]
new_char = '[' + re.escape(''.join(char_removal)) + ']'
v=re.sub(new_char,'', str(vals))
line=','.join([keys,v])+'\n'
print line
f.write(line)
f.close()
This code gets me 90% of the way to the intended solution. I still cannot get the feature classes(values) to separate by a comma within the same cell(being comma delimited it shifts each value over to the next column as I mentioned). In this particular code the "v" on line 20(feature class names) are output to the spreadsheet, separated by a space(" ") in the same cell. Not a huge deal because the replace " " with "," can be done very quickly in the spreadsheet itself but it would be nice to work this into the code to improve reusability.

For a CSV file, use double-quotes around the cell content to preserve interior commas within, like this:
content1,content2,"content3,contains,commas",content4
Generally speaking, many libraries that output CSV just put all contents in quotes, like this:
"content1","content2","content3,contains,commas","content4"
As a side note, I'd strongly recommend using an existing library to create CSV files instead of reinventing the wheel. One such library is built into Python 2.6+.
As they say, "Good coders write. Great coders reuse."

import arcpy,collections,re,csv
arcpy.env.overwriteOutput = True
input = [# list of feature classes]
outfile= # path to output csv file
f=open(outfile,'wb')
csv_write=csv.writer(f)
csv_write.writerow(['Field','Feature Class'])
csv_write.writerow('')
mydict = collections.defaultdict(list)
for fc in input:
cmp=[]
lstflds=arcpy.ListFields(fc)
for fld in lstflds:
cmp.append(fld.name)
for item in cmp:
mydict[item].append(fc)
for keys, vals in mydict.items():
# remove these characters
char_removal = ["[","'","]"]
new_char = '[' + re.escape(''.join(char_removal)) + ']'
v=re.sub(new_char,'', str(vals))
csv_write.writerow([keys,""+v+""])
f.close()

Python - reading data from file with variable attributes and line lengths

I'm trying to find the best way to parse through a file in Python and create a list of namedtuples, with each tuple representing a single data entity and its attributes. The data looks something like this:
UI: T020
STY: Acquired Abnormality
ABR: acab
STN: A1.2.2.2
DEF: An abnormal structure, or one that is abnormal in size or location, found
in or deriving from a previously normal structure. Acquired abnormalities are
distinguished from diseases even though they may result in pathological
functioning (e.g., "hernias incarcerate").
HL: {isa} Anatomical Abnormality
UI: T145
RL: exhibits
ABR: EX
RIN: exhibited_by
RTN: R3.3.2
DEF: Shows or demonstrates.
HL: {isa} performs
STL: [Animal|Behavior]; [Group|Behavior]
UI: etc...
While several attributes are shared (eg UI), some are not (eg STY). However, I could hardcode an exhaustive list of necessary.
Since each grouping is separated by an empty line, I used split so I can process each chunk of data individually:
input = file.read().split("\n\n")
for chunk in input:
process(chunk)
I've seen some approaches use string find/splice, itertools.groupby, and even regexes. I was thinking of doing a regex of '[A-Z]*:' to find where the headers are, but I'm not sure how to approach pulling out multiple lines afterwards until another header is reached (such as the multilined data following DEF in the first example entity).
I appreciate any suggestions.

I took assumption that if you have string span on multiple lines you want newlines replaced with spaces (and to remove any additional spaces).
def process_file(filename):
reg = re.compile(r'([\w]{2,3}):\s') # Matches line header
tmp = '' # Stored/cached data for mutliline string
key = None # Current key
data = {}
with open(filename,'r') as f:
for row in f:
row = row.rstrip()
match = reg.match(row)
# Matches header or is end, put string to list:
if (match or not row) and key:
data[key] = tmp
key = None
tmp = ''
# Empty row, next dataset
if not row:
# Prevent empty returns
if data:
yield data
data = {}
continue
# We do have header
if match:
key = str(match.group(1))
tmp = row[len(match.group(0)):]
continue
# No header, just append string -> here goes assumption that you want to
# remove newlines, trailing spaces and replace them with one single space
tmp += ' ' + row
# Missed row?
if key:
data[key] = tmp
# Missed group?
if data:
yield data
This generator returns dict with pairs like UI: T020 in each iteration (and always at least one item).
Since it uses generator and continuous reading it should be effective event on large files and it won't read whole file into memory at once.
Here's little demo:
for data in process_file('data.txt'):
print('-'*20)
for i in data:
print('%s:'%(i), data[i])
print()
And actual output:
--------------------
STN: A1.2.2.2
DEF: An abnormal structure, or one that is abnormal in size or location, found in or deriving from a previously normal structure. Acquired abnormalities are distinguished from diseases even though they may result in pathological functioning (e.g., "hernias incarcerate").
STY: Acquired Abnormality
HL: {isa} Anatomical Abnormality
UI: T020
ABR: acab
--------------------
DEF: Shows or demonstrates.
STL: [Animal|Behavior]; [Group|Behavior]
RL: exhibits
HL: {isa} performs
RTN: R3.3.2
UI: T145
RIN: exhibited_by
ABR: EX

source = """
UI: T020
STY: Acquired Abnormality
ABR: acab
STN: A1.2.2.2
DEF: An abnormal structure, or one that is abnormal in size or location, found
in or deriving from a previously normal structure. Acquired abnormalities are
distinguished from diseases even though they may result in pathological
functioning (e.g., "hernias incarcerate").
HL: {isa} Anatomical Abnormality
"""
inpt = source.split("\n") #just emulating file
import re
reg = re.compile(r"^([A-Z]{2,3}):(.*)$")
output = dict()
current_key = None
current = ""
for line in inpt:
line_match = reg.match(line) #check if we hit the CODE: Content line
if line_match is not None:
if current_key is not None:
output[current_key] = current #if so - update the current_key with contents
current_key = line_match.group(1)
current = line_match.group(2)
else:
current = current + line #if it's not - it should be the continuation of previous key line
output[current_key] = current #don't forget the last guy
print(output)

import re
from collections import namedtuple
def process(chunk):
split_chunk = re.split(r'^([A-Z]{2,3}):', chunk, flags=re.MULTILINE)
d = dict()
fields = list()
for i in xrange(len(split_chunk)/2):
fields.append(split_chunk[i])
d[split_chunk[i]] = split_chunk[i+1]
my_tuple = namedtuple(split_chunk[1], fields)
return my_tuple(**d)
should do. I think I'd just do the dict though -- why are you so attached to a namedtuple?

problem about split a string

I wrote a program to read a registry entry from a file.
And the entry looks like this:
reg='HKEY_LOCAL_MACHINE\SOFTWARE\TT\Tools\SYS\exePath' #it means rootKey=HKEY_LOCAL_MACHINE, subKey='SOFTWARE\TT\Tools\SYS', property=exePath
I want to read this entry from the file and break it into rootKey, subKey and property.
Apparently, I can do it this way:
rootKey = reg.split('\\', 1)[0]
subKey = reg.split('\\', 1)[1].rsplit('\\', 1)[0] #might be a stupid way
property = reg.rsplit('\\, 1)[1]
Maybe the entry is a stupid one, but any better way to break it into parts like above?

import re
t=re.search(r"(.+?)\\(.+)\\(.+)", reg)
t.groups()
('HKEY_LOCAL_MACHINE', 'SOFTWARE\\TT\\Tools\\SYS', 'exePath')

How about doing the following? There's no need to call .split() so many times, anyway...
s = reg.split('\\')
property = s.pop()
root_key = s.pop(0)
sub_key = '\\'.join(s)

I like to use partition over split when I can, because partition ensures each of the returned tuple elements is a string.
root_key, _, s = reg.partition("\\")
_, sub_key, property = s.rpartition("\\") # note, _r_partition

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python - Iterate through CSV rows and create XML string - python

Slightly modified code that fixed the issue I was having. Turned out to be pretty trivial: with open(csv_file) as f: for row in csv.DictReader(f): top = Element('event') for k, v in row.items(): child = SubElement(top, k) child.text = v print tostring(top) Thanks for the help!

'Top' is just the highest level node -- you could use whatever text you want to wrap the whole document. You can pretty-print it pretty simply as well: http://pymotw.com/2/xml/etree/ElementTree/create.html#pretty-printing-xml

Related

Problem skipping line whilst iterating using previous line and current line comparison

Python: dictionary to collection

Writing out comma separated values in a single cell in spreadsheet

Python - reading data from file with variable attributes and line lengths

problem about split a string

Categories

Resources