I am looking for a way to automate the conversion of CSV to XML.
Here is an example of a CSV file, containing a list of movies:
Here is the file in XML format:
<collection shelf="New Arrivals">
<movietitle="Enemy Behind">
<type>War, Thriller</type>
<format>DVD</format>
<year>2003</year>
<rating>PG</rating>
<stars>10</stars>
<description>Talk about a US-Japan war</description>
</movie>
<movietitle="Transformers">
<type>Anime, Science Fiction</type>
<format>DVD</format>
<year>1989</year>
<rating>R</rating>
<stars>8</stars>
<description>A schientific fiction</description>
</movie>
<movietitle="Trigun">
<type>Anime, Action</type>
<format>DVD</format>
<episodes>4</episodes>
<rating>PG</rating>
<stars>10</stars>
<description>Vash the Stampede!</description>
</movie>
<movietitle="Ishtar">
<type>Comedy</type>
<format>VHS</format>
<rating>PG</rating>
<stars>2</stars>
<description>Viewable boredom</description>
</movie>
</collection>
I've tried a few examples where I am able to read the csv and XML format using Python using DOM and SAX but yet am to find a simple example of the conversion. So far I have:
import csv
f = open('movies2.csv')
csv_f = csv.reader(f)
def convert_row(row):
return """<movietitle="%s">
<type>%s</type>
<format>%s</format>
<year>%s</year>
<rating>%s</rating>
<stars>%s</stars>
<description>%s</description>
</movie>""" % (
row.Title, row.Type, row.Format, row.Year, row.Rating, row.Stars, row.Description)
print ('\n'.join(csv_f.apply(convert_row, axis=1)))
But I get the error:
File "moviesxml.py", line 16, in module
print ('\n'.join(csv_f.apply(convert_row, axis=1)))
AttributeError: '_csv.reader' object has no attribute 'apply'
I am pretty new to Python, so any help would be much appreciated!
I am using Python 3.5.2.
Thanks!
Lisa
A possible solution is to first load the csv into Pandas and then convert it row by row into XML, as so:
import pandas as pd
df = pd.read_csv('untitled.txt', sep='|')
With the sample data (assuming separator and so on) loaded as:
Title Type Format Year Rating Stars \
0 Enemy Behind War,Thriller DVD 2003 PG 10
1 Transformers Anime,Science Fiction DVD 1989 R 9
Description
0 Talk about...
1 A Schientific fiction
And then converting to xml with a custom function:
def convert_row(row):
return """<movietitle="%s">
<type>%s</type>
<format>%s</format>
<year>%s</year>
<rating>%s</rating>
<stars>%s</stars>
<description>%s</description>
</movie>""" % (
row.Title, row.Type, row.Format, row.Year, row.Rating, row.Stars, row.Description)
print '\n'.join(df.apply(convert_row, axis=1))
This way you get a string containing the xml:
<movietitle="Enemy Behind">
<type>War,Thriller</type>
<format>DVD</format>
<year>2003</year>
<rating>PG</rating>
<stars>10</stars>
<description>Talk about...</description>
</movie>
<movietitle="Transformers">
<type>Anime,Science Fiction</type>
<format>DVD</format>
<year>1989</year>
<rating>R</rating>
<stars>9</stars>
<description>A Schientific fiction</description>
</movie>
that you can dump in to a file or whatever.
Inspired by this great answer.
Edit: Using the loading method you posted (or a version that actually loads the data to a variable):
import csv
f = open('movies2.csv')
csv_f = csv.reader(f)
data = []
for row in csv_f:
data.append(row)
f.close()
print data[1:]
We get:
[['Enemy Behind', 'War', 'Thriller', 'DVD', '2003', 'PG', '10', 'Talk about...'], ['Transformers', 'Anime', 'Science Fiction', 'DVD', '1989', 'R', '9', 'A Schientific fiction']]
And we can convert to XML with minor modifications:
def convert_row(row):
return """<movietitle="%s">
<type>%s</type>
<format>%s</format>
<year>%s</year>
<rating>%s</rating>
<stars>%s</stars>
<description>%s</description>
</movie>""" % (row[0], row[1], row[2], row[3], row[4], row[5], row[6])
print '\n'.join([convert_row(row) for row in data[1:]])
Getting identical results:
<movietitle="Enemy Behind">
<type>War</type>
<format>Thriller</format>
<year>DVD</year>
<rating>2003</rating>
<stars>PG</stars>
<description>10</description>
</movie>
<movietitle="Transformers">
<type>Anime</type>
<format>Science Fiction</format>
<year>DVD</year>
<rating>1989</rating>
<stars>R</stars>
<description>9</description>
</movie>
I tried to generalize robertoia's function convert_row for any header instead of writing it by hand.
import csv
import pandas as pd
f = open('movies2.csv')
csv_f = csv.reader(f)
data = []
for row in csv_f:
data.append(row)
f.close()
df = pd.read_csv('movies2.csv')
header= list(df.columns)
def convert_row(row):
str_row = """<%s>%s</%s> \n"""*(len(header)-1)
str_row = """<%s>%s""" +"\n"+ str_row + """</%s>"""
var_values = [list_of_elments[k] for k in range(1,len(header)) for list_of_elments in [header,row,header]]
var_values = [header[0],row[0]]+var_values+[header[0]]
var_values =tuple(var_values)
return str_row % var_values
text ="""<collection shelf="New Arrivals">"""+"\n"+'\n'.join([convert_row(row) for row in data[1:]])+"\n" +"</collection >"
print(text)
with open('output.xml', 'w') as myfile:
myfile.write(text)
Of course with pandas now, it is simpler to just use
to_xml() :
df= pd.read_csv('movies2.csv')
with open('outputf.xml', 'w') as myfile:
myfile.write(df.to_xml())
I found an easier way to insert variables into a string or block of text:
'''Twas brillig and the slithy {what}
Did gyre and gimble in the {where}
All {how} were the borogoves
And the {who} outgrabe.'''.format(what='toves',
where='wabe',
how='mimsy',
who='momeraths')
Alternatively:
'''Twas brillig and the slithy {0}
Did gyre and gimble in the {1}
All {2} were the borogoves
And the {3} outgrabe.'''.format('toves',
'wabe',
'mimsy',
'momeraths')
(substitute name of incoming data variable for 'toves', 'wabe', 'mimsy', and 'momeraths')
Related
I am trying to insert a new line into a csv file that I am writing this data into. The data is
data = [[{'Hi': 'O'}, {'mr': 'O'}, {'you': 'O'}, {'president': 'O'}, {'USA': 'Country'}, {'for': 'O'}, {'answering': 'O'}, {'football': 'O'}, {'questions': 'O'}, {'music': 'JAZZ'}], [{'Hi': 'O'}, {'You': 'O'}, {'have': 'O'}, {'granted': 'STATE'}, {'purchased': 'O'}, {'GHC3': 'O'}, {'Bundle': 'O'} {'248803151': 'O'}]]
This is the code I have but I am not sure how to re-code it to accommodate the new line per array in the data.
def convert_to_biolu(dico, biolu_list = defaultdict(list)): #dico here is output_data
for dict_item in dico: # you can list as many input dicts as you want
for key, value in dict_item.items():
if value not in biolu_list[key]:
biolu_list[key].append(value)
return biolu_list
def save_to_file(path, data_):
data_ = [convert_to_biolu(item) for item in data][-1]
with open(path, 'w', newline='') as file:
fieldnames = ['word', 'label']
writer = csv.DictWriter(file, fieldnames=fieldnames)
writer.writeheader()
for key, val in data_.items():
writer.writerow({'word': key, 'label': " ".join(val)})
You can write csvs without the module.
I prefer to do it myself like this:
def write_csv_with_spaces(data, filename):
with open(FILENAME, 'w+') as file:
for list in data:
for dict in list:
file.write(','.join([str(key) + ',' + str(value) for key,value in dict.items()]) + '\n')
file.write('\n')
I have a dataset and I would like to extract the appositive feature from this dataset.
در
همین
حال
،
<coref coref_coref_class="set_0" coref_mentiontype="ne" markable_scheme="coref" coref_coreftype="ident">
نجیب
الله
خواجه
عمری
,
</coref>
<coref coref_coref_class="set_0" coref_mentiontype="np" markable_scheme="coref" coref_coreftype="atr">
سرپرست
وزارت
تحصیلات
عالی
افغانستان
</coref>
گفت
که
در
سه
ماه
گذشته
در
۳۳
ولایت
کشور
<coref coref_coreftype="ident" coref_coref_class="empty" coref_mentiontype="ne" markable_scheme="coref">
خدمات
ملکی
</coref>
از
حدود
۱۴۹
هزار
I want to store the data inside dataset in two list. In find_atr list I stored the data where the coref tag includes coref_coreftype="atr". For the find_ident list I want to store the data of coref_coreftype="ident" So we have on the last coref tag in this dataset another coref tag that has coref_coref_class="empty". I dont want to store that data that has the tag coref_coref_class="empty". Now on the regex I mentioned that it should only include those that the coref_coref_class="set_.*?" not coref_coref_class="empty" but it still store the data of coref_coref_class="empty", where it should only store the coref_coref_class="set_.*?".
How to avoid:
i_ident = []
j_atr = []
find_ident = re.findall(r'<coref.*?coref_coref_class="set_.*?coref_mentiontype="ne".*?coref_coreftype="ident".*?>(.*?)</coref>', read_dataset, re.S)
ident_list = list(map(lambda x: x.replace('\n', ' '), find_ident))
for i in range(len(ident_list)):
i_ident.append(str(ident_list[i]))
find_atr = re.findall(r'<coref.*?coref_coreftype="atr".*?>(.*?)</coref>', read_dataset, re.S)
atr_list = list(map(lambda x: x.replace('\n', ' '), find_atr))
#print(coref_list)
for i in range(len(atr_list)):
j_atr.append(str(atr_list[i]))
print(i_ident)
print()
print(j_atr)
I reduced your dataset file to:
A
<coref coref_coref_class="set_0" coref_mentiontype="ne" markable_scheme="coref" coref_coreftype="ident">
B
</coref>
<coref coref_coref_class="set_0" coref_mentiontype="np" markable_scheme="coref" coref_coreftype="atr">
C
</coref>
D
<coref coref_coreftype="ident" coref_coref_class="empty" coref_mentiontype="ne" markable_scheme="coref">
E
</coref>
F
And tried this code, which is almost the same you provided:
import re
with open ("test_dataset.log", "r") as myfile:
read_dataset = myfile.read()
i_ident = []
j_atr = []
find_ident = re.findall(r'<coref.*?coref_coref_class="set_.*?coref_mentiontype="ne".*?coref_coreftype="ident".*?>(.*?)</coref>', read_dataset, re.S)
ident_list = list(map(lambda x: x.replace('\n', ' '), find_ident))
for i in range(len(ident_list)):
i_ident.append(str(ident_list[i]))
find_atr = re.findall(r'<coref.*?coref_coreftype="atr".*?>(.*?)</coref>', read_dataset, re.S)
atr_list = list(map(lambda x: x.replace('\n', ' '), find_atr))
#print(coref_list)
for i in range(len(atr_list)):
j_atr.append(str(atr_list[i]))
print(i_ident)
print()
print(j_atr)
And got this output, which seems right to me:
[' B ']
[' C ']
This is an empty Dictionary
d = {}
This is the csv file data
M, Max, Sporting, Football, Cricket
M, Jack, Sporting, Cricket, Tennis
M, Kevin, Sporting, Cricket, Basketball
M, Ben, Sporting, Football, Rugby
I tried to use the following code to append data from the csv to dictionary.
with open('example.csv', "r") as csvfile:
csv_reader = csv.reader(csvfile)
for row in csv_reader:
if row:
d.setdefault(row[0], {})[row[1]] = {row[2]: [row[3]]}
But it gives me an error:
d.setdefault(row[0], {})[row[1]] = {row[2]: [row[3]]}
IndexError: list index out of range
It there any way, i can add data from csv to the dictionary, in the form:
d = {'M': {'Max': {'Sporting': ['Football', 'Cricket']}, 'Jack': {'Sporting': ['Cricket', 'Tennis']}}}
I am new to this so help me.
import csv
d={}
with open('JJ.csv', "r") as csvfile:
csv_reader = csv.reader(csvfile)
for row in csv_reader:
if row:
d.setdefault(row[0],{})[row[1]] = {row[2]: [row[3],row[4]]}
print(d)
{'M': {' Max': {' Sporting': [' Football', ' Cricket']}, ' Jack': {' Sporting': [' Cricket', ' Tennis']}, ' Kevin': {' Sporting': [' Cricket', ' Basketball']}, ' Ben': {' Sporting': [' Football', ' Rugby']}}}
To remove all the leading/trailing spaces in the output, you can use the below line instead. There might be a better way which I'm not sure as of now.
d.setdefault(row[0],{})[row[1].strip()] = {row[2].strip(): [row[3].strip(),row[4].strip()]}
You can use a nested collections.defaultdict tree and check if the rows are long enough:
from collections import defaultdict
def tree():
return defaultdict(tree)
d = tree()
# ...
for row in csv_reader:
if len(row) >= 3:
d[row[0]][row[1]][row[2]] = row[3:]
Change "for column in csv_reader:" to "for row in csv_reader:"
Straightforwardly:
import csv, collections
with open('example.csv', 'r') as f:
reader = csv.reader(f, skipinitialspace=True)
result = collections.defaultdict(dict)
for r in reader:
if not result[r[0]].get(r[1]): result[r[0]][r[1]] = {}
if not result[r[0]][r[1]].get(r[2]):
result[r[0]][r[1]][r[2]] = r[-2:]
print(dict(result))
The output:
{'M': {'Kevin': {'Sporting': ['Cricket', 'Basketball']}, 'Max': {'Sporting': ['Football', 'Cricket']}, 'Jack': {'Sporting': ['Cricket', 'Tennis']}, 'Ben': {'Sporting': ['Football', 'Rugby']}}}
Data: [(Taru, 1234ABCD, 4536, EF32), (Aarul, 10045660, 4562, ABDE), (Vinay, 1254EFDC, 2587, AC42]in list form
output should be like (Tabular form)
Response: Taru 1234ABCD
4536
EF32
Aarul 10045660
4562
ABDE
Vinay 1254EFDC
2587
AC42
Please give your inputs to resolve this query.Thanks
You can use this small script:
l = [['Taru', '12345678ABCDEF', 453678], ['Aarul', '10045660ABDECABF', 45621278]]
print("HEADER1 HEADER2 HEADER3")
for ele1,ele2,ele3 in l:
print("{:<14}{:<11}{:13}".format(ele1,ele2,ele3))
Result:
HEADER1 HEADER2 HEADER3
Taru 12345678ABCDEF 453678
Aarul 10045660ABDECABF 45621278
I think your main question is in how to split the list you got? This seems to be the pattern to do so
EDIT as per comment it was mainly formatting this is one possible sollution
entries = [["Taru", "1234ABCD", "4536", "EF32"], ["Aarul", "10045660", "4562", "ABDE"], ["Vinay", "1254EFDC", "2587", "AC42"]]
csv = 'Name,information\n'
# this has split your array into the parts you want
for entry in entries:
left = entry[0]
for word in entry[1:]:
print("{:<10}{:<10}".format(left,word))
csv += str(left) + ',' + str(word) + '\n'
left = ''
print()
with open('output.csv', 'w') as file:
file.write(csv)
OUPUT:
Taru 1234ABCD
4536
EF32
Aarul 10045660
4562
ABDE
Vinay 1254EFDC
2587
AC42
ouput.csv:
Name,information
Taru,1234ABCD
,4536
,EF32
Aarul,10045660
,4562
,ABDE
Vinay,1254EFDC
,2587
,AC42
I am new to python and need help. I am trying to make a list of comma separated values.
I have this data.
EasternMountain 84,844 39,754 24,509 286 16,571 3,409 315
EasternHill 346,373 166,917 86,493 1,573 66,123 23,924 1,343
EasternTerai 799,526 576,181 206,807 2,715 6,636 1,973 5,214
CentralMountain 122,034 103,137 13,047 8 2,819 2,462 561
Now how do I get something like this;
"EasternMountain": 84844,
"EasternHill":346373,
and so on??
So far I have been able to do this:
fileHandle = open("testData", "r")
data = fileHandle.readlines()
fileHandle.close()
dataDict = {}
for i in data:
temp = i.split(" ")
dataDict[temp[0]]=temp[1]
with_comma='"'+temp[0]+'"'+':'+temp[1]+','
print with_comma
Use the csv module
import csv
with open('k.csv', 'r') as csvfile:
reader = csv.reader(csvfile, delimiter=' ')
my_dict = {}
for row in reader:
my_dict[row[0]] = [''.join(e.split(',')) for e in row[1:]]
print my_dict
k.csv is a text file containing:
EasternMountain 84,844 39,754 24,509 286 16,571 3,409 315
EasternHill 346,373 166,917 86,493 1,573 66,123 23,924 1,343
EasternTerai 799,526 576,181 206,807 2,715 6,636 1,973 5,214
CentralMountain 122,034 103,137 13,047 8 2,819 2,462 561
Output:
{'EasternHill': ['346373', '166917', '86493', '1573', '66123', '23924', '1343', ''], 'EasternTerai': ['799526', '576181', '206807', '2715', '6636', '1973', '5214', ''], 'CentralMountain': ['122034', '103137', '13047', '8', '2819', '2462', '561', ''], 'EasternMountain': ['84844', '39754', '24509', '286', '16571', '3409', '315', '']}
Try this:
def parser(file_path):
d = {}
with open(file_path) as f:
for line in f:
if not line:
continue
parts = line.split()
d[parts[0]] = [part.replace(',', '') for part in parts[1:]]
return d
Running it:
result = parser("testData")
for key, value in result.items():
print key, ':', value
Result:
EasternHill : ['346373', '166917', '86493', '1573', '66123', '23924', '1343']
EasternTerai : ['799526', '576181', '206807', '2715', '6636', '1973', '5214']
CentralMountain : ['122034', '103137', '13047', '8', '2819', '2462', '561']
EasternMountain : ['84844', '39754', '24509', '286', '16571', '3409', '315']