I have been facing an issue parsing an horrible txt file, I have manage to extract to a list the information I need:
['OS-EXT-SRV-ATTR:host', 'compute-0-4.domain.tld']
['OS-EXT-SRV-ATTR:hostname', 'commvault-vsa-vm']
['OS-EXT-SRV-ATTR:hypervisor_hostname', 'compute-0-4.domain.tld']
['OS-EXT-SRV-ATTR:instance_name', 'instance-00000008']
['OS-EXT-SRV-ATTR:root_device_name', '/dev/vda']
['hostId', '985035a85d3c98137796f5799341fb65df21e8893fd988ac91a03124']
['key_name', '-']
['name', 'Commvault_VSA_VM']
['OS-EXT-SRV-ATTR:host', 'compute-0-28.domain.tld']
['OS-EXT-SRV-ATTR:hostname', 'dummy-vm']
['OS-EXT-SRV-ATTR:hypervisor_hostname', 'compute-0-28.domain.tld']
['OS-EXT-SRV-ATTR:instance_name', 'instance-0000226e']
['OS-EXT-SRV-ATTR:root_device_name', '/dev/hda']
['hostId', '7bd08d963a7c598f274ce8af2fa4f7beb4a66b98689cc7cdc5a6ef22']
['key_name', '-']
['name', 'Dummy_VM']
['OS-EXT-SRV-ATTR:host', 'compute-0-20.domain.tld']
['OS-EXT-SRV-ATTR:hostname', 'mavtel-sif-vsifarvl11']
['OS-EXT-SRV-ATTR:hypervisor_hostname', 'compute-0-20.domain.tld']
['OS-EXT-SRV-ATTR:instance_name', 'instance-00001da6']
['OS-EXT-SRV-ATTR:root_device_name', '/dev/vda']
['hostId', 'dd82c20a014e05fcfb3d4bcf653c30fa539a8fd4e946760ee1cc6f07']
['key_name', 'mav_tel_key']
['name', 'MAVTEL-SIF-vsifarvl11']
I would like to have the element 0 as headers and 1 has rows, for example:
OS-EXT-SRV-ATTR:host, OS-EXT-SRV-ATTR:hostname,...., name
compute-0-4.domain.tld, commvault-vsa-vm,....., Commvault_VSA_VM
compute-0-28.domain.tld, dummy-vm,...., Dummy_VM
Here is my code so far:
import re
with open('metadata.txt', 'r') as infile:
lines = infile.readlines()
for line in lines:
if re.search('hostId|properties|OS-EXT-SRV-ATTR:host|OS-EXT-SRV-ATTR:hypervisor_hostname|name', line):
re.sub("[\t]+", " ", line)
find = line.strip()
format = ''.join(line.split()).replace('|', ',')
list = format.split(',')
new_list = list[1:-1]
I am very new at python, so sometimes I ran out of ideas on how to make things work.
Looking at your input file, I see that it contains what appears to be output from the openstack nova show command, mixed with other stuff. There are basically two types of lines: valid ones, and invalid ones (duh).
The valid ones have this structure:
'| key | value |'
and the invalid ones have anything else.
So we could define that every valid line
can be split at the | into exactly four parts, of which
the first and the last part must be empty, and the other parts must be filled.
Python can do this (it's called unpacking assignment):
a, b, c, d = [1, 2, 3, 4]
a, b, c, d = some_string.split('|')
which will succeed when the right-hand side has exactly four parts, otherwise it will fail with a ValueError. When we now make sure that a and d are empty, and b and c are not empty - we have a valid line.
Furthermore we can say, if b equals 'Property' and c equals 'Value', we have hit a header row and what follows must describe a "new record".
This function does exactly that:
def parse_metadata_file(path):
""" parses a data file generated by `nova show` into records """
with open(path, 'r', encoding='utf8') as file:
record = {}
for line in file:
try:
# unpack line into 4 fields: "| key | val |"
a, key, val, z = map(str.strip, line.split('|'))
if a != '' or z != '' or key == '' or val == '':
continue
except ValueError:
# skip invalid lines
continue
if key == 'Property' and val == 'Value' and record:
# output current record and start a new one
yield record
record = {}
else:
# write property to current record
record[key] = val
# output last record
if record:
yield record
It spits out a new dict for each record it finds and disregards all lines that do not pass the sanity check. Effectively this function generates a stream of dicts.
Now we can use the csv module to write this stream of dicts to a CSV file:
import csv
# list of fields we are interested in
fields = ['hostId', 'properties', 'OS-EXT-SRV-ATTR:host', 'OS-EXT-SRV-ATTR:hypervisor_hostname', 'name']
with open('output.csv', 'w', encoding='utf8', newline='') as outfile:
writer = csv.DictWriter(outfile, fieldnames=fields, extrasaction='ignore')
writer.writeheader()
writer.writerows(parse_metadata_file('metadata.txt'))
The CSV module has a DictWriter which is designed to accept dicts as input and write them—according to the given key names—to a CSV row.
With extrasaction='ignore' it does not matter if the current record has more fields than required
With fields list it becomes extremely easy to extract a different set of fields.
Configure the writer to suit your needs (docs).
This:
writer.writerows(parse_metadata_file('metadata.txt'))
is a convenient shorthand for
for record in parse_metadata_file('metadata.txt'):
writer.writerow(record)
You can take a step by step approach to build a 2D array by keeping track of your headers and each entry in the text file.
headers = list(set([entry[0] for entry in data])) # obtain unique headers
num_rows = 1
for entry in data: # figuring out how many rows we are going to need
if 'name' in entry: # name is unique per row so using that
num_rows += 1
num_cols = len(headers)
mat = [[0 for _ in range(num_cols)] for _ in range(num_rows)]
mat[0] = headers # add headers as first row
header_lookup = {header: i for i, header in enumerate(headers)}
row = 1
for entry in data:
header, val = entry[0], entry[1]
col = header_lookup[header]
mat[row][col] = val # add entries to each subsequent row
if header == 'name':
row += 1
print mat
output:
[['hostId', 'OS-EXT-SRV-ATTR:host', 'name', 'OS-EXT-SRV-ATTR:hostname', 'OS-EXT-SRV-ATTR:instance_name', 'OS-EXT-SRV-ATTR:root_device_name', 'OS-EXT-SRV-ATTR:hypervisor_hostname', 'key_name'], ['985035a85d3c98137796f5799341fb65df21e8893fd988ac91a03124', 'compute-0-4.domain.tld', 'Commvault_VSA_VM', 'commvault-vsa-vm', 'instance-00000008', '/dev/vda', 'compute-0-4.domain.tld', '-'], ['7bd08d963a7c598f274ce8af2fa4f7beb4a66b98689cc7cdc5a6ef22', 'compute-0-28.domain.tld', 'Dummy_VM', 'dummy-vm', 'instance-0000226e', '/dev/hda', 'compute-0-28.domain.tld', '-'], ['dd82c20a014e05fcfb3d4bcf653c30fa539a8fd4e946760ee1cc6f07', 'compute-0-20.domain.tld', 'MAVTEL-SIF-vsifarvl11', 'mavtel-sif-vsifarvl11', 'instance-00001da6', '/dev/vda', 'compute-0-20.domain.tld', 'mav_tel_key']]
if you need to write the new 2D array to a file so its not as "horrible" :)
with open('output.txt', 'w') as f:
for lines in mat:
lines_out = '\t'.join(lines)
f.write(lines_out)
f.write('\n')
Looks like a job for pandas:
import pandas as pd
list_to_export = [['OS-EXT-SRV-ATTR:host', 'compute-0-4.domain.tld'],
['OS-EXT-SRV-ATTR:hostname', 'commvault-vsa-vm'],
['OS-EXT-SRV-ATTR:hypervisor_hostname', 'compute-0-4.domain.tld'],
['OS-EXT-SRV-ATTR:instance_name', 'instance-00000008'],
['OS-EXT-SRV-ATTR:root_device_name', '/dev/vda'],
['hostId', '985035a85d3c98137796f5799341fb65df21e8893fd988ac91a03124'],
['key_name', '-'],
['name', 'Commvault_VSA_VM'],
['OS-EXT-SRV-ATTR:host', 'compute-0-28.domain.tld'],
['OS-EXT-SRV-ATTR:hostname', 'dummy-vm'],
['OS-EXT-SRV-ATTR:hypervisor_hostname', 'compute-0-28.domain.tld'],
['OS-EXT-SRV-ATTR:instance_name', 'instance-0000226e'],
['OS-EXT-SRV-ATTR:root_device_name', '/dev/hda'],
['hostId', '7bd08d963a7c598f274ce8af2fa4f7beb4a66b98689cc7cdc5a6ef22'],
['key_name', '-'],
['name', 'Dummy_VM'],
['OS-EXT-SRV-ATTR:host', 'compute-0-20.domain.tld'],
['OS-EXT-SRV-ATTR:hostname', 'mavtel-sif-vsifarvl11'],
['OS-EXT-SRV-ATTR:hypervisor_hostname', 'compute-0-20.domain.tld'],
['OS-EXT-SRV-ATTR:instance_name', 'instance-00001da6'],
['OS-EXT-SRV-ATTR:root_device_name', '/dev/vda'],
['hostId', 'dd82c20a014e05fcfb3d4bcf653c30fa539a8fd4e946760ee1cc6f07'],
['key_name', 'mav_tel_key'],
['name', 'MAVTEL-SIF-vsifarvl11']]
data_dict = {}
for i in list_to_export:
if i[0] not in data_dict:
data_dict[i[0]] = [i[1]]
else:
data_dict[i[0]].append(i[1])
pd.DataFrame.from_dict(data_dict, orient = 'index').T.to_csv('filename.csv')
Related
I am trying to create a function that reads a csv file into a memory in a list form. When I run my code, it gives me this error message ("string indices must be integers"). Were am I getting it wrong.
Below is the code. Thanks for your help
# create the empty set to carry the values of the columns
Hydropower_heading = []
Solar_heading = []
Wind_heading = []
Other_heading = []
def my_task1_file(filename): # defines the function "my_task1_file"
with open(filename,'r') as myNew_file: # opens and read the file
for my_file in myNew_file.readlines(): # loops through the file
# read the values into the empty set created
Hydropower_heading.append(my_file['Hydropower'])
Solar_heading.append(my_file['Solar'])
Wind_heading.append(my_file['Wind'])
Other_heading.append(my_file['Other'])
#Hydropower_heading = int(Hydropower)
#Solar_heading = int(Solar)
#Wind_heading = int(Wind)
#Other_heading = int(Other)
my_task1_file('task1.csv') # calls the csv file into the function
# print the Heading and the column values in a row form
print('Hydropower: ', Hydropower_heading)
print('Solar: ', Solar_heading)
print('Wind: ', Wind_heading)
print('Other: ', Other_heading)
We can read CSV files by the column using csv.DictReader method.
Code: (code.py)
import csv
def my_task1_file(filename): # defines the function "my_task1_file"
Hydropower_heading = []
Solar_heading = []
Wind_heading = []
Other_heading = []
with open(filename, newline='\n') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
# read the values into the empty set created
Hydropower_heading.append(row['Hydropower'])
Solar_heading.append(row['Solar'])
Wind_heading.append(row['Wind'])
Other_heading.append(row['Other'])
return Hydropower_heading, Solar_heading, Wind_heading, Other_heading
if __name__ == "__main__":
Hydropower_heading, Solar_heading, Wind_heading, Other_heading = my_task1_file('task1.csv')
# print the Heading and the column values in a row form
print('Hydropower: ', Hydropower_heading)
print('Solar: ', Solar_heading)
print('Wind: ', Wind_heading)
print('Other: ', Other_heading)
task1.csv:
Hydropower,Solar,Wind,Other
5,6,3,8
6,8,5,12
3,6,9,7
Output:
Hydropower: ['5', '6', '3']
Solar: ['6', '8', '6']
Wind: ['3', '5', '9']
Other: ['8', '12', '7']
Explanation:
The __main__ condition will check if the file is running directly. If the file is being run directly by using python code.py, it will execute this portion. Otherwise if we import code.py from another python file, this portion will not be executed.
You can remove the __main__ block as necessary like below. But it is a good practice to separate the methods from executing while importing one python file from another using the __main__ block. Let me know if it clears your confusion.
code.py (without __main__):
import csv
def my_task1_file(filename): # defines the function "my_task1_file"
Hydropower_heading = []
Solar_heading = []
Wind_heading = []
Other_heading = []
with open(filename, newline='\n') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
# read the values into the empty set created
Hydropower_heading.append(row['Hydropower'])
Solar_heading.append(row['Solar'])
Wind_heading.append(row['Wind'])
Other_heading.append(row['Other'])
return Hydropower_heading, Solar_heading, Wind_heading, Other_heading
Hydropower_heading, Solar_heading, Wind_heading, Other_heading = my_task1_file('task1.csv')
print('Hydropower: ', Hydropower_heading)
print('Solar: ', Solar_heading)
print('Wind: ', Wind_heading)
print('Other: ', Other_heading)
References:
csv.DictReader method
__main__ documentation from Python website
Since the error is "string indices must be integers", you must be using a data type that cannot take in a string value as an index. In this segment of your code...
for my_file in myNew_file.readlines():
Hydropower_heading.append(my_file['Hydropower'])
Solar_heading.append(my_file['Solar'])
Wind_heading.append(my_file['Wind'])
Other_heading.append(my_file['Other'])
... you are using "Hydropower", "Solar", "Wind", and "Other" as index values, which cannot be valid index values of my_file, which, I assume, is a string data type since you are reading the file myNew_file. If you change these values to integers as is appropriate, then the error should not appear anymore.
I have a record as below:
29 16
A 1.2595034 0.82587254 0.7375044 1.1270138 -0.35065323 0.55985355
0.7200067 -0.889543 0.2300735 0.56767654 0.2789483 0.32296127 -0.6423197 0.26456305 -0.07363393 -1.0788593
B 1.2467299 0.78651106 0.4702038 1.204216 -0.5282698 0.13987103
0.5911153 -0.6729466 0.377103 0.34090135 0.3052503 0.028784657 -0.39129165 0.079238065 -0.29310825 -0.99383247
I want to split the data into key-value pairs neglecting the first top row i.e 29 16. It should be neglected.
The output should be something like this:
x = A , B
y = 1.2595034 0.82587254 0.7375044 1.1270138 -0.35065323 0.55985355 0.7200067 -0.889543 0.2300735 0.56767654 0.2789483 0.32296127 -0.6423197 0.26456305 -0.07363393 -1.0788593
1.2467299 0.78651106 0.4702038 1.204216 -0.5282698 0.13987103 0.5911153 -0.6729466 0.377103 0.34090135 0.3052503 0.028784657 -0.39129165 0.079238065 -0.29310825 -0.99383247
I am able to neglect the first line using the below code:
f = open(fileName, 'r')
lines = f.readlines()[1:]
Now how do I separate rest record in Python?
So here's my take :D I expect you'd want to have the numbers parsed as well?
def generate_kv(fileName):
with open(fileName, 'r') as file:
# ignore first line
file.readline()
for line in file:
if '' == line.strip():
# empty line
continue
values = line.split(' ')
try:
yield values[0], [float(x) for x in values[1:]]
except ValueError:
print(f'one of the elements was not a float: {line}')
if __name__ == '__main__':
x = []
y = []
for key, value in generate_kv('sample.txt'):
x.append(key)
y.append(value)
print(x)
print(y)
assumes that the values in sample.txt look like this:
% cat sample.txt
29 16
A 1.2595034 0.82587254 0.7375044 1.1270138 -0.35065323 0.55985355 0.7200067 -0.889543 0.2300735 0.56767654 0.2789483 0.32296127 -0.6423197 0.26456305 -0.07363393 -1.0788593
B 1.2467299 0.78651106 0.4702038 1.204216 -0.5282698 0.13987103 0.5911153 -0.6729466 0.377103 0.34090135 0.3052503 0.028784657 -0.39129165 0.079238065 -0.29310825 -0.99383247
and the output:
% python sample.py
['A', 'B']
[[1.2595034, 0.82587254, 0.7375044, 1.1270138, -0.35065323, 0.55985355, 0.7200067, -0.889543, 0.2300735, 0.56767654, 0.2789483, 0.32296127, -0.6423197, 0.26456305, -0.07363393, -1.0788593], [1.2467299, 0.78651106, 0.4702038, 1.204216, -0.5282698, 0.13987103, 0.5911153, -0.6729466, 0.377103, 0.34090135, 0.3052503, 0.028784657, -0.39129165, 0.079238065, -0.29310825, -0.99383247]]
Alternatively, if you'd wanted to have a dictionary, do:
if __name__ == '__main__':
print(dict(generate_kv('sample.txt')))
That will convert the list into a dictionary and output:
{'A': [1.2595034, 0.82587254, 0.7375044, 1.1270138, -0.35065323, 0.55985355, 0.7200067, -0.889543, 0.2300735, 0.56767654, 0.2789483, 0.32296127, -0.6423197, 0.26456305, -0.07363393, -1.0788593], 'B': [1.2467299, 0.78651106, 0.4702038, 1.204216, -0.5282698, 0.13987103, 0.5911153, -0.6729466, 0.377103, 0.34090135, 0.3052503, 0.028784657, -0.39129165, 0.079238065, -0.29310825, -0.99383247]}
you can use this script if your file is a text
filename='file.text'
with open(filename) as f:
data = f.readlines()
x=[data[0][0],data[1][0]]
y=[data[0][1:],data[1][1:]]
If you're happy to store the data in a dictionary here is what you can do:
records = dict()
with open(filename, 'r') as f:
f.readline() # skip the first line
for line in file:
key, value = line.split(maxsplit=1)
records[key] = value.split()
The structure of records would be:
{
'A': ['1.2595034', '0.82587254', '0.7375044', ... ]
'B': ['1.2467299', '0.78651106', '0.4702038', ... ]
}
What's happening
with ... as f we're opening the file within a context manager (more info here). This allows us to automatically close the file when the block finishes.
Because the open file keeps track of where it is in the file we can use f.readline() to move the pointer down a line. (docs)
line.split() allows you to turn a string into a list of strings. With the maxsplits=1 arg it means that it will only split on the first space.
e.g. x, y = 'foo bar baz'.split(maxsplit=1), x = 'foo' and y = 'bar baz'
If I understood correctly, you want the numbers to be collected in a list. One way of doing this is:
import string
text = '''
29 16
A 1.2595034 0.82587254 0.7375044 1.1270138 -0.35065323 0.55985355 0.7200067 -0.889543 0.2300735 0.56767654 0.2789483 0.32296127 -0.6423197 0.26456305 -0.07363393 -1.0788593
B 1.2467299 0.78651106 0.4702038 1.204216 -0.5282698 0.13987103 0.5911153 -0.6729466 0.377103 0.34090135 0.3052503 0.028784657 -0.39129165 0.079238065 -0.29310825 -0.99383247
'''
lines = text.split('\n')
x = [
line[1:].strip().split()
for i, line in enumerate(lines)
if line and line[0].lower() in string.ascii_letters]
This will produce a list of lists when the outer list contains A, B, etc. and the inner lists contain the numbers associated to A, B, etc.
This code assumes that you are interested in lines starting with any single letter (case-insensitive).
For more elaborated conditions you may want to look into regular expressions.
Obviously, if your text is in a file, you could substitute lines = ... with:
with open(filepath, 'r') as lines:
x = ...
Also, if the items in x should not be separated, but rather in a string, you may want to change line[1:].strip().split() with line[1:].strip().
Instead, if you want the numbers as float and not string, you should replace line[1:].strip().split() with [float(value) for value in line[1:].strip().split()].
EDIT:
Alternatively to line[1:].strip().split() you may want to do:
line.split(maxsplit=1)[1].split()
as suggested in some other answer. This would generalize better if the first token is not a single character.
Here is my file that I want to convert to a python dict:
#
# DATABASE
#
Database name FooFileName
Database file FooDBFile
Info file FooInfoFile
Database ID 3
Total entries 8888
I have tried several things and I can't get it to convert to a dict. I ultimately want to be able to pick off the 'Database file' as a string. Thanks in advance.
Here is what I have tried already and the errors:
# ValueError: need more than 1 value to unpack
#d = {}
#for line in json_dump:
#for k,v in [line.strip().split('\n')]:
# for k,v in [line.strip().split(None, 1)]:
# d[k] = v.strip()
#print d
#print d['Database file']
# IndexError: list index out of range
#d = {}
#for line in json_dump:
# line = line.strip()
# parts = [p.strip() for p in line.split('/n')]
# d[parts[0]] = (parts[1], parts[2])
#print d
First you need to separate the string after last # . you can do it with regular expressions , re.search will do it :
>>> import re
>>> s="""#
... # DATABASE
... #
... Database name FooFileName
... Database file FooDBFile
... Info file FooInfoFile
... Database ID 3
... Total entries 8888"""
>>> re.search(r'#\n([^#]+)',s).group(1)
'Database name FooFileName\nDatabase file FooDBFile\nInfo file FooInfoFile\nDatabase ID 3\nTotal entries 8888'
also in this case you can just use split , you can split the text with # then choose the last element :
>>> s2=s.split('#')[-1]
Then you can use a dictionary comprehension and list comprehension , note that re.split is a good choice for this case as it use r' {2,}' for split that match 2 or more space :
>>> {k:v for k,v in [re.split(r' {2,}',i) for i in s2.split('\n') if i]}
{'Database name': 'FooFileName', 'Total entries': '8888', 'Database ID': '3', 'Database file': 'FooDBFile', 'Info file': 'FooInfoFile'}
Actually when we split, it returns a list of 3 values in it , so we need 3 variables to store the returned results, now we combine the first and second value returned , separated by a space to act as a key whose value is the third value returned in the list , This may be the most simple approach but I guess it will get your job done and it is easy to understand as well
d = {}
for line in json_dump:
if line.startswith('#'): continue
for u,k,v in line.strip().split():
d[u+" "+k] = v.strip()
print d
print d['Database file']
EDITED to reflect a line-wise regular expression approach.
Since it appears your file is not tab-delimited, you could use a regular expression to isolate the columns:
import re
#
# The rest of your code that loads up json_dump
#
d = {}
for line in json_dump:
if line.startswith('#'): continue ## For filtering out comment lines
line = line.strip()
#parts = [p.strip() for p in line.split('/n')]
try:
(key, value) = re.split(r'\s\s+', line) ## Split the line of input using 2 or more consecutive white spaces as the delimiter
except ValueError: continue ## Skip malformed lines
#d[parts[0]] = (parts[1], parts[2])
d[key] = value
print d
This yields this dictionary:
{'Database name': 'FooFileName', 'Total entries': '8888', 'Database ID': '3', 'Database file': 'FooDBFile', 'Info file': 'FooInfoFile'}
Which should allow you to isolate the individual values.
I have a big parent list containing many lists of tuples like the small example:
[
[('id', 'name', 'Trans'), ('ENS001', 'EGSB', 'TTP')],
[('id', 'name', 'Trans'), ('EN02', 'EHGT', 'GFT')]
]
My goal is to make a text file in which there would some columns. The columns are the
second tuple of each list in the parent list. The first tuple in all lists are similar
in all nested lists and they would be column names.
I used this code(z is above list)
rows= [i[1] for i in z]
to get
[('ENS001', 'EGSB', 'TTP'), ('EN02', 'EHGT', 'GFT')]
And this one (which I call it “A”)
with open('out.txt','w') as f :
f.write (' '.join(z[0][0]))
for i in rows:
f.write (' '.join(i))
to get the file. But in the file the columns are not separated like this.
id name Trans
ENS001 EGSB TTP
EN02 EHGT GFT
You are writing it all on one line, you need to add a newline:
rows = (sub[1] for sub in z)
with open('out.txt','w') as f:
f.write ("{}\n".format(' '.join(z[0][0]))) # add a newline
for i in rows:
f.write ("{}\n".format(' '.join(i))) # newline again
If you always have three elements in your rows and you want them aligned:
rows = [sub[1] for sub in z]
mx_len = 0
for tup in rows:
mx = len(max(tup[:-1],key=len))
if mx > mx_len:
mx_len = mx
with open('out.txt', 'w') as f:
a, b, c = z[0][0]
f.write("{:<{mx_len}} {:<{mx_len}} {}\n".format(a, b, c, mx_len=mx_len))
for a, b, c in rows:
f.write("{:<{mx_len}} {:<{mx_len}} {}\n".format(a, b, c, mx_len=mx_len))
Output:
id name Trans
ENS001 EGSB TTP
EN02 EHGT GFT
If the length varies:
with open('out.txt', 'w') as f:
f.write(("{:<{mx_len}}"*len(z[0][0])).format(*z[0][0], mx_len=mx_len) + "\n")
for row in rows:
f.write(("{:<{mx_len}}"*len(row)).format(*row, mx_len=mx_len) + "\n")
If you want to align column with spaces, first you have to determine what each column's width will be -- presumably the length of the longer header or content of each column, e.g:
wids = [len(h) for h in z[0][0]]
for i in rows:
wids = [max(len(r), w) for w, r in zip(wids, i)]
Then on this basis you can prepare a format string, such as
fmt = ' '.join('%%%ds' % w for w in wids) + '\n'
and finally, you can write things out:
with open('out.txt','w') as f:
f.write(fmt % z[0][0])
for i in rows:
f.write(fmt % i)
If you want the output to be separated by tabs like this you can join on \t, rather than ' ', which you are using. The bottom line of your code would look like f.write('\t'.join(i)).
My friend asked me to help him parse eBay csv file and save only couple of important fields, so I thought it will be a good opportunity to learn Python (writing mostly in C for now).
The problem is, eBay csv file format is giving me a hard time:
Numer rekordu sprzedaży,Nazwa użytkownika,Imię i nazwisko kupującego,Numer telefonu kupującego,Adres e-mail kupującego,Adres 1 kupującego,Adres 2 kupującego,Miejscowość kupującego,Województwo kupującego,Kod pocztowy kupującego,Kraj kupującego,Numer przedmiotu,Nazwa przedmiotu,Etykieta niestandardowa,Ilość,Cena sprzedaży,Wysyłka i obsługa,Ubezpieczenie,Koszt płatności za pobraniem,Cena łączna,Forma płatności,Data sprzedaży,Data realizacji transakcji,Data zapłaty,Data wysyłki,Opinia wystawiona,Opinia otrzymana,Uwagi własne,Identyfikator transakcji PayPal,Usługa wysyłkowa,Opcja płatności za pobraniem,Identyfikator transakcji,Identyfikator zamówienia,Szczegóły wersji
"610","xxx","John Rodriguez","(860) 000-00000","mail#yahoo.com","0 Branford Ave Bldg 11","","City","CT","00000","Stany Zjednoczone","330972592582","Honda CBR 900 RR","","1","US $21,49","US $5,50","US $0,00","","US $26,99","PayPal","23-03-2014","23-03-2014","23-03-2014","","Nie","","","4EP58","Standard Shipping from outside US","","9639014","",""
"627","yyy","Name","063100000","mail#orange.fr","Rue barillettes","","st main","Rhône","00000","Francja","3311071","Suzuki SV 650","","1","EUR 15,99","EUR 4,00","EUR 0,00","","EUR 19,99","PayPal","31-03-2014","31-03-2014","31-03-2014","","Nie","","","6E03683046","Livraison standard ? partir de l'étranger","","9659014","",""
Pobrano rekordów: 8,,od ,23-03-2014,15:06:14, do ,11-04-2014,14:32:17
Nazwa sprzedawcy: mail#gmail.com
Parsing it with csv.DictReader, like in the manual, results with every line like as none : list[]
import csv
filename = "SalesHistory.csv"
csvfile = open(filename, encoding="iso-8859-2")
input_file = csv.DictReader(csvfile, quotechar='"', skipinitialspace=True)
for row in input_file:
print (row)
{None: ['\tNumer rekordu sprzedaży', 'Nazwa użytkownika', 'Imię i nazwisko kupującego', 'Numer telefonu kupującego',
'Adres e-mail kupującego', 'Adres 1 kupującego', 'Adres 2 kupującego', 'Miejscowość kupującego',
'Województwo kupującego', 'Kod pocztowy kupującego', 'Kraj kupującego', 'Numer przedmiotu', 'Nazwa przedmiotu',
'Etykieta niestandardowa', 'Ilość', 'Cena sprzedaży', 'Wysyłka i obsługa', 'Ubezpieczenie',
'Koszt płatności za pobraniem', 'Cena łączna', 'Forma płatności', 'Data sprzedaży',
'Data realizacji transakcji', 'Data zapłaty', 'Data wysyłki', 'Opinia wystawiona', 'Opinia otrzymana',
'Uwagi własne', 'Identyfikator transakcji PayPal', 'Usługa wysyłkowa', 'Opcja płatności za pobraniem',
'Identyfikator transakcji', 'Identyfikator zamówienia', 'Szczegóły wersji']}
instead of, first line read as keys for transactions in other lines.
I read Python CSV manual, looked at some examples, searched Stack Overflow but I still don't know what to do next - most of them cover more 'standard' version of csv.
Any tips to get me moving in the right direction would be great.
That's odd... your code didn't give me the error that you posted in your question (although I'm using Python 2.7, and you seem to be using a 3.x, maybe is because of that).
Also, the file doesn't start with a blank (empty line), does it? If it does, it'll mess up with the csv module. It uses the first line to guess the keys that csv.DictReader will use. If there's a blank line at the beginning, it won't be able to guess the keys. You should "clean" the file before trying to parse it with csv (removing empty lines should do the trick) or you could read row by row skipping empty rows, but that complicates using csv.DictReader (you should get the first non-empty row, consider its values the keys for your result dictionary and then read the rest of the rows considering its values as the values for your result dictionary... I'd just remove the empty lines from the file before parsing it)
In the code below I've added a try/catch block to deal with incomplete lines (such as the last 2 lines in your sample file), but even without it, it was working pretty ok
import csv
filename = "SalesHistory.csv"
read_dcts = []
with open(filename, 'r') as csvfile:
input_file = csv.DictReader(csvfile, quotechar='"', skipinitialspace=True)
for i, dct in enumerate(input_file):
try:
utf_dict=dict((k.decode('utf-8'), v.decode('utf-8')) \
for k, v in dct.items())
read_dcts.append(utf_dict)
except AttributeError:
print "Weird line %d found" % (i + 1)
# Verify:
for i, dct in enumerate(read_dcts):
print "Dict %d" % (i + 1)
for k, v in dct.iteritems():
print "\t%s: %s" % (k, v)
If I execute the code above, I get:
Weird line 3 found
Weird line 4 found
Dict 1
Opinia otrzymana:
Cena sprzedaży: US $21,49
[ . . . ]
Wysyłka i obsługa: US $5,50
Opcja płatności za pobraniem:
Dict 2
Opinia otrzymana:
Cena sprzedaży: EUR 15,99
[ . . . ]
Wysyłka i obsługa: EUR 4,00
Opcja płatności za pobraniem
I've removed many of the lines loaded, just for clarity's sake but besides that, it should be loading what you wanted.
If you have an update, let me know through a comment.
EDIT:
Just in case the file contains an empty line and you don't want to pre-clean it, you can pretty much do "manually" what the DictReader class does for you (use the first non-empty line as keys, and the rest of the non-empty lines as values):
import csv
filename = "SalesHistory.csv"
read_dcts = []
keys = []
with open(filename, 'r') as csvfile:
reader = csv.reader(csvfile, quotechar='"', skipinitialspace=True)
for i, row in enumerate(reader):
try:
if len(row) == 0:
raise IndexError("Row %d is empty. Should skip" % (i + 1))
if len(keys) == 0:
keys = [ val.decode('utf-8') for val in row ]
elif len(row) == len(keys):
utf_dict = dict(zip(keys, [ val.decode('utf-8') for val in row ]))
read_dcts.append(utf_dict)
except (IndexError, AttributeError), e:
print "Weird line %d found (got %s)" % ((i + 1), e)
# Verify:
for i, dct in enumerate(read_dcts):
print "Dict %d" % (i + 1)
for k, v in dct.iteritems():
print "\t%s: %s" % (k, v)
A reasonably simlpe function to read a csv file and make keys of the first line in the file and values of other lines.
import csv
def dict_from_csv(filename):
'''
(file)->list of dictionaries
Function to read a csv file and format it to a list of dictionaries.
The headers are the keys with all other data becoming values
'''
#open the file and read it using csv.reader()
#read the file. for each row that has content add it to list mf
#the keys for our user dict are the first content line of the file mf[0]
#the values to our user dict are the other lines in the file mf[1:]
mf = []
with open(filename, 'r') as f:
my_file = csv.reader(f)
for row in my_file:
if any(row):
mf.append(row)
file_keys = mf[0]
file_values = mf[1:]
#Combine the two lists, turning into a list of dictionaries, using the keys list as the key and the value list as the values
my_list = []
for value in file_values:
my_list.append(dict(zip(file_keys, file_values)))
#return the list of dictionaries
return my_list