Related
I have a record as below:
29 16
A 1.2595034 0.82587254 0.7375044 1.1270138 -0.35065323 0.55985355
0.7200067 -0.889543 0.2300735 0.56767654 0.2789483 0.32296127 -0.6423197 0.26456305 -0.07363393 -1.0788593
B 1.2467299 0.78651106 0.4702038 1.204216 -0.5282698 0.13987103
0.5911153 -0.6729466 0.377103 0.34090135 0.3052503 0.028784657 -0.39129165 0.079238065 -0.29310825 -0.99383247
I want to split the data into key-value pairs neglecting the first top row i.e 29 16. It should be neglected.
The output should be something like this:
x = A , B
y = 1.2595034 0.82587254 0.7375044 1.1270138 -0.35065323 0.55985355 0.7200067 -0.889543 0.2300735 0.56767654 0.2789483 0.32296127 -0.6423197 0.26456305 -0.07363393 -1.0788593
1.2467299 0.78651106 0.4702038 1.204216 -0.5282698 0.13987103 0.5911153 -0.6729466 0.377103 0.34090135 0.3052503 0.028784657 -0.39129165 0.079238065 -0.29310825 -0.99383247
I am able to neglect the first line using the below code:
f = open(fileName, 'r')
lines = f.readlines()[1:]
Now how do I separate rest record in Python?
So here's my take :D I expect you'd want to have the numbers parsed as well?
def generate_kv(fileName):
with open(fileName, 'r') as file:
# ignore first line
file.readline()
for line in file:
if '' == line.strip():
# empty line
continue
values = line.split(' ')
try:
yield values[0], [float(x) for x in values[1:]]
except ValueError:
print(f'one of the elements was not a float: {line}')
if __name__ == '__main__':
x = []
y = []
for key, value in generate_kv('sample.txt'):
x.append(key)
y.append(value)
print(x)
print(y)
assumes that the values in sample.txt look like this:
% cat sample.txt
29 16
A 1.2595034 0.82587254 0.7375044 1.1270138 -0.35065323 0.55985355 0.7200067 -0.889543 0.2300735 0.56767654 0.2789483 0.32296127 -0.6423197 0.26456305 -0.07363393 -1.0788593
B 1.2467299 0.78651106 0.4702038 1.204216 -0.5282698 0.13987103 0.5911153 -0.6729466 0.377103 0.34090135 0.3052503 0.028784657 -0.39129165 0.079238065 -0.29310825 -0.99383247
and the output:
% python sample.py
['A', 'B']
[[1.2595034, 0.82587254, 0.7375044, 1.1270138, -0.35065323, 0.55985355, 0.7200067, -0.889543, 0.2300735, 0.56767654, 0.2789483, 0.32296127, -0.6423197, 0.26456305, -0.07363393, -1.0788593], [1.2467299, 0.78651106, 0.4702038, 1.204216, -0.5282698, 0.13987103, 0.5911153, -0.6729466, 0.377103, 0.34090135, 0.3052503, 0.028784657, -0.39129165, 0.079238065, -0.29310825, -0.99383247]]
Alternatively, if you'd wanted to have a dictionary, do:
if __name__ == '__main__':
print(dict(generate_kv('sample.txt')))
That will convert the list into a dictionary and output:
{'A': [1.2595034, 0.82587254, 0.7375044, 1.1270138, -0.35065323, 0.55985355, 0.7200067, -0.889543, 0.2300735, 0.56767654, 0.2789483, 0.32296127, -0.6423197, 0.26456305, -0.07363393, -1.0788593], 'B': [1.2467299, 0.78651106, 0.4702038, 1.204216, -0.5282698, 0.13987103, 0.5911153, -0.6729466, 0.377103, 0.34090135, 0.3052503, 0.028784657, -0.39129165, 0.079238065, -0.29310825, -0.99383247]}
you can use this script if your file is a text
filename='file.text'
with open(filename) as f:
data = f.readlines()
x=[data[0][0],data[1][0]]
y=[data[0][1:],data[1][1:]]
If you're happy to store the data in a dictionary here is what you can do:
records = dict()
with open(filename, 'r') as f:
f.readline() # skip the first line
for line in file:
key, value = line.split(maxsplit=1)
records[key] = value.split()
The structure of records would be:
{
'A': ['1.2595034', '0.82587254', '0.7375044', ... ]
'B': ['1.2467299', '0.78651106', '0.4702038', ... ]
}
What's happening
with ... as f we're opening the file within a context manager (more info here). This allows us to automatically close the file when the block finishes.
Because the open file keeps track of where it is in the file we can use f.readline() to move the pointer down a line. (docs)
line.split() allows you to turn a string into a list of strings. With the maxsplits=1 arg it means that it will only split on the first space.
e.g. x, y = 'foo bar baz'.split(maxsplit=1), x = 'foo' and y = 'bar baz'
If I understood correctly, you want the numbers to be collected in a list. One way of doing this is:
import string
text = '''
29 16
A 1.2595034 0.82587254 0.7375044 1.1270138 -0.35065323 0.55985355 0.7200067 -0.889543 0.2300735 0.56767654 0.2789483 0.32296127 -0.6423197 0.26456305 -0.07363393 -1.0788593
B 1.2467299 0.78651106 0.4702038 1.204216 -0.5282698 0.13987103 0.5911153 -0.6729466 0.377103 0.34090135 0.3052503 0.028784657 -0.39129165 0.079238065 -0.29310825 -0.99383247
'''
lines = text.split('\n')
x = [
line[1:].strip().split()
for i, line in enumerate(lines)
if line and line[0].lower() in string.ascii_letters]
This will produce a list of lists when the outer list contains A, B, etc. and the inner lists contain the numbers associated to A, B, etc.
This code assumes that you are interested in lines starting with any single letter (case-insensitive).
For more elaborated conditions you may want to look into regular expressions.
Obviously, if your text is in a file, you could substitute lines = ... with:
with open(filepath, 'r') as lines:
x = ...
Also, if the items in x should not be separated, but rather in a string, you may want to change line[1:].strip().split() with line[1:].strip().
Instead, if you want the numbers as float and not string, you should replace line[1:].strip().split() with [float(value) for value in line[1:].strip().split()].
EDIT:
Alternatively to line[1:].strip().split() you may want to do:
line.split(maxsplit=1)[1].split()
as suggested in some other answer. This would generalize better if the first token is not a single character.
I have a CSV file that contains a header row followed by a potentially unlimited number of rows with values. For example:
FieldA,FieldB,FieldC,FieldD
1,asdf,2,ghjk
3,qwer,4,yuio
5,slslkd,,aldkjslkj
What I need to do is for each row, create a quasi-XML string where the elements are labeled as the column name and information within each element is the value of the cell. Using the above as an example, if I iterate through each of the three rows I would end up with these three strings:
<FieldA>1</FieldA><FieldB>asdf</FieldB><FieldC>2</FieldC><FieldD>ghjk</FieldD>
<FieldA>3</FieldA><FieldB>qwer</FieldB><FieldC>4</FieldC><FieldD>yuio</FieldD>
<FieldA>5</FieldA><FieldB>slslkd</FieldB><FieldD>aldkjslkj</FieldD>
The way I am currently doing is is:
for row in r:
if row['FieldA']:
fielda = '<FieldA>{0}</FieldA>'.format(row['FieldA'])
else:
fielda = ''
if row['FieldB']:
fieldb = '<FieldB>{0}</FieldB>'.format(row['FieldB'])
else:
fieldb = ''
if row['FieldC']:
fieldc = '<FieldC>{0}</FieldC>'.format(row['FieldC'])
else:
fieldc = ''
if row['FieldD']:
fieldd = '<FieldD>{0}</FieldD>'.format(row['FieldD'])
else:
fieldd = ''
# Compile the string
final_string = fielda + fieldb + fieldc + fieldd
# Process further
do_something(final_string)
As it iterates through each row, this creates the appropriate string and then I can pass it on for further processing.
Is there a better way to achieve what I want, or is my approach the best way? My guess is there is a better, more Pythonic, and more efficient way, but I'm new-ish to Python.
Thanks.
Slightly modified code that fixed the issue I was having. Turned out to be pretty trivial:
with open(csv_file) as f:
for row in csv.DictReader(f):
top = Element('event')
for k, v in row.items():
child = SubElement(top, k)
child.text = v
print tostring(top)
Thanks for the help!
Python is Batteries Included.
In this case, you can use the csv module and the xml module, with code that looks like this:
# CSV module
import csv
# Stuff from the XML module
from xml.etree.ElementTree import Element, SubElement, tostring
# Topmost XML element
top = Element('top')
# Open a file
with open('stuff.csv') as csvfile:
# And use a dictionary-reader
for d in csv.DictReader(csvfile)
# For each mapping in the dictionary
for (k, v) in d.iteritems():
# Create an XML node
child = SubElement(top, k)
child.text = v
print tostring(top)
'Top' is just the highest level node -- you could use whatever text you want to wrap the whole document.
You can pretty-print it pretty simply as well:
http://pymotw.com/2/xml/etree/ElementTree/create.html#pretty-printing-xml
My friend asked me to help him parse eBay csv file and save only couple of important fields, so I thought it will be a good opportunity to learn Python (writing mostly in C for now).
The problem is, eBay csv file format is giving me a hard time:
Numer rekordu sprzedaży,Nazwa użytkownika,Imię i nazwisko kupującego,Numer telefonu kupującego,Adres e-mail kupującego,Adres 1 kupującego,Adres 2 kupującego,Miejscowość kupującego,Województwo kupującego,Kod pocztowy kupującego,Kraj kupującego,Numer przedmiotu,Nazwa przedmiotu,Etykieta niestandardowa,Ilość,Cena sprzedaży,Wysyłka i obsługa,Ubezpieczenie,Koszt płatności za pobraniem,Cena łączna,Forma płatności,Data sprzedaży,Data realizacji transakcji,Data zapłaty,Data wysyłki,Opinia wystawiona,Opinia otrzymana,Uwagi własne,Identyfikator transakcji PayPal,Usługa wysyłkowa,Opcja płatności za pobraniem,Identyfikator transakcji,Identyfikator zamówienia,Szczegóły wersji
"610","xxx","John Rodriguez","(860) 000-00000","mail#yahoo.com","0 Branford Ave Bldg 11","","City","CT","00000","Stany Zjednoczone","330972592582","Honda CBR 900 RR","","1","US $21,49","US $5,50","US $0,00","","US $26,99","PayPal","23-03-2014","23-03-2014","23-03-2014","","Nie","","","4EP58","Standard Shipping from outside US","","9639014","",""
"627","yyy","Name","063100000","mail#orange.fr","Rue barillettes","","st main","Rhône","00000","Francja","3311071","Suzuki SV 650","","1","EUR 15,99","EUR 4,00","EUR 0,00","","EUR 19,99","PayPal","31-03-2014","31-03-2014","31-03-2014","","Nie","","","6E03683046","Livraison standard ? partir de l'étranger","","9659014","",""
Pobrano rekordów: 8,,od ,23-03-2014,15:06:14, do ,11-04-2014,14:32:17
Nazwa sprzedawcy: mail#gmail.com
Parsing it with csv.DictReader, like in the manual, results with every line like as none : list[]
import csv
filename = "SalesHistory.csv"
csvfile = open(filename, encoding="iso-8859-2")
input_file = csv.DictReader(csvfile, quotechar='"', skipinitialspace=True)
for row in input_file:
print (row)
{None: ['\tNumer rekordu sprzedaży', 'Nazwa użytkownika', 'Imię i nazwisko kupującego', 'Numer telefonu kupującego',
'Adres e-mail kupującego', 'Adres 1 kupującego', 'Adres 2 kupującego', 'Miejscowość kupującego',
'Województwo kupującego', 'Kod pocztowy kupującego', 'Kraj kupującego', 'Numer przedmiotu', 'Nazwa przedmiotu',
'Etykieta niestandardowa', 'Ilość', 'Cena sprzedaży', 'Wysyłka i obsługa', 'Ubezpieczenie',
'Koszt płatności za pobraniem', 'Cena łączna', 'Forma płatności', 'Data sprzedaży',
'Data realizacji transakcji', 'Data zapłaty', 'Data wysyłki', 'Opinia wystawiona', 'Opinia otrzymana',
'Uwagi własne', 'Identyfikator transakcji PayPal', 'Usługa wysyłkowa', 'Opcja płatności za pobraniem',
'Identyfikator transakcji', 'Identyfikator zamówienia', 'Szczegóły wersji']}
instead of, first line read as keys for transactions in other lines.
I read Python CSV manual, looked at some examples, searched Stack Overflow but I still don't know what to do next - most of them cover more 'standard' version of csv.
Any tips to get me moving in the right direction would be great.
That's odd... your code didn't give me the error that you posted in your question (although I'm using Python 2.7, and you seem to be using a 3.x, maybe is because of that).
Also, the file doesn't start with a blank (empty line), does it? If it does, it'll mess up with the csv module. It uses the first line to guess the keys that csv.DictReader will use. If there's a blank line at the beginning, it won't be able to guess the keys. You should "clean" the file before trying to parse it with csv (removing empty lines should do the trick) or you could read row by row skipping empty rows, but that complicates using csv.DictReader (you should get the first non-empty row, consider its values the keys for your result dictionary and then read the rest of the rows considering its values as the values for your result dictionary... I'd just remove the empty lines from the file before parsing it)
In the code below I've added a try/catch block to deal with incomplete lines (such as the last 2 lines in your sample file), but even without it, it was working pretty ok
import csv
filename = "SalesHistory.csv"
read_dcts = []
with open(filename, 'r') as csvfile:
input_file = csv.DictReader(csvfile, quotechar='"', skipinitialspace=True)
for i, dct in enumerate(input_file):
try:
utf_dict=dict((k.decode('utf-8'), v.decode('utf-8')) \
for k, v in dct.items())
read_dcts.append(utf_dict)
except AttributeError:
print "Weird line %d found" % (i + 1)
# Verify:
for i, dct in enumerate(read_dcts):
print "Dict %d" % (i + 1)
for k, v in dct.iteritems():
print "\t%s: %s" % (k, v)
If I execute the code above, I get:
Weird line 3 found
Weird line 4 found
Dict 1
Opinia otrzymana:
Cena sprzedaży: US $21,49
[ . . . ]
Wysyłka i obsługa: US $5,50
Opcja płatności za pobraniem:
Dict 2
Opinia otrzymana:
Cena sprzedaży: EUR 15,99
[ . . . ]
Wysyłka i obsługa: EUR 4,00
Opcja płatności za pobraniem
I've removed many of the lines loaded, just for clarity's sake but besides that, it should be loading what you wanted.
If you have an update, let me know through a comment.
EDIT:
Just in case the file contains an empty line and you don't want to pre-clean it, you can pretty much do "manually" what the DictReader class does for you (use the first non-empty line as keys, and the rest of the non-empty lines as values):
import csv
filename = "SalesHistory.csv"
read_dcts = []
keys = []
with open(filename, 'r') as csvfile:
reader = csv.reader(csvfile, quotechar='"', skipinitialspace=True)
for i, row in enumerate(reader):
try:
if len(row) == 0:
raise IndexError("Row %d is empty. Should skip" % (i + 1))
if len(keys) == 0:
keys = [ val.decode('utf-8') for val in row ]
elif len(row) == len(keys):
utf_dict = dict(zip(keys, [ val.decode('utf-8') for val in row ]))
read_dcts.append(utf_dict)
except (IndexError, AttributeError), e:
print "Weird line %d found (got %s)" % ((i + 1), e)
# Verify:
for i, dct in enumerate(read_dcts):
print "Dict %d" % (i + 1)
for k, v in dct.iteritems():
print "\t%s: %s" % (k, v)
A reasonably simlpe function to read a csv file and make keys of the first line in the file and values of other lines.
import csv
def dict_from_csv(filename):
'''
(file)->list of dictionaries
Function to read a csv file and format it to a list of dictionaries.
The headers are the keys with all other data becoming values
'''
#open the file and read it using csv.reader()
#read the file. for each row that has content add it to list mf
#the keys for our user dict are the first content line of the file mf[0]
#the values to our user dict are the other lines in the file mf[1:]
mf = []
with open(filename, 'r') as f:
my_file = csv.reader(f)
for row in my_file:
if any(row):
mf.append(row)
file_keys = mf[0]
file_values = mf[1:]
#Combine the two lists, turning into a list of dictionaries, using the keys list as the key and the value list as the values
my_list = []
for value in file_values:
my_list.append(dict(zip(file_keys, file_values)))
#return the list of dictionaries
return my_list
I have some values that I want to write in a text file with the constraint that each value has to go to a particular column of each line.
For example, lets say that I have values = [a, b, c, d] and I want to write them in a line so that a is going to be written in the 10th column of the line, b on the 25th, c on the 34th, and d on the 48th column.
How would I do this in python?
Does python have something like column.insert(10, a)? It would make my life way easier.
I appreciate your hep.
In this case, I'd think you'd just use the padding functions with python's string formatting syntax.
Something like "%10d%15d%9d%14d"%values will place the right-most digit of a,b,c,d on the columns you listed.
If you want to have the left-most digits placed there, then you could use: "%<15d%<9d%<14d%d"%values, and prepend 10 spaces.
EDIT: For some reason I'm having trouble with the above syntax... so I used the newstyle formatting syntax like so:
" "*9 + "{:<14}{:<9}{:<14}{}".format(*values)
This should print, for values=[20,30,403,50]:
......... <-- from " "*9
20............ <-- {:<14}
30....... <-- {:<9}
403........... <-- {:<14}
50 <-- {}
----=----1----=----2----=----3----=----4----=----5 <-- guide
20 30 403 50 <-- Actual output, all together
class ColumnWriter(object):
def __init__(self, columns):
columns = (-1, ) + tuple(columns)
widths = (c2 - c1 for c1, c2 in zip(columns, columns[1:]))
format_codes = ("{" + str(i) + ":>" + str(width) +"}"
for i, width in enumerate(widths))
self.format_string = ''.join(format_codes)
def get_row(self, values):
return self.format_string.format(*values)
cw = ColumnWriter((1, 20, 21))
print cw.get_row((1, 2, 3))
print cw.get_row((1, 'a', 'a'))
if you need the columns to vary from row to row, then you can do one liners.
import itertools
for columns in itertools.combinations(range(10), 3):
print ColumnWriter(columns).get_row(('.','.','.'))
It slacks on the error checking. It needs to check that columns is sorted and that len(values) == len(columns).
It has problems with the value being longer than the area being allocated to hold it but I'm not sure what to do about this. Currently if that occurs, it overwrites the previous column. example:
print ColumnWriter((1, 2, 3)).get_row((1, 1, 'aa'))
If you had an iterable of rows that you wanted to write to a file, you could do something like this
rows = [(1, 3, 4), ('a', 'b', 4), ['foo', 'ten', 'mongoose']]
format = ColumnWriter((20, 30, 50)).get_row
with open(filename, 'w') as fout:
fout.write("\n".join(format(row) for row in rows))
You can use the mmap module to memory-map a file.
http://docs.python.org/library/mmap.html
With mmap you can do something like this:
fh = file('your_file', 'wb')
map = mmap.mmap(fh.fileno(), <length of the file you want to create>)
map[10] = a
map[25] = b
Not sure if that is what you're looking for, but it might work :)
It seems I might have misunderstood the question. The old answer is below
Perhaps you're looking for the csv module?
http://docs.python.org/library/csv.html
import csv
fh = open('eggs.csv', 'wb')
spamWriter = csv.writer(fh, delimiter=' ')
spamWriter.writerow(['Spam'] * 5 + ['Baked Beans'])
spamWriter.writerow(['Spam', 'Lovely Spam', 'Wonderful Spam'])
I have a text file containing data in rows and columns (~17000 rows in total). Each column is a uniform number of characters long, with the 'unused' characters filled in by spaces. For example, the first column is 11 characters long, but the last four characters in that column are always spaces (so that it appears to be a nice column when viewed with a text editor). Sometimes it's more than four if the entry is less than 7 characters.
The columns are not otherwise separated by commas, tabs, or spaces. They are also not all the same number of characters (the first two are 11, the next two are 8 and the last one is 5 - but again, some are spaces).
What I want to do is import the entires (which are numbers) in the last two columns if the second column contains the string 'OW' somewhere in it. Any help would be greatly appreciated.
Python's struct.unpack is probably the quickest way to split fixed-length fields. Here's a function that will lazily read your file and return tuples of numbers that match your criteria:
import struct
def parsefile(filename):
with open(filename) as myfile:
for line in myfile:
line = line.rstrip('\n')
fields = struct.unpack('11s11s8s8s5s', line)
if 'OW' in fields[1]:
yield (int(fields[3]), int(fields[4]))
Usage:
if __name__ == '__main__':
for field in parsefile('file.txt'):
print field
Test data:
1234567890a1234567890a123456781234567812345
something maybe OW d 111111118888888855555
aaaaa bbbbb 1234 1212121233333
other thinganother OW 121212 6666666644444
Output:
(88888888, 55555)
(66666666, 44444)
In Python you can extract a substring at known positions using a slice - this is normally done with the list[start:end] syntax. However you can also create slice objects that you can use later to do the indexing.
So you can do something like this:
columns = [slice(11,22), slice(30,38), slice(38,44)]
myfile = open('some/file/path')
for line in myfile:
fields = [line[column].strip() for column in columns]
if "OW" in fields[0]:
value1 = int(fields[1])
value12 = int(fields[2])
....
Separating out the slices into a list makes it easy to change the code if the data format changes, or you need to do stuff with the other fields.
Here's a function which might help you:
def rows(f, columnSizes):
while True:
row = {}
for (key, size) in columnSizes:
value = f.read(size)
if len(value) < size: # EOF
return
row[key] = value
yield row
for an example of how it's used:
from StringIO import StringIO
sample = StringIO("""aaabbbccc
d e f
g h i
""")
for row in rows(sample, [('first', 3),
('second', 3),
('third', 4)]):
print repr(row)
Note that unlike the other answers, this example is not line-delimited (it uses the file purely as a provider of bytes, not an iterator of lines), since you specifically mentioned that the fields were not separated, I assumed that the rows might not be either; the newline is taken into account specifically.
You can test if one string is a substring of another with the 'in' operator. For example,
>>> 'OW' in 'hello'
False
>>> 'OW' in 'helOWlo'
True
So in this case, you might do
if 'OW' in row['third']:
stuff()
but you can obviously test any field for any value as you see fit.
entries = ((float(line[30:38]), float(line[38:43])) for line in myfile if "OW" in line[11:22])
for num1, num2 in entries:
# whatever
entries = []
with open('my_file.txt', 'r') as f:
for line in f.read().splitlines()
line = line.split()
if line[1].find('OW') >= 0
entries.append( ( int(line[-2]) , int(line[-1]) ) )
entries is an array containing tuples of the last two entries
edit: oops