parse a csv file into a text file

parse a csv file into a text file - python

I am a second year EE student.
I just started learning python for my project.
I intend to parse a csv file with a format like
3520005,"Toronto (Ont.)",C ,F,2503281,2481494,F,F,0.9,1040597,979330,630.1763,3972.4,1
2466023,"Montréal (Que.)",V ,F,1620693,1583590,T,F,2.3,787060,743204,365.1303,4438.7,2
5915022,"Vancouver (B.C.)",CY ,F,578041,545671,F,F,5.9,273804,253212,114.7133,5039.0,8
3519038,"Richmond Hill (Ont.)",T ,F,162704,132030,F,F,23.2,53028,51000,100.8917,1612.7,28
into a text file like the following
Toronto 2503281
Montreal 1620693
Vancouver 578041
I am extracting the 1st and 5th column and save it into a text file.
This is what i have so far.
import csv
file = open('raw.csv')
reader = csv.reader(file)
f = open('NicelyDone.text','w')
for line in reader:
f.write("%s %s"%line[1],%line[5])
This is not working for me, I was able to extract the data from the csv file as line[1],line[5]. (I am able to print it out)
But I dont know how to write it to a .text file in the format i wanted.
Also, I have to process the first column eg, "Toronto (Ont.)" into "Toronto".
I am familiar with the function find(), I assume that i could extract Toronto out of Toronto(Ont.) using "(" as the stopping character,
but based on my research , I have no idea how to use it and ask it to return me the string(Toronto).
Here is my question:
What is the data format for line[1]?
If it is string how come f.write() does not work?
If it is not string, how do i convert it to a string?
How do i extract the word Toronto out of Toronto(Ont) into a string form using find() or other methods.
My thinking is that I could add those 2 string together like c = a+ ' ' + b, that would give me the format i wanted.
So i can use f.write() to write into a file :)
Sorry if my questions sounds too easy or stupid.
Thanks ahead
Zhen

All data read you get from csv.reader are strings.
There is a variety of solutions to this, but the simplest would be to split on ( and strip away any whitespace:
>>> a = 'Toronto (Ont.)'
>>> b = a.split('(')
>>> b
Out[16]: ['Toronto ', 'Ont.)']
>>> c = b[0]
>>> c
Out[18]: 'Toronto '
>>> c.strip()
Out[19]: 'Toronto'
or in one line:
>>> print 'Toronto (Ont.)'.split('(')[0].strip()
Another option would have been to use regular expression (the re module).
The specific problem in your code lies here:
f.write("%s %s"%line[1],%line[5])
Using the % syntax to format your string, you have to provide either a single value, or an iterable. In your case this should be:
f.write("%s %s" % (line[1], line[5]))
Another way to do the exact same thing, is to use the format method.
f.write('{} {}'.format(line[1], line[5]))
This is a flexible way of formating strings, and I recommend that you read about in the docs.
Regarding your code, there is a couple of things you should consider.
Always remember to close your file handlers. If you use with open(...) as fp, this is taken care of for you.
with open('myfile.txt') as ifile:
# Do stuff
# The file is closed here
Don't use reserved words as your variable name. file is such a thing, and by using it as something else (shadowing it), you may cause problems later on in your code.
To write your data, you can use csv.writer:
with open('myfile.txt', 'wb') as ofile:
writer = csv.writer(ofile)
writer.writerow(['my', 'data'])
From Python 2.6 and above, you can combine multiple with statements in one statement:
with open('raw.csv') as ifile, open('NicelyDone.text','w') as ofile:
reader = csv.reader(ifile)
writer = csv.writer(ofile)
Combining this knowledge, your script can be rewritten to something like:
import csv
with open('raw.csv') as ifile, open('NicelyDone.text', 'wb') as ofile:
reader = csv.reader(ifile)
writer = csv.writer(ofile, delimiter=' ')
for row in reader:
city, num = row[1].split('(')[0].strip(), row[5]
writer.writerow([city, num])

I don't recall csv that well, so I don't know if it's a string or not. What error are you getting? In any case, assuming it is a string, your line should be:
f.write("%s %s " % (line[1], line[5]))
In other words, you need a set of parentheses. Also, you should have a trailing space in your string.
A somewhat hackish but concise way to do this is: line[1].split("(")[0]
This will create a list that splits on the ( symbol, and then you extract the first element.

Related

Parsing a text file with line breaks in python

I have a text file with about 20 entries. They look like this:
~
England
Link: http://imgur.com/foobar.jpg
Capital: London
~
Iceland
Link: http://imgur.com/foobar2.jpg
Capital: Reykjavik
...
etc.
I would like to take these entries and turn them into a CSV.
There is a '~' separating each entry. I'm scratching my head trying to figure out how to go thru line by line and create the CSV values for each country. Can anyone give me a clue on how to go about this?

Use the libraries luke :)
I'm assuming your data is well formatted. Most real world data isn't that way. So, here goes a solution.
>>> content.split('~')
['\nEngland\nLink: http://imgur.com/foobar.jpg\nCapital: London\n', '\nIceland\nLink: http://imgur.com/foobar2.jpg\nCapital: Reykjavik\n', '\nEngland\nLink: http://imgur.com/foobar.jpg\nCapital: London\n', '\nIceland\nLink: http://imgur.com/foobar2.jpg\nCapital: Reykjavik\n']
For writing the CSV, Python has standard library functions.
>>> import csv
>>> csvfile = open('foo.csv', 'wb')
>>> fieldnames = ['Country', 'Link', 'Capital']
>>> writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
>>> for entry in entries:
... cols = entry.strip().splitlines()
... writer.writerow({'Country': cols[0], 'Link':cols[1].split(': ')[1], 'Capital':cols[2].split(':')[1]})
...
If your data is more semi structured or badly formatted, consider using a library like PyParsing.
Edit:
Second column contains URLs, so we need to handle the splits well.
>>> cols[1]
'Link: http://imgur.com/foobar2.jpg'
>>> cols[1].split(':')[1]
' http'
>>> cols[1].split(': ')[1]
'http://imgur.com/foobar2.jpg'

The way that I would do that would be to use the open() function using the syntax of:
f = open('NameOfFile.extensionType', 'a+')
Where "a+" is append mode. The file will not be overwritten and new data can be appended. You could also use "r+" to open the file in read mode, but would lose the ability to edit. The "+" after a letter signifies that if the document does not exist, it will be created. The "a+" I've never found to work without the "+".
After that I would use a for loop like this:
data = []
tmp = []
for line in f:
line.strip() #Removes formatting marks made by python
if line == '~':
data.append(tmp)
tmp = []
continue
else:
tmp.append(line)
Now you have all of the data stored in a list, but you could also reformat it as a class object using a slightly different algorithm.
I have never edited CSV files using python, but I believe you can use a loop like this to add the data:
f2 = open('CSVfileName.csv', 'w') #Can change "w" for other needs i.e "a+"
for entry in data:
for subentry in entry:
f2.write(str(subentry) + '\n') #Use '\n' to create a new line
From my knowledge of CSV that loop would create a single column of all of the data. At the end remember to close the files in order to save the changes:
f.close()
f2.close()
You could combine the two loops into one in order to save space, but for the sake of explanation I have not.

python csv replace listitem

i have following output from a csv file:
word1|word2|word3|word4|word5|word6|01:12|word8
word1|word2|word3|word4|word5|word6|03:12|word8
word1|word2|word3|word4|word5|word6|01:12|word8
what i need to do is change the time string like this 00:01:12.
my idea is to extract the list item [7] and add a "00:" as string to the front.
import csv
with open('temp', 'r') as f:
reader = csv.reader(f, delimiter="|")
for row in reader:
fixed_time = (str("00:") + row[7])
begin = row[:6]
end = row[:8]
print begin + fixed_time +end
get error message:
TypeError: can only concatenate list (not "str") to list.
i also had a look on this post.
how to change [1,2,3,4] to '1234' using python
i neeed to know if my approach to soloution is the right way. maybe need to use split or anything else for this.
thx for any help

The line that's throwing the exception is
print begin + fixed_time +end
because begin and end are both lists and fixed_time is a string. Whenever you take a slice of a list (that's the row[:6] and row[:8] parts), a list is returned. If you just want to print it out, you can do
print begin, fixed_time, end
and you won't get an error.
Corrected code:
I'm opening a new file for writing (I'm calling it 'final', but you can call it whatever you want), and I'm just writing everything to it with the one modification. It's easiest to just change the one element of the list that has the line (row[6] here), and use '|'.join to write a pipe character between each column.
import csv
with open('temp', 'r') as f, open('final', 'w') as fw:
reader = csv.reader(f, delimiter="|")
for row in reader:
# just change the element in the row to have the extra zeros
row[6] = '00:' + row[6]
# 'write the row back out, separated by | characters, and a new line.
fw.write('|'.join(row) + '\n')

you can use regex for that:
>>> txt = """\
... word1|word2|word3|word4|word5|word6|01:12|word8
... word1|word2|word3|word4|word5|word6|03:12|word8
... word1|word2|word3|word4|word5|word6|01:12|word8"""
>>> import re
>>> print(re.sub(r'\|(\d\d:\d\d)\|', r'|00:\1|', txt))
word1|word2|word3|word4|word5|word6|00:01:12|word8
word1|word2|word3|word4|word5|word6|00:03:12|word8
word1|word2|word3|word4|word5|word6|00:01:12|word8

what is a quick way to import a text file in python?

I have a plain text file with a sequence of numbers, one on each line. I need to import those values into a list. I'm currently learning python and I'm not sure of which is a fast or even "standard" way of doing this (also, I come from R so I'm used to the scan or readLines functions that makes this task a breeze).
The file looks like this (note: this isn't a csv file, commas are decimal points):
204,00
10,00
10,00
10,00
10,00
11,00
70,00
276,00
58,00
...
Since it uses commas instead of '.' for decimal points, I guess the task's a little harder, but it should be more or less the same, right?
This is my current solution, which I find quite cumbersome:
f = open("some_file", "r")
data = f.read().replace('\n', '|')
data = data[0:(len(data) - 2)].replace(',', '.')
data = data.split('|')
x = range(len(data))
for i in range(len(data)):
x[i] = float(data[i])
Thanks in advance.

UPDATE
I didn't realize the comma was the decimal separator. If the locale is set right, something like this should work
lines = [locale.atof(line.strip()) for line in open(filename)]
if not, you could do
lines = [float(line.strip().replace(',','.')) for line in open(filename)]
lines = [line.strip() for line in open(filename)]
if you want the data as numbers ...
lines = [map(float,line.strip().split(',')) for line in open(filename)]
edited as per first two comments below

bsoist's answer is good if locale is set correctly. If not, you can simply read the entire file in and split on the line breaks (\n), then use a list comprehension for replacements.
with open('some_file.txt', 'r') as datafile:
data = datafile.read()
x = [float(value.replace(",", ".")) for value in data.split('\n')]

For a more simpler way you could just do
Read = []
with open('File.txt', 'r') as File:
Read = File.readLines()
for A in Read:
print A
The "with open()" will open the file and quit when it's finished reading. This is good practice IIRC.
Then the For loop will just loop over Read and print out the lines.

How to correctly read csv and input into list?

I am trying to read a bunch of data in .csv file into an array in format:
[ [a,b,c,d], [e,f,g,h], ...]
Running the code below, when I print an entry with a space (' ') the way I'm accessing the element isn't correct because it stops at the first space (' ').
For example if Business, Fast Company, Youtube, fastcompany is the 10th entry...when I print the below I get on separate lines:
Business,Fast
Company,YouTube,FastCompany
Any advice on how to get as the result: [ [a,b,c,d], [Business, Fast Company, Youtube, fastcompany], [e,f,g,h], ...]?
import csv
partners = []
partner_dict = {}
i=9
with open('partners.csv', 'rb') as csvfile:
spamreader = csv.reader(csvfile, delimiter=' ', quotechar='|')
for row in spamreader:
partners.append(row)
print len(partners)
for entry in partners[i]:
print entry

The delimiter argument specifies which character to use to split each row of the file into separate values. Since you're passing ' ' (a space), the reader is splitting on spaces.
If this is really a comma-separated file, use ',' as the delimiter (or just leave the delimiter argument out and it will default to ',').
Also, the pipe character is an unusual value for the quote character. Is it really true that your input file contains pipes in place of quotes? The sample data you supplied contains neither pipes nor quotes.

There are a few issues with your code:
The "correct" syntax for iterating over a list is for entry in partners:, not for entry in partners[i]:
The partners_dict variable in your code seems to be unused, I assume you'll use it later, so I'll ignore it for now
You're opening a text file as binary (use open(file_name, "r") instead of open(file_name, "rb")
Your handling of the processed data is still done inside of the context manager (with ... [as ...]:-block)
Your input text seems to delimit by ", ", but you delimit by " " when parsing
If I understood your question right your problem seems to be caused by the last one. The "obvious solution" would probably be to change the delimeter argument to ", ", but only single-char strings are allowed as delimiters by the module. So what do we do? Well, since "," is really the "true" delimiter (it's never supposed to be inside actual unquoted data, contrary to spaces), that would seem like a good solution. However, now all your values start with " " which is probably not what you want. So what do you do? Well, all strings have a pretty neat strip() method which by default removes all whitespace in the beginning and end of the string. So, to strip() all the values, let's use a "list comprehension" (evaluates an expression on all items in a list and then returns a new list with the new values) which should look somewhat like [i.strip() for i in row] before appending it to partners.
In the end your code should hopefully look somewhat like this:
import csv
partners = []
with open('partners.csv', 'r') as csvfile:
spamreader = csv.reader(csvfile, delimiter=',', quotechar='|')
for row in spamreader:
partners.append([i.strip() for i in row])
print len(partners)
for entry in partners:
print entry

Python: How do I delete periods occurring alone in a CSV file?

I have a bunch of CSV files. In some of them, missing data are represented by empty cells, but in others there is a period. I want to loop over all my files, open them, delete any periods that occur alone, and then save and close the file.
I've read a bunch of other questions about doing whole-word-only searches using re.sub(). That is what I want to do (delete . when it occurs alone but not the . in 3.5), but I can't get the syntax right for a whole-word-only search where the whole word is a special character ('.'). Also, I'm worried those answers might be a little different in the case where a whole word can be distinguished by tab and newlines too. That is, does /b work in my CSV file case?
UPDATE: Here is a function I wound up writing after seeing the help below. Maybe it will be useful to someone else.
import csv, re
def clean(infile, outfile, chars):
'''
Open a file, remove all specified special characters used to represent missing data, and save.\n\n
infile:\tAn input file path\n
outfile:\tAn output file path\n
chars:\tA list of strings representing missing values to get rid of
'''
in_temp = open(infile)
out_temp = open(outfile, 'wb')
csvin = csv.reader(in_temp)
csvout = csv.writer(out_temp)
for row in csvin:
row = re.split('\t', row[0])
for colno, col in enumerate(row):
for char in chars:
if col.strip() == char:
row[colno] = ''
csvout.writerow(row)
in_temp.close()
out_temp.close()

Something like this should do the trick... This data wouldn't happen to be coming out of SAS would it - IIRC, that quite often used '.' as missing for numeric values.
import csv
with open('input.csv') as fin, open('output.csv', 'wb') as fout:
csvin = csv.reader(fin)
csvout = csv.writer(fout)
for row in csvin:
for colno, col in enumerate(row):
if col.strip() == '.':
row[colno] = ''
csvout.writerow(row)

Why not just use the csv module?
#!/usr/bin/env python
import csv
with open(somefile) as infile:
r=csv.reader(infile)
rows = []
for row in csv:
rows.append(['' if f == "." else f for f in row])
with open(newfile, 'w') as outfile:
w=csv.writer(outfile)
w.writelines(rows)

The safest way would be to use the CSV module to process the file, then identify any fields that only contain ., delete those and write the new CSV file back to disk.
A brittle workaround would be to search and replace a dot that is not surrounded by alphanumerics: \B\.\B is the regex that would find those dots. But that might also find other dots like the middle dot in "...".
So, to find a dot that is surrounded by commas, you could search for (?<=,)\.(?=,).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

parse a csv file into a text file - python

Related

Parsing a text file with line breaks in python

python csv replace listitem

what is a quick way to import a text file in python?

How to correctly read csv and input into list?

Python: How do I delete periods occurring alone in a CSV file?

Categories

Resources