Download CSV directly into Python CSV parser - python

I'm trying to download CSV content from morningstar and then parse its contents. If I inject the HTTP content directly into Python's CSV parser, the result is not formatted correctly. Yet, if I save the HTTP content to a file (/tmp/tmp.csv), and then import the file in the python's csv parser the result is correct. In other words, why does:
def finDownload(code,report):
h = httplib2.Http('.cache')
url = 'http://financials.morningstar.com/ajax/ReportProcess4CSV.html?t=' + code + '&region=AUS&culture=en_us&reportType='+ report + '&period=12&dataType=A&order=asc&columnYear=5&rounding=1&view=raw&productCode=usa&denominatorView=raw&number=1'
headers, data = h.request(url)
return data
balancesheet = csv.reader(finDownload('FGE','is'))
for row in balancesheet:
print row
return:
['F']
['o']
['r']
['g']
['e']
[' ']
['G']
['r']
['o']
['u']
(etc...)
instead of:
[Forge Group Limited (FGE) Income Statement']
?

The problem results from the fact that iteration over a file is done line-by-line whereas iteration over a string is done character-by-character.
You want StringIO/cStringIO (Python 2) or io.StringIO (Python 3, thanks to John Machin for pointing me to it) so a string can be treated as a file-like object:
Python 2:
mystring = 'a,"b\nb",c\n1,2,3'
import cStringIO
csvio = cStringIO.StringIO(mystring)
mycsv = csv.reader(csvio)
Python 3:
mystring = 'a,"b\nb",c\n1,2,3'
import io
csvio = io.StringIO(mystring, newline="")
mycsv = csv.reader(csvio)
Both will correctly preserve newlines inside quoted fields:
>>> for row in mycsv: print(row)
...
['a', 'b\nb', 'c']
['1', '2', '3']

Related

A function to read CSV data from a file into memory

I am trying to create a function that reads a csv file into a memory in a list form. When I run my code, it gives me this error message ("string indices must be integers"). Were am I getting it wrong.
Below is the code. Thanks for your help
# create the empty set to carry the values of the columns
Hydropower_heading = []
Solar_heading = []
Wind_heading = []
Other_heading = []
def my_task1_file(filename): # defines the function "my_task1_file"
with open(filename,'r') as myNew_file: # opens and read the file
for my_file in myNew_file.readlines(): # loops through the file
# read the values into the empty set created
Hydropower_heading.append(my_file['Hydropower'])
Solar_heading.append(my_file['Solar'])
Wind_heading.append(my_file['Wind'])
Other_heading.append(my_file['Other'])
#Hydropower_heading = int(Hydropower)
#Solar_heading = int(Solar)
#Wind_heading = int(Wind)
#Other_heading = int(Other)
my_task1_file('task1.csv') # calls the csv file into the function
# print the Heading and the column values in a row form
print('Hydropower: ', Hydropower_heading)
print('Solar: ', Solar_heading)
print('Wind: ', Wind_heading)
print('Other: ', Other_heading)
We can read CSV files by the column using csv.DictReader method.
Code: (code.py)
import csv
def my_task1_file(filename): # defines the function "my_task1_file"
Hydropower_heading = []
Solar_heading = []
Wind_heading = []
Other_heading = []
with open(filename, newline='\n') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
# read the values into the empty set created
Hydropower_heading.append(row['Hydropower'])
Solar_heading.append(row['Solar'])
Wind_heading.append(row['Wind'])
Other_heading.append(row['Other'])
return Hydropower_heading, Solar_heading, Wind_heading, Other_heading
if __name__ == "__main__":
Hydropower_heading, Solar_heading, Wind_heading, Other_heading = my_task1_file('task1.csv')
# print the Heading and the column values in a row form
print('Hydropower: ', Hydropower_heading)
print('Solar: ', Solar_heading)
print('Wind: ', Wind_heading)
print('Other: ', Other_heading)
task1.csv:
Hydropower,Solar,Wind,Other
5,6,3,8
6,8,5,12
3,6,9,7
Output:
Hydropower: ['5', '6', '3']
Solar: ['6', '8', '6']
Wind: ['3', '5', '9']
Other: ['8', '12', '7']
Explanation:
The __main__ condition will check if the file is running directly. If the file is being run directly by using python code.py, it will execute this portion. Otherwise if we import code.py from another python file, this portion will not be executed.
You can remove the __main__ block as necessary like below. But it is a good practice to separate the methods from executing while importing one python file from another using the __main__ block. Let me know if it clears your confusion.
code.py (without __main__):
import csv
def my_task1_file(filename): # defines the function "my_task1_file"
Hydropower_heading = []
Solar_heading = []
Wind_heading = []
Other_heading = []
with open(filename, newline='\n') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
# read the values into the empty set created
Hydropower_heading.append(row['Hydropower'])
Solar_heading.append(row['Solar'])
Wind_heading.append(row['Wind'])
Other_heading.append(row['Other'])
return Hydropower_heading, Solar_heading, Wind_heading, Other_heading
Hydropower_heading, Solar_heading, Wind_heading, Other_heading = my_task1_file('task1.csv')
print('Hydropower: ', Hydropower_heading)
print('Solar: ', Solar_heading)
print('Wind: ', Wind_heading)
print('Other: ', Other_heading)
References:
csv.DictReader method
__main__ documentation from Python website
Since the error is "string indices must be integers", you must be using a data type that cannot take in a string value as an index. In this segment of your code...
for my_file in myNew_file.readlines():
Hydropower_heading.append(my_file['Hydropower'])
Solar_heading.append(my_file['Solar'])
Wind_heading.append(my_file['Wind'])
Other_heading.append(my_file['Other'])
... you are using "Hydropower", "Solar", "Wind", and "Other" as index values, which cannot be valid index values of my_file, which, I assume, is a string data type since you are reading the file myNew_file. If you change these values to integers as is appropriate, then the error should not appear anymore.

how to encode a paragraph for use in a CSV file in python

I am completely new to python and struggling to make a simple thing work.
I am reading a bunch of information from a Web service, parsing the results, and I want to write it out into a flat-file. Most of my items are single line items, but one of the things I get back from my Web service is a paragraph. The paragraph will contain newlines, quotes, and any random characters.
I was going to use the CSV module for python, but unsure of the parameters I should use and how to escape my string so the paragraph gets put onto a single line and so I am guaranteed all characters are properly escaped (especially the delimiter)
The default csv.writer setup should handle this properly. Here's a simple example:
import csv
myparagraph = """
this is a long paragraph, with "quotes" and stuff.
"""
mycsv = csv.writer(open('foo.csv', 'wb'))
mycsv.writerow([myparagraph, 'word1'])
mycsv.writerow(['word2', 'word3'])
This yields the following csv file:
"
this is a long paragraph, with ""quotes"" and stuff.
",word1
word2,word3
Which should load into your favorite csv opening tool with no problems, as a having two rows and two columns.
You don't have to do anything special. The CSV module takes care of the quoting for you.
>>> from StringIO import StringIO
>>> s = StringIO()
>>> w = csv.writer(s)
>>> w.writerow(['the\nquick\t\r\nbrown,fox\\', 32])
>>> s.getvalue()
'"the\nquick\t\r\nbrown,fox\\",32\r\n'
>>> s.seek(0)
>>> r = csv.reader(s)
>>> next(r)
['the\nquick\t\r\nbrown,fox\\', '32']
To help with setting your expectations, the following is executable pseudocode for how the quoting etc works in the de-facto standard CSV output:
>>> def csv_output_record(input_row):
... delimiter = ','
... q = '"' # quotechar
... quotables = set([delimiter, '\r', '\n'])
... return delimiter.join(
... q + value.replace(q, q + q) + q if q in value
... else q + value + q if any(c in quotables for c in value)
... else value
... for value in input_row
... ) + '\r\n'
...
>>> csv_output_record(['foo', 'x,y,z', 'Jack "Ripper" Jones', 'top\nmid\nbot'])
'foo,"x,y,z","Jack ""Ripper"" Jones","top\nmid\nbot"\r\n'
Note that there is no escaping, only quoting, and hence if the quotechar appears in the field, it must be doubled.

Python CSV reader isn't reading CSV data the way I expect

I'm trying to read some CSV data into an array. I can probably explain what I'm trying to do better in Python than in English:
>>> line = ImportFile.objects.all().reverse()[0].file.split("\n")[0]
>>> line
'"007147","John Smith","100 Farley Ln","","Berlin NH 03570","Berlin","NH",2450000,"John",24643203,3454,"E","",2345071,1201,"N",15465,"I",.00,20102456,945610,20247320,1245712,"0T",.00100000,"",.00,.00,780,"D","000",.00,0\r'
>>> s = cStringIO.StringIO()
>>> s
<cStringIO.StringO object at 0x9ab1960>
>>> s.write(line)
>>> r = csv.reader(s)
>>> r
<_csv.reader object at 0x9aa217c>
>>> [line for line in r]
[]
As you can see, the CSV data starts in memory, not in a file. I would expect my reader to have some of that data but it doesn't. What am I doing wrong?
You are using StringIO in the wrong way. Try
s = cStringIO.StringIO(line)
r = csv.reader(s)
next(r)
# "['007147', 'John Smith', '100 Farley Ln', '', 'Berlin NH 03570', 'Berlin', 'NH', '2450000', 'John', '24643203', '3454', 'E', '', '2345071', '1201', 'N', '15465', 'I', '.00', '20102456', '945610', '20247320', '1245712', '0T', '.00100000', '', '.00', '.00', '780', 'D', '000', '.00', '0']"
and the result should be what you expect.
Edit: To explain in more detail: After writing to the StringIO instance, the file pointer will point past the end of the contents. This is where you would expect new contents to be written by subsequent write() calls. But this also means that read() calls will not return anything. You would need to call s.reset() or s.seek(0) to reset the position to the beginning, or initialise the StringIO with the desired contents.
Add s.seek(0) after s.write(line). Current pointer in the file-like object s is just past the written line.

Sorting CSV in Python

I assumed sorting a CSV file on multiple text/numeric fields using Python would be a problem that was already solved. But I can't find any example code anywhere, except for specific code focusing on sorting date fields.
How would one go about sorting a relatively large CSV file (tens of thousand lines) on multiple fields, in order?
Python code samples would be appreciated.
Python's sort works in-memory only; however, tens of thousands of lines should fit in memory easily on a modern machine. So:
import csv
def sortcsvbymanyfields(csvfilename, themanyfieldscolumnnumbers):
with open(csvfilename, 'rb') as f:
readit = csv.reader(f)
thedata = list(readit)
thedata.sort(key=operator.itemgetter(*themanyfieldscolumnnumbers))
with open(csvfilename, 'wb') as f:
writeit = csv.writer(f)
writeit.writerows(thedata)
Here's Alex's answer, reworked to support column data types:
import csv
import operator
def sort_csv(csv_filename, types, sort_key_columns):
"""sort (and rewrite) a csv file.
types: data types (conversion functions) for each column in the file
sort_key_columns: column numbers of columns to sort by"""
data = []
with open(csv_filename, 'rb') as f:
for row in csv.reader(f):
data.append(convert(types, row))
data.sort(key=operator.itemgetter(*sort_key_columns))
with open(csv_filename, 'wb') as f:
csv.writer(f).writerows(data)
Edit:
I did a stupid. I was playing with various things in IDLE and wrote a convert function a couple of days ago. I forgot I'd written it, and I haven't closed IDLE in a good long while - so when I wrote the above, I thought convert was a built-in function. Sadly no.
Here's my implementation, though John Machin's is nicer:
def convert(types, values):
return [t(v) for t, v in zip(types, values)]
Usage:
import datetime
def date(s):
return datetime.strptime(s, '%m/%d/%y')
>>> convert((int, date, str), ('1', '2/15/09', 'z'))
[1, datetime.datetime(2009, 2, 15, 0, 0), 'z']
Here's the convert() that's missing from Robert's fix of Alex's answer:
>>> def convert(convert_funcs, seq):
... return [
... item if func is None else func(item)
... for func, item in zip(convert_funcs, seq)
... ]
...
>>> convert(
... (None, float, lambda x: x.strip().lower()),
... [" text ", "123.45", " TEXT "]
... )
[' text ', 123.45, 'text']
>>>
I've changed the name of the first arg to highlight that the per-columns function can do what you need, not merely type-coercion. None is used to indicate no conversion.
You bring up 3 issues:
file size
csv data
sorting on multiple fields
Here is a solution for the third part. You can handle csv data in a more sophisticated way.
>>> data = 'a,b,c\nb,b,a\nb,c,a\n'
>>> lines = [e.split(',') for e in data.strip().split('\n')]
>>> lines
[['a', 'b', 'c'], ['b', 'b', 'a'], ['b', 'c', 'a']]
>>> def f(e):
... field_order = [2,1]
... return [e[i] for i in field_order]
...
>>> sorted(lines, key=f)
[['b', 'b', 'a'], ['b', 'c', 'a'], ['a', 'b', 'c']]
Edited to use a list comprehension, generator does not work as I had expected it to.

How can I merge fields in a CSV string using Python?

I am trying to merge three fields in each line of a CSV file using Python. This would be simple, except some of the fields are surrounded by double quotes and include commas. Here is an example:
,,Joe,Smith,New Haven,CT,"Moved from Portland, CT",,goo,
Is there a simple algorithm that could merge fields 7-9 for each line in this format? Not all lines include commas in double quotes.
Thanks.
Something like this?
import csv
source= csv.reader( open("some file","rb") )
dest= csv.writer( open("another file","wb") )
for row in source:
result= row[:6] + [ row[6]+row[7]+row[8] ] + row[9:]
dest.writerow( result )
Example
>>> data=''',,Joe,Smith,New Haven,CT,"Moved from Portland, CT",,goo,
... '''.splitlines()
>>> rdr= csv.reader( data )
>>> row= rdr.next()
>>> row
['', '', 'Joe', 'Smith', 'New Haven', 'CT', 'Moved from Portland, CT', '', 'goo', '' ]
>>> row[:6] + [ row[6]+row[7]+row[8] ] + row[9:]
['', '', 'Joe', 'Smith', 'New Haven', 'CT', 'Moved from Portland, CTgoo', '']
You can use the csv module to do the heavy lifting: http://docs.python.org/library/csv.html
You didn't say exactly how you wanted to merge the columns; presumably you don't want your merged field to be "Moved from Portland, CTgoo". The code below allows you to specify a separator string (maybe ", ") and handles empty/blank fields.
[transcript of session]
prompt>type merge.py
import csv
def merge_csv_cols(infile, outfile, startcol, numcols, sep=", "):
reader = csv.reader(open(infile, "rb"))
writer = csv.writer(open(outfile, "wb"))
endcol = startcol + numcols
for row in reader:
merged = sep.join(x for x in row[startcol:endcol] if x.strip())
row[startcol:endcol] = [merged]
writer.writerow(row)
if __name__ == "__main__":
import sys
args = sys.argv[1:6]
args[2:4] = map(int, args[2:4])
merge_csv_cols(*args)
prompt>type input.csv
1,2,3,4,5,6,7,8,9,a,b,c
1,2,3,4,5,6,,,,a,b,c
1,2,3,4,5,6,7,8,,a,b,c
1,2,3,4,5,6,7,,9,a,b,c
prompt>\python26\python merge.py input.csv output.csv 6 3 ", "
prompt>type output.csv
1,2,3,4,5,6,"7, 8, 9",a,b,c
1,2,3,4,5,6,,a,b,c
1,2,3,4,5,6,"7, 8",a,b,c
1,2,3,4,5,6,"7, 9",a,b,c
There's a builtin module in Python for parsing CSV files:
http://docs.python.org/library/csv.html
You have tagged this question as 'database'. In fact, maybe it would be easier to upload the two files to separate tables of the db (you can use sqllite or any python sql library, like sqlalchemy) and then join them.
That would give you some advantage after, you would be able to use a sql syntax to query the tables and you can store it on the disk instead of keeping it on memory, so think about it.. :)

Categories

Resources