Parsing CSV in Python 101 - python

I'm trying to understand/visualise the process of parsing a raw csv data file in Python from dataquest.io's training course.
I understand that rows = data.split('\n') splits the long string of csv file into rows based on where the line break is. ie:
day1, sunny, \n day2, rain \n
becomes
day1, sunny
day2, rain
I thought the for loop would further break the data into something like:
day 1
sunny
day 2
rain
Instead the course seems to imply it would actually become a list of lists usefully. I don't understand, why does that happen?
weather_data = []
f = open("la_weather.csv", 'r')
data = f.read()
rows = data.split('\n')
for row in rows:
split_row = row.split(",")
weather_data.append(split_row)

I'm ignoring the CSV stuff and concentrating just on your list misunderstanding. When you split the row of text, it becomes a list of strings. That is, rows becomes: ["day1, sunny","day2, rain"].
The for statement, applied to a list, iterates through the elements of that list. So, on the first time through row will be "day1, sunny", the second time through it will be "day2, rain", etc.
Inside each iteration of the for loop, it creates a new list, by splitting row at the commas into, eg, ["day1"," sunny"]. All of these lists are added to the weather_data list you created at the start. You end up with a list of lists, ie [['day1', ' sunny'], ['day2', ' rain']]. If you wanted ['day1', ' sunny', 'day2', ' rain'], you could do:
for row in rows:
split_row = row.split(",")
for ele in split_row:
weather_data.append(ele)

That code does make it a list of lists.
As you say, the first split converts the data into a list, one element per line.
Then, for each line, the second split converts it into another list, one element per column.
And then the second list is appended, as a single item, to the weather_data list - which is now, as the instructions say, a list of lists.
Note that this code isn't very good - quite apart from the fact that you would always use the csv module, as others have pointed out, you would never do f.read() and then split the result. You would just do for line in f which automatically iterates over each row.

As a more pythonic and flexible way for dealing with csv files you can use csv module, instead of reading it as a raw text:
import csv
with open("la_weather.csv", 'rb') as f:
spamreader = csv.reader(f,delimiter=',')
for row in spamreader:
#do stuff
Here spamreader is a reader object and you can get the rows as tuple with looping over it.
And if you want to get all of rows within a list you can just convert the spamreader to list :
with open("la_weather.csv", 'rb') as f:
spamreader = csv.reader(f,delimiter=',')
print list(spamreader)

Related

From a file containing prime numbers to a list of integers on Python

In order to work out some asymptotic behavior on the topic of twin prime conjecture, I am required to take a raw file(.csv or .txt) and convert that data into a list in python where I could reach by pointing its index number.
That is, I have a big(~10 million numbers) list of prime numbers in .csv file, lets say that this is that list:
2,3,5,7,11,13,17,19,23,29,31,37,41,43,47,53,59,61,67,71,73,79,83
I am and trying to produce the following
[2,3,5,7,11,13,17,19,23,29,31,37,41,43,47,53,59,61,67,71,73,79,83]
in order to examine, ay the third element in the list, which is 5.
The approach I am taking is the following:
import sys
import csv
# The csv file might contain very huge fields, therefore increase the field_size_limit:
csv.field_size_limit(sys.maxsize)
with open('primes1.csv') as csvfile:
reader = csv.reader(csvfile, delimiter=' ')
output = []
for i in reader:
output.append(i)
Then, if printing,
for rows in output:
print(rows)
I am getting
['2,3,5,7,11,13,17,19,23,29,31,37,41,43,47,53,59,61,67,71,73,79,83'].
How does one resolve this? Thank you very much.
Maybe this:
with open("primes1.csv", "r") as f:
lst = [int(i) for i in f.read().split(",")]
You don't need to use the csv reader for that (like the other answer showed) but if you want to, you could do it like this, reading just the first row.
Your code is iterating rows and adding them to the output list, but you need to iterate columns just in the first row. The next(reader) call returns just the first row.
with open('test.csv','r') as csvFile:
reader = csv.reader(csvFile, delimiter=',')
output = [int(i) for i in next(reader)]
# alternate approach
# output = [int(i) for i in csvFile.read().strip().split(',')]
print(output)

All of my data from columns of one file go into one column in my output file. How to keep it the same?

I'm trying to delete some number of data rows from a file, essentially just because there are too many data points. I can easily print them to IDLE but when I try to write the lines to a file, all of the data from one row goes into one column. I'm definitely a noob but it seems like this should be "trivial"
I've tried it with writerow and writerows, zip(), with and without [], I've changed the delimiter and line terminator.
import csv
filename = "velocity_result.csv"
with open(filename, "r") as source:
for i, line in enumerate(source):
if i % 2 == 0:
with open ("result.csv", "ab") as result:
result_writer = csv.writer(result, quoting=csv.QUOTE_ALL, delimiter=',', lineterminator='\n')
result_writer.writerow([line])
This is what happens:
input = |a|b|c|d| <row
|e|f|g|h|
output = |abcd|
<every other row deleted
(just one column)
My expectaion is
input = |a|b|c|d| <row
|e|f|g|h|
output = |a|b|c|d|
<every other row deleted
Once you've read the line, it becomes a single item as far as Python is concerned. Sure, maybe it is a string which has comma separated values in it, but it is a single item still. So [line] is a list of 1 item, no matter how it is formatted.\
If you want to make sure the line is recognized as a list of separate values, you need to make it such, perhaps with split:
result_writer.writerow(line.split('<input file delimiter here>'))
Now the line becomes a list of 4 items, so it makes sense for csv writer to write them as 4 separated values in the file.

Iterating through CSV file in python to find titles with leading spaces

I'm working with a large csv file that contains songs and their ownershp properties. Each song record is written top-down, with associated writer and publisher names below each title. So a given song may comprise of say, 4-6 rows, depending on how many writers/publishers control it (example with header row below):
Title,RoleType,Name,Shares,Note
BOOGIE BREAK 2,ASCAP,Total Current ASCAP Share,100,
BOOGIE BREAK 2,W,MERCADO JOSEPH M,,
BOOGIE BREAK 2,P,CRAFTIN MUSIC,,
BOOGIE BREAK 2,P,NEXT DIMENSION MUSIC,,
I'm currently trying to loop through the entire file to extract all of the song titles that contain leading spaces (e.g.,' song title'). Here's the code that I'm currently using:
import csv
import re
with open('output/sws.txt', 'w') as sws:
with open('data/ascap_catalog1.csv', 'r') as ac:
ascap = csv.reader(ac, delimiter=',')
ascap = list(ascap)
for row in ascap:
for strings in row:
if re.search('\A\s+', strings):
row = str(row)
sws.write(row)
sws.write('\n')
else:
continue
Due to the size of this file csv file that I'm working with (~2GB), it takes quite a bit of time to iterate through and produce a result file. However, based on the results that I've gotten, it appears the song titles with leading spaces are all clustered at the beginning of the file. Once those songs have all been listed, then normal songs w/o leading spaces appear.
Is there a way to make this code a bit more efficient, time-wise? I tried using a few breaks after every for and if statement, but depending on the amount that I used, it either didn't effect the statement at all, or broke too quickly, not capturing any rows.
I also tried wrapping it in a function and implementing return, however, for some reason the code only seemed to iterate through the first row (not counting the header row, which I would skip).
Thanks so much for your time,
list(ascap) isn't doing you nay favors. reader objects are iterators over their contents, but they don't load it all into memory until ti's needed. Just iterate over the reader object directly.
For each row, just check row[0][0].isspace(). That checks the first character of the first entry, which is all you need to determine whether something begins with whitespace.
with open('output/sws.txt', 'w', newline="") as sws:
with open('data/ascap_catalog1.csv', 'r', newline="") as ac:
ascap = csv.reader(ac, delimiter=',')
for row in ascap:
if row and row[0] and row[0][0].isspace():
print(row, file=sws)
You could also play with your output, like saving all the rows you want to keep in a list before writing them at the end. It sounds like your input might be sorted, if all the leading whitespace names come first. If that's the case, you can just add else: break to skip the rest of the file.
You can use a dictionary to find each song and group all of its associated values:
from collections import defaultdict
import csv, re
d = defaultdict(list)
count = 0 #count needed to remove the header, without loading the full data into memory
with open('filename.csv') as f:
for a, *b in csv.reader(f):
if count:
if re.findall('^\s', a):
d[a].append(b)
count += 1
this one worked well for me and seems to be simple enough.
import csv
import re
with open('C:\\results.csv', 'w') as sws:
with open('C:\\ascap.csv', 'r') as ac:
ascap = csv.reader(ac, delimiter=',')
for row in ascap:
if re.match('\s+', row[0]):
sws.write(str(row)+ '\n')
Here are some things you can improve:
Use the reader object as an iterator directly without creating an intermediate list. This will save you both computation time and memory.
Check only the first value in a row (which is a title), not all.
Remove an unnecessary else clause.
Combining all of this and applying some best practices you can do:
import csv
import re
with open('data/ascap_catalog1.csv') as ac, open('output/sws.txt', 'w') as sws:
reader = csv.reader(ac)
for row in reader:
if re.search(r'\A\s+', row[0]):
print(row, file=sws)
It appears the song titles with leading spaces are all clustered at
the beginning of the file.
In this case you can use itertools.takewhile to only iterate the file as long the titles have leading spaces:
import csv
import re
from itertools import takewhile
with open('data/ascap_catalog1.csv') as ac, open('output/sws.txt', 'w') as sws:
reader = csv.reader(ac)
next(reader) # skip the header
for row in takewhile(lambda x: re.search(r'\A\s+', x[0]), reader):
print(row, file=sws)

Python CSV library returning 1 item instead of a list of items

Im trying to use the CSV library to do some excel processing, but when I use the code posted below, row returns the entirety of data as 1 item, so row[0] returns the entire file and row[1] returns index out of range. Is there a way to make each row a list with each cell being an item? Making the final product a list of lists. I was thinking of using split everytime ther was a close bracket ']' . If needed I can post the excel file
Heres a sample of what some of the output looks like. This is all one item in the list:
['3600035671,"$13,668",8/11/2008,8/11/2013,,,2,4A,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,']
['3910435005,"$34,872",4/1/2010,10/8/2016,,,2,4A,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,']
['5720636344,"$1,726",8/30/2010,9/5/2011,,,3,6C,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,']
['15260473510,"-$1,026,580",7/22/2005,3/5/2008,,,6,1C2A,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,']
import csv
csvfile = open('Invictus.csv', 'rU')
data = csv.reader(csvfile, dialect=csv.excel_tab)
for char in data:
char = filter(None, char)
print char
Assuming you are giving examples of your data above the line import csv, it looks like your data is comma delimited but you are setting up your CSV reader to expect tab delimited data (dialect=csv.excel_tab).
What happens if you change that line to:
data = csv.reader(csvfile, dialect=csv.excel)

python: adding a zero if my value is less then 3 digits long

I have a csv file that needs to add a zero in front of the number if its less than 4 digits.
I only have to update a particular row:
import csv
f = open('csvpatpos.csv')
csv_f = csv.reader(f)
for row in csv_f:
print row[5]
then I want to parse through that row and add a 0 to the front of any number that is shorter than 4 digits. And then input it into a new csv file with the adjusted data.
You want to use string formatting for these things:
>>> '{:04}'.format(99)
'0099'
Format String Syntax documentation
When you think about parsing, you either need to think about regex or pyparsing. In this case, regex would perform the parsing quite easily.
But that's not all, once you are able to parse the numbers, you need to zero fill it. For that purpose, you need to use str.format for padding and justifying the string accordingly.
Consider your string
st = "parse through that row and add a 0 to the front of any number that is shorter than 4 digits."
In the above lines, you can do something like
Implementation
parts = re.split(r"(\d{0,3})", st)
''.join("{:>04}".format(elem) if elem.isdigit() else elem for elem in parts)
Output
'parse through that row and add a 0000 to the front of any number that is shorter than 0004 digits.'
The following code will read in the given csv file, iterate through each row and each item in each row, and output it to a new csv file.
import csv
import os
f = open('csvpatpos.csv')
# open temp .csv file for output
out = open('csvtemp.csv','w')
csv_f = csv.reader(f)
for row in csv_f:
# create a temporary list for this row
temp_row = []
# iterate through all of the items in the row
for item in row:
# add the zero filled value of each temporary item to the list
temp_row.append(item.zfill(4))
# join the current temporary list with commas and write it to the out file
out.write(','.join(temp_row) + '\n')
out.close()
f.close()
Your results will be in csvtemp.csv. If you want to save the data with the original filename, just add the following code to the end of the script
# remove original file
os.remove('csvpatpos.csv')
# rename temp file to original file name
os.rename('csvtemp.csv','csvpatpos.csv')
Pythonic Version
The code above is is very verbose in order to make it understandable. Here is the code refactored to make it more Pythonic
import csv
new_rows = []
with open('csvpatpos.csv','r') as f:
csv_f = csv.reader(f)
for row in csv_f:
row = [ x.zfill(4) for x in row ]
new_rows.append(row)
with open('csvpatpos.csv','wb') as f:
csv_f = csv.writer(f)
csv_f.writerows(new_rows)
Will leave you with two hints:
s = "486"
s.isdigit() == True
for finding what things are numbers.
And
s = "486"
s.zfill(4) == "0486"
for filling in zeroes.

Categories

Resources