I'm working with a large csv file that contains songs and their ownershp properties. Each song record is written top-down, with associated writer and publisher names below each title. So a given song may comprise of say, 4-6 rows, depending on how many writers/publishers control it (example with header row below):
Title,RoleType,Name,Shares,Note
BOOGIE BREAK 2,ASCAP,Total Current ASCAP Share,100,
BOOGIE BREAK 2,W,MERCADO JOSEPH M,,
BOOGIE BREAK 2,P,CRAFTIN MUSIC,,
BOOGIE BREAK 2,P,NEXT DIMENSION MUSIC,,
I'm currently trying to loop through the entire file to extract all of the song titles that contain leading spaces (e.g.,' song title'). Here's the code that I'm currently using:
import csv
import re
with open('output/sws.txt', 'w') as sws:
with open('data/ascap_catalog1.csv', 'r') as ac:
ascap = csv.reader(ac, delimiter=',')
ascap = list(ascap)
for row in ascap:
for strings in row:
if re.search('\A\s+', strings):
row = str(row)
sws.write(row)
sws.write('\n')
else:
continue
Due to the size of this file csv file that I'm working with (~2GB), it takes quite a bit of time to iterate through and produce a result file. However, based on the results that I've gotten, it appears the song titles with leading spaces are all clustered at the beginning of the file. Once those songs have all been listed, then normal songs w/o leading spaces appear.
Is there a way to make this code a bit more efficient, time-wise? I tried using a few breaks after every for and if statement, but depending on the amount that I used, it either didn't effect the statement at all, or broke too quickly, not capturing any rows.
I also tried wrapping it in a function and implementing return, however, for some reason the code only seemed to iterate through the first row (not counting the header row, which I would skip).
Thanks so much for your time,
list(ascap) isn't doing you nay favors. reader objects are iterators over their contents, but they don't load it all into memory until ti's needed. Just iterate over the reader object directly.
For each row, just check row[0][0].isspace(). That checks the first character of the first entry, which is all you need to determine whether something begins with whitespace.
with open('output/sws.txt', 'w', newline="") as sws:
with open('data/ascap_catalog1.csv', 'r', newline="") as ac:
ascap = csv.reader(ac, delimiter=',')
for row in ascap:
if row and row[0] and row[0][0].isspace():
print(row, file=sws)
You could also play with your output, like saving all the rows you want to keep in a list before writing them at the end. It sounds like your input might be sorted, if all the leading whitespace names come first. If that's the case, you can just add else: break to skip the rest of the file.
You can use a dictionary to find each song and group all of its associated values:
from collections import defaultdict
import csv, re
d = defaultdict(list)
count = 0 #count needed to remove the header, without loading the full data into memory
with open('filename.csv') as f:
for a, *b in csv.reader(f):
if count:
if re.findall('^\s', a):
d[a].append(b)
count += 1
this one worked well for me and seems to be simple enough.
import csv
import re
with open('C:\\results.csv', 'w') as sws:
with open('C:\\ascap.csv', 'r') as ac:
ascap = csv.reader(ac, delimiter=',')
for row in ascap:
if re.match('\s+', row[0]):
sws.write(str(row)+ '\n')
Here are some things you can improve:
Use the reader object as an iterator directly without creating an intermediate list. This will save you both computation time and memory.
Check only the first value in a row (which is a title), not all.
Remove an unnecessary else clause.
Combining all of this and applying some best practices you can do:
import csv
import re
with open('data/ascap_catalog1.csv') as ac, open('output/sws.txt', 'w') as sws:
reader = csv.reader(ac)
for row in reader:
if re.search(r'\A\s+', row[0]):
print(row, file=sws)
It appears the song titles with leading spaces are all clustered at
the beginning of the file.
In this case you can use itertools.takewhile to only iterate the file as long the titles have leading spaces:
import csv
import re
from itertools import takewhile
with open('data/ascap_catalog1.csv') as ac, open('output/sws.txt', 'w') as sws:
reader = csv.reader(ac)
next(reader) # skip the header
for row in takewhile(lambda x: re.search(r'\A\s+', x[0]), reader):
print(row, file=sws)
Related
I have several CSV files, that do not have a header row but do have a variable number of free text comment lines at the start. Free text meaning with Spaces ","s and anything you can think of
All the comment lines begin with a #
After the comments, the CSV part is fixed at 4 columns, but is of variable length, up to 86400 lines.
I am trying to read the file, ignoring the comment line and put columns 1,2 from the remaining CSV into an array.
I have been trying to process line by line and also with a dictreader filter, but not having much luck.
Instrument: Mark2
#Type: Phase
#Start: 2000/03/29 19:33:43
#StartPC: 2013/01/15 17:31:06
#UTCOffset: 0:00
#Tau: 1
#MTIE1: 0000000010,0000000021,0000000033,0000000082,0000000168,0000000386,0000001007,0000001920,0000003720,0000007308,000001
4526,0000028941,0000000000
#MTIE2: 0000000002,0000000006,0000000013,0000000047,0000000116,0000000339,0000001001,0000001985,0000003954,0000007824,000001
5200,0000029773,0000000000
954358423.902,-315,-363,0000
954358424.902,-315,-363,0000
954358425.902,-319,-363,0000
954358426.902,-319,-363,0000
954358427.902,-317,-363,0000
954358428.902,-318,-363,0000
954358429.902,-320,-363,0000
954358430.902,-321,-362,0000
954358431.902,-324,-363,0000
954358432.902,-326,-363,0000
954358433.902,-329,-363,0000
954358434.902,-332,-362,0000
954358435.902,-331,-363,0000
954358436.902,-331,-363,0000
954358437.902,-336,-363,0000
954358438.902,-336,-363,0000
954358439.903,-334,-363,0000
954358440.903,-336,-363,0000
My most sucessful code has been as below,
however it returns more data per line then I expect. Also I am not sure how to read it into an array after that
{'954358423.902,-315,-363,0000': '954418183.599,-60158,-60125,0000'}
{'954358423.902,-315,-363,0000': '954418184.599,-60158,-60126,0000'}
{'954358423.902,-315,-363,0000': '954418185.599,-60156,-60127,0000'}
{'954358423.902,-315,-363,0000': '954418186.599,-60157,-60128,0000'}
from itertools
import dropwhile
import csv
fname = "file1.csv"
with open(fname) as fin:
start = dropwhile(lambda L: L.lower().lstrip().startswith('#'), fin)
for row in csv.DictReader(start, delimiter='\t'):
print row
Your code is basically fine but for some reason you are using a DictReader when you should be using a reader. DictReaders expect there to be a header
line and you don't have one.
However, you then have the problem that you have specified "\t" as a delimiter whereas your file is delimited with ",". Fixing that as well makes it all work:
from itertools import dropwhile
import csv
fname = "file1.csv"
with open(fname) as fin:
start = dropwhile(lambda L: L.lower().lstrip().startswith('#'), fin)
for row in csv.reader(start, delimiter=','):
print row
Here is a section of the log file I want to parse:
And here is the code I am writing:
import csv
with open('Coplanarity_PedCheck.log','rt') as tsvin, open('YX2.csv', 'wt') as csvout:
read_tsvin = csv.reader(tsvin, delimiter='\t')
for row in read_tsvin:
print(row)
filters = row[0]
if "#Log File Initialized!" in filters:
print(row)
datetime = row[0]
print("looklook",datetime[23:46])
csvout.write(datetime[23:46]+",")
BS = row[0]
print("looklook",BS[17:21])
csvout.write(datetime[17:21]+",")
csvout.write("\n")
csvout.close()
I need to get the date and time information from row1, then get "left" from row2, then need to skip section 4. How should I do it?
Since the csv.reader makes row1 an list with only 1 element I converted it to string again to split out the datetime info I need. But I think it is not efficient.
I did same thing for row2, then I want to skip row 3-6, but I don't know how.
Also, csv.reader converts my float data into text, how can I convert them back before I write them into another file?
You are going to want to learn to use regular expressions.
For example, you could do something like this:
with open('Coplanarity_PedCheck.log','rt') as tsvin, open('YX2.csv', 'wt') as csvout:
read_tsvin = csv.reader(tsvin, delimiter='\t')
# Get the first line and use the first field
header = read_tsvin.next()[0]
m = re.search('\[([0-9: -]+)\]', row)
datetime = m.group(1)
csvout.write(datetime, ',')
# Find if 'Left' is in line 1
direction = read_tsvin.next()[0]
m = re.search('Left', direction)
if m:
# If m is found print the whole line
csvout.write(m.group(0))
csvout.write('\n')
# Skip lines 3-6
for i in range(4):
null = read_tsvin.next()
# Loop over the rest of the rows
for row in tsvin:
# Parse the data
csvout.close()
Modifying this to look for a line containing '#Log File Initialized!' rather than hard coding for the first line would be fairly simple using regular expressions. Take a look at the regular expression documentation
This probably isn't exactly what you want to do, but rather a suggestion for a good starting point.
I have a csv file that needs to add a zero in front of the number if its less than 4 digits.
I only have to update a particular row:
import csv
f = open('csvpatpos.csv')
csv_f = csv.reader(f)
for row in csv_f:
print row[5]
then I want to parse through that row and add a 0 to the front of any number that is shorter than 4 digits. And then input it into a new csv file with the adjusted data.
You want to use string formatting for these things:
>>> '{:04}'.format(99)
'0099'
Format String Syntax documentation
When you think about parsing, you either need to think about regex or pyparsing. In this case, regex would perform the parsing quite easily.
But that's not all, once you are able to parse the numbers, you need to zero fill it. For that purpose, you need to use str.format for padding and justifying the string accordingly.
Consider your string
st = "parse through that row and add a 0 to the front of any number that is shorter than 4 digits."
In the above lines, you can do something like
Implementation
parts = re.split(r"(\d{0,3})", st)
''.join("{:>04}".format(elem) if elem.isdigit() else elem for elem in parts)
Output
'parse through that row and add a 0000 to the front of any number that is shorter than 0004 digits.'
The following code will read in the given csv file, iterate through each row and each item in each row, and output it to a new csv file.
import csv
import os
f = open('csvpatpos.csv')
# open temp .csv file for output
out = open('csvtemp.csv','w')
csv_f = csv.reader(f)
for row in csv_f:
# create a temporary list for this row
temp_row = []
# iterate through all of the items in the row
for item in row:
# add the zero filled value of each temporary item to the list
temp_row.append(item.zfill(4))
# join the current temporary list with commas and write it to the out file
out.write(','.join(temp_row) + '\n')
out.close()
f.close()
Your results will be in csvtemp.csv. If you want to save the data with the original filename, just add the following code to the end of the script
# remove original file
os.remove('csvpatpos.csv')
# rename temp file to original file name
os.rename('csvtemp.csv','csvpatpos.csv')
Pythonic Version
The code above is is very verbose in order to make it understandable. Here is the code refactored to make it more Pythonic
import csv
new_rows = []
with open('csvpatpos.csv','r') as f:
csv_f = csv.reader(f)
for row in csv_f:
row = [ x.zfill(4) for x in row ]
new_rows.append(row)
with open('csvpatpos.csv','wb') as f:
csv_f = csv.writer(f)
csv_f.writerows(new_rows)
Will leave you with two hints:
s = "486"
s.isdigit() == True
for finding what things are numbers.
And
s = "486"
s.zfill(4) == "0486"
for filling in zeroes.
I want to parse a csv file which is in the following format:
Test Environment INFO for 1 line.
Test,TestName1,
TestAttribute1-1,TestAttribute1-2,TestAttribute1-3
TestAttributeValue1-1,TestAttributeValue1-2,TestAttributeValue1-3
Test,TestName2,
TestAttribute2-1,TestAttribute2-2,TestAttribute2-3
TestAttributeValue2-1,TestAttributeValue2-2,TestAttributeValue2-3
Test,TestName3,
TestAttribute3-1,TestAttribute3-2,TestAttribute3-3
TestAttributeValue3-1,TestAttributeValue3-2,TestAttributeValue3-3
Test,TestName4,
TestAttribute4-1,TestAttribute4-2,TestAttribute4-3
TestAttributeValue4-1-1,TestAttributeValue4-1-2,TestAttributeValue4-1-3
TestAttributeValue4-2-1,TestAttributeValue4-2-2,TestAttributeValue4-2-3
TestAttributeValue4-3-1,TestAttributeValue4-3-2,TestAttributeValue4-3-3
and would like to turn this into tab seperated format like in the following:
TestName1
TestAttribute1-1 TestAttributeValue1-1
TestAttribute1-2 TestAttributeValue1-2
TestAttribute1-3 TestAttributeValue1-3
TestName2
TestAttribute2-1 TestAttributeValue2-1
TestAttribute2-2 TestAttributeValue2-2
TestAttribute2-3 TestAttributeValue2-3
TestName3
TestAttribute3-1 TestAttributeValue3-1
TestAttribute3-2 TestAttributeValue3-2
TestAttribute3-3 TestAttributeValue3-3
TestName4
TestAttribute4-1 TestAttributeValue4-1-1 TestAttributeValue4-2-1 TestAttributeValue4-3-1
TestAttribute4-2 TestAttributeValue4-1-2 TestAttributeValue4-2-2 TestAttributeValue4-3-2
TestAttribute4-3 TestAttributeValue4-1-3 TestAttributeValue4-2-3 TestAttributeValue4-3-3
Number of TestAttributes vary from test to test. For some tests there are only 3 values, for some others 7, etc. Also as in TestName4 example, some tests are executed more than once and hence each execution has its own TestAttributeValue line. (in the example testname4 is executed 3 times, hence we have 3 value lines)
I am new to python and do not have much knowledge but would like to parse the csv file with python. I checked 'csv' library of python and could not be sure whether it will be enough for me or shall I write my own string parser? Could you please help me?
Best
I'd use a solution using the itertools.groupby function and the csv module. Please have a close look at the documentation of itertools -- you can use it more often than you think!
I've used blank lines to differentiate the datasets, and this approach uses lazy evaluation, storing only one dataset in memory at a time:
import csv
from itertools import groupby
with open('my_data.csv') as ifile, open('my_out_data.csv', 'wb') as ofile:
# Use the csv module to handle reading and writing of delimited files.
reader = csv.reader(ifile)
writer = csv.writer(ofile, delimiter='\t')
# Skip info line
next(reader)
# Group datasets by the condition if len(row) > 0 or not, then filter
# out all empty lines
for group in (v for k, v in groupby(reader, lambda x: bool(len(x))) if k):
test_data = list(group)
# Write header
writer.writerow([test_data[0][1]])
# Write transposed data
writer.writerows(zip(*test_data[1:]))
# Write blank line
writer.writerow([])
Output, given that the supplied data is stored in my_data.csv:
TestName1
TestAttribute1-1 TestAttributeValue1-1
TestAttribute1-2 TestAttributeValue1-2
TestAttribute1-3 TestAttributeValue1-3
TestName2
TestAttribute2-1 TestAttributeValue2-1
TestAttribute2-2 TestAttributeValue2-2
TestAttribute2-3 TestAttributeValue2-3
TestName3
TestAttribute3-1 TestAttributeValue3-1
TestAttribute3-2 TestAttributeValue3-2
TestAttribute3-3 TestAttributeValue3-3
TestName4
TestAttribute4-1 TestAttributeValue4-1-1 TestAttributeValue4-2-1 TestAttributeValue4-3-1
TestAttribute4-2 TestAttributeValue4-1-2 TestAttributeValue4-2-2 TestAttributeValue4-3-2
TestAttribute4-3 TestAttributeValue4-1-3 TestAttributeValue4-2-3 TestAttributeValue4-3-3
The following does what you want, and only reads up to one section at a time (saves memory for a large file). Replace in_path and out_path with the input and output file paths respectively:
import csv
def print_section(section, f_out):
if len(section) > 0:
# find maximum column length
max_len = max([len(col) for col in section])
# build and print each row
for i in xrange(max_len):
f_out.write('\t'.join([col[i] if len(col) > i else '' for col in section]) + '\n')
f_out.write('\n')
with csv.reader(open(in_path, 'r')) as f_in, open(out_path, 'w') as f_out:
line = f_in.next()
section = []
for line in f_in:
# test for new "Test" section
if len(line) == 3 and line[0] == 'Test' and line[2] == '':
# write previous section data
print_section(section, f_out)
# reset section
section = []
# write new section header
f_out.write(line[1] + '\n')
else:
# add line to section
section.append(line)
# print the last section
print_section(section, f_out)
Note that you'll want to change 'Test' in the line[0] == 'Test' statement to the correct word for indicating the header line.
The basic idea here is that we import the file into a list of lists, then write that list of lists back out using an array comprehension to transpose it (as well as adding in blank elements when the columns are uneven).
I have a bunch of CSV files. In some of them, missing data are represented by empty cells, but in others there is a period. I want to loop over all my files, open them, delete any periods that occur alone, and then save and close the file.
I've read a bunch of other questions about doing whole-word-only searches using re.sub(). That is what I want to do (delete . when it occurs alone but not the . in 3.5), but I can't get the syntax right for a whole-word-only search where the whole word is a special character ('.'). Also, I'm worried those answers might be a little different in the case where a whole word can be distinguished by tab and newlines too. That is, does /b work in my CSV file case?
UPDATE: Here is a function I wound up writing after seeing the help below. Maybe it will be useful to someone else.
import csv, re
def clean(infile, outfile, chars):
'''
Open a file, remove all specified special characters used to represent missing data, and save.\n\n
infile:\tAn input file path\n
outfile:\tAn output file path\n
chars:\tA list of strings representing missing values to get rid of
'''
in_temp = open(infile)
out_temp = open(outfile, 'wb')
csvin = csv.reader(in_temp)
csvout = csv.writer(out_temp)
for row in csvin:
row = re.split('\t', row[0])
for colno, col in enumerate(row):
for char in chars:
if col.strip() == char:
row[colno] = ''
csvout.writerow(row)
in_temp.close()
out_temp.close()
Something like this should do the trick... This data wouldn't happen to be coming out of SAS would it - IIRC, that quite often used '.' as missing for numeric values.
import csv
with open('input.csv') as fin, open('output.csv', 'wb') as fout:
csvin = csv.reader(fin)
csvout = csv.writer(fout)
for row in csvin:
for colno, col in enumerate(row):
if col.strip() == '.':
row[colno] = ''
csvout.writerow(row)
Why not just use the csv module?
#!/usr/bin/env python
import csv
with open(somefile) as infile:
r=csv.reader(infile)
rows = []
for row in csv:
rows.append(['' if f == "." else f for f in row])
with open(newfile, 'w') as outfile:
w=csv.writer(outfile)
w.writelines(rows)
The safest way would be to use the CSV module to process the file, then identify any fields that only contain ., delete those and write the new CSV file back to disk.
A brittle workaround would be to search and replace a dot that is not surrounded by alphanumerics: \B\.\B is the regex that would find those dots. But that might also find other dots like the middle dot in "...".
So, to find a dot that is surrounded by commas, you could search for (?<=,)\.(?=,).