Python: How do I delete periods occurring alone in a CSV file? - python

I have a bunch of CSV files. In some of them, missing data are represented by empty cells, but in others there is a period. I want to loop over all my files, open them, delete any periods that occur alone, and then save and close the file.
I've read a bunch of other questions about doing whole-word-only searches using re.sub(). That is what I want to do (delete . when it occurs alone but not the . in 3.5), but I can't get the syntax right for a whole-word-only search where the whole word is a special character ('.'). Also, I'm worried those answers might be a little different in the case where a whole word can be distinguished by tab and newlines too. That is, does /b work in my CSV file case?
UPDATE: Here is a function I wound up writing after seeing the help below. Maybe it will be useful to someone else.
import csv, re
def clean(infile, outfile, chars):
'''
Open a file, remove all specified special characters used to represent missing data, and save.\n\n
infile:\tAn input file path\n
outfile:\tAn output file path\n
chars:\tA list of strings representing missing values to get rid of
'''
in_temp = open(infile)
out_temp = open(outfile, 'wb')
csvin = csv.reader(in_temp)
csvout = csv.writer(out_temp)
for row in csvin:
row = re.split('\t', row[0])
for colno, col in enumerate(row):
for char in chars:
if col.strip() == char:
row[colno] = ''
csvout.writerow(row)
in_temp.close()
out_temp.close()

Something like this should do the trick... This data wouldn't happen to be coming out of SAS would it - IIRC, that quite often used '.' as missing for numeric values.
import csv
with open('input.csv') as fin, open('output.csv', 'wb') as fout:
csvin = csv.reader(fin)
csvout = csv.writer(fout)
for row in csvin:
for colno, col in enumerate(row):
if col.strip() == '.':
row[colno] = ''
csvout.writerow(row)

Why not just use the csv module?
#!/usr/bin/env python
import csv
with open(somefile) as infile:
r=csv.reader(infile)
rows = []
for row in csv:
rows.append(['' if f == "." else f for f in row])
with open(newfile, 'w') as outfile:
w=csv.writer(outfile)
w.writelines(rows)

The safest way would be to use the CSV module to process the file, then identify any fields that only contain ., delete those and write the new CSV file back to disk.
A brittle workaround would be to search and replace a dot that is not surrounded by alphanumerics: \B\.\B is the regex that would find those dots. But that might also find other dots like the middle dot in "...".
So, to find a dot that is surrounded by commas, you could search for (?<=,)\.(?=,).

Related

Iterating through CSV file in python to find titles with leading spaces

I'm working with a large csv file that contains songs and their ownershp properties. Each song record is written top-down, with associated writer and publisher names below each title. So a given song may comprise of say, 4-6 rows, depending on how many writers/publishers control it (example with header row below):
Title,RoleType,Name,Shares,Note
BOOGIE BREAK 2,ASCAP,Total Current ASCAP Share,100,
BOOGIE BREAK 2,W,MERCADO JOSEPH M,,
BOOGIE BREAK 2,P,CRAFTIN MUSIC,,
BOOGIE BREAK 2,P,NEXT DIMENSION MUSIC,,
I'm currently trying to loop through the entire file to extract all of the song titles that contain leading spaces (e.g.,' song title'). Here's the code that I'm currently using:
import csv
import re
with open('output/sws.txt', 'w') as sws:
with open('data/ascap_catalog1.csv', 'r') as ac:
ascap = csv.reader(ac, delimiter=',')
ascap = list(ascap)
for row in ascap:
for strings in row:
if re.search('\A\s+', strings):
row = str(row)
sws.write(row)
sws.write('\n')
else:
continue
Due to the size of this file csv file that I'm working with (~2GB), it takes quite a bit of time to iterate through and produce a result file. However, based on the results that I've gotten, it appears the song titles with leading spaces are all clustered at the beginning of the file. Once those songs have all been listed, then normal songs w/o leading spaces appear.
Is there a way to make this code a bit more efficient, time-wise? I tried using a few breaks after every for and if statement, but depending on the amount that I used, it either didn't effect the statement at all, or broke too quickly, not capturing any rows.
I also tried wrapping it in a function and implementing return, however, for some reason the code only seemed to iterate through the first row (not counting the header row, which I would skip).
Thanks so much for your time,
list(ascap) isn't doing you nay favors. reader objects are iterators over their contents, but they don't load it all into memory until ti's needed. Just iterate over the reader object directly.
For each row, just check row[0][0].isspace(). That checks the first character of the first entry, which is all you need to determine whether something begins with whitespace.
with open('output/sws.txt', 'w', newline="") as sws:
with open('data/ascap_catalog1.csv', 'r', newline="") as ac:
ascap = csv.reader(ac, delimiter=',')
for row in ascap:
if row and row[0] and row[0][0].isspace():
print(row, file=sws)
You could also play with your output, like saving all the rows you want to keep in a list before writing them at the end. It sounds like your input might be sorted, if all the leading whitespace names come first. If that's the case, you can just add else: break to skip the rest of the file.
You can use a dictionary to find each song and group all of its associated values:
from collections import defaultdict
import csv, re
d = defaultdict(list)
count = 0 #count needed to remove the header, without loading the full data into memory
with open('filename.csv') as f:
for a, *b in csv.reader(f):
if count:
if re.findall('^\s', a):
d[a].append(b)
count += 1
this one worked well for me and seems to be simple enough.
import csv
import re
with open('C:\\results.csv', 'w') as sws:
with open('C:\\ascap.csv', 'r') as ac:
ascap = csv.reader(ac, delimiter=',')
for row in ascap:
if re.match('\s+', row[0]):
sws.write(str(row)+ '\n')
Here are some things you can improve:
Use the reader object as an iterator directly without creating an intermediate list. This will save you both computation time and memory.
Check only the first value in a row (which is a title), not all.
Remove an unnecessary else clause.
Combining all of this and applying some best practices you can do:
import csv
import re
with open('data/ascap_catalog1.csv') as ac, open('output/sws.txt', 'w') as sws:
reader = csv.reader(ac)
for row in reader:
if re.search(r'\A\s+', row[0]):
print(row, file=sws)
It appears the song titles with leading spaces are all clustered at
the beginning of the file.
In this case you can use itertools.takewhile to only iterate the file as long the titles have leading spaces:
import csv
import re
from itertools import takewhile
with open('data/ascap_catalog1.csv') as ac, open('output/sws.txt', 'w') as sws:
reader = csv.reader(ac)
next(reader) # skip the header
for row in takewhile(lambda x: re.search(r'\A\s+', x[0]), reader):
print(row, file=sws)

how can I use csv tools for zip text file?

update-my file.txt.zp is tab delimited and looks kind of like this :
file.txt.zp
I want to split the first col by : _ /
original post:
I have a very large zipped tab delimited file.
I want to open it, scan it one row at a time, split some of the col, and write it to a new file.
I got various errors (every time I fix one another pops)
This is my code:
import csv
import re
import gzip
f = gzip.open('file.txt.gz')
original = f.readlines()
f.close()
original_l = csv.reader(original)
for row in original_l:
file_l = re.split('_|:|/',row)
with open ('newfile.gz', 'w', newline='') as final:
finalfile = csv.writer(final,delimiter = ' ')
finalfile.writerow(file_l)
Thanks!
for this code i got the error:
for row in original_l:
_csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)
so based on what I found here I added this after f.close():
original = original.decode('utf8')
and then got the error:
original = original.decode('utf8')
AttributeError: 'list' object has no attribute 'decode'
Update 2
This code should produce the output that you're after.
import csv
import gzip
import re
with gzip.open('file.txt.gz', mode='rt') as f, \
open('newfile.gz', 'w') as final:
writer = csv.writer(final, delimiter=' ')
reader = csv.reader(f, delimiter='\t')
_ = next(reader) # skip header row
for row in reader:
writer.writerow(re.split(r'_|:|/', row[0]))
Update
Open the gzip file in text mode because str objects are required by the CSV module in Python 3.
f = gzip.open('file.txt.gz', 'rt')
Also specify the delimiter when creating the csv.reader.
original_l = csv.reader(original, delimiter='\t')
This will get you past the first hurdle.
Now you need to explain what the data is, which columns you wish to extract, and what the output should look like.
Original answer follows...
One obvious problem is that the output file is constantly being overwritten by the next row of input. This is because the output file is opened in (over)write mode (`'w`` ) once per row.
It would be better to open the output file once outside of the loop.
Also, the CSV file delimiter is not specified when creating the reader. You said that the file is tab delimited so specify that:
original_l = csv.reader(original, delimiter='\t')
On the other hand, your code attempts to split each row using other delimiters, however, the rows coming from the csv.reader are represented as a list, not a string as the re.split() code would require.
Another problem is that the output file is not zipped as the name suggests.

Python: Read csv file with an arbitrary number of tabs as delimiter

I have my csv file formatted with all columns nicely alligned by using one or more tabs in between different values.
I know it is possible to use a single tab as delimiter with csv.register_dialect("tab_delimiter", delimiter="\t"). But this only works with exaclty one tab between the values. I would like to process the file keeping its format, i.e., not deleting duplicate tabs. Each field (row, column) contains a value.
Is it possible to use a number of 1+ tabs as delimiter or ignore additional tabs without affecting the numbering of the values in a row? row[1] should be the second value independent of how many tabs are in between row[0].
##Sample.txt
##ID name Age
##1 11 111
##2 22 222
import pandas as pd
df=pd.read_csv('Sample.txt' ,sep=r'\t+')
print df
Assuming that there will never be empty fields, you can use a generator to remove duplicates from the incoming CSV file and then use the csv module as usual:
import csv
def de_dup(f, delimiter='\t'):
for line in f:
yield delimiter.join(field for field in line.split(delimiter) if field)
with open('data.csv') as f:
for row in csv.reader(de_dup(f), delimiter='\t'):
print(row)
An alternative way is to use re.sub() in the generator:
import re
def de_dup(f, delimiter='\t'):
for line in f:
yield re.sub(r'{}{{2,}}'.format(delimiter), delimiter, line)
but this still has the limitation that all fields must contain a value.
The most convenient way for me to deal with the multiple tabs was using an additonal function that takes the row and removes the empty values/fields that are created by multiple tabs in a row. This doesn't affect the formating of the csv-file and I can access the second value in the row with row[1] - even with multiple tabs before it.
def remove_empty(line):
result = []
for i in range(len(line)):
if line[i] != "":
result.append(line[i])
return result
And in the code where I read the file and process the values:
for row in reader:
row = remove_empty(row)
**continue processing normally**
I think this solution is similar to mhawke's, but with his solution I couldn't access the same values with row[i] as before (i.e., with only one delimiter in between each value).
Or the completely general solution for any type of repeated separators is to recursively replace each multiple separator by single separator and write to a new file (although it is slow for gigabyte sized CSV files):
def replaceMultipleSeparators( fileName, oldSeparator, newSeparator ):
linesOfCsvInputFile = open( fileName, encoding='utf-8', mode='r' ).readlines()
csvNewFileName = fileName + ".new"
print('Writing: %s replacing %s with %s' % ( csvNewFileName, oldSeparator, newSeparator ) , end='' )
outputFileStream = open( newFileName, 'w' )
for line in linesOfCsvInputFile:
newLine = line.rstrip()
processedLine = ""
while newLine != processedLine:
processedLine = newLine
newLine = processedLine.replace( oldSeparator + oldSeparator, oldSeparator )
newLine = newLine.replace( oldSeparator, newSeparator )
outputFileStream.write( newLine + '\n' )
outputFileStream.close()
which given input testFile.csv will generator testFile.csv.new with TABs replaced by PIPEs if you run:
replaceMultipleSeparators( 'testFile.csv', '\t', '|' )
Sometimes you will need to replace 'utf-8' encoding with 'latin-1' for some microsoft US generated CSV files. See errors related to 0xe4 reading for this issue.

Using python to parsing a log file with real case - how to skip lines in csv.reader for loop, how to use different delimiter

Here is a section of the log file I want to parse:
And here is the code I am writing:
import csv
with open('Coplanarity_PedCheck.log','rt') as tsvin, open('YX2.csv', 'wt') as csvout:
read_tsvin = csv.reader(tsvin, delimiter='\t')
for row in read_tsvin:
print(row)
filters = row[0]
if "#Log File Initialized!" in filters:
print(row)
datetime = row[0]
print("looklook",datetime[23:46])
csvout.write(datetime[23:46]+",")
BS = row[0]
print("looklook",BS[17:21])
csvout.write(datetime[17:21]+",")
csvout.write("\n")
csvout.close()
I need to get the date and time information from row1, then get "left" from row2, then need to skip section 4. How should I do it?
Since the csv.reader makes row1 an list with only 1 element I converted it to string again to split out the datetime info I need. But I think it is not efficient.
I did same thing for row2, then I want to skip row 3-6, but I don't know how.
Also, csv.reader converts my float data into text, how can I convert them back before I write them into another file?
You are going to want to learn to use regular expressions.
For example, you could do something like this:
with open('Coplanarity_PedCheck.log','rt') as tsvin, open('YX2.csv', 'wt') as csvout:
read_tsvin = csv.reader(tsvin, delimiter='\t')
# Get the first line and use the first field
header = read_tsvin.next()[0]
m = re.search('\[([0-9: -]+)\]', row)
datetime = m.group(1)
csvout.write(datetime, ',')
# Find if 'Left' is in line 1
direction = read_tsvin.next()[0]
m = re.search('Left', direction)
if m:
# If m is found print the whole line
csvout.write(m.group(0))
csvout.write('\n')
# Skip lines 3-6
for i in range(4):
null = read_tsvin.next()
# Loop over the rest of the rows
for row in tsvin:
# Parse the data
csvout.close()
Modifying this to look for a line containing '#Log File Initialized!' rather than hard coding for the first line would be fairly simple using regular expressions. Take a look at the regular expression documentation
This probably isn't exactly what you want to do, but rather a suggestion for a good starting point.

Remove special characters from csv file using python

There seems to something on this topic already (How to replace all those Special Characters with white spaces in python?), but I can't figure this simple task out for the life of me.
I have a .CSV file with 75 columns and almost 4000 rows. I need to replace all the 'special characters' ($ # & * ect) with '_' and write to a new file. Here's what I have so far:
import csv
input = open('C:/Temp/Data.csv', 'rb')
lines = csv.reader(input)
output = open('C:/Temp/Data_out1.csv', 'wb')
writer = csv.writer(output)
conversion = '-"/.$'
text = input.read()
newtext = '_'
for c in text:
newtext += '_' if c in conversion else c
writer.writerow(c)
input.close()
output.close()
All this succeeds in doing is to write everything to the output file as a single column, producing over 65K rows. Additionally, the special characters are still present!
Sorry for the redundant question.
Thank you in advance!
I might do something like
import csv
with open("special.csv", "rb") as infile, open("repaired.csv", "wb") as outfile:
reader = csv.reader(infile)
writer = csv.writer(outfile)
conversion = set('_"/.$')
for row in reader:
newrow = [''.join('_' if c in conversion else c for c in entry) for entry in row]
writer.writerow(newrow)
which turns
$ cat special.csv
th$s,2.3/,will-be
fixed.,even.though,maybe
some,"shoul""dn't",be
(note that I have a quoted value) into
$ cat repaired.csv
th_s,2_3_,will-be
fixed_,even_though,maybe
some,shoul_dn't,be
Right now, your code is reading in the entire text into one big line:
text = input.read()
Starting from a _ character:
newtext = '_'
Looping over every single character in text:
for c in text:
Add the corrected character to newtext (very slowly):
newtext += '_' if c in conversion else c
And then write the original character (?), as a column, to a new csv:
writer.writerow(c)
.. which is unlikely to be what you want. :^)
This doesn't seem to need to deal with CSV's in particular (as long as the special characters aren't your column delimiters).
lines = []
with open('C:/Temp/Data.csv', 'r') as input:
lines = input.readlines()
conversion = '-"/.$'
newtext = '_'
outputLines = []
for line in lines:
temp = line[:]
for c in conversion:
temp = temp.replace(c, newtext)
outputLines.append(temp)
with open('C:/Temp/Data_out1.csv', 'w') as output:
for line in outputLines:
output.write(line + "\n")
In addition to the bug pointed out by #Nisan.H and the valid point made by #dckrooney that you may not need to treat the file in a special way in this case just because it is a CSV file (but see my comment below):
writer.writerow() should take a sequence of strings, each of which would be written out separated by commas (see here). In your case you are writing a single string.
This code is setting up to read from 'C:/Temp/Data.csv' in two ways - through input and through lines but it only actually reads from input (therefore the code does not deal with the file as a CSV file anyway).
The code appends characters to newtext and writes out each version of that variable. Thus, the first version of newtext would be 1 character long, the second 2 characters long, the third 3 characters long, etc.
Finally, given that a CSV file can have quote marks in it, it may actually be necessary to deal with the input file specifically as a CSV to avoid replacing quote marks that you want to keep, e.g. quote marks that are there to protect commas that exist within fields of the CSV file. In that case, it would be necessary to process each field of the CSV file individually, then write each row out to the new CSV file.
Maybe try
s = open('myfile.cv','r').read()
chars = ('$','%','^','*') # etc
for c in chars:
s = '_'.join( s.split(c) )
out_file = open('myfile_new.cv','w')
out_file.write(s)
out_file.close()

Categories

Resources