I have an excel file that contains data with multiple columns of varying width that I need to work with on my PC. However, the file contains SOH and STX characters as delimiting characters, since they were from TextEdit on a Mac. The SOH is record delimiter and the STX is row delimiter. On my PC, both these characters are shown as a rectangle (in screenshot). I can't use the fixed width delimited option since I would lose data. I tried writing a Python script, but Python doesn't recognize the SOH and STX either, just displays it as a rectangle too. How do I delimit these records appropriately? I would appreciate any possible method.
Thanks!
This should work
SOH='\x01'
STX='\x02'
# As it is, this function returns the values as strings, not as integers
def read_lines(filename):
rawdata = open(filename, "rb").read()
for l in rawdata.split(SOH + STX):
if not l:
continue
yield l.split(SOH)
# Rows is a list. Each element in the list is a row of values
# (either a list or a tuple, for example)
def write_lines(filename, rows):
with open(filename, "wb") as f:
for row in rows:
f.write(SOH.join([str(x) for x in row]) + SOH + STX)
Edit: Example use...
for row in read_lines("myfile.csv"):
print ", ".join(row)
Related
I have a .csv file with comma separated fields. I am receiving this file from a 3rd party and the content cannot change. I need to import the file to a database, but there are commas in some of the "comma" separated fields. The comma separated fields are also fixed length - when I stright up print the fields as per the below lines in function insert_line_csv they are spaced in a fixed length.
I need essentially need an efficient method of collecting fields that could have comma's included in the field. I was hoping to combine the two methods. Not sure if that would be efficient.
I am using python 3 - willing to use any libraries to make the job efficient and easy.
Currently I am have the following:
with open(FileName, 'r') as f:
for count, line in enumerate(f):
insert_line_csv(count, line)
with the insert_line_csv function looking like:
def insert_line_csv(line_no, line):
line = line.split(",")
field0 = line[0]
field1 = line[1]
......
I am importing the line_no, as that is also being entered into the db.
Any insight would be appreciated.
A sample dataset:
text ,2000.00 ,2018-07-07,textwithoutcomma ,text ,1
text ,3000.00 ,2018-07-08,textwith,comma ,text ,7
text ,1000.00 ,2018-07-07,textwithoutcomma ,text ,4
If the comma seperated fields are all fixed length, you should be able to just splice them off by count instead of splicing by commas, see Split string by count of characters
as a mockup-code you have
toParse = line
while (toParse != "")
chunk = first X chars of toParse
restOfLine = toParse without the chars just cut off
write chunk to db
toParse = restOfLine
That should work imho
Edit:
upon seeing your sample dataset. Can there only be one field with commas inside of it? If so, you could split via comma, read out the first 3 fields, then the last two. Whatever is left, you concatenate again, because it is the value fo the 4th field. (If it had commas, ou'll need to actually concatenate there, if not, its already the value)
I'm trying to delete some number of data rows from a file, essentially just because there are too many data points. I can easily print them to IDLE but when I try to write the lines to a file, all of the data from one row goes into one column. I'm definitely a noob but it seems like this should be "trivial"
I've tried it with writerow and writerows, zip(), with and without [], I've changed the delimiter and line terminator.
import csv
filename = "velocity_result.csv"
with open(filename, "r") as source:
for i, line in enumerate(source):
if i % 2 == 0:
with open ("result.csv", "ab") as result:
result_writer = csv.writer(result, quoting=csv.QUOTE_ALL, delimiter=',', lineterminator='\n')
result_writer.writerow([line])
This is what happens:
input = |a|b|c|d| <row
|e|f|g|h|
output = |abcd|
<every other row deleted
(just one column)
My expectaion is
input = |a|b|c|d| <row
|e|f|g|h|
output = |a|b|c|d|
<every other row deleted
Once you've read the line, it becomes a single item as far as Python is concerned. Sure, maybe it is a string which has comma separated values in it, but it is a single item still. So [line] is a list of 1 item, no matter how it is formatted.\
If you want to make sure the line is recognized as a list of separate values, you need to make it such, perhaps with split:
result_writer.writerow(line.split('<input file delimiter here>'))
Now the line becomes a list of 4 items, so it makes sense for csv writer to write them as 4 separated values in the file.
I have my csv file formatted with all columns nicely alligned by using one or more tabs in between different values.
I know it is possible to use a single tab as delimiter with csv.register_dialect("tab_delimiter", delimiter="\t"). But this only works with exaclty one tab between the values. I would like to process the file keeping its format, i.e., not deleting duplicate tabs. Each field (row, column) contains a value.
Is it possible to use a number of 1+ tabs as delimiter or ignore additional tabs without affecting the numbering of the values in a row? row[1] should be the second value independent of how many tabs are in between row[0].
##Sample.txt
##ID name Age
##1 11 111
##2 22 222
import pandas as pd
df=pd.read_csv('Sample.txt' ,sep=r'\t+')
print df
Assuming that there will never be empty fields, you can use a generator to remove duplicates from the incoming CSV file and then use the csv module as usual:
import csv
def de_dup(f, delimiter='\t'):
for line in f:
yield delimiter.join(field for field in line.split(delimiter) if field)
with open('data.csv') as f:
for row in csv.reader(de_dup(f), delimiter='\t'):
print(row)
An alternative way is to use re.sub() in the generator:
import re
def de_dup(f, delimiter='\t'):
for line in f:
yield re.sub(r'{}{{2,}}'.format(delimiter), delimiter, line)
but this still has the limitation that all fields must contain a value.
The most convenient way for me to deal with the multiple tabs was using an additonal function that takes the row and removes the empty values/fields that are created by multiple tabs in a row. This doesn't affect the formating of the csv-file and I can access the second value in the row with row[1] - even with multiple tabs before it.
def remove_empty(line):
result = []
for i in range(len(line)):
if line[i] != "":
result.append(line[i])
return result
And in the code where I read the file and process the values:
for row in reader:
row = remove_empty(row)
**continue processing normally**
I think this solution is similar to mhawke's, but with his solution I couldn't access the same values with row[i] as before (i.e., with only one delimiter in between each value).
Or the completely general solution for any type of repeated separators is to recursively replace each multiple separator by single separator and write to a new file (although it is slow for gigabyte sized CSV files):
def replaceMultipleSeparators( fileName, oldSeparator, newSeparator ):
linesOfCsvInputFile = open( fileName, encoding='utf-8', mode='r' ).readlines()
csvNewFileName = fileName + ".new"
print('Writing: %s replacing %s with %s' % ( csvNewFileName, oldSeparator, newSeparator ) , end='' )
outputFileStream = open( newFileName, 'w' )
for line in linesOfCsvInputFile:
newLine = line.rstrip()
processedLine = ""
while newLine != processedLine:
processedLine = newLine
newLine = processedLine.replace( oldSeparator + oldSeparator, oldSeparator )
newLine = newLine.replace( oldSeparator, newSeparator )
outputFileStream.write( newLine + '\n' )
outputFileStream.close()
which given input testFile.csv will generator testFile.csv.new with TABs replaced by PIPEs if you run:
replaceMultipleSeparators( 'testFile.csv', '\t', '|' )
Sometimes you will need to replace 'utf-8' encoding with 'latin-1' for some microsoft US generated CSV files. See errors related to 0xe4 reading for this issue.
There seems to something on this topic already (How to replace all those Special Characters with white spaces in python?), but I can't figure this simple task out for the life of me.
I have a .CSV file with 75 columns and almost 4000 rows. I need to replace all the 'special characters' ($ # & * ect) with '_' and write to a new file. Here's what I have so far:
import csv
input = open('C:/Temp/Data.csv', 'rb')
lines = csv.reader(input)
output = open('C:/Temp/Data_out1.csv', 'wb')
writer = csv.writer(output)
conversion = '-"/.$'
text = input.read()
newtext = '_'
for c in text:
newtext += '_' if c in conversion else c
writer.writerow(c)
input.close()
output.close()
All this succeeds in doing is to write everything to the output file as a single column, producing over 65K rows. Additionally, the special characters are still present!
Sorry for the redundant question.
Thank you in advance!
I might do something like
import csv
with open("special.csv", "rb") as infile, open("repaired.csv", "wb") as outfile:
reader = csv.reader(infile)
writer = csv.writer(outfile)
conversion = set('_"/.$')
for row in reader:
newrow = [''.join('_' if c in conversion else c for c in entry) for entry in row]
writer.writerow(newrow)
which turns
$ cat special.csv
th$s,2.3/,will-be
fixed.,even.though,maybe
some,"shoul""dn't",be
(note that I have a quoted value) into
$ cat repaired.csv
th_s,2_3_,will-be
fixed_,even_though,maybe
some,shoul_dn't,be
Right now, your code is reading in the entire text into one big line:
text = input.read()
Starting from a _ character:
newtext = '_'
Looping over every single character in text:
for c in text:
Add the corrected character to newtext (very slowly):
newtext += '_' if c in conversion else c
And then write the original character (?), as a column, to a new csv:
writer.writerow(c)
.. which is unlikely to be what you want. :^)
This doesn't seem to need to deal with CSV's in particular (as long as the special characters aren't your column delimiters).
lines = []
with open('C:/Temp/Data.csv', 'r') as input:
lines = input.readlines()
conversion = '-"/.$'
newtext = '_'
outputLines = []
for line in lines:
temp = line[:]
for c in conversion:
temp = temp.replace(c, newtext)
outputLines.append(temp)
with open('C:/Temp/Data_out1.csv', 'w') as output:
for line in outputLines:
output.write(line + "\n")
In addition to the bug pointed out by #Nisan.H and the valid point made by #dckrooney that you may not need to treat the file in a special way in this case just because it is a CSV file (but see my comment below):
writer.writerow() should take a sequence of strings, each of which would be written out separated by commas (see here). In your case you are writing a single string.
This code is setting up to read from 'C:/Temp/Data.csv' in two ways - through input and through lines but it only actually reads from input (therefore the code does not deal with the file as a CSV file anyway).
The code appends characters to newtext and writes out each version of that variable. Thus, the first version of newtext would be 1 character long, the second 2 characters long, the third 3 characters long, etc.
Finally, given that a CSV file can have quote marks in it, it may actually be necessary to deal with the input file specifically as a CSV to avoid replacing quote marks that you want to keep, e.g. quote marks that are there to protect commas that exist within fields of the CSV file. In that case, it would be necessary to process each field of the CSV file individually, then write each row out to the new CSV file.
Maybe try
s = open('myfile.cv','r').read()
chars = ('$','%','^','*') # etc
for c in chars:
s = '_'.join( s.split(c) )
out_file = open('myfile_new.cv','w')
out_file.write(s)
out_file.close()
I am trying to read a very simple but somehow large(800Mb) csv file using the csv library in python. The delimiter is a single tab and each line consists of some numbers.
Each line is a record, and I have 20681 rows in my file. I had some problems during my calculations using this file,it always stops at a certain row. I got suspicious about the number of rows in the file.I used the code below to count the number of row in this file:
tfdf_Reader = csv.reader(open('v2-host_tfdf_en.txt'),delimiter=' ')
c = 0
for row in tfdf_Reader:
c = c + 1
print c
To my surprise c is printed with the value of 61722!!! Why is this happening? What am I doing wrong?
800 million bytes in the file and 20681 rows means that the average row size is over 38 THOUSAND bytes. Are you sure? How many numbers do you expect in each line? How do you know that you have 20681 rows? That the file is 800 Mb?
61722 rows is almost exactly 3 times 20681 -- is the number 3 of any significance e.g. 3 logical sub-sections of each record?
To find out what you really have in your file, don't rely on what it looks like. Python's repr() function is your friend.
Are you on Windows? Even if not, always open(filename, 'rb').
If the fields are tab-separated, then don't put delimeter=" " (whatever is between the quotes appears not to be a tab). Put delimiter="\t".
Try putting some debug statements in your code, like this:
DEBUG = True
f = open('v2-host_tfdf_en.txt', 'rb')
if DEBUG:
rawdata = f.read(200)
f.seek(0)
print 'rawdata', repr(rawdata)
# what is the delimiter between fields? between rows?
tfdf_Reader = csv.reader(f,delimiter=' ')
c = 0
for row in tfdf_Reader:
c = c + 1
if DEBUG and c <= 10:
print "row", c, repr(row)
# Are you getting rows like you expect?
print "rowcount", c
Note: if you are getting Error: field larger than field limit (131072), that means your file has 128Kb of data with no delimiters.
I'd suspect that:
(a) your file has random junk or a big chunk of binary zeroes apppended to it -- this should be obvious in a hex editor; it also should be obvious in a TEXT editor. Print all the rows that you do get to help identify where the trouble starts.
or (b) the delimiter is a string of one or more whitespace characters (space, tab), the first few rows have tabs, and the remaining rows have spaces. If so, this should be obvious in a hex editor (or in Notepad++, especially if you do View/Show Symbol/Show all characters). If this is the case, you can't use csv, you'd need something simple like:
f = open('v2-host_tfdf_en.txt', 'r') # NOT 'rb'
rows = [line.split() for line in f]
My first guess would be the delimeter. How are you ensuring the delimeter is a tab?
What is actually the value you are passing? (the code your pased lists a space, but I'm sure you intended to pass something else).
If your file is tab separated, then look specifically for '\t' as your delimeter. Looking for a space would mess up situations where there is space in your data that is not a column separator.
Also, if your file is an excel-tab, then there is a special "dialect" for that.