Reading a tab separated file with Python Pandas - python

I have encountered a problem reading a tab separated file using Pandas.
All the cell values have double quotations but for some rows, there is an extra double quotation that breaks the whole procedure. For instance:
Column A Column B Column C
"foo1" "121654" "unit"
"foo2" "1214" "unit"
"foo3" "15884""
The error I get is: Error tokenizing data. C error: Expected 31 fields in line 8355, saw 58
The code I used is:
csv = pd.read_csv(file, sep='\t',  lineterminator='\n', names=None)
and it works fine for the rest of the files but not for the ones where this extra double quotation appears.

If you cannot change the buggy input, the best way would be to read the input file into a io.StringIO object, replacing the double quotes, then pass this file-like object to pd.read (it supports filenames and file-like objects)
That way you don't have to create a temporary file or to alter the input data.
import io
with open(file) as f:
fileobject = io.StringIO(f.read().replace('""','"'))
csv = pd.read_csv(fileobject, sep='\t', lineterminator='\n', names=None)

You can do the preprocessing step to fix the quotation issue:
with open(file, 'r') as fp:
text = fp.read().replace('""', '"')
with open(file, 'w') as fp:
fp.write(text)

Related

Want to convert the csv file from line break mode to be separated by comma

Currently the csv file is saved in line break mode. But it should be separated by comma for inputting these datas as an array.
The current csv file:
test#eaxmple.com
test#eaxmple.com
test#eaxmple.com
The ideal csv file:
test#eaxmple.com, test#eaxmple.com, test#eaxmple.com
The code:
def get_addresses():
with open('./addresses.csv') as f:
addresses_file = csv.reader(f)
# Need to be converted
How can I convert it? I hope to use Python.
tried this.
with open('./addresses.txt') as input, open('./addresses.csv', 'w') as output:
output.write(','.join(input.readlines()))
output.write('\n')
the result:
test#eaxmple.com
,test#eaxmple.com
,test#eaxmple.com
with open('./addresses.txt') as f:
print(",".join(f.read().splitlines()))
Load the original file into pandas using:
import pandas as pd
df = pd.read_csv({YOUR_FILE}, escapechar='\\')
Then export it back to .csv (by default this will be comma separated).
df.to_csv({YOUR_FILE})
For this simple task, just read them into an array, then join the array on commas.
with open('./addresses.txt') as input, open('./addresses.csv', 'w') as output:
output.write(','.join(input.read().splitlines()))
output.write('\n')
This ignores any complications in the CSV formatting - if your data could contain commas (which are reserved as the field separator) or double quotes (which are reserved for quoting other reserved characters) you will want to switch to the proper csv module for output and perhaps for input.
Overwriting your input file is also an unnecessary complication, so I suggest you rename the input file to addresses.txt and use addresses.csv only for output.
Demo: https://repl.it/repls/AdequateStunningVideogames
Another common trick is to read one line at a time, and write a separator before each output except the first. This is more scalable for large input files.
with open blah blah blah ...:
separator = '' # for first line
for line in input:
output.write(separator)
output.write(line)
separator = ',' # for subsequent input lines
output.write('\n')

Write to a csv file multiple times? [duplicate]

I am trying to add a new row to my old CSV file. Basically, it gets updated each time I run the Python script.
Right now I am storing the old CSV rows values in a list and then deleting the CSV file and creating it again with the new list value.
I wanted to know are there any better ways of doing this.
with open('document.csv','a') as fd:
fd.write(myCsvRow)
Opening a file with the 'a' parameter allows you to append to the end of the file instead of simply overwriting the existing content. Try that.
I prefer this solution using the csv module from the standard library and the with statement to avoid leaving the file open.
The key point is using 'a' for appending when you open the file.
import csv
fields=['first','second','third']
with open(r'name', 'a') as f:
writer = csv.writer(f)
writer.writerow(fields)
If you are using Python 2.7 you may experience superfluous new lines in Windows. You can try to avoid them using 'ab' instead of 'a' this will, however, cause you TypeError: a bytes-like object is required, not 'str' in python and CSV in Python 3.6. Adding the newline='', as Natacha suggests, will cause you a backward incompatibility between Python 2 and 3.
Based in the answer of #G M and paying attention to the #John La Rooy's warning, I was able to append a new row opening the file in 'a'mode.
Even in windows, in order to avoid the newline problem, you must declare it as newline=''.
Now you can open the file in 'a'mode (without the b).
import csv
with open(r'names.csv', 'a', newline='') as csvfile:
fieldnames = ['This','aNew']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writerow({'This':'is', 'aNew':'Row'})
I didn't try with the regular writer (without the Dict), but I think that it'll be ok too.
If you use pandas, you can append your dataframes to an existing CSV file this way:
df.to_csv('log.csv', mode='a', index=False, header=False)
With mode='a' we ensure that we append, rather than overwrite, and with header=False we ensure that we append only the values of df rows, rather than header + values.
Are you opening the file with mode of 'a' instead of 'w'?
See Reading and Writing Files in the python docs
7.2. Reading and Writing Files
open() returns a file object, and is most commonly used with two arguments: open(filename, mode).
>>> f = open('workfile', 'w')
>>> print f <open file 'workfile', mode 'w' at 80a0960>
The first argument is a string containing the filename. The second argument is
another string containing a few characters describing the way in which
the file will be used. mode can be 'r' when the file will only be
read, 'w' for only writing (an existing file with the same name will
be erased), and 'a' opens the file for appending; any data written to
the file is automatically added to the end. 'r+' opens the file for
both reading and writing. The mode argument is optional; 'r' will be
assumed if it’s omitted.
On Windows, 'b' appended to the mode opens the file in binary mode, so
there are also modes like 'rb', 'wb', and 'r+b'. Python on Windows
makes a distinction between text and binary files; the end-of-line
characters in text files are automatically altered slightly when data
is read or written. This behind-the-scenes modification to file data
is fine for ASCII text files, but it’ll corrupt binary data like that
in JPEG or EXE files. Be very careful to use binary mode when reading
and writing such files. On Unix, it doesn’t hurt to append a 'b' to
the mode, so you can use it platform-independently for all binary
files.
If the file exists and contains data, then it is possible to generate the fieldname parameter for csv.DictWriter automatically:
# read header automatically
with open(myFile, "r") as f:
reader = csv.reader(f)
for header in reader:
break
# add row to CSV file
with open(myFile, "a", newline='') as f:
writer = csv.DictWriter(f, fieldnames=header)
writer.writerow(myDict)
I use the following approach to append a new line in a .csv file:
pose_x = 1
pose_y = 2
with open('path-to-your-csv-file.csv', mode='a') as file_:
file_.write("{},{}".format(pose_x, pose_y))
file_.write("\n") # Next line.
[NOTE]:
mode='a' is append mode.
# I like using the codecs opening in a with
field_names = ['latitude', 'longitude', 'date', 'user', 'text']
with codecs.open(filename,"ab", encoding='utf-8') as logfile:
logger = csv.DictWriter(logfile, fieldnames=field_names)
logger.writeheader()
# some more code stuff
for video in aList:
video_result = {}
video_result['date'] = video['snippet']['publishedAt']
video_result['user'] = video['id']
video_result['text'] = video['snippet']['description'].encode('utf8')
logger.writerow(video_result)

creating a csv file from a function result, python

I am using this pdf to csv function from {Python module for converting PDF to text} and I was wondering how can I now export the result to a csv file on my drive? I tried adding in the function
with open('C:\location', 'wb') as f:
writer = csv.writer(f)
for row in data:
writer.writerow(row)
but the resulting csv file has one character per row and not the rows I have when printing data in python.
If you are printing a single character per row, then what you have is a string. Your loop
for row in data:
translates to
for character in string:
so you need to break your string up into the chunks you want written on a single row. You might be able to use something like data.split() but it's hard to say without seeing more of your code and data.
In response to your comment:
yes, you can just dump the data to a CSV... If it adheres to the rules of CSV. If your data is separated by commas, with each row terminated by a newline, then you can just write your data to a file.
with open ("file.csv",'w') as f:
f.write(data)
This will ONLY work if your data adheres to the rules of csv.

Python csv writer : "Unknown Dialect" Error

I have a very large string in the CSV format that will be written to a CSV file.
I try to write it to CSV using the simplest if the python script
results=""" "2013-12-03 23:59:52","/core/log","79.223.39.000","logging-4.0",iPad,Unknown,"1.0.1.59-266060",NA,NA,NA,NA,3,"1385593191.865",true,ERROR,"app_error","iPad/Unknown/webkit/537.51.1",NA,"Does+not",false
"2013-12-03 23:58:41","/core/log","217.7.59.000","logging-4.0",Win32,Unknown,"1.0.1.59-266060",NA,NA,NA,NA,4,"1385593120.68",true,ERROR,"app_error","Win32/Unknown/msie/9.0",NA,"Does+not,false
"2013-12-03 23:58:19","/core/client_log","79.240.195.000","logging-4.0",Win32,"5.1","1.0.1.59-266060",NA,NA,NA,NA,6,"1385593099.001",true,ERROR,"app_error","Win32/5.1/mozilla/25.0",NA,"Could+not:+{"url":"/all.json?status=ongoing,scheduled,conflict","code":0,"data":"","success":false,"error":true,"cached":false,"jqXhr":{"readyState":0,"responseText":"","status":0,"statusText":"error"}}",false"""
resultArray = results.split('\n')
with open(csvfile, 'wb') as f:
writer = csv.writer(f)
for row in resultArray:
writer.writerows(row)
The code returns
"Unknown Dialect"
Error
Is the error because of the script or is it due to the string that is being written?
EDIT
If the problem is bad input how do I sanitize it so that it can be used by the csv.writer() method?
You need to specify the format of your string:
with open(csvfile, 'wb') as f:
writer = csv.writer(f, delimiter=',', quotechar="'", quoting=csv.QUOTE_ALL)
You might also want to re-visit your writing loop; the way you have it written you will get one column in your file, and each row will be one character from the results string.
To really exploit the module, try this:
import csv
lines = ["'A','bunch+of','multiline','CSV,LIKE,STRING'"]
reader = csv.reader(lines, quotechar="'")
with open('out.csv', 'wb') as f:
writer = csv.writer(f)
writer.writerows(list(reader))
out.csv will have:
A,bunch+of,multiline,"CSV,LIKE,STRING"
If you want to quote all the column values, then add quoting=csv.QUOTE_ALL to the writer object; then you file will have:
"A","bunch+of","multiline","CSV,LIKE,STRING"
To change the quotes to ', add quotechar="'" to the writer object.
The above code does not give csv.writer.writerows input that it expects. Specifically:
resultArray = results.split('\n')
This creates a list of strings. Then, you pass each string to your writer and tell it to writerows with it:
for row in resultArray:
writer.writerows(row)
But writerows does not expect a single string. From the docs:
csvwriter.writerows(rows)
Write all the rows parameters (a list of row objects as described above) to the writer’s file object, formatted according to the current dialect.
So you're passing a string to a method that expects its argument to be a list of row objects, where a row object is itself expected to be a sequence of strings or numbers:
A row must be a sequence of strings or numbers for Writer objects
Are you sure your listed example code accurately reflects your attempt? While it certainly won't work, I would expect the exception produced to be different.
For a possible fix - if all you are trying to do is to write a big string to a file, you don't need the csv library at all. You can just write the string directly. Even splitting on newlines is unnecessary unless you need to do something like replacing Unix-style linefeeds with DOS-style linefeeds.
If you need to use the csv module after all, you need to give your writer something it understands - in this example, that would be something like writer.writerow(['A','bunch+of','multiline','CSV,LIKE,STRING']). Note that that's a true Python list of strings. If you need to turn your raw string "'A','bunch+of','multiline','CSV,LIKE,STRING'" into such a list, I think you'll find the csv library useful as a reader - no need to reinvent the wheel to handle the quoted commas in the substring 'CSV,LIKE,STRING'. And in that case you would need to care about your dialect.
you can use 'register_dialect':
for example for escaped formatting:
csv.register_dialect('escaped', escapechar='\\', doublequote=True, quoting=csv.QUOTE_ALL)

Python CSV: Remove quotes from value

I have a process where a CSV file can be downloaded, edited then uploaded again. On the download, the CSV file is in the correct format, with no wrapping double quotes
1, someval, someval2
When I open the CSV in a spreadsheet, edit and save, it adds double quotes around the strings
1, "someEditVal", "someval2"
I figured this was just the action of the spreadsheet (in this case, openoffice). I want my upload script to remove the wrapping double quotes. I cannot remove all quotes, just incase the body contains them, and I also dont want to just check first and last characters for double quotes.
Im almost sure that the CSV library in python would know how to handle this, but not sure how to use it...
EDIT
When I use the values within a dictionary, they turn out as follows
{'header':'"value"'}
Thanks
For you example, the following works:
import csv
writer = csv.writer(open("out.csv", "wb"), quoting=csv.QUOTE_NONE)
reader = csv.reader(open("in.csv", "rb"), skipinitialspace=True)
writer.writerows(reader)
You might need to play with the dialect options of the CSV reader and writer -- see the documentation of the csv module.
Thanks to everyone who was trying to help me, but I figured it out. When specifying the reader, you can define the quotechar
csv.reader(upload_file, delimiter=',', quotechar='"')
This handles the wrapping quotes of strings.
For Python 3:
import csv
writer = csv.writer(open("query_result.csv", "wt"), quoting=csv.QUOTE_NONE, escapechar='\\')
reader = csv.reader(open("out.txt", "rt"), skipinitialspace=True)
writer.writerows(reader)
The original answer gives this error under Python 3. Also See this SO for detail: csv.Error: iterator should return strings, not bytes
Traceback (most recent call last):
File "remove_quotes.py", line 11, in
writer.writerows(reader)
_csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)

Categories

Resources