Python: how to build a text file for loading data? - python

I'm new to Python, and I am following this guide to implement a linear regression
http://nbviewer.jupyter.org/github/jdwittenauer/ipython-notebooks/blob/master/notebooks/ml/ML-Exercise1.ipynb
Basically I am on the step where I need to build a data set to import it into Python
I have created a text file with two columns, each data is separated by a tab
However, this is what I get
I looked around online and it seems that the tab is the delimiter. What am I doing wrong? How can I build this text file?

I would advise using the official documentation instead of "looking around online" - if you check the pandas read_csv() documentation, it lists (at the very top) the default values of each argument. The default value of the sep (separator) argument is ", ". So just change your call to pd.read_csv() to add sep='\t'.

use ',' instead of 'tab' as a delimiter in your textfile ex1data.txt, as pandas default delimiter is ','.
Here is an explanation from pandas official documentation for delimiter :
sep : str, default ‘,’
Delimiter to use. If sep is None, the C engine
cannot automatically detect the separator, but the Python parsing
engine can, meaning the latter will be used automatically. In
addition, separators longer than 1 character and different from '\s+'
will be interpreted as regular expressions and will also force the use
of the Python parsing engine. Note that regex delimiters are prone to
ignoring quoted data. Regex example: '\r\t'

Related

Python Pandas - use Multiple Character Delimiter when writing to_csv

It appears that the pandas to_csv function only allows single character delimiters/separators.
Is there some way to allow for a string of characters to be used like, "::" or "%%" instead?
I tried:
df.to_csv(local_file, sep = '::', header=None, index=False)
and getting:
TypeError: "delimiter" must be a 1-character string
Use numpy-savetxt
Ex:
np.savetxt(file.csv, np.char.decode(chunk_data.values.astype(np.bytes_), 'UTF-8'), delimiter='~|', fmt='%s',encoding=None)
np.savetxt(file.dat, chunk_data.values, delimiter='~|', fmt='%s',encoding='utf-8')
Think about what this line a::b::c‘ means to a standard CSV tool: an a, an empty column, a b, an empty column, and a c. Even in a more complicated case with quoting or escaping:"abc::def"::2 means an abc::def, an empty column, and a 2.
So, all you have to do is add an empty column between every column, and then use : as a delimiter, and the output will be almost what you want.
I say “almost” because Pandas is going to quote or escape single colons. Depending on the dialect options you’re using, and the tool you’re trying to interact with, this may or may not be a problem. Unnecessary quoting usually isn’t a problem (unless you ask for QUOTE_ALL, because then your columns will be separated by :"":, so hopefully you don’t need that dialect option), but unnecessary escapes might be (e.g., you might end up with every single : in a string turned into a \: or something). So you have to be careful with the options. But it’ll work for the basic “quote as needed, with mostly standard other options” settings.

defining proper separators with text in pandas csv_read

I've been reading up on machine learning with python and sklearn.
I've tried practicing with the iris dataset and then went on to find other datasets on the UCI website.
I found one that was called
"Amazon Book Reviews".
The documentation says each entry is separated with a new line and each of the four attributes is separated with a blank-space " ".
Unfortunately the data contains spaces everywhere since you have a title(text) and a description(html).
When I try and use the panda csv_read function of course it doesn't know where to separate the columns and neither do I.
Any ideas? Am I just way too out of my depth for a machine learning (and programming in general) beginner?
You haven't done anything wrong, the documentation is actually incorrect. The delimiter used in the data files is actually a tab '\t' character. You can use this as the delimiter parameter to pandas.read_csv.
Good luck with your analysis!
each entry is separated with a new line and each of the four attributes is separated with a blank-space " "
read_csv provides an optional sep argument where the default is ','
You can make this a space.

LPTHW: double quotes around CSV.writer

While going through LPTHW, I've set on reading the code here:
https://github.com/BrechtDeMan/secretsanta/blob/master/pairing.py
I've been trying to understand why the output CSV has double-quotes. There are several questions here about the problem but I'm not groking.
Where are the quotes getting introduced?
Edit: I wrote the author a couple of weeks back but haven't heard back.
Edit 2: An example of the output...
"Alice,101,alice#mail.org,Wendy,204,wendy#mail.org"
Double quotes are introduced in write_file function.
CSV files look simple on the surface, but sooner or later you will encounter some more complex problems. The first one is: what should happen if character denoting delimiter occurs in field content? Because there is no real standard for CSV format, different people had different ideas of correct answer for this question.
Python csv library tries to abstract this complexity and various approaches and make it easier to read and write CSV files following different rules. This is done by Dialect class objects.
The author of write_file function decided to construct output row manually by joining all fields and delimiter characters together, but then used csv module to actually write data into file:
writer.writerow([givers_list[ind][1] + ',' + givers_list[ind][2]
+ ',' + givers_list[ind][3]
+ ',' + givers_list[rand_vec[ind]][1] + ','
+ givers_list[rand_vec[ind]][2] + ',' + givers_list[rand_vec[ind]][3]])
This inconsistent usage of csv module resulted in entire row of data being treated as single field. Because that field contains characters used as field delimiters, Dialect.quoting decides how it should be handled. Default quoting configuration, csv.QUOTE_MINIMAL says that field should be quoted using Dialect.quotechar - which defaults to double quote character ("). That's why eventually entire field ends up surrounded by double quote characters.
Fast and easy, but not correct, solution would be changing quoting algorithm to csv.QUOTE_NONE. This will tell writer object to never surround fields, but instead to escape special characters by Dialect.escapechar. According to documentation, setting it to None (default) will raise an error. I guess that setting it to empty string could do the job.
The correct solution is feeding writer.writerrow with expected input data - list of fields. This should do (untested):
writer.writerow([givers_list[ind][1], givers_list[ind][2],
givers_list[ind][3],
givers_list[rand_vec[ind]][1],
givers_list[rand_vec[ind]][2], givers_list[rand_vec[ind]][3]])
In general, (double)quotes are needed when there is a seperator-char inside a field - and if there are quotes inside that field, they need to be 'escaped' with another quote.
Do you have an example of the output and the quotes you are talking about?
Edit (after example):
Ok, the whole row is treated as one field here. As Miroslaw Zalewski mentioned, those values should be treated as seperate fields instead of one long string.

csv.reader removing commas within quoted values

I have two programs that I converted to using the csv module in Python 2.7. The first I just used csv.reader:
f=open(sys.argv[1],'r')
line=csv.reader(f)
for values in line:
…do some stuff…
In this case commas within quoted string are removed. Another program was too complex just to read so I copied the file changing the delimiter to semi-colon:
fin=open(sys.argv[1],'rb')
lines=csv.reader(fin)
with open('temp.csv','wb') as fout:
w=csv.writer(fout,delimiter=';')
w.writerows(lines)
In this case the commas inside of quoted string are preserved. I find no posts on bugs.python.org on this, so I conclude it must be me. So, anyone else experience this behavior. In the second case all elements are quoted. In the first case quotes are used only where required.
It's hard to guess without seeing a sample of your input file and the resulting output, but maybe this is related: Read CSV file with comma within fields in Python
Try csv.reader(f, skipinitialspace=True).

Pandas error tokenizing data when field in csv file contains quotation mark

I'm using pandas.read_csv to read a tab delimited file and am running into the error: Error tokenizing data. C error: Expected 364 fields in line 73058, saw 398
After much searching, it seems that the offending entry is: "– SO ,쳌 \\ ?Œ  ø ,d -L ,ú ,‚ ZO
Removing the quotation mark seems to solve things. I've got a lot of large files with a lot of strange characters in them, so this will no doubt repeat itself. Do I need to remove single quotation marks ahead of time or is there some way around this?
There is a quoting argument for read_csv:
quoting : int or csv.QUOTE_* instance, default None
Control field quoting behavior per ``csv.QUOTE_*`` constants. Use one of
QUOTE_MINIMAL (0), QUOTE_ALL (1), QUOTE_NONNUMERIC (2) or QUOTE_NONE (3).
Default (None) results in QUOTE_MINIMAL behavior.
These are described in the csv docs.
Try setting quoting=3 (i.e. QUOTE_NONE).

Categories

Resources