I've been reading up on machine learning with python and sklearn.
I've tried practicing with the iris dataset and then went on to find other datasets on the UCI website.
I found one that was called
"Amazon Book Reviews".
The documentation says each entry is separated with a new line and each of the four attributes is separated with a blank-space " ".
Unfortunately the data contains spaces everywhere since you have a title(text) and a description(html).
When I try and use the panda csv_read function of course it doesn't know where to separate the columns and neither do I.
Any ideas? Am I just way too out of my depth for a machine learning (and programming in general) beginner?
You haven't done anything wrong, the documentation is actually incorrect. The delimiter used in the data files is actually a tab '\t' character. You can use this as the delimiter parameter to pandas.read_csv.
Good luck with your analysis!
each entry is separated with a new line and each of the four attributes is separated with a blank-space " "
read_csv provides an optional sep argument where the default is ','
You can make this a space.
Related
Very newbie programmer asking a question here. I have searched all over the forums but can't find something to solve this issue I thought there would be a simple function for. Is there a way to do this?
I am trying to reformat a txt file so I can use it with the pandas function but this requires my data to be in a specific format.
Currently my data is in the following format of a txt file:
01/09/21,00:28,7.1,75,3.0,3.7,3.7,292,0.0,0.0,1025.8,81.9,17.1,44,3.7,4.6,7.1,0,0,0.00,0.00,3.0,0,0.0,292,0.0,0.0
01/09/21,00:58,7.0,75,2.9,5.1,5.1,248,0.0,0.0,1025.9,81.9,17.0,44,5.1,3.8,7.0,0,0,0.00,0.00,1.9,0,0.0,248,0.0,0.0
it is required to be formatted like this for processing using pandas:
["06/09/21","19:58",11.4,69,5.9,0.0,0.0,0,0.0,0.3,1006.6,82.2,21.8,52,0.0,11.4,11.4,0,0,0.00,0.00,10.5,0,1.5,0,0.0,0.3],
["06/09/21","20:28",10.6,73,6.0,0.0,0.0,0,0.0,0.3,1006.3,82.2,22.4,49,0.0,10.6,10.6,0,0,0.00,0.00,9.7,0,1.5,0,0.0,0.3],
This requires adding a [" at the start and adding a " at the end of the date before the comma, then adding another " after the comma and another " at the end of the time section. At the end of the line, I also need to add a ],
I thought something like this would work but i get an error when trying to run it.
info =
06/09/21,19:58,11.4,69,5.9,0.0,0.0,0,0.0,0.3,1006.6,82.2,21.8,52,0.0,11.4,11.4,0,0,0.00,0.00,10.5,0,1.5,0,0.0,0.3
info=info[:1] +"['" +info[1:]
print (info)
I have over 1000 lines of data so doing it manually is out of the question. I've seen other questions like this, but they didn't get helpful answers. Can it be done, preferably with either a method or a loop?
You are confusing the CONTENTS of your data with the REPRESENTATION of your data. You don't really need brackets and quotes at all. What you need is a list that contains strings and integers. What you've shown there is how Python would PRINT a list containing strings and integers. The list doesn't actually contain brackets or quotes.
You can use pandas.read_csv directly on that data file with no extra processing. You just need to provide the column names.
It appears that the pandas to_csv function only allows single character delimiters/separators.
Is there some way to allow for a string of characters to be used like, "::" or "%%" instead?
I tried:
df.to_csv(local_file, sep = '::', header=None, index=False)
and getting:
TypeError: "delimiter" must be a 1-character string
Use numpy-savetxt
Ex:
np.savetxt(file.csv, np.char.decode(chunk_data.values.astype(np.bytes_), 'UTF-8'), delimiter='~|', fmt='%s',encoding=None)
np.savetxt(file.dat, chunk_data.values, delimiter='~|', fmt='%s',encoding='utf-8')
Think about what this line a::b::c‘ means to a standard CSV tool: an a, an empty column, a b, an empty column, and a c. Even in a more complicated case with quoting or escaping:"abc::def"::2 means an abc::def, an empty column, and a 2.
So, all you have to do is add an empty column between every column, and then use : as a delimiter, and the output will be almost what you want.
I say “almost” because Pandas is going to quote or escape single colons. Depending on the dialect options you’re using, and the tool you’re trying to interact with, this may or may not be a problem. Unnecessary quoting usually isn’t a problem (unless you ask for QUOTE_ALL, because then your columns will be separated by :"":, so hopefully you don’t need that dialect option), but unnecessary escapes might be (e.g., you might end up with every single : in a string turned into a \: or something). So you have to be careful with the options. But it’ll work for the basic “quote as needed, with mostly standard other options” settings.
I'm new to Python, and I am following this guide to implement a linear regression
http://nbviewer.jupyter.org/github/jdwittenauer/ipython-notebooks/blob/master/notebooks/ml/ML-Exercise1.ipynb
Basically I am on the step where I need to build a data set to import it into Python
I have created a text file with two columns, each data is separated by a tab
However, this is what I get
I looked around online and it seems that the tab is the delimiter. What am I doing wrong? How can I build this text file?
I would advise using the official documentation instead of "looking around online" - if you check the pandas read_csv() documentation, it lists (at the very top) the default values of each argument. The default value of the sep (separator) argument is ", ". So just change your call to pd.read_csv() to add sep='\t'.
use ',' instead of 'tab' as a delimiter in your textfile ex1data.txt, as pandas default delimiter is ','.
Here is an explanation from pandas official documentation for delimiter :
sep : str, default ‘,’
Delimiter to use. If sep is None, the C engine
cannot automatically detect the separator, but the Python parsing
engine can, meaning the latter will be used automatically. In
addition, separators longer than 1 character and different from '\s+'
will be interpreted as regular expressions and will also force the use
of the Python parsing engine. Note that regex delimiters are prone to
ignoring quoted data. Regex example: '\r\t'
I have a large set of strings, and am looking to extract a certain part of each of the strings. Each string contains a sub string like this:
my_token:[
"key_of_interest"
],
This is the only part in each string it says my_token. I was thinking about getting the end index position of ' my_token:[" ' and after that getting the beginning index position of ' "], ' and getting all the text between those two index positions.
Is there a better or more efficient way of doing this? I'll be doing this for string of length ~10,000 and sets of size 100,000.
Edit: The file is a .ion file. From my understanding it can be treated as a flat file - as it is text based and used for describing metadata.
How can this can possibly be done the "dumbest and simplest way"?
find the starting position
look on for the ending position
grab everything indiscriminately between the two
This is indeed what you're doing. Thus any further inprovement can only come from the optimization of each step. Possible ways include:
narrow down the search region (requires additional constraints/assumptions as per comment56995056)
speed up the search operation bits, which include:
extracting raw data from the format
you already did this by disregarding the format altogether - so you have to make sure there'll never be any incorrect parsing (e.g. your search terms embedded in strings elsewhere or matching a part of a token) as per comment56995034
elementary pattern comparison operation
unlikely to attain in pure Python since str.index is implemented in C already and the implementation is probably already as simple as can possibly be
The underlying requirement shows through when you clarify:
I was thinking about getting the end index position of ' my_token:[" ' and after that getting the beginning index position of ' "], ' and getting all the text between those two index positions.
That sounds like you're trying to avoid the correct approach: use a parser for whatever language is in the string.
There is no good reason to build directly on top of string primitives for parsing, unless you are interested in writing yet another parsing framework.
So, use libraries written by people who have dealt with the issues before you.
If it's JSON, use the standard library json module; ditto if it's some other language with a parser already in the Python standard library.
If it's some other widely-implemented standard: get whichever already-existing third-party Python library knows how to parse that properly.
If it's not already implemented: write a custom parser using pyparsing or some other well-known solid library.
So to make a good choice you need to know what is the data format (this is not answered by “what are the file names”; rather, you need to know what is the data format of the content of those files). Then you'll be able to search for a parser library that knows about that data format.
Well, as already mentioned - a parser seems the best option.
But to answer your question without all this extra advice ... if you're just looking at speed, a parser isn't really the best method of doing this. The faster method is you already have a string like this would be to use regex.
matches = re.match(r"my_token:\[\s*"(.*)"\s*\]\.",str)
key_of_interest = matches.groups()[0]
There are other issues that come up. For example what if your key has a " inside it ? strinified JSON will automatically use an escape character there and that will be captures by the regex too. And therefore this gets a bit too complicated.
And JSON is not regex parsable in itself (is-json-a-regular-language). So, use at your own risk. But with the appropriate restrictions and assumptions regex would be faster than a json parser.
While going through LPTHW, I've set on reading the code here:
https://github.com/BrechtDeMan/secretsanta/blob/master/pairing.py
I've been trying to understand why the output CSV has double-quotes. There are several questions here about the problem but I'm not groking.
Where are the quotes getting introduced?
Edit: I wrote the author a couple of weeks back but haven't heard back.
Edit 2: An example of the output...
"Alice,101,alice#mail.org,Wendy,204,wendy#mail.org"
Double quotes are introduced in write_file function.
CSV files look simple on the surface, but sooner or later you will encounter some more complex problems. The first one is: what should happen if character denoting delimiter occurs in field content? Because there is no real standard for CSV format, different people had different ideas of correct answer for this question.
Python csv library tries to abstract this complexity and various approaches and make it easier to read and write CSV files following different rules. This is done by Dialect class objects.
The author of write_file function decided to construct output row manually by joining all fields and delimiter characters together, but then used csv module to actually write data into file:
writer.writerow([givers_list[ind][1] + ',' + givers_list[ind][2]
+ ',' + givers_list[ind][3]
+ ',' + givers_list[rand_vec[ind]][1] + ','
+ givers_list[rand_vec[ind]][2] + ',' + givers_list[rand_vec[ind]][3]])
This inconsistent usage of csv module resulted in entire row of data being treated as single field. Because that field contains characters used as field delimiters, Dialect.quoting decides how it should be handled. Default quoting configuration, csv.QUOTE_MINIMAL says that field should be quoted using Dialect.quotechar - which defaults to double quote character ("). That's why eventually entire field ends up surrounded by double quote characters.
Fast and easy, but not correct, solution would be changing quoting algorithm to csv.QUOTE_NONE. This will tell writer object to never surround fields, but instead to escape special characters by Dialect.escapechar. According to documentation, setting it to None (default) will raise an error. I guess that setting it to empty string could do the job.
The correct solution is feeding writer.writerrow with expected input data - list of fields. This should do (untested):
writer.writerow([givers_list[ind][1], givers_list[ind][2],
givers_list[ind][3],
givers_list[rand_vec[ind]][1],
givers_list[rand_vec[ind]][2], givers_list[rand_vec[ind]][3]])
In general, (double)quotes are needed when there is a seperator-char inside a field - and if there are quotes inside that field, they need to be 'escaped' with another quote.
Do you have an example of the output and the quotes you are talking about?
Edit (after example):
Ok, the whole row is treated as one field here. As Miroslaw Zalewski mentioned, those values should be treated as seperate fields instead of one long string.