Decimal point / comma handling in pandas.read_clipboard - python

My script reads in data from the clipboard with pd.read_clipboard(). Being run on different computers and getting data from different inputs with either European decimal format (1,23) or international decimal format (1.23) I'd like to have some flexibility in the parsing process.
I could first read the content of the clipboard and decide according to some heuristics if I set decimal='.' or decimal=',' as parameter to pd.read_clipboard(), but is there a more elegant way to achieve this? So far I haven't found an option to let pandas decide on the correct decimal separator.

Related

how to read a csv and write it back exactly the same with pandas overcoming float imprecision

I would like to read in a csv and write it back exactly the same as it was using pandas or similar
example csv
019-12-12 23:45:00,95480,12.41,-10.19,11.31851,2.1882
and when I go to write it back, due to floating point properties i might get something like
019-12-12 23:45:00,95480,12.410000009,-10.19,11.31851.000000002,2.1822
I've seen suggestions to use float_format but the format is different for each column and different across files I'm looping through.
I'm not sure what you're doing, but if you need pandas and want to re-save the file, you likely will want to change data somewhere. If so, I'd recommend using
df.set_option('precision', num_decimals)
as long as your decimals are reasonably close in precision, in the example given, 4 would allow for enough precision and remove any straggling floating point inaccuracy. Otherwise, you'll have to look for several zeros in a row and delete all decimal places after that.
If you don't need to change any data, I would go with an alternative solution: shutil
import shutil
shutil.copyfile(path_to_file, path_to_target_file)
This way, there's no mutation that can occur as it's just copying the raw contents.

How many bytes will Django/Postgres use for a decimal field?

I would like to like to store a percentage with two decimal places (0 to 100). Thus a decimal field (5 digits total) seems like a good choice, but smallint would also get the job done with only two bytes.
How many bytes would be utilized in the database for this configuration of decimal field? And as a follow-up question, how would I query Postgres to investigate the details of the data structure used behind the scenes?
The documentation says:
The actual storage requirement is two bytes for each group of four decimal digits, plus three to eight bytes overhead.
That would be between 5 and 10 bytes in your case.
But storage size should be the last of your concerns here:
Depending on the types of the fields before and after that number, you might not save as much space as you think, because some data types are aligned at 4 or 8 byte boundaries, and you might lose some of the saved space to padding anyway.
If you use a lot of arithmetic operations on these data (number crunching), smallint or integer will perform much better than numeric.
On the other hand, you'll have to shift the comma around during arithmetical processing. That is fast, but might make your code less readable, resulting in a maintenance burden.
To learn about the storage and processing of numeric, you'd have to browse the source code.

Did Python pandas to_csv add decimal separator support, like read_csv?

I just noticed (in pandas 0.16.x) there is now a decimal parameter in read_csv() method. It's really useful and simplifies a lot when reading European CSV files (with comma decimal separator).
I didn't see the same feature in to_csv().
Is there same reason not to have it or just a question of time or release. I would appreciate very much having this feature too.

FontTools: extracting useful UTF information provided by it

FontTools is producing some XML with all sorts of details in this structure
<cmap>
<tableVersion version="0"/>
<cmap_format_4 platformID="0" platEncID="3" language="0">
<map code="0x20" name="space"/><!-- SPACE -->
<!--many, many more characters-->
</cmap_format_4>
<cmap_format_0 platformID="1" platEncID="0" language="0">
<map code="0x0" name=".notdef"/>
<!--many, many more characters again-->
</cmap_format_0>
<cmap_format_4 platformID="0" platEncID="3" language="0"> <!--"cmap_format_4" again-->
<map code="0x20" name="space"/><!-- SPACE -->
<!--more "map" nodes-->
</cmap_format_4>
</cmap>
I'm trying to figure out every character this font supports, so these code attributes are what I'm interested in. I believe I am correct in thinking that all code attributes are UTF-8 values: is this correct? I am curious why there are two nodes cmap_format_4 (they seem to be identical, but I haven't tested that with a thorough amount of fonts those, so if someone familiar with this module knows for certain, that is my first question).
To be assured I am seeing all characters contained in the typeface, do I need to combine all code attribute values, or just one or two. Will FontTools always produce these three XML nodes, or is the quantity variable? Any idea why? The documentation is a little vague.
the number of cmap_format_N nodes ("cmap subtables") is variable, as is the 'N' (the format). There are several formats; the most common is 4, but there is also format 12, format 0, format 6, and a few others.
fonts may have multiple cmap subtables, but are not required to. The reason for this is the history of the development of TrueType (which has evolved into OpenType). The format was invented before Unicode, at a time when each platform had their own way(s) of character mapping. The different formats and ability to have multiple mappings was necessity at the time in order to have a single font file that could map everything without multiple files, duplication, etc. Nowadays most fonts that are produced will only have a single Unicode subtable, but there are many floating around that have multiple subtables.
The code values in the map node are code point values expressed as hexadecimal. They might be Unicode values, but not necessarily (see the next point).
I think your font may be corrupted (or possibly there was copy/paste mix-up). It is possible to have multiple cmap_format_N entries in the cmap, but each combination of platformID/platformEncID/language should be unique. Also, it is important to note that not all cmap subtables map Unicodes; some express older, pre-Unicode encodings. You should look at tables where platformID="3" first, then platformID="0" and finally platformID="2" as a last resort. Other platformIDs do not necessarily map Unicode values.
As for discovering "all Unicodes mapped in a font": that can be a bit tricky when there are multiple Unicode subtables, especially if their contents differ. You might get close by taking the union of all code values in all of the subtables that are known to be Unicode maps, but it is important to understand that most platforms will only use one of the maps at a time. Usually there is a preferred picking order similar to what I stated above; when one is found, that is the one used. There's no standardized order of preference that applies to all platforms (that I'm aware of), but most of the popular ones follow an order pretty close to what I listed.
Finally, regarding Unicode vs UTF-8: the code values are Unicode code points; NOT UTF-8 byte sequences. If you're not sure of the difference, spend some time reading about character encodings and byte serialization at Unicode.org.

Preferred data format for R dataframe

I am writing a data-harvesting code in Python. I'd like to produce a data frame file that would be as easy to import into R as possible. I have full control over what my Python code will produce, and I'd like to avoid unnecessary data processing on the R side, such as converting columns into factor/numeric vectors and such. Also, if possible, I'd like to make importing that data as easy as possible on the R side, preferably by calling a single function with a single argument of file name.
How should I store data into a file to make this possible?
You can write data to CSV using http://docs.python.org/2/library/csv.html Python's csv module, then it's a simple matter of using read.csv in R. (See ?read.csv)
When you read in data to R using read.csv, unless you specify otherwise, character strings will be converted to factors, numeric fields will be converted to numeric. Empty values will be converted to NA.
First thing you should do after you import some data is to look at the ?str of it to ensure the classes of data contained within meet your expectations. Many times have I made a mistake and mixed a character value in a numeric field and ended up with a factor instead of a numeric.
One thing to note is that you may have to set your own NA strings. For example, if you have "-", ".", or some other such character denoting a blank, you'll need to use the na.strings argument (which can accept a vector of strings ie, c("-",".")) to read.csv.
If you have date fields, you will need to convert them properly. R does not necessarily recognize dates or times without you specifying what they are (see ?as.Date)
If you know in advance what each column is going to be you can specify the class using colClasses.
A thorough read through of ?read.csv will provide you with more detailed information. But I've outlined some common issues.
Brandon's suggestion of using CSV is great if your data isn't enormous, and particularly if it doesn't contain a whole honking lot of floating point values, in which case the CSV format is extremely inefficient.
An option that handled huge datasets a little better might be to construct an equivalent DataFrame in pandas and use its facilities to dump to hdf5, and then open it in R that way. See for example this question for an example of that.
This other approach feels like overkill, but you could also directly transfer the dataframe in-memory to R using pandas's experimental R interface and then save it from R directly.

Categories

Resources