I would like to like to store a percentage with two decimal places (0 to 100). Thus a decimal field (5 digits total) seems like a good choice, but smallint would also get the job done with only two bytes.
How many bytes would be utilized in the database for this configuration of decimal field? And as a follow-up question, how would I query Postgres to investigate the details of the data structure used behind the scenes?
The documentation says:
The actual storage requirement is two bytes for each group of four decimal digits, plus three to eight bytes overhead.
That would be between 5 and 10 bytes in your case.
But storage size should be the last of your concerns here:
Depending on the types of the fields before and after that number, you might not save as much space as you think, because some data types are aligned at 4 or 8 byte boundaries, and you might lose some of the saved space to padding anyway.
If you use a lot of arithmetic operations on these data (number crunching), smallint or integer will perform much better than numeric.
On the other hand, you'll have to shift the comma around during arithmetical processing. That is fast, but might make your code less readable, resulting in a maintenance burden.
To learn about the storage and processing of numeric, you'd have to browse the source code.
Related
I'm building a social network and I're going to create my own urls at random the question that has been on my mind for a long time is how to create urls like Instagram posts like Django like the following:
https://www.instagram.com/p/CeqcZdeoNaP/
or
https://www.youtube.com/watch?v=MhRaaU9-Jg4
My problem is that on the one hand these urls have to be unique and on the other hand it does not make sense that on a large scale when the number of uploaded posts by the user is more than 100,000 I set unique = True Because the performance of the database decreases
Another point is the use of uuids, which solves this problem of uniqueness to a large extent, but the strings produced by uuid are very long, and if I shorten these strings and reduce the number of letters in the string, there is a possibility of a collision. And that several identical strings are produced
I wanted to know if there is a solution to this issue that generated urls are both short and unique while maintaining database performance
Thank you for your time š
You might choose to design around ULIDs. https://github.com/ulid/spec
It's still 128 bits.
The engineering tradeoff they made was 48 bits of predictable low-entropy clock
catenated with an 80-bit nonce.
Starting with a timestamp makes it play very nicely with postgres B-trees.
They serialize 5 bits per character instead of the 4 bits offered by hex.
You could choose to go for 6 if you want, for the sake of brevity.
Similarly you could also choose to adjust the clock tick granularity,
and reduce its range.
Keeping the Birthday Paradox in mind,
you might choose to use a smaller nonce, as well.
The current design offers good collision resistance
up to around 2^40 identifiers per clock tick,
which might be overkill for your needs.
I have data (mostly a series of numpy arrays) that I want to convert into text that can be copied/pasted/emailed etc.. I created the following formula which does this.
def convert_to_ascii85(x):
p = pickle.dumps(x)
p = zlib.compress(p)
return b64.b85encode(p)
My issue is that the string it produces is longer than it needs to be because it only uses a subset of letters, numbers, and symbols. If I was able to encode using unicode, I feel like it could produce a shorter string because it would have access to more characters. Is there a way to do this?
Edit to clarify:
My goal is NOT the smallest amount of data/information/bytes. My goal is the smallest number of characters. The reason is that the channel I'm sending the data through is capped by characters (100k to be precise) instead of bytes (strange, I know). I've already tested that I can send 100k unicode characters, I just don't know how to convert my bytes into unicode.
UPDATE: I just saw that you changed your question to clarify that you care about character length rather than byte length. This is a really strange constraint. I've never heard of it before. I don't know quite what to make of it. But if that's your need, and you want predicable blocking behavior, then I'm thinking that your problem is pretty simple. Just pick the compatible character encoding that can represent the most possible unique characters, and then map blocks of your binary across that character set such that each block is the longest it can be and yet consists of fewer bits than the number of representable characters in your character encoding. Each such block then becomes a single character. Since this constraint is kinda strange, I don't know if there are libraries out there that do this.
UPDATE2: Being curious about the above myself, I just Google'd and found this: https://qntm.org/unicodings. If your tools and communication channels can deal with UFT-16 or UTF-32, then you might be onto something in seeking to use that. If so, I hope this article opens up to the solution you're looking for. I think this article is still optimizing for byte length vs character length, so maybe this won't provide the optimal solution, but it can only help (32 potential bits per char rather than 7 or 8). I couldn't find anything seeking to optimize on character count alone, but maybe a UTF-32 scheme like Base65536 is your answer. Check out https://github.com/qntm/base65536 .
If it is byte length that you cared about, and you want to stick to using what is usually meant by "printable characters" or "plain printable text", then here's my original answer...
There are options for getting better "readable text" encoding space efficiency from an encoding other than Base85. There's also a case to be made for giving up more space efficiency and going with Base64. Here I'll make the case for using both Base85 and Base64. If you can use Base85, you only take a 25% hit on the inflation of your binary, and you save a whole lot of headaches in doing so.
Base85 is pretty close to the best you're going to do if you seek to encode arbitrary binary to "plain text", and it is the BEST you can do if you want a "plain text" encoding that you can logically break into meaningful, predictable chunks. You can in theory use a character set that uses printable characters in the high-ASCII range, but experience has shown that many tools and communication channels don't deal well with high-ASCII if they can't handle straight binary. You don't get much in additional space savings for trying to use the extra 5 bits per 4 binary bytes or so that could potentially be used by using 256-bit high-ASCII vs 128-bit ASCII.
For any BaseXX encoding, the algorithm takes incoming binary bits and encodes them as tightly as it can using the XX printable characters it has at its disposal. Base85 will be more compact than Base64 because it uses more of the printable characters (85) than Base64 does (64 characters).
There are 95 printable characters in standard ASCII. So there is a Base95 that is the most compact encoding possible using all the printable characters. But to try to use all 95 bits is messy, because it leads to uneven blockings of the incoming bits. Each 4 binary bytes is mapped to some fractional number of characters less than 5.
It turns out that 85 characters is what you need to encode 4 bytes as exactly 5 printable characters. Many will choose to add about 10% of extra length to attain the fact that every 4 encoded bytes leads to exactly 5 ASCII characters. This is only a 25% inflation in size of the binary. That's not bad at all for all of the headaches it saves. Hence, the motivation behind Base85.
Base64 is used to produce longer, but even less problematic encodings. Characters that cause trouble for various text documents, like HTML, XML, JSON, etc, are not used. In this way, Base64 is useful in almost any context without any escaping. You have to be more careful with Base85, as it doesn't throw out any of these problematic characters. For encoding/decoding efficiency, it uses the range 33 (ā!ā) through 117 (āuā), starting at 33 rather than 32 just to avoid the often problematic space character. The characters above 'u' that it doesn't use are nothing special.
So that's the story pretty much on binary -> ASCII encoding side. The other question is what you can do to reduce the size of what you're representing prior to the stage of encoding its binary representation to ASCII. You're choosing to use pickle.dumps() and zlib.compress(). If those are your best choices are left for another discussion...
FontTools is producing some XML with all sorts of details in this structure
<cmap>
<tableVersion version="0"/>
<cmap_format_4 platformID="0" platEncID="3" language="0">
<map code="0x20" name="space"/><!-- SPACE -->
<!--many, many more characters-->
</cmap_format_4>
<cmap_format_0 platformID="1" platEncID="0" language="0">
<map code="0x0" name=".notdef"/>
<!--many, many more characters again-->
</cmap_format_0>
<cmap_format_4 platformID="0" platEncID="3" language="0"> <!--"cmap_format_4" again-->
<map code="0x20" name="space"/><!-- SPACE -->
<!--more "map" nodes-->
</cmap_format_4>
</cmap>
I'm trying to figure out every character this font supports, so these code attributes are what I'm interested in. I believe I am correct in thinking that all code attributes are UTF-8 values: is this correct? I am curious why there are two nodes cmap_format_4 (they seem to be identical, but I haven't tested that with a thorough amount of fonts those, so if someone familiar with this module knows for certain, that is my first question).
To be assured I am seeing all characters contained in the typeface, do I need to combine all code attribute values, or just one or two. Will FontTools always produce these three XML nodes, or is the quantity variable? Any idea why? The documentation is a little vague.
the number of cmap_format_N nodes ("cmap subtables") is variable, as is the 'N' (the format). There are several formats; the most common is 4, but there is also format 12, format 0, format 6, and a few others.
fonts may have multiple cmap subtables, but are not required to. The reason for this is the history of the development of TrueType (which has evolved into OpenType). The format was invented before Unicode, at a time when each platform had their own way(s) of character mapping. The different formats and ability to have multiple mappings was necessity at the time in order to have a single font file that could map everything without multiple files, duplication, etc. Nowadays most fonts that are produced will only have a single Unicode subtable, but there are many floating around that have multiple subtables.
The code values in the map node are code point values expressed as hexadecimal. They might be Unicode values, but not necessarily (see the next point).
I think your font may be corrupted (or possibly there was copy/paste mix-up). It is possible to have multiple cmap_format_N entries in the cmap, but each combination of platformID/platformEncID/language should be unique. Also, it is important to note that not all cmap subtables map Unicodes; some express older, pre-Unicode encodings. You should look at tables where platformID="3" first, then platformID="0" and finally platformID="2" as a last resort. Other platformIDs do not necessarily map Unicode values.
As for discovering "all Unicodes mapped in a font": that can be a bit tricky when there are multiple Unicode subtables, especially if their contents differ. You might get close by taking the union of all code values in all of the subtables that are known to be Unicode maps, but it is important to understand that most platforms will only use one of the maps at a time. Usually there is a preferred picking order similar to what I stated above; when one is found, that is the one used. There's no standardized order of preference that applies to all platforms (that I'm aware of), but most of the popular ones follow an order pretty close to what I listed.
Finally, regarding Unicode vs UTF-8: the code values are Unicode code points; NOT UTF-8 byte sequences. If you're not sure of the difference, spend some time reading about character encodings and byte serialization at Unicode.org.
I'm extracting a large CSV file (200Mb) that was generated using R with Python (I'm the one using python).
I do some tinkling with the file (normalization, scaling, removing junk columns, etc) and then save it again using numpy's savetxt with data delimiter as ',' to kee the csv property.
Thing is, the new file is almost twice as large than the original (almost 400Mb). The original data as well as the new one are only arrays of floats.
If it helps, it looks as if the new file has really small values, that need exponential values, which the original did not have.
Any idea on why is this happening?
Have you looked at the way floats are represented in text before and after? You might have a line "1.,2.,3." become "1.000000e+0, 2.000000e+0,3.000000e+0" or something like that, the two are both valid and both represent the same numbers.
More likely, however, is that if the original file contained floats as values with relatively few significant digits (for example "1.1, 2.2, 3.3"), after you do normalization and scaling, you "create" more digits which are needed to represent the results of your math but do not correspond to real increase in precision (for example, normalizing the sum of values to 1.0 in the last example gives "0.1666666, 0.3333333, 0.5").
I guess the short answer is that there is no guarantee (and no requirement) for floats represented as text to occupy any particular amount of storage space, or less than the maximum possible per float; it can vary a lot even if the data remains the same, and will certainly vary if the data changes.
A bit of background first: GeoModel is a library I wrote that adds very basic geospatial indexing and querying functionality to App Engine apps. It is similar in approach to geohashing. The equivalent location hash in GeoModel is called a 'geocell.'
Currently, the GeoModel library adds 13 properties (location_geocell__n_, n=1..13) to each location-aware entity. For example, an entity can have property values such as:
location_geocell_1 = 'a'
location_geocell_2 = 'a3'
location_geocell_3 = 'a3f'
...
This is required in order to not use up an inequality filter during spatial queries.
The problem with the 13-properties approach is that, for any geo query an app would like to run, 13 new indexes must be defined and built. This is definitely a maintenance hassle, as I've just painfully realized while rewriting the demo app for the project. This leads to my first question:
QUESTION 1: Is there any significant storage overhead per index? i.e. if I have 13 indexes with n entities in each, versus 1 index with 13n entities in it, is the former much worse than the latter in terms of storage?
It seems like the answer to (1) is no, per this article, but I'd just like to see if anyone has had a different experience.
Now, I'm considering adjusting the GeoModel library so that instead of 13 string properties, there'd only be one StringListProperty called location_geocells, i.e.:
location_geocells = ['a', 'a3', 'a3f']
This results in a much cleaner index.yaml. But, I do question the performance implications:
QUESTION 2: If I switch from 13 string properties to 1 StringListProperty, will query performance be adversely affected; my current filter looks like:
query.filter('location_geocell_%d =' % len(search_cell), search_cell)
and the new filter would look like:
query.filter('location_geocells =', search_cell)
Note that the first query has a search space of _n_ entities, whereas the second query has a search space of _13n_ entities.
It seems like the answer to (2) is that both result in equal query performance, per tip #6 in this blog post, but again, I'd like to see if anyone has any differing real-world experiences with this.
Lastly, if anyone has any other suggestions or tips that can help improve storage utilization, query performance and/or ease of use (specifically w.r.t. index.yaml), please do let me know! The source can be found here geomodel & geomodel.py
You're correct that there's no significant overhead per-index - 13n entries in one index is more or less equivalent to n entries in 13 indexes. There's a total index count limit of 100, though, so this eats up a fair chunk of your available indexes.
That said, using a ListProperty is definitely a far superior approach from usability and index consumption perspectives. There is, as you supposed, no performance difference between querying a small index and a much larger index, supposing both queries return the same number of rows.
The only reason I can think of for using separate properties is if you knew you only needed to index on certain levels of detail - but that could be accomplished better at insert-time by specifying the levels of detail you want added to the list in the first place.
Note that in either case you only need the indexes if you intend to query the geocell properties in conjunction with a sort order or inequality filter, though - in all other cases, the automatic indexing will suffice.
Lastly, if anyone has any other suggestions or tips that can help improve storage utilization, query performance and/or ease of use
The StringListproperty is the way to go for the reasons mentioned above, but in actual usage one might want to add the geocells to ones own previously existing StringList so one could query against multiple properties.
So, if you were to provide a lower level api it could work with full text search implementations like bill katz's
def point2StringList(Point, stub="blah"):
.....
return ["blah_1:a", "blah_2":"a3", "blah_3":"a3f" ....]
def boundingbox2Wheresnippet(Box, stringlist="words", stub="blah"):
.....
return "words='%s_3:a3f' AND words='%s_3:b4g' ..." %(stub)
etc.
Looks like you ended up with 13 indices because you encoded in hex (for human readability / map levels?).
If you had utilized full potential of a byte (ByteString), you'd have had 256 cells instead of 16 cells per character (byte). There by reducing to far fewer number of indices for the same precision.
ByteString is just a subclass of a str and is indexed similarly if less than 500bytes in length.
However number of levels might be lower; to me 4 or 5 levels is practically good enough for most situations on 'the Earth'. For a larger planet or when cataloging each sand particle, more divisions might anyway need to be introduced irrespective of encoding used. In either case ByteString is better than hex encoding. And helps reduce indexing substantially.
For representing 4 billion low(est) level cells, all we need is 4 bytes or just 4 indices. (From basic computer arch or memory addressing).
For representing the same, we'd need 16 hex digits or 16 indices.
I could be wrong. May be the number of index levels matching map zoom levels are more important. Please correct me. Am planning to try this instead of hex if just one (other) person here finds this meaningful :)
Or a solution that has fewer large cells (16) but more (128,256) as we go down the hierarchy.
Any thoughts?
eg:
[0-15][0-31][0-63][0-127][0-255] gives 1G low level cells with 5 indices with log2 decrement in size.
[0-15][0-63][0-255][0-255][0-255] gives 16G low level cells with 5 indices.