Identical jsons but different disk sizes

Identical jsons but different disk sizes - python

I have a python script agregating data stored in several csv files. After agregating the data, I export it using
pd.json("myfile.json", orient="table")
This is done locally on my Windows 10 computer.
I use the same code and functions with the same data on AWS Lambda, by reading the data from an S3 bucket. I downloaded the exported json file, and load them both with pandas.
Using
myLocaljson.equals(myAwsjson)
I verified that they are indeed equal. The dtypes are also equal. But the disks sizes are different. One is 259 KB and the other is 267 KB.
Do you have ideas why?

This is most probably caused by the whitespace characters.
Windows has the line endings \r\n, most other OSes only use \n. So that's one extra character per line.
Also, this JSON file:
{"foo":"bar","baz":[]}
is equal to this one:
{
"foo": "bar",
"baz": [
]
}
which is actually a nice representation of this:
{\n\t"foo": "bar",\n\t"baz": [\n\t]\n}
So JSON objects can be "equal", although the string representing them can be very different.

I would guess non-printable characters at the end of the File, open it with a HEX editor and compare the file.
The other option is the was small files are stored on a hard disk, 1024 - 1KB files will not take up 1024KB of disk space but take up significantly more disk space depending on the cluster size of the drive.

Related

Error (Hung process?) when using COPY INTO with ANSI file

I'm trying to load a set of public flat files (using COPY INTO from Python) - that apparently are saved in ANSI format. Some of the files load with no issue, but there is at least one case where the COPY INTO statement hangs (no error is returned, & nothing is logged, as far as I can tell). I isolated the error to a particular row with a non-standard character e.g., the ¢ character in the 2nd row -
O}10}49771}2020}02}202002}4977110}141077}71052900}R }}N}0}0}0}0}0}0}0}0}0}0}0}0}0}0}0}0}08}CWI STATE A}CENTENNIAL RESOURCE PROD, LLC}PHANTOM (WOLFCAMP)
O}10}50367}2020}01}202001}5036710}027348}73933500}R }}N}0}0}0}0}0}0}0}0}0}0}0}0}0}0}0}0}08}A¢C 34-197}APC WATER HOLDINGS 1, LLC}QUITO, WEST (DELAWARE)
Re-saving these rows into a file with UTF-8 encoding solves the issue, but I thought I'd pass this along in case someone wants to take a look at the back-end to handle these types of characters and/or return some kind of error.

Why do you save into a file?
If it is possible, just play with strings internally from Python with:
resultstr= bytestr.encode("utf-8")

Why is numpy.memmap initialization so fast? [duplicate]

What is a sparse file and why do we need it?
The only thing that I am able to get is that it is a very large file and it is efficient(in gigabytes). How is it efficient ?

Say you have a file with many empty bytes \x00. These many empty bytes \x00 are called holes. Storing empty bytes is just not efficient, we know there are many of them in the file, so why store them on the storage device? We could instead store metadata describing those zeros. When a process reads the file those zero byte blocks get generated dynamically as opposed to being stored on physical storage (look at this schematic from Wikipedia):
This is why a sparse file is efficient, because it does not store the zeros on disk, instead it holds enough data describing the zeros that will be generated.
Note: the logical file size is greater than the physical file size for sparse files. This is because we have not stored the zeros physically on a storage device.
Edit:
When you run:
$ dd if=/dev/zero of=output bs=1G count=4
The command here copies 4G blocks of null bytes to output. To see that:
$ stat output
File: ouput
Size: 4294967296 Blocks: 8388616 IO Block: 4096 regular file
--omitted--
You can see that this file has 8388616 blocks allocated to it, these blocks store nothing but empty bytes copied from /dev/zero and they do occupy physical disk space, they're holes stored on disk (sparse zeros). dd did what you asked for, copying blocks of data from one file to another.
Now, run this command to detect the holes and make the file sparse in-place:
$ fallocate -d output
$ stat output
File: swapfile
Size: 4294967296 Blocks: 0 IO Block: 4096 regular file
--omitted--
Do you notice something? The the number of blocks now is 0 because the blocks that were storing only empty bytes were de-allocated. Remember, output's blocks store nothing, only a bunch of empty zeros, fallocate -d detected the blocks that contain only empty zeros and deallocated them, since all the blocks for this file contain zeros, they were all de-allocated.
Also notice how the size remained the same. This is the logical (virtual) size of the file, not its size on disk. It's crucial to know that output doesn't occupy physical storage space now, it has 0 blocks allocated to it and thus I doesn't really use disk space. The size preserved after running fallocate -d so when you later read from the file, you get the empty bytes generated to you by the filesystem at runtime. The physical size of output however, is zero, it uses no data blocks.
Remember, when you read output file the empty bytes are generated by the filesystem at runtime dynamically, they're not really physically stored on disk, and the file's size as reported by stat is the logical size, and the physical size is zero for output. In this case the filesystem has to generate 4G of empty bytes when a process reads the file.
To generate a sparse file using dd:
$ dd if=/dev/zero of=output2 bs=1G seek=0 count=0
$ stat
stat output2
File: output2
Size: 4294967296 Blocks: 0 IO Block: 4096 regular file
GNU dd internally uses lseek and ftruncate, so check truncate(2) and lseek(2).

A sparse file is a file that is mostly empty, i.e. it contains large blocks of bytes whose value is 0 (zero).
On the disk, the content of a file is stored in blocks of fixed size (usually 4 KiB or more). When all the bytes contained in such a block are 0, a file system that implements sparse files does not store the block on disk, instead it keeps the information somewhere in the file meta-data.
Advantages of using sparse files:
empty blocks of data do not occupy disk space; they are not stored as the regular blocks of data, their identifiers (that use only several bytes) are stored instead in the file meta-data; this way 4 KiB of disk space (or more) are saved for each empty block;
reading an empty block of data from a sparse file does not take time; this happens because no data is read from disk; since the file system knows all the bytes in the block are 0, it just sets to 0 all the bytes in the input buffer and the data is ready; there is no need to access the slow storage device;
writing an empty block of data into a sparse file does not take time; on writing, the file system detects that the block is empty (all its bytes are 0) and puts the block ID into the list of empty blocks (in the file meta-data); no data is written to the disk.
More information about sparse files can be found on the Wikipedia page.

File size changes after read/write txt file in python

After executing the following code to generate a copy of a text file with Python, the newfile.txt doesn't have the exact same file size as oldfile.txt.
with open('oldfile.txt','r') as a, open('newfile.txt','w') as b:
content = a.read()
b.write(content)
While oldfile.txt has e.g. 667 KB, newfile.txt has 681 KB.
Does anyone have an explanation for that?

There are various causes.
You are opening a file as text file, so the bytes of file are interpreted (decoded) into python, and than encoded. So there could be changes.
From open documentation (https://docs.python.org/3/library/functions.html#open):
When reading input from the stream, if newline is None, universal newlines mode is enabled. Lines in the input can end in '\n', '\r', or '\r\n', and these are translated into '\n' before being returned to the caller.
So if the original file were ASCII (e.g. generated in Windows), you will have the \r removed. But when writing back the file you can have no more the original \r (if you are in Linux or MacOs) or you will have always \r\n, if you are on Windows (which it seems the case, because you file increase in size).
Also encoding could change text. E.g. BOM mark could be removed (or added), and potentially (but AFAIK it is not done implicitly), unneeded codes could be removed (you can have some extra code in Unicode, which change the behaviour of nearby codes. One could add more of one of them, but only the last one is effective.

I tried on Linux / Ubuntu. It works as expected, the file-size of both files is perfectly equal.
At this point, i guess this behavior does not relate to python, maybe it depends on your filesystem (compression) or operating system.

what does ^# sign in .txt file suggest

I was concurrently manipulating a txt file (some r/w operation)with multiple processes. and I saw traces of special signs as ^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^# spreading across some lines now and then. What does this suggest? And under what circumstances will these symbols appear. Does it mean some binary contents were written in to, by mistake, where it should be text?
UPDATE
I read through the documentation. Some suggest it's due to newline issue on linux/windows platform, while others suggest it's because of big endian/small endian in a networked environment. The fact is I was running multiple processes in a networked filesystem and manipulate one common txt file. So I guess the encoding format might be the major reason. Anyone who can suggest how to avoid this issue? I don't want to edit files(like manually doing text substitution). A clean way of producing the right file without any null characters are preferred.
UPDATE2
This is the python pseudo code that implements my project. the fcntl.lockf thing is to lock the common manipulated file across multiple machines that run multiple process on it.
while(manipulatedfile size is not 0):
open(manipulatedfile, 'r+') as fh:
fcntl.lockf(fh, fcntl.LOCK_EX)
all_lines = fh.readlines()
listing=all_lines[0:50] #get the first 50 lines
rest_lines = all_lines[50:] # get remaining lines
fh.seek(0)
fh.truncate()
fh.writelines(rest_lines) # write remaining lines back to file
fcntl.lockf(fh, fcntl.LOCK_UN)
listing = map(lambda s:s.strip(), listing)
do_sth(listing)
Thanks

In ASCII, ^# is a binary zero (NUL) character.
Data containing ^# between each ASCII character can sometimes be incorrectly translated from Unicode (4 bytes to a character) to ASCII (1 bytes to a character), or vice versa.
To remove the ^# characters, run vi file.txt, and enter :%s/ Ctrl+V Ctrl+# //g and hit ↵ Return.
See this detailed article for more information.

These are "file holes" and contain null characters. The null character (or NUL char) has an ASCII code of 0 and appears as ^# when viewed in vi or less.
I usually see these when I am nearly out of disk space and processes are trying to write to log files.

How to separate content from a file that is a container for binary and other forms of content

I am trying to parse some .txt files. These files serve as containers for a variable number of 'children' files that are set off or identified within the container with SGML tags. With python I can easily separate the children files. However I am having trouble writing the binary content back out as a binary file (say a gif or jpg). In the simplest case the container might have an embedded html file followed by a graphic that is called by the html. I am assuming that my problem is because I am reading the original .txt file using open(filename,'r'). But that seems the only option to find the sgml tags to split the file.
I would appreciate any help to identify some relevant reading material.
I appreciate the suggestions but I am still struggling with the most basic questions. For example when I open the file with wordpad and scroll down to the section tagged as a gif I see this:
<FILENAME>h65803h6580301.gif
<DESCRIPTION>GRAPHIC
<TEXT>
begin 644 h65803h6580301.gif
M1TE&.#EA(P)I`=4#`("`#,#`P$!`0+^_OW]_?_#P\*"#H.##X-#0T&!#8!`0
M$+"PL"`#('!P<)"0D#`P,%!04#\_/^_O[Y^?GZ^OK]_?WX^/C\_/SV]O;U]?
I can handle finding the section easily enough but where does the gif file begin. Does the header start with 644, the blanks after the word begin or the line beginning with MITE?
Next, when the file is read into python does it do anything to the binary code that has to be undone when it is read back out?
I can find the lines where the graphics begin:
filerefbin=file('myfile.txt','rb')
wholeFile=filerefbin.read()
import re
graphicReg=re.compile('<DESCRIPTION>GRAPHIC')
locationGraphics=graphicReg.finditer(wholeFile)
graphicsTags=[]
for match in locationGraphics:
graphicsTags.append(match.span())
I can easily use the same process to get to the word begin, or to identify the filename and get to the end of the filename in the 'first' line. I have also successefully gotten to the end of the embedded gif file. But I can't seem to write out the correct combination of things so when I double click on h65803h6580301.gif when it has been isolated and saved I get to see the graphic.
Interestingly, when I open the file in rb, the line endings appear to still be present even though they don't seem to have any effect in notebpad. So that is clearly one of my problems I might need to readlines and join the lines together after stripping out the \n
I love this site and I love PYTHON
This was too easy once I read bendin's post. I just had to snip the section that began with the word begin and save that in a txt file and then run the following command:
import uu
uu.decode(r'c:\test2.txt',r'c:\test.gif')
I have to work with some other stuff for the rest of the day but I will post more here as I look at this more closely. The first thing I need to discover is how to use something other than a file, that is since I read the whole .txt file into memory and clipped out the section that has the image I need to work with the clipped section instead of writing it out to test2.txt. I am sure that can be done its just figuring out how to do it.

What you're looking at isn't "binary", it's uuencoded. Python's standard library includes the module uu, to handle uuencoded data.
The module uu requires the use of temporary files for encoding and decoding. You can accomplish this without resorting to temporary files by using Python's codecs module like this:
import codecs
data = "Let's just pretend that this is binary data, ok?"
uuencode = codecs.getencoder("uu")
data_uu, n = uuencode(data)
uudecode = codecs.getdecoder("uu")
decoded, m = uudecode(data_uu)
print """* The initial input:
%(data)s
* Encoding these %(n)d bytes produces:
%(data_uu)s
* When we decode these %(m)d bytes, we get the original data back:
%(decoded)s""" % globals()

You definitely need to be reading in binary mode if the content includes JPEG images.
As well, Python includes an SGML parser, http://docs.python.org/library/sgmllib.html .
There is no example there, but all you need to do is setup do_ methods to handle the sgml tags you wish.

You need to open(filename,'rb') to open the file in binary mode. Be aware that this will cause python to give You confusing, two-byte line endings on some operating systems.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.