I have a python script agregating data stored in several csv files. After agregating the data, I export it using
pd.json("myfile.json", orient="table")
This is done locally on my Windows 10 computer.
I use the same code and functions with the same data on AWS Lambda, by reading the data from an S3 bucket. I downloaded the exported json file, and load them both with pandas.
Using
myLocaljson.equals(myAwsjson)
I verified that they are indeed equal. The dtypes are also equal. But the disks sizes are different. One is 259 KB and the other is 267 KB.
Do you have ideas why?
This is most probably caused by the whitespace characters.
Windows has the line endings \r\n, most other OSes only use \n. So that's one extra character per line.
Also, this JSON file:
{"foo":"bar","baz":[]}
is equal to this one:
{
"foo": "bar",
"baz": [
]
}
which is actually a nice representation of this:
{\n\t"foo": "bar",\n\t"baz": [\n\t]\n}
So JSON objects can be "equal", although the string representing them can be very different.
I would guess non-printable characters at the end of the File, open it with a HEX editor and compare the file.
The other option is the was small files are stored on a hard disk, 1024 - 1KB files will not take up 1024KB of disk space but take up significantly more disk space depending on the cluster size of the drive.
How can I quickly create a large file on a Linux (Red Hat Linux) system?
dd will do the job, but reading from /dev/zero and writing to the drive can take a long time when you need a file several hundreds of GBs in size for testing... If you need to do that repeatedly, the time really adds up.
I don't care about the contents of the file, I just want it to be created quickly. How can this be done?
Using a sparse file won't work for this. I need the file to be allocated disk space.
dd from the other answers is a good solution, but it is slow for this purpose. In Linux (and other POSIX systems), we have fallocate, which uses the desired space without having to actually writing to it, works with most modern disk based file systems, very fast:
For example:
fallocate -l 10G gentoo_root.img
This is a common question -- especially in today's environment of virtual environments. Unfortunately, the answer is not as straight-forward as one might assume.
dd is the obvious first choice, but dd is essentially a copy and that forces you to write every block of data (thus, initializing the file contents)... And that initialization is what takes up so much I/O time. (Want to make it take even longer? Use /dev/random instead of /dev/zero! Then you'll use CPU as well as I/O time!) In the end though, dd is a poor choice (though essentially the default used by the VM "create" GUIs). E.g:
dd if=/dev/zero of=./gentoo_root.img bs=4k iflag=fullblock,count_bytes count=10G
truncate is another choice -- and is likely the fastest... But that is because it creates a "sparse file". Essentially, a sparse file is a section of disk that has a lot of the same data, and the underlying filesystem "cheats" by not really storing all of the data, but just "pretending" that it's all there. Thus, when you use truncate to create a 20 GB drive for your VM, the filesystem doesn't actually allocate 20 GB, but it cheats and says that there are 20 GB of zeros there, even though as little as one track on the disk may actually (really) be in use. E.g.:
truncate -s 10G gentoo_root.img
fallocate is the final -- and best -- choice for use with VM disk allocation, because it essentially "reserves" (or "allocates" all of the space you're seeking, but it doesn't bother to write anything. So, when you use fallocate to create a 20 GB virtual drive space, you really do get a 20 GB file (not a "sparse file", and you won't have bothered to write anything to it -- which means virtually anything could be in there -- kind of like a brand new disk!) E.g.:
fallocate -l 10G gentoo_root.img
Linux & all filesystems
xfs_mkfile 10240m 10Gigfile
Linux & and some filesystems (ext4, xfs, btrfs and ocfs2)
fallocate -l 10G 10Gigfile
OS X, Solaris, SunOS and probably other UNIXes
mkfile 10240m 10Gigfile
HP-UX
prealloc 10Gigfile 10737418240
Explanation
Try mkfile <size> myfile as an alternative of dd. With the -n option the size is noted, but disk blocks aren't allocated until data is written to them. Without the -n option, the space is zero-filled, which means writing to the disk, which means taking time.
mkfile is derived from SunOS and is not available everywhere. Most Linux systems have xfs_mkfile which works exactly the same way, and not just on XFS file systems despite the name. It's included in xfsprogs (for Debian/Ubuntu) or similar named packages.
Most Linux systems also have fallocate, which only works on certain file systems (such as btrfs, ext4, ocfs2, and xfs), but is the fastest, as it allocates all the file space (creates non-holey files) but does not initialize any of it.
truncate -s 10M output.file
will create a 10 M file instantaneously (M stands for 10241024 bytes, MB stands for 10001000 - same with K, KB, G, GB...)
EDIT: as many have pointed out, this will not physically allocate the file on your device. With this you could actually create an arbitrary large file, regardless of the available space on the device, as it creates a "sparse" file.
For e.g. notice no HDD space is consumed with this command:
### BEFORE
$ df -h | grep lvm
/dev/mapper/lvm--raid0-lvm0
7.2T 6.6T 232G 97% /export/lvm-raid0
$ truncate -s 500M 500MB.file
### AFTER
$ df -h | grep lvm
/dev/mapper/lvm--raid0-lvm0
7.2T 6.6T 232G 97% /export/lvm-raid0
So, when doing this, you will be deferring physical allocation until the file is accessed. If you're mapping this file to memory, you may not have the expected performance.
But this is still a useful command to know. For e.g. when benchmarking transfers using files, the specified size of the file will still get moved.
$ rsync -aHAxvP --numeric-ids --delete --info=progress2 \
root#mulder.bub.lan:/export/lvm-raid0/500MB.file \
/export/raid1/
receiving incremental file list
500MB.file
524,288,000 100% 41.40MB/s 0:00:12 (xfr#1, to-chk=0/1)
sent 30 bytes received 524,352,082 bytes 38,840,897.19 bytes/sec
total size is 524,288,000 speedup is 1.00
Where seek is the size of the file you want in bytes - 1.
dd if=/dev/zero of=filename bs=1 count=1 seek=1048575
Examples where seek is the size of the file you want in bytes
#kilobytes
dd if=/dev/zero of=filename bs=1 count=0 seek=200K
#megabytes
dd if=/dev/zero of=filename bs=1 count=0 seek=200M
#gigabytes
dd if=/dev/zero of=filename bs=1 count=0 seek=200G
#terabytes
dd if=/dev/zero of=filename bs=1 count=0 seek=200T
From the dd manpage:
BLOCKS and BYTES may be followed by the following multiplicative suffixes: c=1, w=2, b=512, kB=1000, K=1024, MB=1000*1000, M=1024*1024, GB =1000*1000*1000, G=1024*1024*1024, and so on for T, P, E, Z, Y.
To make a 1 GB file:
dd if=/dev/zero of=filename bs=1G count=1
I don't know a whole lot about Linux, but here's the C Code I wrote to fake huge files on DC Share many years ago.
#include < stdio.h >
#include < stdlib.h >
int main() {
int i;
FILE *fp;
fp=fopen("bigfakefile.txt","w");
for(i=0;i<(1024*1024);i++) {
fseek(fp,(1024*1024),SEEK_CUR);
fprintf(fp,"C");
}
}
You can use "yes" command also. The syntax is fairly simple:
#yes >> myfile
Press "Ctrl + C" to stop this, else it will eat up all your space available.
To clean this file run:
#>myfile
will clean this file.
I don't think you're going to get much faster than dd. The bottleneck is the disk; writing hundreds of GB of data to it is going to take a long time no matter how you do it.
But here's a possibility that might work for your application. If you don't care about the contents of the file, how about creating a "virtual" file whose contents are the dynamic output of a program? Instead of open()ing the file, use popen() to open a pipe to an external program. The external program generates data whenever it's needed. Once the pipe is open, it acts just like a regular file in that the program that opened the pipe can fseek(), rewind(), etc. You'll need to use pclose() instead of close() when you're done with the pipe.
If your application needs the file to be a certain size, it will be up to the external program to keep track of where in the "file" it is and send an eof when the "end" has been reached.
One approach: if you can guarantee unrelated applications won't use the files in a conflicting manner, just create a pool of files of varying sizes in a specific directory, then create links to them when needed.
For example, have a pool of files called:
/home/bigfiles/512M-A
/home/bigfiles/512M-B
/home/bigfiles/1024M-A
/home/bigfiles/1024M-B
Then, if you have an application that needs a 1G file called /home/oracle/logfile, execute a "ln /home/bigfiles/1024M-A /home/oracle/logfile".
If it's on a separate filesystem, you will have to use a symbolic link.
The A/B/etc files can be used to ensure there's no conflicting use between unrelated applications.
The link operation is about as fast as you can get.
The GPL mkfile is just a (ba)sh script wrapper around dd; BSD's mkfile just memsets a buffer with non-zero and writes it repeatedly. I would not expect the former to out-perform dd. The latter might edge out dd if=/dev/zero slightly since it omits the reads, but anything that does significantly better is probably just creating a sparse file.
Absent a system call that actually allocates space for a file without writing data (and Linux and BSD lack this, probably Solaris as well) you might get a small improvement in performance by using ftrunc(2)/truncate(1) to extend the file to the desired size, mmap the file into memory, then write non-zero data to the first bytes of every disk block (use fgetconf to find the disk block size).
This is the fastest I could do (which is not fast) with the following constraints:
The goal of the large file is to fill a disk, so can't be compressible.
Using ext3 filesystem. (fallocate not available)
This is the gist of it...
// include stdlib.h, stdio.h, and stdint.h
int32_t buf[256]; // Block size.
for (int i = 0; i < 256; ++i)
{
buf[i] = rand(); // random to be non-compressible.
}
FILE* file = fopen("/file/on/your/system", "wb");
int blocksToWrite = 1024 * 1024; // 1 GB
for (int i = 0; i < blocksToWrite; ++i)
{
fwrite(buf, sizeof(int32_t), 256, file);
}
In our case this is for an embedded linux system and this works well enough, but would prefer something faster.
FYI the command dd if=/dev/urandom of=outputfile bs=1024 count = XX was so slow as to be unusable.
Shameless plug: OTFFS provides a file system providing arbitrarily large (well, almost. Exabytes is the current limit) files of generated content. It is Linux-only, plain C, and in early alpha.
See https://github.com/s5k6/otffs.
So I wanted to create a large file with repeated ascii strings. "Why?" you may ask. Because I need to use it for some NFS troubleshooting I'm doing. I need the file to be compressible because I'm sharing a tcpdump of a file copy with the vendor of our NAS. I had originally created a 1g file filled with random data from /dev/urandom, but of course since it's random, it means it won't compress at all and I need to send the full 1g of data to the vendor, which is difficult.
So I created a file with all the printable ascii characters, repeated over and over, to a limit of 1g in size. I was worried it would take a long time. It actually went amazingly quickly, IMHO:
cd /dev/shm
date
time yes $(for ((i=32;i<127;i++)) do printf "\\$(printf %03o "$i")"; done) | head -c 1073741824 > ascii1g_file.txt
date
Wed Apr 20 12:30:13 CDT 2022
real 0m0.773s
user 0m0.060s
sys 0m1.195s
Wed Apr 20 12:30:14 CDT 2022
Copying it from an nfs partition to /dev/shm took just as long as with the random file (which one would expect, I know, but I wanted to be sure):
cp ascii1gfile.txt /home/greygnome/
uptime; free -m; sync; echo 1 > /proc/sys/vm/drop_caches; free -m; date; dd if=/home/greygnome/ascii1gfile.txt of=/dev/shm/outfile bs=16384 2>&1; date; rm -f /dev/shm/outfile
But while doing that I ran a simultaneous tcpdump:
tcpdump -i em1 -w /dev/shm/dump.pcap
I was able to compress the pcap file down to 12M in size! Awesomesauce!
Edit: Before you ding me because the OP said, "I don't care about the contents," know that I posted this answer because it's one of the first replies to "how to create a large file linux" in a Google search. And sometimes, disregarding the contents of a file can have unforeseen side effects.
Edit 2: And fallocate seems to be unavailable on a number of filesystems, and creating a 1GB compressible file in 1.2s seems pretty decent to me (aka, "quickly").
You could use https://github.com/flew-software/trash-dump
you can create file that is any size and with random data
heres a command you can run after installing trash-dump (creates a 1GB file)
$ trash-dump --filename="huge" --seed=1232 --noBytes=1000000000
BTW I created it
How do I truncate a csv log file that is being used as std out pipe destination from another process without generating a _csv.Error: line contains NULL byte error?
I have one process running rtlamr > log/readings.txt that is piping radio signal data to readings.txt. I don't think it matters what is piping to the file--any long-running pipe process will do.
I have a file watcher using watchdog (Python file watcher) on that file, which triggers a function when the file is changed. The function read the files and updates a database.
Then I try to truncate readings.txt so that it doesn't grow infinitely (or back it up).
file = open(dir_path+'/log/readings.txt', "w")
file.truncate()
file.close()
This corrupts readings.txt and generates the error (the start of the file contains garbage characters).
I tried moving the file instead of truncating it, in the hopes that rtlamr will recreate a fresh file, but that only has the effect of stopping the pipe.
EDIT
I noticed that the charset changes from us-ascii to binary but attempting to truncate the file with file = open(dir_path+'/log/readings.log', "w",encoding="us-ascii") does not do anything.
If you truncate a file1 while another process has it open in w mode, that process will continue to write to the same offsets, making the file sparse. Low offsets will thus be read as 0s.
As per x11 - Concurrent writing to a log file from many processes - Unix & Linux Stack Exchange and Can two Unix processes simultaneous write to different positions in a single file?, each process that has a file open has its own offset in it, and a ftruncate() doesn't change that.
If you want the other process to react to truncation, it needs to have it open in a mode.
Your approach has principal bugs, too. E.g. it's not atomic: you may (=will, eventually) truncate the file after the producer has added data but before you have read it so it would get lost.
Consider using dedicated data buffering utilities instead like buffer or pv as per Add a big buffer to a pipe between two commands.
1Which is superfluous because open(mode='w') already does that. Either truncate or reopen, no need to do both.
I'm trying to get the size of a file just downloaded by using urllib in python3. I tried to get the filesize with len(response.read()) and the value got was 653856. However when this done I found after checking that the size of the file saved was 0 (did the response.read() consume the data maybe?) so I discarded this option and switched to ckeck the size with os:
if strFecha in dicc or fecha.year < 2010:
# Download the file from `url` and save it locally under `nomFichero`:
with urllib.request.urlopen(url) as response, open(nomFichero, 'wb') as outFile:
# Following one returns 653856
size1 = len(response.read())
shutil.copyfileobj(response, outFile)
# Following two lines return 524288
size2 = os.path.getsize(nomFichero)
size2 = os.stat(nomFichero).st_size
I got the same result for both of them (524288), but after doing an 'ls -l' on the command line I got 653856, and I don't know why this value did not match the one got from python.
Can anyone tell me which value is the good one and how to get it with python3?
I would suspect this has to do with block sizing & empty blocks.
If you have say a 2,324,020 byte file there will be two values, size on disk (how much room is being used to store it because of the file system's minimum allowed space usage)
and actual file size OR Logical File Size (2,324,020 byte). for instance on Windows 7, I have a 2,324,020 byte file; it's size on disk is 2,326,528 bytes.
Python's OS commands
os.path.getsize #Returns real size of file in bytes
and
os.stat #Returns real size of file in bytes
Will return the exact size of the file in bytes.
Something like this:
lSize=os.stat(filename).st_size
bSize=os.statvfs(filename).f_bsize
sizeOnDisk=(lSize/bSize+1)*bSize
will return the size on disk but is not supported on Windows at this time.
EDIT
This post says LS does NOT show disk usage size, rather logical size, however notes that "empty blocks" may alter the file size.
The output of the ls command displays only the size of a file and does
not include indirect blocks used by the file. Any empty blocks of the
file also get included in the output of the command.
It is worth noting if you are looking for Logical File Size, Python stat would be ideal.
This is tested on Linux:
Passing buffering argument to open doesn't really alter the buffer size:
buffering is an optional integer used to set the buffering policy. Pass 0 to switch buffering off (only allowed in binary mode), 1 to select line buffering (only usable in text mode), and an integer > 1 to indicate the size in bytes of a fixed-size chunk buffer.
Here's what I tried, for text files:
>>> f = open("/home/user/Desktop/data.txt", "w+", buffering=30)
The anatomy of the f object returned by open:
>>> f
<_io.TextIOWrapper name='/home/user/Desktop/data.txt' mode='w+' encoding='UTF-8'>
>>> f.buffer
<_io.BufferedRandom name='/home/user/Desktop/data.txt'>
>>> f.buffer.raw
<_io.FileIO name='/home/user/Desktop/data.txt' mode='rb+' closefd=True>
_io.TextIOWrapper --> handles encoding, decoding, universal EOF
translations...
_io.BufferedRandom --> read and write buffer.
io.FileIO --> This is the file on disk.
If I write to f.buffer.raw everything goes straight to the file on disk, this bypasses the f.buffer and the operating system flushs everything straight into the disk file, there's no need to call os.fsync with file descriptor to flush the data from Os's buffer to disk.
The strange thing:
>>> f.write("xyz" * 80)
240
I've written by far more than twice the size of the buffer which is specified to be 30 bytes. I expected the buffer to be flushed, and I assumed Python flushed the buffer but Os's buffer still holding data so I called os.fsync but nothing was flushed and the file remained empty!
The buffer is not flushed and nothing is written to disk, but writing to the buffer itself where the amount of the written data exceeded the buffer size, everything gets flushed and data is written to file on disk:
>>> f.buffer.write(b"x" * 30)
30
>>> f.buffer.write(b"x") # now buffer empties itself
1
This is strange because type(f) should just encode and add some bells and whistles and then once finished sends bytes to f.buffer and when f.buffer exceeds 30 as specified to open, the buffer should empty itself by flushing. It doesn't seem like that, the problem is f relies on its f._CHUNK_SIZE when the written data exceed _CHUNK_SIZE everything gets written to disk.
Then f acts as a buffer of str strings with _CHUNK_SIZE set is its size and it flushes its data to f.buffer?
f.buffer is the bytes buffer and its value is specified by the bufferingargument ofopenand it flushes its dataf.buffer.raw`?
My initial thinking was that, type(f) handles certain tasks like encoding, decoding and then gives its data to the buffer, it's really a wrapper around the buffer. Is it a buffer of str strings?