How to blow up disk space fast [duplicate] - python

How can I quickly create a large file on a Linux (Red Hat Linux) system?
dd will do the job, but reading from /dev/zero and writing to the drive can take a long time when you need a file several hundreds of GBs in size for testing... If you need to do that repeatedly, the time really adds up.
I don't care about the contents of the file, I just want it to be created quickly. How can this be done?
Using a sparse file won't work for this. I need the file to be allocated disk space.

dd from the other answers is a good solution, but it is slow for this purpose. In Linux (and other POSIX systems), we have fallocate, which uses the desired space without having to actually writing to it, works with most modern disk based file systems, very fast:
For example:
fallocate -l 10G gentoo_root.img

This is a common question -- especially in today's environment of virtual environments. Unfortunately, the answer is not as straight-forward as one might assume.
dd is the obvious first choice, but dd is essentially a copy and that forces you to write every block of data (thus, initializing the file contents)... And that initialization is what takes up so much I/O time. (Want to make it take even longer? Use /dev/random instead of /dev/zero! Then you'll use CPU as well as I/O time!) In the end though, dd is a poor choice (though essentially the default used by the VM "create" GUIs). E.g:
dd if=/dev/zero of=./gentoo_root.img bs=4k iflag=fullblock,count_bytes count=10G
truncate is another choice -- and is likely the fastest... But that is because it creates a "sparse file". Essentially, a sparse file is a section of disk that has a lot of the same data, and the underlying filesystem "cheats" by not really storing all of the data, but just "pretending" that it's all there. Thus, when you use truncate to create a 20 GB drive for your VM, the filesystem doesn't actually allocate 20 GB, but it cheats and says that there are 20 GB of zeros there, even though as little as one track on the disk may actually (really) be in use. E.g.:
truncate -s 10G gentoo_root.img
fallocate is the final -- and best -- choice for use with VM disk allocation, because it essentially "reserves" (or "allocates" all of the space you're seeking, but it doesn't bother to write anything. So, when you use fallocate to create a 20 GB virtual drive space, you really do get a 20 GB file (not a "sparse file", and you won't have bothered to write anything to it -- which means virtually anything could be in there -- kind of like a brand new disk!) E.g.:
fallocate -l 10G gentoo_root.img

Linux & all filesystems
xfs_mkfile 10240m 10Gigfile
Linux & and some filesystems (ext4, xfs, btrfs and ocfs2)
fallocate -l 10G 10Gigfile
OS X, Solaris, SunOS and probably other UNIXes
mkfile 10240m 10Gigfile
HP-UX
prealloc 10Gigfile 10737418240
Explanation
Try mkfile <size> myfile as an alternative of dd. With the -n option the size is noted, but disk blocks aren't allocated until data is written to them. Without the -n option, the space is zero-filled, which means writing to the disk, which means taking time.
mkfile is derived from SunOS and is not available everywhere. Most Linux systems have xfs_mkfile which works exactly the same way, and not just on XFS file systems despite the name. It's included in xfsprogs (for Debian/Ubuntu) or similar named packages.
Most Linux systems also have fallocate, which only works on certain file systems (such as btrfs, ext4, ocfs2, and xfs), but is the fastest, as it allocates all the file space (creates non-holey files) but does not initialize any of it.

truncate -s 10M output.file
will create a 10 M file instantaneously (M stands for 10241024 bytes, MB stands for 10001000 - same with K, KB, G, GB...)
EDIT: as many have pointed out, this will not physically allocate the file on your device. With this you could actually create an arbitrary large file, regardless of the available space on the device, as it creates a "sparse" file.
For e.g. notice no HDD space is consumed with this command:
### BEFORE
$ df -h | grep lvm
/dev/mapper/lvm--raid0-lvm0
7.2T 6.6T 232G 97% /export/lvm-raid0
$ truncate -s 500M 500MB.file
### AFTER
$ df -h | grep lvm
/dev/mapper/lvm--raid0-lvm0
7.2T 6.6T 232G 97% /export/lvm-raid0
So, when doing this, you will be deferring physical allocation until the file is accessed. If you're mapping this file to memory, you may not have the expected performance.
But this is still a useful command to know. For e.g. when benchmarking transfers using files, the specified size of the file will still get moved.
$ rsync -aHAxvP --numeric-ids --delete --info=progress2 \
root#mulder.bub.lan:/export/lvm-raid0/500MB.file \
/export/raid1/
receiving incremental file list
500MB.file
524,288,000 100% 41.40MB/s 0:00:12 (xfr#1, to-chk=0/1)
sent 30 bytes received 524,352,082 bytes 38,840,897.19 bytes/sec
total size is 524,288,000 speedup is 1.00

Where seek is the size of the file you want in bytes - 1.
dd if=/dev/zero of=filename bs=1 count=1 seek=1048575

Examples where seek is the size of the file you want in bytes
#kilobytes
dd if=/dev/zero of=filename bs=1 count=0 seek=200K
#megabytes
dd if=/dev/zero of=filename bs=1 count=0 seek=200M
#gigabytes
dd if=/dev/zero of=filename bs=1 count=0 seek=200G
#terabytes
dd if=/dev/zero of=filename bs=1 count=0 seek=200T
From the dd manpage:
BLOCKS and BYTES may be followed by the following multiplicative suffixes: c=1, w=2, b=512, kB=1000, K=1024, MB=1000*1000, M=1024*1024, GB =1000*1000*1000, G=1024*1024*1024, and so on for T, P, E, Z, Y.

To make a 1 GB file:
dd if=/dev/zero of=filename bs=1G count=1

I don't know a whole lot about Linux, but here's the C Code I wrote to fake huge files on DC Share many years ago.
#include < stdio.h >
#include < stdlib.h >
int main() {
int i;
FILE *fp;
fp=fopen("bigfakefile.txt","w");
for(i=0;i<(1024*1024);i++) {
fseek(fp,(1024*1024),SEEK_CUR);
fprintf(fp,"C");
}
}

You can use "yes" command also. The syntax is fairly simple:
#yes >> myfile
Press "Ctrl + C" to stop this, else it will eat up all your space available.
To clean this file run:
#>myfile
will clean this file.

I don't think you're going to get much faster than dd. The bottleneck is the disk; writing hundreds of GB of data to it is going to take a long time no matter how you do it.
But here's a possibility that might work for your application. If you don't care about the contents of the file, how about creating a "virtual" file whose contents are the dynamic output of a program? Instead of open()ing the file, use popen() to open a pipe to an external program. The external program generates data whenever it's needed. Once the pipe is open, it acts just like a regular file in that the program that opened the pipe can fseek(), rewind(), etc. You'll need to use pclose() instead of close() when you're done with the pipe.
If your application needs the file to be a certain size, it will be up to the external program to keep track of where in the "file" it is and send an eof when the "end" has been reached.

One approach: if you can guarantee unrelated applications won't use the files in a conflicting manner, just create a pool of files of varying sizes in a specific directory, then create links to them when needed.
For example, have a pool of files called:
/home/bigfiles/512M-A
/home/bigfiles/512M-B
/home/bigfiles/1024M-A
/home/bigfiles/1024M-B
Then, if you have an application that needs a 1G file called /home/oracle/logfile, execute a "ln /home/bigfiles/1024M-A /home/oracle/logfile".
If it's on a separate filesystem, you will have to use a symbolic link.
The A/B/etc files can be used to ensure there's no conflicting use between unrelated applications.
The link operation is about as fast as you can get.

The GPL mkfile is just a (ba)sh script wrapper around dd; BSD's mkfile just memsets a buffer with non-zero and writes it repeatedly. I would not expect the former to out-perform dd. The latter might edge out dd if=/dev/zero slightly since it omits the reads, but anything that does significantly better is probably just creating a sparse file.
Absent a system call that actually allocates space for a file without writing data (and Linux and BSD lack this, probably Solaris as well) you might get a small improvement in performance by using ftrunc(2)/truncate(1) to extend the file to the desired size, mmap the file into memory, then write non-zero data to the first bytes of every disk block (use fgetconf to find the disk block size).

This is the fastest I could do (which is not fast) with the following constraints:
The goal of the large file is to fill a disk, so can't be compressible.
Using ext3 filesystem. (fallocate not available)
This is the gist of it...
// include stdlib.h, stdio.h, and stdint.h
int32_t buf[256]; // Block size.
for (int i = 0; i < 256; ++i)
{
buf[i] = rand(); // random to be non-compressible.
}
FILE* file = fopen("/file/on/your/system", "wb");
int blocksToWrite = 1024 * 1024; // 1 GB
for (int i = 0; i < blocksToWrite; ++i)
{
fwrite(buf, sizeof(int32_t), 256, file);
}
In our case this is for an embedded linux system and this works well enough, but would prefer something faster.
FYI the command dd if=/dev/urandom of=outputfile bs=1024 count = XX was so slow as to be unusable.

Shameless plug: OTFFS provides a file system providing arbitrarily large (well, almost. Exabytes is the current limit) files of generated content. It is Linux-only, plain C, and in early alpha.
See https://github.com/s5k6/otffs.

So I wanted to create a large file with repeated ascii strings. "Why?" you may ask. Because I need to use it for some NFS troubleshooting I'm doing. I need the file to be compressible because I'm sharing a tcpdump of a file copy with the vendor of our NAS. I had originally created a 1g file filled with random data from /dev/urandom, but of course since it's random, it means it won't compress at all and I need to send the full 1g of data to the vendor, which is difficult.
So I created a file with all the printable ascii characters, repeated over and over, to a limit of 1g in size. I was worried it would take a long time. It actually went amazingly quickly, IMHO:
cd /dev/shm
date
time yes $(for ((i=32;i<127;i++)) do printf "\\$(printf %03o "$i")"; done) | head -c 1073741824 > ascii1g_file.txt
date
Wed Apr 20 12:30:13 CDT 2022
real 0m0.773s
user 0m0.060s
sys 0m1.195s
Wed Apr 20 12:30:14 CDT 2022
Copying it from an nfs partition to /dev/shm took just as long as with the random file (which one would expect, I know, but I wanted to be sure):
cp ascii1gfile.txt /home/greygnome/
uptime; free -m; sync; echo 1 > /proc/sys/vm/drop_caches; free -m; date; dd if=/home/greygnome/ascii1gfile.txt of=/dev/shm/outfile bs=16384 2>&1; date; rm -f /dev/shm/outfile
But while doing that I ran a simultaneous tcpdump:
tcpdump -i em1 -w /dev/shm/dump.pcap
I was able to compress the pcap file down to 12M in size! Awesomesauce!
Edit: Before you ding me because the OP said, "I don't care about the contents," know that I posted this answer because it's one of the first replies to "how to create a large file linux" in a Google search. And sometimes, disregarding the contents of a file can have unforeseen side effects.
Edit 2: And fallocate seems to be unavailable on a number of filesystems, and creating a 1GB compressible file in 1.2s seems pretty decent to me (aka, "quickly").

You could use https://github.com/flew-software/trash-dump
you can create file that is any size and with random data
heres a command you can run after installing trash-dump (creates a 1GB file)
$ trash-dump --filename="huge" --seed=1232 --noBytes=1000000000
BTW I created it

Related

Parallel bzip2 decompression in Python [duplicate]

I am using pythons bz2 module to generate (and compress) a large jsonl file (bzip2 compressed 17GB).
However, when I later try to decompress it using pbzip2 it only seems to use one CPU-core for decompression, which is quite slow.
When i compress it with pbzip2 it can leverage multiple cores on decompression. Is there a way to compress within python in the pbzip2-compatible format?
import bz2,sys
from Queue import Empty
#...
compressor = bz2.BZ2Compressor(9)
f = open(path, 'a')
try:
while 1:
m = queue.get(True, 1*60)
f.write(compressor.compress(m+"\n"))
except Empty, e:
pass
except Exception as e:
traceback.print_exc()
finally:
sys.stderr.write("flushing")
f.write(compressor.flush())
f.close()
A pbzip2 stream is nothing more than the concatenation of multiple bzip2 streams.
An example using the shell:
bzip2 < /usr/share/dict/words > words_x_1.bz2
cat words_x_1.bz2{,,,,,,,,,} > words_x_10.bz2
time bzip2 -d < words_x_10.bz2 > /dev/null
time pbzip2 -d < words_x_10.bz2 > /dev/null
I've never used python's bz2 module, but it should be easy to close/reopen a stream in 'a'ppend mode, every so-many bytes, to get the same result. Note that if BZ2File is constructed from an existing file-like object, closing the BZ2File will not close the underlying stream (which is what you want here).
I haven't measured how many bytes is optimal for chunking, but I would guess every 1-20 megabytes - it definitely needs to be larger than the bzip2 block size (900k) though.
Note also that if you record the compressed and uncompressed offsets of each chunk, you can do fairly efficient random access. This is how the dictzip program works, though that is based on gzip.
If you absolutely must use pbzip2 on decompression this won't help you, but the alternative lbzip2 can perform multicore decompression of "normal" .bz2 files, such as those generated by Python's BZ2File or a traditional bzip2 command. This avoids the limitation of pbzip2 you're describing, where it can only achieve parallel decompression if the file is also compressed using pbzip2. See https://lbzip2.org/.
As a bonus, benchmarks suggest lbzip2 is substantially faster than pbzip2, both on decompression (by 30%) and compression (by 40%) while achieving slightly superior compression ratios. Further, its peak RAM usage is less than 50% of the RAM used by pbzip2. See https://vbtechsupport.com/1614/.

python os.read(fd, n) requires parameter n, why?

I need to read a text file with the os module as such:
t = os.open('te.txt', os.O_RDONLY)
r = os.read(t, 20)
rs = r.decode('utf-8')
print(rs)
What if I don't know the byte size of the file. I could put a very large number instead of 20 as a value seems to be required, but perhaps there is a more pythonic way.
The second argument isn't supposed to hold the size of the file in bytes; it's only supposed to hold the maximum amount of content you're prepared to read at a time (which should typically be divisible by both your operating system's block size and page size; 64kb is not a bad default).
The "why" of this is because memory has to be allocated in userspace before the kernel can be instructed to write content into that memory. This isn't the kind of detail that Python developers need to think about often, but you're using a low-level interface built for use from C; it accordingly has implementation details leaking out of that underlying layer.
The operating system is free to give you less than the number of bytes you indicate as a maximum (for example, if it gets interrupted, or the filesystem driver isn't written to provide that much data at a time), so no matter what, you need to be prepared to call it repeatedly; only when it returns an empty string (as opposed to throwing an exception or returning a shorter-than-requested string) are you certain to have reached the end of the file.
os.read() isn't a Pythonic interface, and it isn't supposed to be. It's a thin wrapper around the syscall provided by the operating system kernel. If you want a Pythonic interface, don't use os.read(), but instead use Python's native file objects.
If you wanted to load the whole file and you have to use os, you could use os.stat(filename).st_size or os.path.getsize(filename) to get the size of the file in bytes.
filename = 'te.txt'
t = os.open(filename, os.O_RDONLY)
b = os.stat(filename).st_size
r = os.read(t, b)
rs = r.decode('utf-8')
print(rs)

Why is numpy.memmap initialization so fast? [duplicate]

What is a sparse file and why do we need it?
The only thing that I am able to get is that it is a very large file and it is efficient(in gigabytes). How is it efficient ?
Say you have a file with many empty bytes \x00. These many empty bytes \x00 are called holes. Storing empty bytes is just not efficient, we know there are many of them in the file, so why store them on the storage device? We could instead store metadata describing those zeros. When a process reads the file those zero byte blocks get generated dynamically as opposed to being stored on physical storage (look at this schematic from Wikipedia):
This is why a sparse file is efficient, because it does not store the zeros on disk, instead it holds enough data describing the zeros that will be generated.
Note: the logical file size is greater than the physical file size for sparse files. This is because we have not stored the zeros physically on a storage device.
Edit:
When you run:
$ dd if=/dev/zero of=output bs=1G count=4
The command here copies 4G blocks of null bytes to output. To see that:
$ stat output
File: ouput
Size: 4294967296 Blocks: 8388616 IO Block: 4096 regular file
--omitted--
You can see that this file has 8388616 blocks allocated to it, these blocks store nothing but empty bytes copied from /dev/zero and they do occupy physical disk space, they're holes stored on disk (sparse zeros). dd did what you asked for, copying blocks of data from one file to another.
Now, run this command to detect the holes and make the file sparse in-place:
$ fallocate -d output
$ stat output
File: swapfile
Size: 4294967296 Blocks: 0 IO Block: 4096 regular file
--omitted--
Do you notice something? The the number of blocks now is 0 because the blocks that were storing only empty bytes were de-allocated. Remember, output's blocks store nothing, only a bunch of empty zeros, fallocate -d detected the blocks that contain only empty zeros and deallocated them, since all the blocks for this file contain zeros, they were all de-allocated.
Also notice how the size remained the same. This is the logical (virtual) size of the file, not its size on disk. It's crucial to know that output doesn't occupy physical storage space now, it has 0 blocks allocated to it and thus I doesn't really use disk space. The size preserved after running fallocate -d so when you later read from the file, you get the empty bytes generated to you by the filesystem at runtime. The physical size of output however, is zero, it uses no data blocks.
Remember, when you read output file the empty bytes are generated by the filesystem at runtime dynamically, they're not really physically stored on disk, and the file's size as reported by stat is the logical size, and the physical size is zero for output. In this case the filesystem has to generate 4G of empty bytes when a process reads the file.
To generate a sparse file using dd:
$ dd if=/dev/zero of=output2 bs=1G seek=0 count=0
$ stat
stat output2
File: output2
Size: 4294967296 Blocks: 0 IO Block: 4096 regular file
GNU dd internally uses lseek and ftruncate, so check truncate(2) and lseek(2).
A sparse file is a file that is mostly empty, i.e. it contains large blocks of bytes whose value is 0 (zero).
On the disk, the content of a file is stored in blocks of fixed size (usually 4 KiB or more). When all the bytes contained in such a block are 0, a file system that implements sparse files does not store the block on disk, instead it keeps the information somewhere in the file meta-data.
Advantages of using sparse files:
empty blocks of data do not occupy disk space; they are not stored as the regular blocks of data, their identifiers (that use only several bytes) are stored instead in the file meta-data; this way 4 KiB of disk space (or more) are saved for each empty block;
reading an empty block of data from a sparse file does not take time; this happens because no data is read from disk; since the file system knows all the bytes in the block are 0, it just sets to 0 all the bytes in the input buffer and the data is ready; there is no need to access the slow storage device;
writing an empty block of data into a sparse file does not take time; on writing, the file system detects that the block is empty (all its bytes are 0) and puts the block ID into the list of empty blocks (in the file meta-data); no data is written to the disk.
More information about sparse files can be found on the Wikipedia page.

is there a built-in python analog to unix 'wc' for sniffing a file?

Everyone's done this--from the shell, you need some details about a text file (more than just ls -l gives you), in particular, that file's line count, so:
# > wc -l iris.txt
149 iris.txt
i know that i can access shell utilities from python, but i am looking for a python built-in, if there is one.
The crux of my question is getting this information without opening the file (hence my reference to the unix utility *wc -*l)
(is 'sniffing' the correct term for this--i.e., peeking at a file w/o opening it?')
You can always scan through it quickly, right?
lc = sum(1 for l in open('iris.txt'))
No, I would not call this "sniffing". Sniffing typically refers to looking at data as it passes through, like Ethernet packet capture.
You cannot get the number of lines from a file without opening it. This is because the number of lines in the file is actually the number of newline characters ("\n" on linux) in the file, which you have to read after open()ing it.

How to append EOF to file using Perl or Python?

I’m trying to bulk insert data to SQL server express database. When doing bcp from Windows XP command prompt, I get the following error:
C:\temp>bcp in -T -f -S
Starting copy...
SQLState = S1000, NativeError = 0
Error = [Microsoft][SQL Native Client]Unexpected EOF encountered in BCP data-file
0 rows copied.
Network packet size (bytes): 4096
Clock Time (ms.) Total : 4391
So, there is a problem with EOF. How to append a correct EOF character to this file using Perl or Python?
EOF is End Of File. What probably occurred is that the file is not complete; the software expects data, but there is none to be had anymore.
These kinds of things happen when:
the export is interrupted (quit dump software while dumping)
while copying the dumpfile aborting the copy
disk full during dump
these kinds of things.
By the way, though EOF is usually just an end of file, there does exist an EOF character. This is used because terminal (command line) input doesn't really end like a file does, but it sometimes is necessary to pass an EOF to such a utility. I don't think it's used in real files, at least not to indicate an end of file. The file system knows perfectly well when the file has ended, it doesn't need an indicator to find that out.
EDIT shamelessly copied from a comment provided by John Machin
It can happen (uninentionally) in real files. All it needs is (1) a data-entry user to type Ctrl-Z by mistake, see nothing on the screen, type the intended Shift-Z, and keep going and (2) validation software (written by e.g. the company president's nephew) which happily accepts Ctrl-anykey in text fields and your database has a little bomb in it, just waiting for someone to produce a query to a flat file.
Unexpected EOF means that the bcp reader found an EOF when it was expecting more data. This EOF can be:
(1) the actual physical end-of-file (no more bytes to be read). This means that you have mis-formatted data. Check near the end of your file for an incomplete record.
OR
(2) on Windows, where you are, programs reading a file in text mode honour the ancient convention inherited via MS-DOS from CP/M of regarding Ctrl-Z (aka ^Z aka \'x1A' aka SUB aka SUBSTITUTE) as an end-of-file marker when reading from ANY file, not just a terminal. This includes Python -- the behaviour is determined by the C stdlib. Check for '\x1A' in your data.
Update responding to comments in a legible fashion:
In Notepad++, you can make it display unusual characters by doing View / Show Symbol / Show All Characters. You can search by doing Ctrl-F, typing \x1a in the Find What box, and selecting the Extended radio button in the Search panel.
Or you can with a little bit of Python get the line number of the first Ctrl-Z:
bytes = open('bcp.dat', 'rb').read()
zpos = bytes.find('\x1a')
# if zpos is -1, no Ctrl-Z in file
print 1 + bytes[:zpos].count('\r\n')
Where your .dat was created doesn't matter. An unintentional Ctrl-Z can happen anywhere in a file created on any operating system. It is where it is being read as a text file that matters -- Windows? Bang!
This is not a problem with missing EOF, but with EOF that is there and is not expected by bcp.
I am not a bcp tool expert, but it looks like there is some problem with format of your data files.

Categories

Resources