I am trying to create a hex file with unique data such that if the file size is 1 MB, the file will have a hex equivalent of values from 1 to 1<<20, but the output file's size is 6.9 MB. I tried replacing 'wb' with 'w' but it didn't affect the file size.
f=open('/tmp/hex_data','wb')
for i in xrange(1<<20):
f.write(hex(i))
f.close()
$ ls -lrth hex_data
-rw-r--r-- 1 aameershaikh wheel 6.9M Apr 5 16:43 hex_data
Related
I am trying to split large csv files to small csv files which is
having 125MB to 1GB. split command will work if we give number of
records per file it will split but i want get that row count
dynamically on basis of file size. if the file size is 20GB then
while laoding this whole file into redshift table using copy command
but this is taking lot of time, so if we chunk the 20GB file into
mentioned size files so i will get good results.
Example 20GB file we can split 6_000_000 records per file so in that way the chunk file size
will be around 125mb, in that way i want that 600_000 row count
dynamically depends on size
You can get the file size in MB and divide by some ideal size that you need to predetermine (for my example I picked your minimum of 125MB), and that will give you the number of chunks.
You then get the row count (wc -l, assuming your CSV has no line breaks inside a cell) and divide that by the number of chunks to give your rows per chunk.
Rows per chunk is your "lines per chunk" count that you can finally pass to split.
Because we are doing division which will most likely result in a remainder, you'll probably get an extra file with a relatively few amount of these remainder rows (which you can see in the example).
Here's how I coded this up. I'm using shellcheck, so I think this is pretty POSIX compliant:
csvFile=$1
maxSizeMB=125
rm -f chunked_*
fSizeMB=$(du -ms "$csvFile" | cut -f1)
echo "File size is $fSizeMB, max size per new file is $maxSizeMB"
nChunks=$(( fSizeMB / maxSizeMB ))
echo "Want $nChunks chunks"
nRows=$(wc -l "$csvFile" | cut -d' ' -f2)
echo "File row count is $nRows"
nRowsPerChunk=$(( nRows / nChunks ))
echo "Need $nChunks files at around $nRowsPerChunk rows per file (plus one more file, maybe, for remainder)"
split -d -a 4 -l $nRowsPerChunk "$csvFile" "chunked_"
echo "Row (line) counts per file:"
wc -l chunked_00*
echo
echo "Size (MB) per file:"
du -ms chunked_00*
I created a mock CSV with 60_000_000 rows that is about 5GB:
ll -h gen_60000000x11.csv
-rw-r--r-- 1 zyoung staff 4.7G Jun 24 15:21 gen_60000000x11.csv
When I ran that script I got this output:
./main.sh gen_60000000x11.csv
File size is 4801MB, max size per new file is 125MB
Want 38 chunks
File row count is 60000000
Need 38 files at around 1578947 rows per file (plus one more file, maybe, for remainder)
Row (line) counts per file:
1578947 chunked_0000
1578947 chunked_0001
1578947 chunked_0002
...
1578947 chunked_0036
1578947 chunked_0037
14 chunked_0038
60000000 total
Size (MB) per file:
129 chunked_0000
129 chunked_0001
129 chunked_0002
...
129 chunked_0036
129 chunked_0037
1 chunked_0038
I have a binary file called "input.bin".
I am practicing how to work with such files (read them, change the content and write into a new binary file).
the contents of input file:
03 fa 55 12 20 66 67 50 e8 ab
which is in hexadecimal notation.
I want to make a output file which is simply the input file with the value of each byte incremented by one.
here is the expected output:
04 fb 56 13 21 67 68 51 e9 ac
which also will be in hexadecimal notation.
I am trying to do that in python3 using the following command:
with open("input.bin", "rb") as binary_file:
data = binary_file.read()
for item in data:
item2 = item+1
with open("output.bin", "wb") as binary_file2:
binary_file2.write(item2)
but it does not return what I want. do you know how to fix it?
You want to open the output file before the loop, and call write in the loop.
with open("input.bin", "rb") as binary_file:
data = binary_file.read()
with open("output.bin", "wb") as binary_file2:
binary_file2.write(bytes(item - 1 for item in data))
I have a text file with the size of all files on different servers with extension *.AAA I would like to extract the filename + size from each servers that are bigger than 20 GB. I know how to extract a line from a file and display it but here is my example and what I would like to Achieve.
The example of the file itself:
Pad 1001
Volume in drive \\192.168.0.101\c$ has no label.
Volume Serial Number is XXXX-XXXX
Directory of \\192.168.0.101\c$\TESTUSER\
02/11/2016 02:07 AM 894,889,984 File1.AAA
05/25/2015 07:18 AM 25,673,969,664 File2.AAA
02/11/2016 02:07 AM 17,879,040 File3.AAA
05/25/2015 07:18 AM 12,386,304 File4.AAA
10/13/2008 10:29 AM 1,186,988,032 File3.AAA_oct13
02/15/2016 11:15 AM 2,799,263,744 File5.AAA
6 File(s) 30,585,376,768 bytes
0 Dir(s) 28,585,127,936 bytes free
Pad 1002
Volume in drive \\192.168.0.101\c$ has no label.
Volume Serial Number is XXXX-XXXX
Directory of \\192.168.0.101\c$\TESTUSER\
02/11/2016 02:08 AM 1,379,815,424 File1.AAA
02/11/2016 02:08 AM 18,542,592 File3.AAA
02/15/2016 12:41 AM 853,659,648 File5.AAA
3 File(s) 2,252,017,664 bytes
0 Dir(s) 49,306,902,528 bytes free
Here is what I would like as my output The Pad# and the file that is bigger than 20GB:
Pad 1001 05/25/2015 07:18 AM 25,673,969,664 File2.AAA
I will eventually put this in a excel spreadsheet but this I know how.
Any Ideas?
Thank you
The following should get you started:
import re
output = []
with open('input.txt') as f_input:
text = f_input.read()
for pad, block in re.findall(r'(Pad \d+)(.*?)(?=Pad|\Z)', text, re.M + re.S):
file_list = re.findall(r'^(.*? +([0-9,]+) +.*?\.AAA\w*?)$', block, re.M)
for line, length in file_list:
length = int(length.replace(',', ''))
if length > 2e10: # Or your choice of what 20GB is
output.append((pad, line))
print output
This would display a list with one tuple entry as follows:
[('Pad 1001', '05/25/2015 07:18 AM 25,673,969,664 File2.AAA')]
[EDIT] Here is my approach:
import re
result = []
with open('txtfile.txt', 'r') as f:
content = [line.strip() for line in f.readlines()]
for line in content:
m = re.findall('\d{2}/\d{2}/\d{4}\s+\d{2}:\d{2}\s+(A|P)M\s+([0-9,]+)\s+((?!.AAA).)*.AAA((?!.AAA).)*', line)
if line.startswith('Pad') or m and int(m[0][1].replace(',','')) > 20 * 1024 ** 3:
result.append(line)
print re.sub('Pad\s+\d+$', '', ' '.join(result))
Output is:
Pad 1001 05/25/2015 07:18 AM 25,673,969,664 File2.AAA
I have an HDF5 file containing a very large EARRAY that I would like to truncate in order to save disk space and process it more quickly. I am using the truncate method on the node containing the EARRAY. pytables reports that the array has been truncated but it still takes up the same amount of space on disk.
Directory listing before truncation:
$ ll total 3694208
-rw-rw-r-- 1 chris 189 Aug 27 13:03 main.py
-rw-rw-r-- 1 chris 3782858816 Aug 27 13:00 original.hdf5
The script I am using to truncate (main.py):
import tables
filename = 'original.hdf5'
h5file = tables.open_file(filename, 'a')
print h5file
node = h5file.get_node('/recordings/0/data')
node.truncate(30000)
print h5file
h5file.close()
Output of the script. As expected, the EARRAY goes from very large to much smaller.
original.hdf5 (File) ''
Last modif.: 'Thu Aug 27 13:00:12 2015'
Object Tree:
/ (RootGroup) ''
/recordings (Group) ''
/recordings/0 (Group) ''
/recordings/0/data (EArray(43893300, 43)) ''
/recordings/0/application_data (Group) ''
original.hdf5 (File) ''
Last modif.: 'Thu Aug 27 13:00:12 2015'
Object Tree:
/ (RootGroup) ''
/recordings (Group) ''
/recordings/0 (Group) ''
/recordings/0/data (EArray(30000, 43)) ''
/recordings/0/application_data (Group) ''
Yet the file takes up almost exactly the same amount of space on disk:
ll
total 3693196
-rw-rw-r-- 1 chris 189 Aug 27 13:03 main.py
-rw-rw-r-- 1 chris 3781824064 Aug 27 13:03 original.hdf5
What am I doing wrong? How can I reclaim this disk space?
If there were a way to directly modify the contents of the earray, instead of using the truncate method, this would be even more useful for me. Something like node = node[idx1:idx2, :], so that I could select which chunk of data I want to keep. But when I use this syntax, the variable node simply becomes a numpy array and the hdf5 file is not modified.
As discussed in this question you can't really deallocate disk space from an existing hdf5 file. It's just not a part of how hdf5 is designed, and therefore it's not really a part of pytables. You can either load the data from the file, then rewrite it all as a new file (potentially with the same name), or you can use the command line utility h5repack to do that for you.
You can use ftplib for full FTP support in Python. However the preferred way of getting a directory listing is:
# File: ftplib-example-1.py
import ftplib
ftp = ftplib.FTP("www.python.org")
ftp.login("anonymous", "ftplib-example-1")
data = []
ftp.dir(data.append)
ftp.quit()
for line in data:
print "-", line
Which yields:
$ python ftplib-example-1.py
- total 34
- drwxrwxr-x 11 root 4127 512 Sep 14 14:18 .
- drwxrwxr-x 11 root 4127 512 Sep 14 14:18 ..
- drwxrwxr-x 2 root 4127 512 Sep 13 15:18 RCS
- lrwxrwxrwx 1 root bin 11 Jun 29 14:34 README -> welcome.msg
- drwxr-xr-x 3 root wheel 512 May 19 1998 bin
- drwxr-sr-x 3 root 1400 512 Jun 9 1997 dev
- drwxrwxr-- 2 root 4127 512 Feb 8 1998 dup
- drwxr-xr-x 3 root wheel 512 May 19 1998 etc
...
I guess the idea is to parse the results to get the directory listing. However this listing is directly dependent on the FTP server's way of formatting the list. It would be very messy to write code for this having to anticipate all the different ways FTP servers might format this list.
Is there a portable way to get an array filled with the directory listing?
(The array should only have the folder names.)
Try using ftp.nlst(dir).
However, note that if the folder is empty, it might throw an error:
files = []
try:
files = ftp.nlst()
except ftplib.error_perm as resp:
if str(resp) == "550 No files found":
print "No files in this directory"
else:
raise
for f in files:
print f
The reliable/standardized way to parse FTP directory listing is by using MLSD command, which by now should be supported by all recent/decent FTP servers.
import ftplib
f = ftplib.FTP()
f.connect("localhost")
f.login()
ls = []
f.retrlines('MLSD', ls.append)
for entry in ls:
print entry
The code above will print:
modify=20110723201710;perm=el;size=4096;type=dir;unique=807g4e5a5; tests
modify=20111206092323;perm=el;size=4096;type=dir;unique=807g1008e0; .xchat2
modify=20111022125631;perm=el;size=4096;type=dir;unique=807g10001a; .gconfd
modify=20110808185618;perm=el;size=4096;type=dir;unique=807g160f9a; .skychart
...
Starting from python 3.3, ftplib will provide a specific method to do this:
http://bugs.python.org/issue11072
http://hg.python.org/cpython/file/67053b135ed9/Lib/ftplib.py#l535
I found my way here while trying to get filenames, last modified stamps, file sizes etc and wanted to add my code. It only took a few minutes to write a loop to parse the ftp.dir(dir_list.append) making use of python std lib stuff like strip() (to clean up the line of text) and split() to create an array.
ftp = FTP('sick.domain.bro')
ftp.login()
ftp.cwd('path/to/data')
dir_list = []
ftp.dir(dir_list.append)
# main thing is identifing which char marks start of good stuff
# '-rw-r--r-- 1 ppsrt ppsrt 545498 Jul 23 12:07 FILENAME.FOO
# ^ (that is line[29])
for line in dir_list:
print line[29:].strip().split(' ') # got yerself an array there bud!
# EX ['545498', 'Jul', '23', '12:07', 'FILENAME.FOO']
There's no standard for the layout of the LIST response. You'd have to write code to handle the most popular layouts. I'd start with Linux ls and Windows Server DIR formats. There's a lot of variety out there, though.
Fall back to the nlst method (returning the result of the NLST command) if you can't parse the longer list. For bonus points, cheat: perhaps the longest number in the line containing a known file name is its length.
I happen to be stuck with an FTP server (Rackspace Cloud Sites virtual server) that doesn't seem to support MLSD. Yet I need several fields of file information, such as size and timestamp, not just the filename, so I have to use the DIR command. On this server, the output of DIR looks very much like the OP's. In case it helps anyone, here's a little Python class that parses a line of such output to obtain the filename, size and timestamp.
import datetime
class FtpDir:
def parse_dir_line(self, line):
words = line.split()
self.filename = words[8]
self.size = int(words[4])
t = words[7].split(':')
ts = words[5] + '-' + words[6] + '-' + datetime.datetime.now().strftime('%Y') + ' ' + t[0] + ':' + t[1]
self.timestamp = datetime.datetime.strptime(ts, '%b-%d-%Y %H:%M')
Not very portable, I know, but easy to extend or modify to deal with various different FTP servers.
This is from Python docs
>>> from ftplib import FTP_TLS
>>> ftps = FTP_TLS('ftp.python.org')
>>> ftps.login() # login anonymously before securing control
channel
>>> ftps.prot_p() # switch to secure data connection
>>> ftps.retrlines('LIST') # list directory content securely
total 9
drwxr-xr-x 8 root wheel 1024 Jan 3 1994 .
drwxr-xr-x 8 root wheel 1024 Jan 3 1994 ..
drwxr-xr-x 2 root wheel 1024 Jan 3 1994 bin
drwxr-xr-x 2 root wheel 1024 Jan 3 1994 etc
d-wxrwxr-x 2 ftp wheel 1024 Sep 5 13:43 incoming
drwxr-xr-x 2 root wheel 1024 Nov 17 1993 lib
drwxr-xr-x 6 1094 wheel 1024 Sep 13 19:07 pub
drwxr-xr-x 3 root wheel 1024 Jan 3 1994 usr
-rw-r--r-- 1 root root 312 Aug 1 1994 welcome.msg
That helped me with my code.
When I tried feltering only a type of files and show them on screen by adding a condition that tests on each line.
Like this
elif command == 'ls':
print("directory of ", ftp.pwd())
data = []
ftp.dir(data.append)
for line in data:
x = line.split(".")
formats=["gz", "zip", "rar", "tar", "bz2", "xz"]
if x[-1] in formats:
print ("-", line)