I'm looking for fastest way to generate fake files for stress test where is a possibility to pass file size. Currently I'm using simple
with open("{}".format(i), 'wb') as f: f.write(os.urandom(FILE_SIZE))
but for my case, creating each file takes too long. It seems to me that the Faker library does not have the method to generate fake files
EDIT: The code below is just a part of whole script so any CMD/OS commands are not a solution for my problem.
You can follow the below to get command with explanation and then run the same command in for loop.
How to create a file with ANY given size in Linux?
Wouldn't it be better to use OS commands for that?
dd if=/dev/urandom of=/tmp/x bs=1M count=1
You could start that using subprocess module:
subprocess.check_call("dd if=/dev/urandom of=/tmp/y bs=1M count=1".split(" "))
Related
How can I convert file.docx to file.doc using Python? I have a code that outputs a file in docx format, but this program is for someone who can only use Word 2003, so I need to convert that file to .doc using Python. How can I do it? Thank you in advance.
Bit clunky, but you could use pywinauto to programatically open your .docx documents in word, then save-as .doc. It'd be using word to do the conversion, so it should be as clean as you could get.
This is a snippet of what I've used for converting to pdf within word (it was just a test). You'd have to follow the keystrokes necessary to save as a .doc
import pywinauto
from pywinauto.application import Application
app1 = Application(backend="uia").connect(path="C:\\Program Files (x86)\\Microsoft Office\\root\\Office16\\WINWORD.EXE")
wordhndl = app1.top_window()
wordhndl.type_keys('^o')
wordhndl.type_keys('%f')
wordhndl.type_keys('^o')
wordhndl.type_keys('^o')
#Now that we're in a sub-window, using the top_window() handle doesn't work...
#Instead refer to absolute (using friendly_class_name())
app1.Dialog.Open.type_keys("Y:\\996.Software\\04.Python\\Test\\SampleDoc1.docx")
app1.Dialog.Open.type_keys('~')
#Publish it to pdf
app1.SampleDoc1docx.type_keys('%f')
app1.SampleDoc1docx.type_keys('e')
app1.SampleDoc1docx.type_keys('p')
app1.SampleDoc1docx.type_keys('a')
app1.SampleDoc1docx.PublishasPDForXPS.Publish.type_keys('~')
#Deal with popups & prompts
if app1.Dialog.PublishasPDForXPS.ConfirmSaveAs.exists():
app1.Dialog.PublishasPDForXPS.ConfirmSaveAs.Yes.click() #This line can take some time...
I think the .doc keystrokes would be (not tested)
app1.SampleDoc1docx.type_keys('%f')
app1.SampleDoc1docx.type_keys('a')
app1.SampleDoc1docx.type_keys('y')
app1.SampleDoc1docx.type_keys('4')
app1.SampleDoc1docx.type_keys('{DOWN}')
app1.SampleDoc1docx.type_keys('{DOWN}')
app1.SampleDoc1docx.type_keys('~')
app1.SampleDoc1docx.type_keys('{RIGHT}')
app1.SampleDoc1docx.type_keys('~')
But... the better solution is to use word. I've used VBA within word to do this exact thing before. Don't have the code to hand, but a good pointer would be:
https://www.datanumen.com/blogs/3-quick-ways-to-batch-convert-word-doc-to-docx-files-and-vice-versa/
you can do something of the likes of:
import docx
doc = docx.Document("myWordxFile.docx")
doc.save('myNewWordFile.doc')
check this too. Good luck!
I have seen some installation files (huge ones, install.sh for Matlab or Mathematica, for example) for Unix-like systems, they must have embedded quite a lot of binary data, such as icons, sound, graphics, etc, into the script. I am wondering how that can be done, since this can be potentially useful in simplifying file structure.
I am particularly interested in doing this with Python and/or Bash.
Existing methods that I know of in Python:
Just use a byte string: x = b'\x23\xa3\xef' ..., terribly inefficient, takes half a MB for a 100KB wav file.
base64, better than option 1, enlarge the size by a factor of 4/3.
I am wondering if there are other (better) ways to do this?
You can use base64 + compression (using bz2 for instance) if that suits your data (e.g., if you're not embedding already compressed data).
For instance, to create your data (say your data consist of 100 null bytes followed by 200 bytes with value 0x01):
>>> import bz2
>>> bz2.compress(b'\x00' * 100 + b'\x01' * 200).encode('base64').replace('\n', '')
'QlpoOTFBWSZTWcl9Q1UAAABBBGAAQAAEACAAIZpoM00SrccXckU4UJDJfUNV'
And to use it (in your script) to write the data to a file:
import bz2
data = 'QlpoOTFBWSZTWcl9Q1UAAABBBGAAQAAEACAAIZpoM00SrccXckU4UJDJfUNV'
with open('/tmp/testfile', 'w') as fdesc:
fdesc.write(bz2.decompress(data.decode('base64')))
Here's a quick and dirty way. Create the following script called MyInstaller:
#!/bin/bash
dd if="$0" of=payload bs=1 skip=54
exit
Then append your binary to the script, and make it executable:
cat myBinary >> myInstaller
chmod +x myInstaller
When you run the script, it will copy the binary portion to a new file specified in the path of=. This could be a tar file or whatever, so you can do additional processing (unarchiving, setting execute permissions, etc) after the dd command. Just adjust the number in "skip" to reflect the total length of the script before the binary data starts.
Here is my problem: I have a file in HDFS which can potentially be huge (=not enough to fit all in memory)
What I would like to do is avoid having to cache this file in memory, and only process it line by line like I would do with a regular file:
for line in open("myfile", "r"):
# do some processing
I am looking to see if there is an easy way to get this done right without using external libraries. I can probably make it work with libpyhdfs or python-hdfs but I'd like if possible to avoid introducing new dependencies and untested libs in the system, especially since both of these don't seem heavily maintained and state that they shouldn't be used in production.
I was thinking to do this using the standard "hadoop" command line tools using the Python subprocess module, but I can't seem to be able to do what I need since there is no command line tools that would do my processing and I would like to execute a Python function for every linein a streaming fashion.
Is there a way to apply Python functions as right operands of the pipes using the subprocess module? Or even better, open it like a file as a generator so I could process each line easily?
cat = subprocess.Popen(["hadoop", "fs", "-cat", "/path/to/myfile"], stdout=subprocess.PIPE)
If there is another way to achieve what I described above without using an external library, I'm also pretty open.
Thanks for any help !
You want xreadlines, it reads lines from a file without loading the whole file into memory.
Edit:
Now I see your question, you just need to get the stdout pipe from your Popen object:
cat = subprocess.Popen(["hadoop", "fs", "-cat", "/path/to/myfile"], stdout=subprocess.PIPE)
for line in cat.stdout:
print line
If you want to avoid adding external dependencies at any cost, Keith's answer is the way to go. Pydoop, on the other hand, could make your life much easier:
import pydoop.hdfs as hdfs
with hdfs.open('/user/myuser/filename') as f:
for line in f:
do_something(line)
Regarding your concerns, Pydoop is actively developed and has been used in production for years at CRS4, mostly for computational biology applications.
Simone
In the last two years, there has been a lot of motion on Hadoop-Streaming. This is pretty fast according to Cloudera: http://blog.cloudera.com/blog/2013/01/a-guide-to-python-frameworks-for-hadoop/ I've had good success with it.
You can use the WebHDFS Python Library (built on top of urllib3):
from hdfs import InsecureClient
client_hdfs = InsecureClient('http://host:port', user='root')
with client_hdfs.write(access_path) as writer:
dump(records, writer) # tested for pickle and json (doesnt work for joblib)
Or you can use the requests package in python as:
import requests
from json import dumps
params = (('op', 'CREATE')
('buffersize', 256))
data = dumps(file) # some file or object - also tested for pickle library
response = requests.put('http://host:port/path', params=params, data=data) # response 200 = successful
Hope this helps!
Is it possible to take an ISO file and edit a file in it directly, i.e. not by unpacking it, changing the file, and repacking it?
It is possible to do 1. from Python? How would I do it?
You can use for listing and extracting, I tested the first.
https://github.com/barneygale/iso9660/blob/master/iso9660.py
import iso9660
cd = iso9660.ISO9660("/Users/murat/Downloads/VisualStudio6Enterprise.ISO")
for path in cd.tree():
print path
https://github.com/barneygale/isoparser
import isoparser
iso = isoparser.parse("http://www.microsoft.com/linux.iso")
print iso.record("boot", "grub").children
print iso.record("boot", "grub", "grub.cfg").content
Have you seen Hachoir, a Python library to "view and edit a binary stream field by field"? I haven't had a need to try it myself, but ISO 9660 is listed as a supported parser format.
PyCdlib can read and edit ISO-9660 files, as well as Joliet, Rock Ridge, UDF, and El Torito extensions. It has a bunch of detailed examples in its documentation, including one showing how to edit a file in-place. At the time of writing, it cannot add or remove files, or edit directories. However, it is still actively maintained, in contrast to the libraries linked in older answers.
Of course, as with any file.
It can be done with open/read/write/seek/tell/close operations on a file. Pack/unpack the data with struct/ctypes. It would require serious knowledge of the contents of ISO, but I presume you already know what to do. If you're lucky you can try using mmap - the interface to file contents string-like.
Is there any library to show progress when adding files to a tar archive in python or alternativly would be be possible to extend the functionality of the tarfile module to do this?
In an ideal world I would like to show the overall progress of the tar creation as well as an ETA as to when it will be complete.
Any help on this would be really appreciated.
Unfortunately it doesn't look like there is an easy way to get byte by byte numbers.
Are you adding really large files to this tar file? If not, I would update progress on a file-by-file basis so that as files are added to the tar, the progress is updated based on the size of each file.
Supposing that all your filenames are in the variable toadd and tarfile is a TarFile object. How about,
from itertools import imap
from operator import attrgetter
# you may want to change this depending on how you want to update the
# file info for your tarobjs
tarobjs = imap(tarfile.getattrinfo, toadd)
total = sum(imap(attrgetter('size'), tarobjs))
complete = 0.0
for tarobj in tarobjs:
sys.stdout.write("\rPercent Complete: {0:2.0d}%".format(complete))
tarfile.add(tarobj)
complete += tarobj.size / total * 100
sys.stdout.write("\rPercent Complete: {0:2.0d}%\n".format(complete))
sys.stdout.write("Job Done!")
Find or write a file-like that wraps a real file which provides progress reporting, and pass it to Tarfile.addfile() so that you can know how many bytes have been requested for inclusion in the archive. You may have to use/implement throttling in case Tarfile tries to read the whole file at once.
I have recently written a wrapper library that provides a progress callback. Have a look at it on git hub:
https://github.com/thomaspurchas/tarfile-Progress-Reporter
Feel free to ask for help if you need it.
Seems like you can use the filter parameter of tarfile.add()
with tarfile.open(<tarball path>, 'w') as tarball:
tarball.add(<some file>, filter = my_filter)
def my_filter(tarinfo):
#increment some count
#add tarinfo.size to some byte counter
return tarinfo
All information you can get from a TarInfo object is available to you.
How are you adding files to the tar file? Is is through "add" with recursive=True? You could build the list of files yourself and call "add" one-by-one, showing the progress as you go. If you're building from a stream/file then it looks like you could wrap that fileobj to see the read status and pass that into addfile.
It does not look like you will need to modify tarfile.py at all.