python - ftp not transferring full file - python

tl;dr, is there any reason that one file would fail to transfer completely over ftp, while every other file uploaded in the exact same way works fine?
Every week, I use python's ftplib to update my website – usually this consists of transferring around 30-31 files total - 4 that overwrite existing files, and the rest that are completely new. For basically all of these files, my code looks like:
myFile = open('[fileURL]', 'rb')
ftp.storbinary(cmd='STOR [fileURLonServer]', fp=myFile)
myFile.close()
This works completely fine for almost all of my files. Except for one: the top-level index.html file. This file is usually around 7.8-8.1 kb in size, depending on its contents from week to week. It seems to be the case that only the first 4096 bytes of the file are transferred to the server – I have to manually go in and upload the full version of the index every week. Why is it doing this, and how can I get it to stop? Is there some metadata in the file that could be causing the problem?
StackOverflow recommended me this question which doesn't solve my problem – I'm already using the rb mode for opening every file I'm trying to transfer, and each one except this one is working perfectly fine.

Related

Python: Why opening an XFA pdf file takes longer than a txt file of same size?

I am currently developping some python code to extract data from 14 000 pdfs (7 Mb per pdf). They are dynamic XFAs made from Adobe LiveCycle Designer 11.0 so they contain streams that needs to be decoded later (so there are some non-ascii characters if it makes any difference).
My problem is that calling open() on those files takes around 1 second each if not more.
I tried the same operation on 13Mb text files created from copy-pasting a character and they take less than 0.01 sec to open. Where does this time increase come from when I am opening the dynamic pdfs with open()? Can I avoid this bottleneck?
I got those timings using cProfile like this:
from cProfile import Profile
profiler = Profile()
profiler.enable()
f = open('test.pdf', 'rb')
f.close()
profiler.disable()
profiler.print_stats('tottime')
The result of print_stats is the following for a given xfa pdf:
io.open() takes around 1 second to execute once
Additionnal information:
I have noticed that the opening time is around 10x faster when the same pdf file was opened in the last 15 or 30 minutes, even if I delete the __pycache__ directory inside of my project. A solution that could make this speed increase apply regardless of the elapsed time could be worth it, though I only have 50 Gb left on my pc.
Also, parallel processing of the pdfs is not an option since I only have 1 free core to run my implementation...
To solve this problem you can do one of the following:
specify files/directories/extensions to exclude (no realtime scanning) from Windows Defender settings
temporarily turn off real time protection from Windows Defender.
save the files in an encoded format where Windows Defender cant detect links to other files/websites and decode them on read. (I have not tried)
As "user2357112 supports monica" said in the comments, the culprit is the anti-virus software scanning the files before making them available to python.
I was able to verify this by calling open() on a list of files while having the task manager open. Python used almost 0% of the CPU while Service antivirus Microsoft Defender was maxing out one of my cores.
I compared the results to another run of my script where I opened the same file multiple times and python was maxing out the core while the antivirus stayed at 0%.
I tried to run a quick-scan of a single pdf file 2 times with Windows Defender. The first execution resulted in 800 files being scanned in 1 seconds (hence the 1 second delay of the open() execution) and the second scan resulted in one scanned file instantly.
Explication:
Windows Defender scans through all the file/internet links written in the folder, that is why it takes so long to scan them and it's why there is around 800 files scanned in the first report. Windows defender keeps a cache of files scanned since powering on the pc. Files not linked to the internet dont need to be rescanned by Windows Defender. But XFAs contain links to websites. Since it is impossible to tell if a website was maliciously modified, files that contain them need to be rescanned periodically to make sure they are still safe.
Here is a link to to the Official Microsoft Forum tread.

Getting the time where a file is copied to a folder (Python)

I'm trying to write a Python script that runs on Windows. Files are copied to a folder every few seconds, and I'm polling that folder every 30 seconds for names of the new files that were copied to the folder after the last poll.
What I have tried is to use one of the os.path.getXtime(folder_path) functions and compare that with the timestamp of my previous poll. If the getXtime value is larger than the timestamp, then I work on those files.
I have tried to use the function os.path.getctime(folder_path), but that didn't work because the files were created before I wrote the script. I tried os.path.getmtime(folder_path) too but the modified times are usually smaller than the poll timestamp.
Finally, I tried os.path.getatime(folder_path), which works for the first time the files were copied over. The problem is I also read the files once they were in the folder, so the access time keeps getting updated and I end up reading the same files over and over again.
I'm not sure what is a better way or function to do this.
You've got a bit of an XY problem here. You want to know when files in a folder change, you tried a homerolled solution, it didn't work, and now you want to fix your homerolled solution.
Can I suggest that instead of terrible hackery, you use an existing package designed for monitoring for file changes? One that is not a polling loop, but actually gets notified of changes as they happen? While inotify is Linux-only, there are other options for Windows.

How to modify a large file remotely

I have a large XML file, ~30 MB.
Every now and then I need to update some of the values. I am using element tree module to modify the XML. I am currently fetching the entire file, updating it and then placing it again. SO there is ~60 MB of data transfer every time. Is there a way I update the file remotely?
I am using the following code to update the file.
import xml.etree.ElementTree as ET
tree = ET.parse("feed.xml")
root = tree.getroot()
skus = ["RUSSE20924","PSJAI22443"]
qtys = [2,3]
for child in root:
sku = child.find("Product_Code").text.encode("utf-8")
if sku in skus:
print "found"
i = skus.index(sku)
child.find("Quantity").text = str(qtys[i])
child.set('updated', 'yes')
tree.write("feed.xml")
Modifying a file directly via FTP without uploading the entire thing is not possible except when appending to a file.
The reason is that there are only three commands in FTP that actually modify a file (Source):
APPE: Appends to a file
STOR: Uploads a file
STOU: Creates a new file on the server with a unique name
What you could do
Track changes
Cache the remote file locally and track changes to the file using the MDTM command.
Pros:
Will half the required data transfer in many cases.
Hardly requires any change to existing code.
Almost zero overhead.
Cons:
Other clients will have to download the entire thing every time something changes(no change from current situation)
Split up into several files
Split up your XML into several files. (One per product code?)
This way you only have to download the data that you actually need.
Pros:
Less data to transfer
Allows all scripts that access the data to only download what they need
Combinable with suggestion #1
Cons:
All existing code has to be adapted
Additional overhead when downloading or updating all the data
Switch to a delta-sync protocol
If the storage server supports it switching to a delta synchronization protocol like rsync would help a lot because these only transmit the changes (with little overhead).
Pros:
Less data transfer
Requires little change to existing code
Cons:
Might not be available
Do it remotely
You already pointed out that you can't but it still would be the best solution.
What won't help
Switch to a network filesystem
As somebody in the comments already pointed out switching to a network file system (like NFS or CIFS/SMB) would not really help because you cannot actually change parts of the file unless the new data has the exact same length.
What to do
Unless you can do delta synchronization I'd suggest to implement some caching on the client side first and if it doesn't help enough to then split up your files.

Beginner Python: Saving an excel file while it is open

I have a simple problem that I hope will have a simple solution.
I am writing python(2.7) code using the xlwt package to write excel files. The program takes data and writes it out to a file that is being saved constantly. The problem is that whenever I have the file open to check the data and python tries to save the file the program crashes.
Is there any way to make python save the file when I have it open for reading?
My experience is that sashkello is correct, Excel locks the file. Even OpenOffice/LibreOffice do this. They lock the file on disk and create a temp version as a working copy. ANY program trying to access the open file will be denied by the OS. The reason for this is because many corporations treat Excel files as databases but the users have no understanding of the issues involved in concurrency and synchronisation.
I am on linux and I get this behaviour (at least when the file is on a SAMBA share). Look in the same directory as your file, if a file called .~lock.[filename]# exists then you will be unable to read your file from another program. I'm not sure what enforces this lock but I suspect it's an NTFS attribute. Note that even a simple cp or cat fails: cp: error reading ‘CATALOGUE.ods’: Input/output error
UPDATE: The actual locking mechanism appears to be 'oplocks`, a concept connected to Windows shares: http://oreilly.com/openbook/samba/book/ch05_05.html . If the share is managed by Samba the workaround is to disable locks on certain file types, eg:
veto oplock files = /*.xlsx/
If you aren't using a share or NTFS on linux then I guess you should be able to RW the file as long as your script has write permissions. By default only the user who created the file has write access.
WORKAROUND 2: The restriction only seems to apply if you have the file open in Excel/LO as writable, however LO at least allows you to open a file as read-only (Go to File -> Properties -> Security, set Read-Only, Save and re-open the file). I don't know if this will also make it RO for xlwt though.
Hah, funny I ran across your post. I actually just implemented this tonight.
The issue is that Excel files write, and that's it, not both. You cannot read/write off the same object. So if you have another method to save data please do. I'm in a position where I don't have an option.. and so might you.
You're going to need xlutils it's the bread and butter to this.
Here's some example code:
from xlutils.copy import copy
wb_filename = 'example.xls'
wb_object = xlrd.open_workbook(wb_filename)
# And then you can read this file to your hearts galore.
# Now when it comes to writing to this, you need to copy the object and work off that.
write_object = copy(wb_object)
# Write to it all you want and then save that object.
And that's it, now if you read the object, write to it, and read the original one again it won't be updated. You either need to recreate wb_object or you need to create some sort of table in memory that you can keep track of while working through it.

How to determing if a file has finished downloading in python

I have a folder called raw_files. Very large files (~100GB files) from several sources will be uploaded to this folder.
I need to get file information from videos that have finished uploading to the folder. What is the best way to determine if a file is currently being downloaded to the folder (pass) or if the video has finished download (run script)? Thank you.
The most reliable way is to modify the uploading software if you can.
A typical scheme would be to first upload each file into a temporary directory on the same filesystem, and move to the final location when the upload is finished. Such a "move" operation is cheap and atomic.
A variation on this theme is to upload each file under a temporary name (e.g. file.dat.incomplete instead of file.dat) and then rename. You script will simply need to skip files called *.incomplete.
If you check those files, store the size of the files somewhere. When you are in the next round and the filesize is still the same, you can pretty much consider them as finished (depending on how much time is between first and second check). The time interval could e.g. be set to the timeout-interval of your uploading service(FTP, whatever).
There is no special sign or content showing that a file is complete.

Categories

Resources