How to download part of a file over SFTP connection? - python

So I have a Python program that pulls access logs from remote servers and processes them. There are separate log files for each day. The files on the servers are in this format:
access.log
access.log-20130715
access.log-20130717
The file "access.log" is the log file for the current day, and is modified throughout the day with new data. The files with the timestamp appended are archived log files, and are not modified. If any of the files in the directory are ever modified, it is either because (1) data is being added to the "access.log" file, or (2) the "access.log" file is being archived, and an empty file takes its place. Every minute or so, my program checks for the most recent modification time of any files in the directory, and if it changes it pulls down the "access.log" file and any newly archived files
All of this currently works fine. However, if a lot of data is added to the log file throughout the day, downloading the whole thing over and over just to get some of the data at the end of the file will create a lot of traffic on the network, and I would like to avoid that. Is there any way to only download a part of the file? If I have already processed, say 1 GB of the file, and another 500 bytes suddenly get added to the log file, is there a way to only download the 500 bytes at the end?
I am using Python 3.2, my local machine is running Windows, and the remote servers all run Linux. I am using Chilkat for making SSH and SFTP connections. Any help would be greatly appreciated!

Call ResumeDownloadFileByName. Here's the description of the method in the Chilkat reference documentation:
Resumes an SFTP download. The size of the localFilePath is checked and
the download begins at the appropriate position in the remoteFilePath.
If localFilePath is empty or non-existent, then this method is
identical to DownloadFileByName. If the localFilePath is already fully
downloaded, then no additional data is downloaded and the method will
return True.
See http://www.chilkatsoft.com/refdoc/pythonCkSFtpRef.html

You could do that, or you could massively reduce your complexity by splitting the latest log file down into hours, or tens of minutes.

Related

How do I transfer files from s3 to my ec2 instance whenever I add a new file to s3?

I have a py script which is in my ec2 instance. That requires a video file as input which is in an S3 bucket. How do I automate the process where the ec2 instance starts running every time a new file is added to that bucket? I want the ec2 instance to recognize this new file and then add it to its local directory where the py script can use it and process it and create the output file. I want to then send this output file back to the bucket and store it there.
I know boto3 library is used to connect s3 to ec2 , however I am unclear how to trigger this automatically and look for new files without having to manually start my instance and copy everything
Edit:
I have a python program which basically takes a video file(mp4) and then breaks it into frames and stitches it to create a bunch of small panorama images and stores it in a folder named 'Output'. Now as the program needs a video as input, in the program I refer to a particular directory where it is supposed pick the mp4 file from and read it as input. So what I now want is that, there is going to be an s3 bucket that is going to receive a video file from elsewhere. it is going to be inside a folder inside a particular bucket. I want any new mp4 file entering that bucket to be copied or sent to the input directory in my instance. Also, when this happens, I want my python program stored in that instance to be automatically executed and find this new video file in the input directory to process it and make the small panoramas and then store it in the output directory or even better, send it to an output folder in the same s3 bucket.
There are many ways in which you could design a solution for that. They will vary depending on how often you get your videos, should it be scalable, fault tolerant, how many videos do you want to process in parallel and more. I will just provide one, on the assumption that the new videos are uploaded occasionally and no auto-scaling groups are needed for processing large number of videos at the same time.
On the above assumption, one way could be as follows:
Upload of a new video triggers a lambda function using S3 event notifications.
Lambda gets the video details (e.g. s3 path) from the S3 event, submits the video details to a SQS queue and starts your instance.
Your application on the instance, once started, pulls the SQS queue for details of the video file to process. This would require your application to be designed in a way that its starts a instance start, which can be done using modified user data, systemd unit files and more.
Its a very basic solution, and as I mentioned many other ways are possible, involving auto-scaling group, scaling policies based on sqs size, ssm run commands, and more.

How to solve OdbError in Abaqus Python script?

I am running a 3D solid model in Abaqus Python script, which is supposed to be analyzed for 200 times as the model has been arranged in a for loop (for i in range(0,199):). Sometimes, I receive the following error and then the analysis terminates. I can't realize the reason.
Odb_0=session.openOdb(name='Job-1'+'.odb')
odberrror: the .lck file for the output database D:/abaqus/Model/Job-1.odb indicates that the analysis Input File Processor is currently modifying the database. The database cannot be opened at this time.
It is noted that all the variables including "Odb_0" and .... are deleted at the end of each step of the loop prior to starting the further one.
I don't believe your problem will be helped by a change in element type.
The message and the .lck file say that there's an access deadlock in the database. The output file lost out and cannot update the .odb database.
I'm not sure what database Abaqus uses. I would have guessed that the input stream would have scanned the input file and written whatever records were necessary to the database before the solution and output processing began.
From the Abaqus documentation
The lock file (job_name.lck) is written whenever an output database file is opened with write access, including when an analysis is running and writing output to an output database file. The lock file prevents you from having simultaneous write permission to the output database from multiple sources. It is deleted automatically when the output database file is closed or when the analysis that creates it ends.
When you are deleting your previous analysis you should be sure that all processes connected with that simulation have been terminated. There are several possibilities to do so:
Launching simulation through subprocess.popen could give you much more control over the process (e.g. waiting until it ends, writing of a specific log, etc.);
Naming your simulations differently (e.g. 'Job-1', 'Job-2', etc.) and deleting old ones with a delay (e.g. deleting 'Job-1' while 'Job-3' has started);
Less preferable: using the time module

AWS S3 continuous byte stream upload?

I am doing some batch processing and occasionally end up with a corrupt line of (string) data. I would like to upload these to an S3 file.
Now, I would really want to add all the lines to a single file, and upload it in a after my script finished executing, but my client asked me to use a socket connection instead and add each line one by one as they come up, simulating a single slow upload.
It sounds like he's done this before, but I couldn't find any reference for anything like it (not talking about multi-part uploads). Has anyone done something like this before?

rsync - write delta (new data) to new file

I have a growing web server log file that rotates once a week, and I need to feed 1 hour worth of logs to some python script. So, rsync is the solution, right?
I know that rsync would transfer (add) only new (changed) date in file. But how do I make it write new lines from remote log (form last time it did so an hour aga) to a separete file for inspection?
So, the difference from normal behaviour is that it does not append new (changed) lines to local file, but just writes the difference to a separete one.

Python to check if file status is being uploading

Python 2.6
My script needs to monitor some 1G files on the ftp, when ever it's changed/modified, the script will download it to another place. Those file name will remain unchanged, people will delete the original file on ftp first, then upload a newer version. My script will checking the file metadata like file size and date modified to see if any difference.
The question is when the script checking metadata, the new file may be still being uploading. How to handle this situation? Is there any file attribute indicates uploading status (like the file is locked)? Thanks.
There is no such attribute. You may be unable to GET such file, but it depends on the server software. Also, file access flags may be set one way while the file is being uploaded and then changed when upload is complete; or incomplete file may have modified name (e.g. original_filename.ext.part) -- it all depends on the server-side software used for upload.
If you control the server, make your own metadata, e.g. create an empty flag file alongside the newly uploaded file when upload is finished.
In the general case, I'm afraid, the best you can do is monitor file size and consider the file completely uploaded if its size is not changing for a while. Make this interval sufficiently large (on the order of minutes).
Your question leaves out a few details, but I'll try to answer.
If you're running your status checker
program on the same server thats
running ftp:
1) Depending on your operating system, if you're using Linux and you've built inotify into your kernel you could use pyinotify to watch your upload directory -- inotify distinguishes from open, modify, close events and lets you asynchronously watch filesystem events so you're not polling constantly. OSX and Windows both have similar but differently implemented facilities.
2) You could pythonically tail -f to see when a new file is put on the server (if you're even logging that) and just update when you see related update messages.
If you're running your program remotely
3) If your status checking utility has to run on a remote host from the FTP server, you'd have to poll the file for status and build in some logic to detect size changes. You can use the FTP 'SIZE' command for this for an easily parse-able string.
You'd have to put some logic into it such that if the filesize gets smaller you would assume it's being replaced, and then wait for it to get bigger until it stops growing and stays the same size for some duration. If the archive is compressed in a way that you could verify the sum you could then download it, checksum, and then reupload to the remote site.

Categories

Resources