I wish to monitor the CSS of a few websites (these websites aren't my own) for changes and receive a notification of some sort when they do. If you could please share any experience you've had with this to point me in the right direction for how to code it I'd appreciate it greatly.
I'd like this script/app to notify a Slack group on change, which I assume will require a webhook.
Not asking for code, just any advice about particular APIs and other tools that may be of benefit.
I would suggest a modification of tschaefermedia's answer.
Crawl website for .css files, save.
Take an md5 of each file.
Then compare md5 of the new file will the old file.
If the md5 is different then the file changed.
Below is a function to take md5 of large files.
def md5(file_name):
# make a md5 hash object
hash_md5 = hashlib.md5()
# open file as binary and read only
with open(file_name, 'rb') as f:
i = 0
# read 4096 bytes at a time and take the md5 hash of it and add it to the hash total
# b converts string literal to bytes
for chunk in iter(lambda: f.read(4096), b''):
i += 1
# get sum of md5 hashes
# m.update(a); m.update(b) is equivalent to m.update(a+b)
hash_md5.update(chunk)
# check for correct number of iterations
file_size = os.path.getsize(file_name)
expected_i = int(math.ceil(float(file_size) / float(4096)))
correct_i = i == expected_i
# check if md5 correct
md5_chunk_file = hash_md5.hexdigest()
return md5_chunk_file
I would suggest using Github in your workflow. That gives you a good idea of changes and ways to revert back to older versions.
One possible solution:
Crawl website for .css files, save change dates and/or filesize.
After each crawl compare information and if changes detected use slack API to notify. I haven't worked with slack that, for this part of the solution maybe someone else can give advice.
Related
I working on a pdf file dedupe project and analyzed many libraries in python, which read files, then generate hash value of it and then compare it with the next file for duplication - similar to logic below or using python filecomp lib. But the issue I found these logic is like, if a pdf is generated from a source DOCX(Save to PDF) , those outputs are not considered duplicates - even content is exactly the same. Why this happens? Is there any other logic to read the content, then create a unique hash value based on the actual content.
def calculate_hash_val(path, blocks=65536):
file = open(path, 'rb')
hasher = hashlib.md5()
data = file.read()
while len(data) > 0:
hasher.update(data)
data = file.read()
file.close()
return hasher.hexdigest()
One of the things that happens is that you save metadata to the file including the time of creation. It is invisible in the PDF, but that will make the hash different.
Here is an explanation of how to find and strip out that data with at least one tool. I am sure that there are many others.
I used the ftputil module for this, but ran into a problem that it doesn't support 'a'(append) appending to the file, and if you write via 'w' it overwrites the contents.
That's what I tried and I'm stuck there:
with ftputil.FTPHost(host, ftp_user, ftp_pass) as ftp_host:
with ftp_host.open("my_path_to_file_on_the_server", "a") as fobj:
cupone_wr = input('Enter coupons with a space: ')
cupone_wr = cupone_wr.split(' ')
for x in range(0, len(cupone_wr)):
cupone_str = '<p>Your coupon %s</p>\n' % cupone_wr[x]
data = fobj.write(cupone_str)
print(data)
The goal is to leave the old entries in the file and add fresh entries to the end of the file every time the script is called again.
Indeed, ftputil does not support appending. So either you will have to download complete file and reupload it with appended records. Or you will have to use another FTP library.
For example the built-in Python ftplib supports appending. On the other hand, it does not (at least not easily) support streaming. Instead, it's easier to construct the new records in-memory and upload/append them at once:
from ftplib import FTP
from io import BytesIO
flo = BytesIO()
cupone_wr = input('Enter coupons with a space: ')
cupone_wr = cupone_wr.split(' ')
for x in range(0, len(cupone_wr)):
cupone_str = '<p>Your coupon %s</p>\n' % cupone_wr[x]
flo.write(cupone_str)
ftp = FTP('ftp.example.com', 'username', 'password')
flo.seek(0)
ftp.storbinary('APPE my_path_to_file_on_the_server', flo)
ftputil author here :-)
Martin is correct in that there's no explicit append mode. That said, you can open file-like objects with a rest argument. In your case, rest would need to be the original length of the file you want to append to.
The documentation warns against using a rest argument that points after the file because I'm quite sure rest isn't expected to be used that way. However, if you use your program only against a specific server and can verify its behavior, it might be worthwhile to experiment with rest. I'd be interested whether it works for you.
I'm trying to write a script in Python for sorting through files (photos, videos), checking metadata of each, finding and moving all duplicates to a separate directory. Got stuck with the metadata checking part. Tried os.stat - doesn't return True for duplicate files. Ideally, I should be able to do something like :
if os.stat("original.jpg")== os.stat("duplicate.jpg"):
shutil.copy("duplicate.jpg","C:\\Duplicate Folder")
Pointers anyone?
There's a few things you can do. You can compare the contents or hash of each file or you can check a few select properties from the os.stat result, ex
def is_duplicate(file1, file2):
stat1, stat2 = os.stat(file1), os.stat(file2)
return stat1.st_size==stat2.st_size and stat1.st_mtime==stat2.st_mtime
A basic loop using a set to keep track of already encountered files:
import glob
import hashlib
uniq = set()
for fname in glob.glob('*.txt'):
with open(fname,"rb") as f:
sig = hashlib.sha256(f.read()).digest()
if sig not in uniq:
uniq.add(sig)
print fname
else:
print fname, " (duplicate)"
Please note as with any hash function there is a slight chance of collision. That is two different files having the same digest. Depending your needs, this is acceptable of not.
According to Thomas Pornin in an other answer :
"For instance, with SHA-256 (n=256) and one billion messages (p=109) then the probability [of collision] is about 4.3*10-60."
Given your need, if you have to check for additional properties in order to identify "true" duplicates, change the sig = ....line to whatever suits you. For example, if you need to check for "same content" and "same owner" (st_uidas returned by os.stat()), write:
sig = ( hashlib.sha256(f.read()).digest(),
os.stat(fname).st_uid )
If two files have the same md5 they are exact duplicates.
from hashlib import md5
with open(file1, "r") as original:
original_md5 = md5(original.read()).hexdigest()
with open(file2, "r") as duplicate:
duplicate_md5 = md5(duplicate.read()).hexdigest()
if original_md5 == duplicate_md5:
do_stuff()
In your example you're using jpg file in that case you want to call the method open with its second argument equals to rb. For that see the documentation for open
os.stat offers information about some file's metadata and features, including the creation time. That is not a good approach in order to find out if two files are the same.
For instance: Two files can be the same and have different time creation. Hence, comparing stats will fail here. Sylvain Leroux approach is the best one when combining performance and accuracy, since it is very rare two different files has the same hash.
So, unless you have an incredibly large amount of data and a repeated file will cause a system fatality, this is the way to go.
If that your case (it not seems to be), well ... the only way you can be 100% sure two file are the same is iterating and perform a comparison byte per byte.
I want to get the size of uploading image to control if it is greater than max file upload limit. I tried this one:
#app.route("/new/photo",methods=["POST"])
def newPhoto():
form_photo = request.files['post-photo']
print form_photo.content_length
It printed 0. What am I doing wrong? Should I find the size of this image from the temp path of it? Is there anything like PHP's $_FILES['foo']['size'] in Python?
There are a few things to be aware of here - the content_length property will be the content length of the file upload as reported by the browser, but unfortunately many browsers dont send this, as noted in the docs and source.
As for your TypeError, the next thing to be aware of is that file uploads under 500KB are stored in memory as a StringIO object, rather than spooled to disk (see those docs again), so your stat call will fail.
MAX_CONTENT_LENGTH is the correct way to reject file uploads larger than you want, and if you need it, the only reliable way to determine the length of the data is to figure it out after you've handled the upload - either stat the file after you've .save()d it:
request.files['file'].save('/tmp/foo')
size = os.stat('/tmp/foo').st_size
Or if you're not using the disk (for example storing it in a database), count the bytes you've read:
blob = request.files['file'].read()
size = len(blob)
Though obviously be careful you're not reading too much data into memory if your MAX_CONTENT_LENGTH is very large
If you don't want save the file to disk first, use the following code, this work on in-memory stream
import os
file = request.files['file']
# os.SEEK_END == 2
# seek() return the new absolute position
file_length = file.seek(0, os.SEEK_END)
# also can use tell() to get current position
# file_length = file.tell()
# seek back to start position of stream,
# otherwise save() will write a 0 byte file
# os.SEEK_END == 0
file.seek(0, os.SEEK_SET)
otherwise, this will better
request.files['file'].save('/tmp/file')
file_length = os.stat('/tmp/file').st_size
The proper way to set a max file upload limit is via the MAX_CONTENT_LENGTH app configuration. For example, if you wanted to set an upload limit of 16 megabytes, you would do the following to your app configuration:
app.config['MAX_CONTENT_LENGTH'] = 16 * 1024 * 1024
If the uploaded file is too large, Flask will automatically return status code 413 Request Entity Too Large - this should be handled on the client side.
The following section of the code should meet your purpose..
form_photo.seek(0,2)
size = form_photo.tell()
As someone else already suggested, you should use the
app.config['MAX_CONTENT_LENGTH']
to restrict file sizes. But
Since you specifically want to find out the image size, you can do:
import os
photo_size = os.stat(request.files['post-photo']).st_size
print photo_size
You can go by popen from os import
save it first
photo=request.files['post-photo']
photo.save('tmp')
now, just get the size
os.popen('ls -l tmp | cut -d " " -f5').read()
this in bytes
for Megabytes or Gigabytes, use the flag --b M or --b G
Is there any library to show progress when adding files to a tar archive in python or alternativly would be be possible to extend the functionality of the tarfile module to do this?
In an ideal world I would like to show the overall progress of the tar creation as well as an ETA as to when it will be complete.
Any help on this would be really appreciated.
Unfortunately it doesn't look like there is an easy way to get byte by byte numbers.
Are you adding really large files to this tar file? If not, I would update progress on a file-by-file basis so that as files are added to the tar, the progress is updated based on the size of each file.
Supposing that all your filenames are in the variable toadd and tarfile is a TarFile object. How about,
from itertools import imap
from operator import attrgetter
# you may want to change this depending on how you want to update the
# file info for your tarobjs
tarobjs = imap(tarfile.getattrinfo, toadd)
total = sum(imap(attrgetter('size'), tarobjs))
complete = 0.0
for tarobj in tarobjs:
sys.stdout.write("\rPercent Complete: {0:2.0d}%".format(complete))
tarfile.add(tarobj)
complete += tarobj.size / total * 100
sys.stdout.write("\rPercent Complete: {0:2.0d}%\n".format(complete))
sys.stdout.write("Job Done!")
Find or write a file-like that wraps a real file which provides progress reporting, and pass it to Tarfile.addfile() so that you can know how many bytes have been requested for inclusion in the archive. You may have to use/implement throttling in case Tarfile tries to read the whole file at once.
I have recently written a wrapper library that provides a progress callback. Have a look at it on git hub:
https://github.com/thomaspurchas/tarfile-Progress-Reporter
Feel free to ask for help if you need it.
Seems like you can use the filter parameter of tarfile.add()
with tarfile.open(<tarball path>, 'w') as tarball:
tarball.add(<some file>, filter = my_filter)
def my_filter(tarinfo):
#increment some count
#add tarinfo.size to some byte counter
return tarinfo
All information you can get from a TarInfo object is available to you.
How are you adding files to the tar file? Is is through "add" with recursive=True? You could build the list of files yourself and call "add" one-by-one, showing the progress as you go. If you're building from a stream/file then it looks like you could wrap that fileobj to see the read status and pass that into addfile.
It does not look like you will need to modify tarfile.py at all.