I'm working on a script that autonomously scrapes historical data from several websites and saves them to the same excel file for each past date that's within a specified date range. Each individual function accesses several webpages from a different website, formats the data, and writes it to the file on separate sheets. Because I am continuously making requests on these sites, I make sure that I add ample sleep time between requests. Instead of running these functions one after another, is there a way that I could run them together?
I want to make one request with Function 1, then make one request with Function 2, and so on until all functions have made one request. After all functions have made a request, I would like it to loop back and complete the second request within each function (and so on) until all requests for a given date are complete. Doing this would allow the same amount of sleep time between requests on each website while decreasing the time the code would take to run by a large amount. One point to note is that each function makes a slightly different number of HTTP requests. For instance, Function 1 may make 10 requests on a given date while Function 2 makes 8 requests, Function 3 makes 8, Function 4 makes 7, and Function 5 makes 10.
I've read into this topic and have read about multithreading, but I am unsure how to apply this to my specific scenario. If there is no way to do this, I could run each function as its own code and run them at the same time, but then I would have to concatenate five different excel files for each date, which is why I am trying to do it this way.
start_date = 'YYYY-MM-DD'
end_date = 'YYYY-MM-DD'
idx = pd.date_range(start_date,end_date)
date_range = [d.strftime('%Y-%m-%d') for d in idx]
max_retries_min_sleeptime = 300
max_retries_max_sleeptime = 600
min_sleeptime = 150
max_sleeptime = 250
for date in date_range:
writer = pd.ExcelWriter('Daily Data -' + date + '.xlsx')
Function1()
Function2()
Function3()
Function4()
Function5()
writer.save()
print('Date Complete: ' + date)
time.sleep(random.randrange(min_sleeptime,max_sleeptime,1))
Using Python3.6
Here is a minimal example of concurrent requests with aiohttp to get you started (docs). This example runs 3 downloader's at the same time, appending the rsp to responses. I believe you will be able to adapt this idea to your problem.
import asyncio
from aiohttp.client import ClientSession
async def downloader(session, iter_url, responses):
while True:
try:
url = next(iter_url)
except StopIteration:
return
rsp = await session.get(url)
if not rsp.status == 200:
continue # < - Or raise error
responses.append(rsp)
async def run(urls, responses):
with ClientSession() as session:
iter_url = iter(urls)
await asyncio.gather(*[downloader(session, iter_url, responses) for _ in range(3)])
urls = [
'https://stackoverflow.com/questions/tagged/python',
'https://aiohttp.readthedocs.io/en/stable/',
'https://docs.python.org/3/library/asyncio.html'
]
responses = []
loop = asyncio.get_event_loop()
loop.run_until_complete(run(urls, responses))
Result:
>>> responses
[<ClientResponse(https://docs.python.org/3/library/asyncio.html) [200 OK]>
<CIMultiDictProxy('Server': 'nginx', 'Content-Type': 'text/html', 'Last-Modified': 'Sun, 28 Jan 2018 05:08:54 GMT', 'ETag': '"5a6d5ae6-6eae"', 'X-Clacks-Overhead': 'GNU Terry Pratchett', 'Strict-Transport-Security': 'max-age=315360000; includeSubDomains; preload', 'Via': '1.1 varnish', 'Fastly-Debug-Digest': '79eb68156ce083411371cd4dbd0cb190201edfeb12e5d1a8a1e273cc2c8d0e41', 'Content-Length': '28334', 'Accept-Ranges': 'bytes', 'Date': 'Sun, 28 Jan 2018 23:48:17 GMT', 'Via': '1.1 varnish', 'Age': '66775', 'Connection': 'keep-alive', 'X-Served-By': 'cache-iad2140-IAD, cache-mel6520-MEL', 'X-Cache': 'HIT, HIT', 'X-Cache-Hits': '1, 1', 'X-Timer': 'S1517183297.337465,VS0,VE1')>
, <ClientResponse(https://stackoverflow.com/questions/tagged/python) [200 OK]>
<CIMultiDictProxy('Content-Type': 'text/html; charset=utf-8', 'Content-Encoding': 'gzip', 'X-Frame-Options': 'SAMEORIGIN', 'X-Request-Guid': '3fb98f74-2a89-497d-8d43-322f9a202775', 'Strict-Transport-Security': 'max-age=15552000', 'Content-Length': '23775', 'Accept-Ranges': 'bytes', 'Date': 'Sun, 28 Jan 2018 23:48:17 GMT', 'Via': '1.1 varnish', 'Age': '0', 'Connection': 'keep-alive', 'X-Served-By': 'cache-mel6520-MEL', 'X-Cache': 'MISS', 'X-Cache-Hits': '0', 'X-Timer': 'S1517183297.107658,VS0,VE265', 'Vary': 'Accept-Encoding,Fastly-SSL', 'X-DNS-Prefetch-Control': 'off', 'Set-Cookie': 'prov=8edb36d8-8c63-bdd5-8d56-19bf14916c93; domain=.stackoverflow.com; expires=Fri, 01-Jan-2055 00:00:00 GMT; path=/; HttpOnly', 'Cache-Control': 'private')>
, <ClientResponse(https://aiohttp.readthedocs.io/en/stable/) [200 OK]>
<CIMultiDictProxy('Server': 'nginx/1.10.3 (Ubuntu)', 'Date': 'Sun, 28 Jan 2018 23:48:18 GMT', 'Content-Type': 'text/html', 'Last-Modified': 'Wed, 17 Jan 2018 08:45:22 GMT', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Vary': 'Accept-Encoding', 'ETag': 'W/"5a5f0d22-578a"', 'X-Subdomain-TryFiles': 'True', 'X-Served': 'Nginx', 'X-Deity': 'web01', 'Content-Encoding': 'gzip')>
]
Here is a minimal example to demonstrate how to use concurrent.futures for parallel processing. This does not include the actual scraping logic as you can add it yourself, if needed, but demonstrates the pattern to follow:
from concurrent import futures
from concurrent.futures import ThreadPoolExecutor
def scrape_func(*args, **kwargs):
""" Stub function to use with futures - your scraping logic """
print("Do something in parallel")
return "result scraped"
def main():
start_date = 'YYYY-MM-DD'
end_date = 'YYYY-MM-DD'
idx = pd.date_range(start_date,end_date)
date_range = [d.strftime('%Y-%m-%d') for d in idx]
max_retries_min_sleeptime = 300
max_retries_max_sleeptime = 600
min_sleeptime = 150
max_sleeptime = 250
# The important part - concurrent futures
# - set number of workers as the number of jobs to process
with ThreadPoolExecutor(len(date_range)) as executor:
# Use list jobs for concurrent futures
# Use list scraped_results for results
jobs = []
scraped_results = []
for date in date_range:
# Pass some keyword arguments if needed - per job
kw = {"some_param": "value"}
# Here we iterate 'number of dates' times, could be different
# We're adding scrape_func, could be different function per call
jobs.append(executor.submit(scrape_func, **kw))
# Once parallell processing is complete, iterate over results
for job in futures.as_completed(jobs):
# Read result from future
scraped_result = job.result()
# Append to the list of results
scraped_results.append(scraped_result)
# Iterate over results scraped and do whatever is needed
for result is scraped_results:
print("Do something with me {}".format(result))
if __name__=="__main__":
main()
As mentioned, this is just to demonstrate the pattern to follow, the rest should be straightforward.
Thanks for the responses guys! As it turns out a pretty simple block of code from this other question (Make 2 functions run at the same time) seems to do what I want.
import threading
from threading import Thread
def func1():
print 'Working'
def func2():
print 'Working'
if __name__ == '__main__':
Thread(target = func1).start()
Thread(target = func2).start()
Related
I'm getting this message when I'm trying to test my python 3.8 lambda function:
Logs are:
soc-connect
contacts.csv
{'ResponseMetadata': {'RequestId': '9D7D7F0C5CB79984', 'HostId': 'wOd6HvIm+BpLOMKF2beRvqLiW0NQt5mK/kzjCjYxQ2kHQZY0MRCtGs3l/rqo4o0r4xAPuV1QpGM=', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amz-id-2': 'wOd6HvIm+BpLOMKF2beRvqLiW0NQt5mK/kzjCjYxQ2kHQZY0MRCtGs3l/rqo4o0r4xAPuV1QpGM=', 'x-amz-request-id': '9D7D7F0C5CB79984', 'date': 'Thu, 26 Mar 2020 11:21:35 GMT', 'last-modified': 'Tue, 24 Mar 2020 16:07:30 GMT', 'etag': '"8a3785e750475af3ca25fa7eab159dab"', 'accept-ranges': 'bytes', 'content-type': 'text/csv', 'content-length': '52522', 'server': 'AmazonS3'}, 'RetryAttempts': 0}, 'AcceptRanges': 'bytes', 'LastModified': datetime.datetime(2020, 3, 24, 16, 7, 30, tzinfo=tzutc()), 'ContentLength': 52522, 'ETag': '"8a3785e750475af3ca25fa7eab159dab"', 'ContentType': 'text/csv', 'Metadata': {}, 'Body': <botocore.response.StreamingBody object at 0x7f858dc1e6d0>}
1153
<_csv.reader object at 0x7f858ea76970>
[ERROR] Error: iterator should return strings, not bytes (did you open the file in text mode?)
The code snippet is:
import boto3
import csv
def digest_csv(bucket_name, key_name):
# Let's use Amazon S3
s3 = boto3.client('s3');
print(bucket_name)
print(key_name)
s3_object = s3.get_object(Bucket=bucket_name, Key=key_name)
print(s3_object)
# read the contents of the file and split it into a list of lines
lines = s3_object['Body'].read().splitlines(True)
print(len(lines))
contacts = csv.reader(lines, delimiter=';')
print(contacts)
# now iterate over those contacts
for contact in contacts:
# here you get a sequence of dicts
# do whatever you want with each line here
print('-*-'.join(contact))
I think the problem is on csv.reader.
I'm setting first parameter an array of lines... Should it be modified?
Any ideas?
Instead of using the csv.reader the following worked for me (adjusted for your delimiter and variables):
for line in lines:
contact = ''.join(line.decode().split(';'))
print(contact)
I'm using Python script to scrap google, this is what I get when script finishes. Imagine if I have 100 results (I showed 2 for example).
{'query_num_results_total': 'Око 64 резултата (0,54 секунде/и)\xa0', 'query_num_results_page': 77, 'query_page_number': 1, 'query': 'example', 'serp_rank': 1, 'serp_type': 'results', 'serp_url': 'example2.com', 'serp_rating': None, 'serp_title': '', 'serp_domain': 'example2.com', 'serp_visible_link': 'example2.com', 'serp_snippet': '', 'serp_sitelinks': None, 'screenshot': ''}
{'query_num_results_total': 'Око 64 резултата (0,54 секунде/и)\xa0', 'query_num_results_page': 77, 'query_page_number': 1, 'query': 'example', 'serp_rank': 2, 'serp_type': 'results', 'serp_url': 'example.com', 'serp_rating': None, 'serp_title': 'example', 'serp_domain': 'example.com', 'serp_visible_link': 'example.com', 'serp_snippet': '', 'serp_sitelinks': None, 'screenshot': ''}
This is script usage code
import serpscrap
import pprint
import sys
config = serpscrap.Config()
config_new = {
'cachedir': '/tmp/.serpscrap/',
'clean_cache_after': 24,
'sel_browser': 'chrome',
'chrome_headless': True,
'database_name': '/tmp/serpscrap',
'do_caching': True,
'num_pages_for_keyword': 2,
'scrape_urls': False,
'search_engines': ['google'],
'google_search_url': 'https://www.google.com/search?num=100',
'executable_path': '/usr/local/bin/chromedriver',
'headers': {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'de-DE,de;q=0.8,en-US;q=0.6,en;q=0.4',
'Accept-Encoding': 'gzip, deflate, sdch',
'Connection': 'keep-alive',
},
}
arr = sys.argv
keywords = ['example']
config.apply(config_new)
scrap = serpscrap.SerpScrap()
scrap.init(config=config.get(), keywords=keywords)
results = scrap.run()
for result in results:
print(result)
I want to stop script if in results is some url I want, for example "example.com"
If I have https here 'serp_url': 'https://example2.com' I want to check it and stop script if I give argument without https, just example2.com. If it's not possible to check while script working, I will need explanation how to find serp_url by an argument I provided.
I'm not familiar with Python, but I'm building PHP application that will run this Python script and output results. But I don't want to work with results in PHP (extracting by serp_url etc,) I want everything to be done in Python.
You can with something like this:
for result in results:
if my_url in result['serp_url']:
# this match 'myexample.com' in 'http://example.com'
# or even more like 'http://example.com/whatever' and of course begining with 'https'
exit
With any is another solution:
if any((my_url in result['serp_url'] for result in results)):
exit
First of all you need to access serp_url's value.
Since result variable is a dictionary, typing result['serp_url'] will return each result's url.
Inside for-loop where you print your results you should add an if-statement where result['serp_url'] will be compared with a variable that contains your desired urls (i think you don't provide that info in your code). Maybe it could be something like the following:
for result in results:
print(result)
if my_url == result['serp_url']:
exit
Same thinking in the case of https but now we need startswith() method:
for result in results:
print(result)
if my_url == result['serp_url']:
exit
if result['serp_url'].startswith('https'):
exit
Hope it helps.
I'm using the Shopify Python library to work with my store's orders, discounts, etc. I'm writing some tests that involve creating and deleting price rules and discounts.
def test_get_discount(self):
random_number_string = str(randint(0,10000))
price_rule = shopify.PriceRule.create({
'title': 'TEST-PRICERULE-' + random_number_string,
'target_type': 'line_item',
'target_selection': 'all',
'allocation_method': 'across',
'value_type': 'fixed_amount',
'value': -1000,
'once_per_customer': True,
'customer_selection': 'all',
'starts_at': datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
})
discount = shopify.DiscountCode.create({
'price_rule_id': price_rule.id,
'code': 'TEST-DISCOUNT-' + random_number_string,
'usage_count': 1,
'value_type': 'fixed_amount',
'value': -1000
})
fetched_price_rule = shopify_utils.get_discount_codes('TEST-PRICERULE-' + random_number_string)
self.assertIsNotNone(fetched_price_rule)
#deleted_discount = discount.destroy()
#self.assertIsNone(deleted_discount)
deleted_price_rule = price_rule.delete('discounts')
self.assertIsNone(deleted_price_rule)
Everything up to deleting works properly. I'm also able to delete a discount with no problem, but when i try to delete a price rule, it errors out with the following:
deleted_price_rule = price_rule.delete()
TypeError: _instance_delete() missing 1 required positional argument: 'method_name'
I believe that it's asking me for the nested resource name, so I've tried passing things like discounts, discount_code, etc, but no luck. I see that pyactiveresouce is using that method_name to construct a url to delete the related resource, just don't know what their API looks like as it's not very well documented for cases like this.
When i do specify a (wrong) method name, I get this error:
pyactiveresource.connection.ClientError: Response(code=406, body="b''", headers={'Server': 'nginx', 'Date': 'Sun, 10 Sep 2017 00:57:28 GMT', 'Content-Type': 'application/json', 'Transfer-Encoding': 'chunked', 'Connection': 'close', 'X-Sorting-Hat-PodId': '23', 'X-Sorting-Hat-PodId-Cached': '0', 'X-Sorting-Hat-ShopId': '13486411', 'X-Sorting-Hat-Section': 'pod', 'X-Sorting-Hat-ShopId-Cached': '0', 'Referrer-Policy': 'origin-when-cross-origin', 'X-Frame-Options': 'DENY', 'X-ShopId': '13486411', 'X-ShardId': '23', 'X-Shopify-Shop-Api-Call-Limit': '3/40', 'HTTP_X_SHOPIFY_SHOP_API_CALL_LIMIT': '3/40', 'X-Stats-UserId': '0', 'X-Stats-ApiClientId': '1807529', 'X-Stats-ApiPermissionId': '62125531', 'Strict-Transport-Security': 'max-age=7776000', 'Content-Security-Policy': "default-src 'self' data: blob: 'unsafe-inline' 'unsafe-eval' https://* shopify-pos://*; block-all-mixed-content; child-src 'self' https://* shopify-pos://*; connect-src 'self' wss://* https://*; script-src https://cdn.shopify.com https://checkout.shopifycs.com https://js-agent.newrelic.com https://bam.nr-data.net https://dme0ih8comzn4.cloudfront.net https://api.stripe.com https://mpsnare.iesnare.com https://appcenter.intuit.com https://www.paypal.com https://maps.googleapis.com https://stats.g.doubleclick.net https://www.google-analytics.com https://visitors.shopify.com https://v.shopify.com https://widget.intercom.io https://js.intercomcdn.com 'self' 'unsafe-inline' 'unsafe-eval'; upgrade-insecure-requests; report-uri /csp-report?source%5Baction%5D=error_404&source%5Bapp%5D=Shopify&source%5Bcontroller%5D=admin%2Ferrors&source%5Bsection%5D=admin_api&source%5Buuid%5D=72550a71-d55b-4fb2-961f-f0447c1d04c6", 'X-Content-Type-Options': 'nosniff', 'X-Download-Options': 'noopen', 'X-Permitted-Cross-Domain-Policies': 'none', 'X-XSS-Protection': '1; mode=block; report=/xss-report?source%5Baction%5D=error_404&source%5Bapp%5D=Shopify&source%5Bcontroller%5D=admin%2Ferrors&source%5Bsection%5D=admin_api&source%5Buuid%5D=72550a71-d55b-4fb2-961f-f0447c1d04c6', 'X-Dc': 'ash,chi2', 'X-Request-ID': '72550a71-d55b-4fb2-961f-f0447c1d04c6'}, msg="Not Acceptable")
Any ideas hugely appreciated!
As #Tim mentioned in his response, the destroy method worked for me and successfully deleted all children instances as well as parent instance.
I am working with an AWS Lambda function written in python 2.7x which downloads, saves to /tmp , then uploads the image file back to bucket.
My image meta data starts out in original bucket with http headers like Content-Type= image/jpeg, and others.
After saving my image with PIL, all headers are gone and I am left with Content-Type = binary/octet-stream
From what I can tell, image.save is loosing the headers due to the way PIL works. How do I either preserve metadata or at least apply it to the new saved image?
I have seen post suggesting that this metadata is in exif but I tried to get exif info from original file and apply to saved file with no luck. I am not clear of it's in exif data anyway.
Partial code to give idea of what I am doing:
def resize_image(image_path):
with Image.open(image_path) as image:
image.save(upload_path, optimize=True)
def handler(event, context):
global upload_path
for record in event['Records']:
bucket = record['s3']['bucket']['name']
key = urllib.unquote_plus(event['Records'][0]['s3']['object']['key'].encode("utf8"))
download_path = '/tmp/{}{}'.format(uuid.uuid4(), file_name)
upload_path = '/tmp/resized-{}'.format(file_name)
s3_client.download_file(bucket, key, download_path)
resize_image(download_path)
s3_client.upload_file(upload_path, '{}resized'.format(bucket), key)
Thanks to Sergey, I changed to using get_object but response is missing Metadata:
response = s3_client.get_object(Bucket=bucket,Key=key)
response= {u'Body': , u'AcceptRanges': 'bytes', u'ContentType': 'image/jpeg', 'ResponseMetadata': {'HTTPStatusCode': 200, 'RetryAttempts': 0, 'HostId': 'au30hBMN37/ti0WCfDqlb3t9ehainumc9onVYWgu+CsrHtvG0u/zmgcOIvCCBKZgQrGoooZoW9o=', 'RequestId': '1A94D7F01914A787', 'HTTPHeaders': {'content-length': '84053', 'x-amz-id-2': 'au30hBMN37/ti0WCfDqlb3t9ehainumc9onVYWgu+CsrHtvG0u/zmgcOIvCCBKZgQrGoooZoW9o=', 'accept-ranges': 'bytes', 'expires': 'Sun, 01 Jan 2034 00:00:00 GMT', 'server': 'AmazonS3', 'last-modified': 'Fri, 23 Dec 2016 15:21:56 GMT', 'x-amz-request-id': '1A94D7F01914A787', 'etag': '"9ba59e5457da0dc40357f2b53715619d"', 'cache-control': 'max-age=2592000,public', 'date': 'Fri, 23 Dec 2016 15:21:58 GMT', 'content-type': 'image/jpeg'}}, u'LastModified': datetime.datetime(2016, 12, 23, 15, 21, 56, tzinfo=tzutc()), u'ContentLength': 84053, u'Expires': datetime.datetime(2034, 1, 1, 0, 0, tzinfo=tzutc()), u'ETag': '"9ba59e5457da0dc40357f2b53715619d"', u'CacheControl': 'max-age=2592000,public', u'Metadata': {}}
If I use:
metadata = response['ResponseMetadata']['HTTPHeaders']
metadata = {'content-length': '84053', 'x-amz-id-2': 'f5UAhWzx7lulo3cMVF8hdVRbHnhdnjHWRDl+LDFkYm9pubjL0A01L5yWjgDjWRE4TjRnjqDeA0U=', 'accept-ranges': 'bytes', 'expires': 'Sun, 01 Jan 2034 00:00:00 GMT', 'server': 'AmazonS3', 'last-modified': 'Fri, 23 Dec 2016 15:47:09 GMT', 'x-amz-request-id': '4C69DF8A58EF3380', 'etag': '"9ba59e5457da0dc40357f2b53715619d"', 'cache-control': 'max-age=2592000,public', 'date': 'Fri, 23 Dec 2016 15:47:10 GMT', 'content-type': 'image/jpeg'}
Saving with put_object
s3_client.put_object(Bucket=bucket+'resized',Key=key, Metadata=metadata, Body=downloadfile)
creates a whole lot of extra metadata in s3 including the fact that it does not save content-type as image/jpeg but rather as binary/octet-stream and it does create metadata x-amz-meta-content-type = image/jpeg
You are confusing S3 metadata, stored by AWS S3 along with an object, and EXIF metadata, stored inside the file itself.
download_file() doesn't get object attributes from S3. You should use get_object() instead: https://boto3.readthedocs.io/en/latest/reference/services/s3.html#S3.Client.get_object
Then you can use put_objects() with the same attributes to upload new file: https://boto3.readthedocs.io/en/latest/reference/services/s3.html#S3.Client.put_object
Content type information is not on the file you upload, it has to be guessed or extracted somehow. This is something you must do manually or using tools. With a fairly small dictionary you can guess most file types.
When you upload a file or object, you have the chance to specify its content type. Otherwise S3 defaults to application/octet-stream.
Using the boto3 python package for instance:
s3client.upload_file(
Filename=local_path,
Bucket=bucket,
Key=remote_path,
ExtraArgs={
"ContentType": "image/jpeg"
}
)
When using the Python swiftclient module I can POST to an object with a header of X-Delete-At/After and an epoch, but how do I show the expiration time of the object? I was doing some testing and it seems that the file is always being expired immediately, e.g. where I set the time for 100 days in the future:
>>> swift.put_object('container1','test_file01.txt','This is line 1 in the file test_file01.txt done at: %s' % datetime.now().strftime('%Y-%m-%d %H:%M:%S,%f'))
'4b3faf0b79d97f5e478949e7d6c4c575'
>>> swift.head_object('container1','test_file01.txt')
{'content-length': '78', 'server': 'Jetty(7.6.4.v20120524)', 'last-modified': 'Wed, 23 Apr 2014 17:09:55 GMT', 'etag': '4b3faf0b79d97f5e478949e7d6c4c575', 'x-timestamp': '1398272995', 'date': 'Wed, 23 Apr 2014 17:09:59 GMT', 'content-type': 'application/octet-stream'}
>>> swift.post_object('container1','test_file01.txt',headers={'X-Delete-At':(datetime.now(pytz.timezone('GMT')) + timedelta(days=100)).strftime('%s')})
>>> swift.head_object('container1','test_file01.txt')
Traceback (most recent call last):
File "<pyshell#121>", line 1, in <module>
swift.head_object('container1','test_file01.txt')
File "/usr/local/lib/python2.7/dist-packages/swiftclient/client.py", line 1279, in head_object
return self._retry(None, head_object, container, obj)
File "/usr/local/lib/python2.7/dist-packages/swiftclient/client.py", line 1189, in _retry
rv = func(self.url, self.token, *args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/swiftclient/client.py", line 853, in head_object
http_response_content=body)
ClientException: Object HEAD failed: http://10.249.238.135:9024:9024/v1/rjm-vnx-namespace01/container1/test_file01.txt 404 Not Found
So it seems it was expired immediately. My questions are:
Am I setting the expiration correctly? I would like to be able to do it to an existing object rather than at object creation time, but perhaps I HAVE to do it when it's being created???
Is there a way to see the expiration time? Obviously if it's not working correctly than there's no good way to see it, but if it were, does head_object() return that information?
Thanks,
Rob
Never mind, I figured it out. By setting an "after" I realized that the value apparently needs to be in milliseconds. So when I changed it to:
>>> swift.post_object('container1','test_file01.txt',headers={'X-Delete-At':int((datetime.now(pytz.timezone('GMT')) + timedelta(days=100)).strftime('%s'))*1000})
>>> swift.head_object('container1','test_file01.txt')
{'content-length': '78', 'x-delete-at':'1406932148000', 'server': 'Jetty(7.6.4.v20120524)', 'last-modified': 'Wed, 23 Apr 2014 17:29:06 GMT', 'etag': '0baf8b37f374c94e59a05a7f7b339811', 'x-timestamp': '1398274146', 'date': 'Wed, 23 Apr 2014 17:29:08 GMT', 'content-type': 'application/octet-stream'}
Then it worked as expected.
Rob