Compare string vs modified strings & partial strings - python

I have a list of links, and I want to see if they're listed in my disavow file.
My disavow file contains both URLs (e.g. http://getpaydayloan.org/blog/blog-how-to-apply-for-online-payday-loans-san) as well as whole domains, listed as domain:getpaydayloan.org.
The new URLs file holds URLs only, e.g. http://getpaydayloan.org/blog/blog-how-to-apply-for-online-payday-loans-san
I want to see if the new URLs are already in the disavow file. I am currently generating a diff using diff = set(url_set)-set(disavow_urls), but I also need to check to see if they are in the disavow file using the domain:url.com format.
How would I do something like that?
In case it helps, here is the whole script: https://github.com/growth-austen/disavow_automator

Here is a function to check if the url contains any of the disavowed domains.
def inDisavow(url, disavowDomainList):
for domain in disavowDomainList:
if domain in url:
return true
return false

Some alternative definitions to David's function for fun:
return any(domain in url for domain in disavowDomainList)
return any(map(url.__contains__, disavowDomainList))
(replace map with itertools.imap in Python 2 for memory efficiency)

Related

Is there a way in Zapier to convert a python output to separate lines?

I have a Python output in a Zapier output that looks like this:
I want to be able to use this in the body of gmail as separate lines. However, presently, it looks like this when I use that python output as the step. The screenshot below is the email returned after I test.
Is there a filter or a pythonic way to do this within Zapier?
The output would ideally look like this:
https://hectv.sharefile.com/dxxxxxxxfcebd247d09
https://hectv.sharefile.com/dxxxxxxx729cd9494
https://hectv.sharefile.com/d-xxxxxx84622a
Thank you.
William from Zapier answered this for me.
If we're generating a line item array with the code step and we need individual items, you'll want to add a Formatter - Utility - Line Item to Text action.
This action will go just after the Run Python step and should take the Sharefile Output as the input for the formatter. From there, the formatter can break the line item array down into individual text strings that you can assign in the zaps remaining steps. :)
For more information on Formatter check out our article here: https://zapier.com/help/create/format/get-started-with-formatter
For more info on Line Item to Text, check out this article: https://zapier.com/help/create/format/convert-line-items-into-text-strings
Using Line Item to Text, the zap won't care if there are 10 items or a single item, it should still return the same individual items. The main concern there is that the test that is done with Line Item to Text should be a test that include the MAX number of items. This way those items can be assigned in the following steps and used when they're present or ignored when they are not.

Check if string is in certain format in Python

I have string as below.
/customer/v1/123456789/account/
The id in the url is dynamic.
What I want to check is if I have that string how can I be sure that if first part and second part is matching with below structure. /customer/v1/<customer_id>/account
What I have done so far is this. however, I want to check if endpoints is totally matching to the structure or not.
endpoint_structure = '/customer/v1/'
endpoint = '/customer/v1/123456789/account/'
if endpoint_structure in endpoint:
return True
return False
Endpoint structure might change as well.
For example: /customer/v1/<customer_id>/documents/<document_id>/ and there will be again given endpoint and I need to check if given endpoint fits with the structure.
You can use a regular expression;
import re
return re.match(r'^/customer/v1/\d+/account/$', endpoint)
or you can examine the beginning and the end:
return endpoint.startswith('/customer/v1/') and endpoint.endswith('/account/')
... though this doesn't attempt to verify that the stuff between the beginning and the end is numeric.
Can solve this using regular expression
^(/customer/v1/)(\d)+(/account/)$
Also if you want to specify the minimum length for customer_id
(/customer/v1/<customer_id>/account ) then use the following regexp
^(/customer/v1/)(\d){5,}(/account/)$
Here expecting the customer_id must have at least 5 digits length
Check here

Regex to parse out a part of URL using python

I am having data as follows,
data['url']
http://hostname.com/aaa/uploads/2013/11/a-b-c-d.jpg https://www.aaa.com/
http://hostname.com/bbb/uploads/2013/11/e-f-g-h.gif https://www.aaa.com/
http://hostname.com/ccc/uploads/2013/11/e-f-g-h.png http://hostname.com/ccc/uploads/2013/11/a-a-a-a.html
http://hostname.com/ddd/uploads/2013/11/w-e-r-t.ico
http://hostname.com/ddd/uploads/2013/11/r-t-y-u.aspx https://www.aaa.com/
http://hostname.com/bbb/uploads/2013/11/t-r-w-q.jpeg https://www.aaa.com/
I want to find out the formats such as .jpg, .gif, .png, .ico, .aspx, .html, .jpeg and parse it out backwards until it finds a "/". Also I want to check for several occurance all through the string. My output should be,
data['parsed']
a-b-c-d
e-f-g-h
e-f-g-h a-a-a-a
w-e-r-t
r-t-y-u
t-r-w-q
I am thinking instead of writing individual commands for each of the formats, is there a way to write everything under a single command.
Can anybody help me in writing for theses commands? I am new to regex and any help would be appreciated.
this builds a list of name to extension pairs
import re
results = []
for link in data:
matches = re.search(r'/(\w-\w-\w-\w)\.(\w{2,})\b', link)
results.append((matches.group(1), matches.group(2)))
This pattern returns the file names. I have just used one of your urls to demonstrate, for more, you could simply append the matches to a list of results:
import re
url = "http://hostname.com/ccc/uploads/2013/11/e-f-g-h.png http://hostname.com/ccc/uploads/2013/11/a-a-a-a.html"
p = r'((?:[a-z]-){3}[a-z]).'
matches = re.findall(p, url)
>>> print('\n'.join(matches))
e-f-g-h
a-a-a-a
There is the assumption that the urls all have the general form you provided.
You might try this:
data['parse'] = re.findall(r'[^/]+\.[a-z]+ ',data['url'])
That will pick out all of the file names with their extensions. If you want to remove the extensions, the code above returns a list which you can then process with list comprehension and re.sub like so:
[re.sub('\.[a-z]+$','',exp) for exp in data['parse']]
Use the .join function to create a string as demonstrated in Totem's answer

How can I get a list of all file names having same prefix on Amazon S3?

I am using boto and Python to store and retrieve files to and from Amazon S3.
I need to get the list of files present in a directory. I know there is no concept of directories in S3 so I am phrasing my question like how can I get a list of all file names having same prefix?
For example- Let's say I have following files-
Brad/files/pdf/abc.pdf
Brad/files/pdf/abc2.pdf
Brad/files/pdf/abc3.pdf
Brad/files/pdf/abc4.pdf
mybucket/files/pdf/new/
mybucket/files/pdf/new/abc.pdf
mybucket/files/pdf/2011/
When I call foo("Brad"), it should return a list like this-
files/pdf/abc.pdf
files/pdf/abc2.pdf
files/pdf/abc3.pdf
files/pdf/abc4.pdf
What is the best way to do it?
user3's approach is a pure client side solution. I think it works well in small scale. If you have millions of object in one bucket, you may pay for many requests and bandwidth fee.
Alternatively, you can use delimiter and prefix parameter provided by GET BUCKET API to archive your requirement. There are many example in the document, see http://docs.aws.amazon.com/AmazonS3/latest/API/RESTBucketGET.html
Needless to say, you can use boto to achieve this.
You can use startswith and list comprehension for this purpose as below:
paths=['Brad/files/pdf/abc.pdf','Brad/files/pdf/abc2.pdf','Brad/files/pdf/abc3.pdf','Brad/files/pdf/abc4.pdf','mybucket/files/pdf/new/','mybucket/files/pdf/new/abc.pdf','mybucket/files/pdf/2011/']
def foo(m):
return [p for p in paths if p.startswith(m+'/')]
print foo('Brad')
output:
['Brad/files/pdf/abc.pdf', 'Brad/files/pdf/abc2.pdf', 'Brad/files/pdf/abc3.pdf', 'Brad/files/pdf/abc4.pdf']
Using split and filter:
def foo(m):
return filter(lambda x: x.split('/')[0]== m, paths)

How to pass multiple values for a single URL parameter?

Is it possible to pass multiple values for a single URL parameter without using your own separator?
What I want to do is that the backend expects an input parameter urls to have one or more values. It can me set to a single or multiple URLs. What is a way to set the urls parameter so it can have multiple values? I can't use my own separator because it can be part of the value itself.
Example:
http://example.com/?urls=[value,value2...]
The urls parameter be set to just http://google.com or it can be set to http://google.com http://yahoo.com .... In the backend, I want to process each url as a separate values.
http://.../?urls=foo&urls=bar&...
...
request.GET.getlist('urls')
The following is probably the best way of doing it - ie, don't specify a delimited list of URLs, rather, use the fact you can specify the same param name multiple times, eg:
http://example.com/?url=http://google.co.uk&url=http://yahoo.com
The URL list be then be used and retrieved via request.GET.getlist('url')

Categories

Resources