Extracting domain for "asterisk-prefixed" suffixes

Extracting domain for "asterisk-prefixed" suffixes - python

I use tldextract (version 2.2.2) to extract subdomain/domain/suffix from URLs.
I recently noticed a result that I was surprised by:
>>> from tldextract import extract
>>> extract('http://althawrah.ye/archives/597366')
ExtractResult(subdomain='', domain='', suffix='althawrah.ye')
Instead of being picked up as the domain, althawrah is picked up as part of the suffix. Why is this?
Snooping around a bit, I notice in the Public Suffice List itself that .ye is one of a small number of suffixes that uses a leading asterisk, e.g.
// fj : https://en.wikipedia.org/wiki/.fj
*.fj
// ye : http://www.y.net.ye/services/domain_name.htm
*.ye
The implication here is that these suffixes do not allow domain names to be registered directly under the suffix, but instead must be registered as a third level name. However, this is not the case with http://althawrah.ye/; that is, althawrah is not listed as a second-level domain of .ye. So, what is going on here?

Based on the history of the list and the description of the process for updating, it looks like the Yemen entry is simply wrong or out of date. The entry was added before 2007 (when the list was migrated from CVS to git), while the list guidelines state that:
Changes [for ICANN Domains] need to either come from a representative of the registry (authenticated in a similar manner to below) or be from public sources such as a registry website.
The website linked in the list (which hasn't changed since 2002) gives little detail but does mention URLs of the format www.yourcompany.com.ye, which is where the *.ye rule presumably came from. IANA's root zone database specifies TeleYemen as the current TLD manager, but there is no mention of domain registration on their site. The Wikipedia list of supposed "second level domains" was added in 2008 by a Canadian user linking to a since-deleted website of a company called phpcomet (archived here) which claimed to sell domains in the listed second level domains. However, a google search for "site:ye" reveals plenty of sites outside those domains (e.g. press24.ye, ndc.ye) and fails to give any result for many of them (me.ye, co.ye, ltd.ye, plc.ye).
I'm not sure what could be done to update the official list, but I wouldn't be surprised if the correct entry would read something like:
ye
com.ye
edu.ye
gov.ye
org.ye

These changes were merged into publicsuffix/list in pull request 1189, thanks to TeleYemen and the project maintainers.
The list now specifies subdomains explicitly and drops the * asterisk.

Related

Is there a library in Python that has a full list of registered domains on the internet?

I want to implement a program that basically detects to see if input from the user is not a genuine domain name or registered name on the ICANN.

Is there a library in Python that has a full list of registered domains on the internet?
No, not in Python or any other language, and for obvious reasons (don't you think such kind of list would get misused if it existed?)
First two points:
see if input from the user is not a genuine domain name or registered name on the ICANN.
ICANN has nothing to do here with your needs. It has no operational play in the day to day life of domain names, when they are registered or deleted. ICANN just define the list of current TLDs, that is basically topmost registries maintaining central databases with the list of domain names.
What is a "genuine" domain? But more importantly, what exactly do you need to test with domains? Is it like for people entering an email address to test if it is plausible? If so, testing the domain name is not the correct approach. But to say more really depends on your use case that you do not describe, which also makes your question offtopic as long as it has no more relationship with programming.
So just some generic ideas:
you can use the DNS to query domain names, but this has edge cases (and you need to understand how the DNS works) among which not all really registered domain names are published, for totally normal reasons
you can use whois, as John said, or better RDAP when it is available (at least all gTLDs); this has drawbacks too, as, without a solid library parsing the replies, you don't even have a standard way of finding out if a name does not exist, as the registries will give different free form strings for such cases; plus it is not suitable for high volume queries as it is heavily rate limited
if you would be really interested in something closer to a list of domain names, all gTLDs are required to publish their zonefile, at least daily, which is basically the list of all resolving domains (which is a subset of the list of all registered domains, consider a few % of difference), see https://czds.icann.org/ ; some ccTLDs do it too, but each one on their own policies and rules, some have an "open data" feature that gives things similar (often with a delay) or a list of "recently" registered domain names (so if you grab that days after days at some point you have something close to the list of all domains)
Any good programming language has already polished libraries for DNS queries, whois or RDAP queries, or parsing zonefiles.

How are 'ea_localid' values distributed in Enterprise Architect API?

While working with the Enterprise Architect API I have noticed that when you export an EA project to XMI, several different kinds of elements get an attribute called ea_localid. That means in the XMI you'll find a tag that has the ea_localid as an attribute. This attribute seems to be used to reference source and target of connecting elements (at least this is valid for 'transitions', as we are working with State Machine diagrams).
So far, so good. Now the problem for my intended usage is that these values seem to be newly distributed every time you do an import and an export. EDIT: it is not quite clear to me when exactly during this process. EDIT #2 It seems to happen on import.
That means after having exported your project, reimporting it, changing nothing and then exporting it again gives the generated XMI document a set of different ea_localid values. Moreover, it seems that some values that used to belong to one element can now be used for an entirely different one.
Does anybody know anything about the distribution mechanism? Or, even better, a way of emulating it? Or a way to reset all counters?
As far as I've seen, generally there seem to be different classes of elements and within these classes a new ea_localid for the next element is generated by counting +1. So the first one has the value 1, then the next one 2 and so on.
My goal here is doing 'roundtrips' (XMI --> project --> XMI ...) and always getting the same ea_localid values, possibly by editing the XMI document after export. Any help or suggestions would be highly appreciated. Cheers

The ea_localid represents the elementID of elements (or AttributeID for attributes etc...)
In EA each "thing" has two ID's. A numeric ID, and a GUID.
The numeric ID (e.g. t_object.Object_ID) is used as key in most relations, but this is not stable.
Things like importing XMI files can reset the numeric ID's. This explains why the ea_localID changes.
If you are looking for a stable ID you should use the GUID. This one is guaranteed to stay the same, even after exporting and importing to other models. (as long as you don't set the flag Strip GUIDs when importing)
In the xmi file you'll find those stable id's in the attribte xmi.id
e.g
<UML:Class name="Aannemer" xmi.id="EAID_04A526DF_7F07_4475_8E65_16D2D88CEECD" visibility="public" namespace="EAPK_0345C8A9_9E8F_42c5_9931_CB842233B11B" isRoot="false" isLeaf="false" isAbstract="false" isActive="false">
This value corresponds to the ea_guid columns in each of the tables.

So, after some testing I have found out that regarding the aforementioned goal of doing a roundtrip (xmi --> import to EA --> xmi) and always getting the exact same document, the easiest solution is ...
running a filter over the xmi that just deletes all nodes containing ea_localid, ea_sourceID (sic!) and ea_targetID values.
On reimport EA will just assign them new values. The information regarding source and target of 'transitions' and other connecting elements is also stored with the GUID, so there is no loss of information.

How to find full service name from partial name in python

I am writing a Python script that will manage multiple Oracle databases on a single box. Each database has its own OracleService, but they all run under one TNSListener. Because each computer's install might name things differently I want to make this as dynamic as possible.
First, I need to start the TNSListener service. Most of these are on local laptops that only start the listener when we are going to use an Oracle database. In addition, some laptops run different versions of Oracle so the actual service name is different. For this I need to be able to find the full service name or names that contains the string 'TNSListener'.
Second, all of the OracleService names will be appended by the instance name (i.e., OracleServiceTESTING1). So I need to get a list of all the OracleServices on the machine and then display a selection of the instances based on the appended portion of the service names.
I thought about accessing the registry and trying to pull services from there, but the overhead to parse through that seems excessive. I'm just looking for some general guidance on how to find all services that match the string 'TNSListener' and 'OracleService'.

I would recommend a library like pywinservicemanager. A short code example to check if a particular service exists would look like this:
from pywinservicemanager.WindowsServiceConfigurationManager import ServiceExists
serviceName = 'TestService'
serviceExists = ServiceExists(serviceName)
print serviceExists

Django/FeinCMS ignores language/country specific url-name

we've got a Django (1.5.5) based application using FeinCMS (1.7.4).
For a page formerly only the (general) en-based version was configured. Later specific configurations for en-us and en-ca were added, with different url-names (than used by the en version). This had the consequence that (en-based) links that had been distributed (via marketing channels) prior to that change, didn't work anymore.
Playing around with the url-names I noticed, that Django/FeinCMS only honours the url-name which was edited last. Meaning, that ever only one url-name is recognised for all contexts (en, en-us and en-ca). The one which was edited/created last.
Does someone know a way to fix this? I've tried to find the "responsible" code, but without success.
Creating manual redirects is no option as there are too many links to specific stories/articles.
[EDIT 17-10-2016 17:53]
Based on Jonas' comments I investigated cms_page table in the DB a little. I noticed...
That there is no row in cms_page which represents the country specific page configurations (e.g. for en-us and en-ca).
Although the last-edited url-name and title are the ones of the country-specific configuration, meaning the one which "works", they don't show up in the table.

Error while creating PDF using ReportLab in Python

Data:
['<p>Work! please work.img:0\xc3\x82\xc2\xa0Will you?img:1</p>img:2img:3\xc3\x82\xc2\xa0ascasdacasdadasdaca HAHAHAHAHA! BAND!\n', '\n', "<p>Random test.</p><p><br />If you want to start a flame war, mention lines of code per day or hour in a developer\xc3\xa2€™s public forum. At least that is what I found when I started investigating how many lines of code are written per day per programmer. Lines of code, or loc for short, are supposedly a terrible metric for measuring programmer productivity and empirically I agree with this. There are too many variables involved starting with the definition of a line of code and going all the way up to the complexity of the requirements. There are single lines that take a long time to get right and there many lines which are mindless boilerplate code. All the same this measurement does have information encoded in it; the hard part is extracting that information and drawing the correct conclusions. Unfortunately I don\xc3\xa2€™t have access to enough data about software projects to provide a statistically sound analysis but I got a very interesting result from measuring two very different projects that I would like to share.</p><p>The first project is a traditional client server data mining tool for a vertical market mostly built in VB.NET and WinForms. This project started in 2003 and has been through several releases and an upgrade from .NET 1.1 to .NET 2.0. It has server components but most of the half a million lines of code lives in the client side. The team has always had around four developers although not always the same people. The average lines of code for this project came in at around ninety lines of code per day per developer. I wasn\xc3\xa2€™t able to measure the SQL in the stored procedures so this number is slightly inflated.</p><p><em>The second project is much smaller adding up to ten thousand lines of C# plus seven thousand lines of XAML c</em>reated by a team of four that also worked on the first project. This project lasted three months and it is a WPF point of sale application thus very different in scope from the first project. <strong>It was built around a number of web services in SOA fashion and does not have a database per se. Its average came up around seventy lines of code per developer per day.</strong></p><p>I am very surprised with the closeness of these numbers, especially given the difference in size and scope of the products. The commonality between them are the .NET framework and the team and one of them may be the key. Of these two, I am leaning to the .NET framework being the unifier because although the developers worked on both projects, three of elements on the team of the second project have spent less than a year on the first project and did not belong to the core team that wrote the vast majority of that first product. Or maybe there is something more general at work here?</p><p>The first step in using the WP_Filesystem is requesting credentials from the user. The normal way this is accomplished is at the time when you're saving the results of a form input, or you have otherwise determined that you need to write to a file.</p><p>The credentials form can be displayed onto an admin page by using the following code:</p><pre>$url = wp_nonce_url('themes.php?page=example','example-theme-options');\n</pre>", "if (false === ($creds = request_filesystem_credentials($url, '', false, false, null) ) ) {\n", '\treturn; // stop processing here\n', '}\n', '<p>The request_filesystem_credentials() call takes five arguments.</p><ul><li>The URL to which the form should be submitted (a nonced URL to a theme page was used in the example above)</li><li>A method override (normally you should leave this as the empty string: "")</li><li>An error flag (normally false unless an error is detected, see below)</li><li>A context directory (false, or a specific directory path that you want to test for access)</li><li>Form fields (an array of form field names from your previous form that you wish to "pass-through" the resulting credentials form, or null if there are none)</li></ul><p>The request_filesystem_credentials call will test to see if it is capable of writing to the local filesystem directly without credentials first. If this is the case, then it will return true and not do anything. Your code can then proceed to use the WP_Filesystem class.</p><p>The request_filesystem_credentials call also takes into account hardcoded information, such as hostname or username or password, which has been inserted into the wp-config.php file using defines. If these are pre-defined in that file, then this call will return that information instead of displaying a form, bypassing the form for the user.</p><p>If it does need credentials from the user, then it will output the FTP information form and return false. In this case, you should stop processing further, in order to allow the user to input credentials. Any form fields names you specified will be included in the resulting form as hidden inputs, and will be returned when the user resubmits the form, this time with FTP credentials.</p><p>Note: Do not use the reserved names of hostname, username, password, public_key, or private_key for your own inputs. These are used by the credentials form itself. Alternatively, if you do use them, the request_filesystem_credentials function will assume that they are the incoming FTP credentials.</p><p>When the credentials form is submitted, it will look in the incoming POST data for these fields, and if found, it will return them in an array suitable for passing to WP_Filesystem, which is the next step.</p><p><a id="Initializing_WP_Filesystem_Base" name="Initializing_WP_Filesystem_Base"></a>']
I use ReportLab to convert it to pdf but it fails.
This is my ReportLab code:
for page in self.pagelist:
self.image_parser(page)
print page.content
for i in range(0,len(page.content)):
bogustext = page.content[i]
while (len(re.findall(r'img:?',bogustext)) > 0):
for m in re.finditer( r'img:?', bogustext ):
image_tag = bogustext[m.start():m.end()+1]
print (image_tag.split(':')[1])
im = Image(page.images[int(image_tag.split(':')[1])],width=2*inch, height=2*inch)
Story.append(Paragraph(bogustext[0:m.start()], style))
bogustext = bogustext.replace(bogustext[0:m.start()],'')
Story.append(im)
bogustext = bogustext.replace(image_tag,'')
break
p = Paragraph(bogustext,style)
Story.append(p)
Story.append(Spacer(1,0.2*inch))
page is class of which page.content contains the Data I mentioned above.
self.image(page) is a function that removes all the image urls in the page.content(Data).
Error:
xml parser error (invalid attribute name id) in paragraph beginning
'<p>The request_filesystem_cred'
I don't get this error if I produce a PDF for every element of the list but I do get one if I try to make a complete PDF out of it. Where am I going wrong?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.