How to Extract Versions from Software Packages

How to Extract Versions from Software Packages - python

I'm trying to extract the version number from software packages hosted on SourceForge based on this Stack Overflow post. Specifically, I'm using the Release API and the "best_release.json" call. I have the following examples:
7-zip: https://sourceforge.net/projects/sevenzip/best_release.json
KeePass: https://sourceforge.net/projects/keepass/best_release.json
OpenOffice.org:
https://sourceforge.net/projects/openofficeorg.mirror/best_release.json
Using the following code snippet:
import requests
"""
Un/comment the following lines to change the project name and test
different responses.
"""
proj = "keepass"
# proj = "sevenzip"
# proj = "openofficeorg.mirror"
r = requests.get(f'https://sourceforge.net/projects/{proj}/best_release.json')
json_resp = r.json()
print(json_resp['release']['filename'])
I receive the respective results for each package:
7-Zip: /7-Zip/22.00/7z2200-linux-x86.tar.xz
KeePass: /KeePass 2.x/2.51.1/KeePass-2.51.1.zip
Openoffice.org: /extended/iso/en/OOo_3.3.0_Win_x86_install_en-US_20110219.iso
I'm wondering how I can extract the file versions from these disparate packages. Looking at the results, one can see that there are different naming conventions. For example, 7-Zip puts the file version as "22.00" in the second directory level. KeePass, however, puts it in the second directory level as well as the filename itself. OpenOffice.org puts it inside the filename.
Is there a way to do some sort of fuzzy match that can attempt to extract a "best guess" file version given a filename?
I thought of using regular expressions, re. For example, I can use the (\d+) capture group to capture one or more digits, as demonstrated here. However, this would also capture text such as "x86," which I don't want. I just desire some text that looks closest to a version number, but I'm unsure how to do this.

Related

Using error names rather than error codes in pydocstyle's ignore file

My team's repo using pydocstyle to manage doc string standardization. I've been tasked with reformatting the relevant configuration file. Currently it looks something like this:
[pydocstyle]
ignore = D100,D203,D405
match = *.py
but we want something more like:
[pydocstyle]
ignore = "error-name1", "error-name2", "error-name3"
match = *.py
so that it's more obvious to readers what rules are being ignored. However from what I can see this is not mentioned in the documentation (here).
Instead I thought I'd do something like this:
[pydocstyle]
ignore =
D100, # corresponds to "error-name1"
D203, # corresponds to "error-name2"
D405 # corresponds to "error-name3"
match = *.py
But I don't know if this is valid syntax in this file, or how to check.
So my questions are:
Is there a way to the original task I was given (ignore="error-name1" etc)?
If not, would my pseudo-solution work?
Really appreciate any help!

The pseudo-solution is fine. You can test it by just running
pydocstyle --config=name_of_config_file
in console. Note name_of_config_file must conform to one of the standards listed here.

How to filter command line output?

I need to filter the output of a command executed in network equipment in order to bring only the lines that match a text like '10.13.32.34'.
I created a python code that brings all the output of the command but I need only part of this.
I am using Python 3.7.3 running on a Windows 10 Pro.
The code I used is below and I need the filtering part because I am a network engineer without the basic notion of python programming. (till now...)
from steelscript.steelhead.core import steelhead
from steelscript.common.service import UserAuth
auth = UserAuth(username='admin', password='password')
sh = steelhead.SteelHead(host='em01r001', auth=auth)
from steelscript.cmdline.cli import CLIMode
sh.cli.exec_command("show connections optimized", mode=CLIMode.CONFIG)
output = (sh.cli.exec_command("show connections optimized"))

I have no idea what your output looks like, so used the text in your question as example data. Anyway, for a simple pattern such as what's shown in your question, you could do it like this:
output = '''\
I need to filter the output of a command executed
in network equipment in order to bring only the
lines that match a text like '10.13.32.34'. I
created a python code that brings all the output
of the command but I need only part of this.
I am using Python 3.7.3 running on a Windows 10
Pro.
The code I used is below and I need the filtering
part because I am a network engineer without the
basic notion of python programming. (till now...)
'''
# Filter the lines of text in output.
filtered = ''.join(line for line in output.splitlines()
if '10.13.32.34' in line)
print(filtered) # -> lines that match a text like '10.13.32.34'. I
You could do something similar for more complex patterns by using the re.search() function in Python's built-in regular expression re module. Using regular expressions is more complicated, but extremely powerful. There are many tutorials on using them, including one in Python's own documentation titled the Regular Expression HOWTO.

search for all variations of a package name with python

given a package name, i want to search known websites to see if that package exists. the problem is, some packages have capitalized letters, titled letters, small case letters and so on.
for instance, suppose i have a package called: lp-Solve-1.0.tar.gz
and generally, i can find this package on a site like this:
https://www.example.com/packages/l/lp-Solve/lp-Solve-1.0.tar.gz
other times, the website(s) may have the package named with a different case letter:
https://www.example.com/packages/L/lp-solve/lp-solve-1.0.tar.gz
or like this:
https://www.example.com/packages/L/LP-SOLVE/LP-SOLVE-1.0.tar.gz
As you can see here, there are many different ways the package can be named on a website.
I have no clue where to begin with this.
I was using this code:
#!/usr/bin/env python2.7
listOfWebsites = [ website1, website2, website3, website4, and so on ]
goodWebsites = []
for eachWebsite in listOfWebsites:
genURL = eachWebsite + "/" + packageName
res = requests.head(genUrl)
if res.ok:
goodWebsites.append(genURL)
but where Im stuck at is how to search the websites for all the possible names a package could be called.

Reading xbrl with python

I am trying find particular tag in an xbrl file. I originally tried using python-xbrl package, but it is not exactly what I want, so I based my code on the one available from the package.
Here's the part of xbrl that I am interested in
<us-gaap:LiabilitiesCurrent contextRef="eol_PE2035----1510-Q0008_STD_0_20150627_0" unitRef="iso4217_USD" decimals="-6" id="id_5025426_6FEF05CB-B19C-4D84-AAF1-79B431731049_1_24">65285000000</us-gaap:LiabilitiesCurrent>
<us-gaap:Liabilities contextRef="eol_PE2035----1510-Q0008_STD_0_20150627_0" unitRef="iso4217_USD" decimals="-6" id="id_5025426_6FEF05CB-B19C-4D84-AAF1-79B431731049_1_28">147474000000</us-gaap:Liabilities>
Here is the code
python-xbrl package is based on beautifulsoup4 and several other packages.
liabilities = xbrl.find_all(name=re.compile("(us-gaap:Liabilities)",
re.IGNORECASE | re.MULTILINE))
I get the value for us-gaap:LiabilitiesCurrent, but I want value for us-gaap:Liabilities.
Right now as soon as it finds a match it, stores it. But in many cases its the wrong match due to the tag format in xbrl. I believe I need to change re.compile() part to make it work correctly.

I'd be very wary about using this approach to parsing XBRL (or indeed, any XML with namespaces in it). "us-gaap:Liabilities" is a QName, consisting of a prefix ("us-gaap") and a local name ("Liabilities"). The prefix is just a shorthand for a full namespace URI such as "http://fasb.org/us-gaap/2015-01-31", which is defined by a namespace declaration, usually at the top of the document. If you look at the top of the document you'll see something like:
xmlns:us-gaap="http://fasb.org/us-gaap/2015-01-31"
This means that within the scope of this document, "us-gaap" is taken to mean that full namespace URI.
XML creators are free to use whatever prefixes they want, so there is no guarantee that the element will actually be called "us-gaap:Liabilities" across all documents that you encounter.
beautifulsoup4 has very limited support for namespaces, so I wouldn't recommend it as a starting point for building an XBRL processor. It may be worth taking a look at the Arelle project, which is a full XBRL processor, and will make it easier to do other tasks such as finding the labels and other information associated with facts in the taxonomy.

Try it with a $ dollar sign at the end to indicate not to match anything else following the dollar sign:
liabilities = xbrl.find_all(name=re.compile("(us-gaap:Liabilities$)",
re.IGNORECASE | re.MULTILINE))

Working with Parameters containing Escaped Characters in Python Config file

I have a config file that I'm reading using the following code:
import configparser as cp
config = cp.ConfigParser()
config.read('MTXXX.ini')
MT=identify_MT(msgtext)
schema_file = config.get(MT,'kbfile')
fold_text = config.get(MT,'fold')
The relevant section of the config file looks like this:
[536]
kbfile=MT536.kb
fold=:16S:TRANSDET\n
Later I try to find text contained in a dictionary that matches the 'fold' parameter, I've found that if I find that text using the following function:
def test (find_text)
return {k for k, v in dictionary.items() if find_text in v}
I get different results if I call that function in one of two ways:
test(fold_text)
Fails to find the data I want, but:
test(':16S:TRANSDET\n')
returns the results I know are there.
And, if I print the content of the dictionary, I can see that it is, as expected, shown as
:16S:TRANSDET\n
So, it matches when I enter the search text directly, but doesn't find a match when I load the same text in from a config file.
I'm guessing that there's some magic being applied here when reading/handling the \n character pattern in from the config file, but don't know how to get it to work the way I want it to.
I want to be able to parameterise using escape characters but it seems I'm blocked from doing this due to some internal mechanism.
Is there some switch I can apply to the config reader, or some extra parsing I can do to get the behavior I want? Or perhaps there's an alternate solution. I do find the configparser module convenient to use, but perhaps this is a limitation that requires an alternative, or even self-built module to lift data out of a parameter file.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.