search for all variations of a package name with python

search for all variations of a package name with python - python

given a package name, i want to search known websites to see if that package exists. the problem is, some packages have capitalized letters, titled letters, small case letters and so on.
for instance, suppose i have a package called: lp-Solve-1.0.tar.gz
and generally, i can find this package on a site like this:
https://www.example.com/packages/l/lp-Solve/lp-Solve-1.0.tar.gz
other times, the website(s) may have the package named with a different case letter:
https://www.example.com/packages/L/lp-solve/lp-solve-1.0.tar.gz
or like this:
https://www.example.com/packages/L/LP-SOLVE/LP-SOLVE-1.0.tar.gz
As you can see here, there are many different ways the package can be named on a website.
I have no clue where to begin with this.
I was using this code:
#!/usr/bin/env python2.7
listOfWebsites = [ website1, website2, website3, website4, and so on ]
goodWebsites = []
for eachWebsite in listOfWebsites:
genURL = eachWebsite + "/" + packageName
res = requests.head(genUrl)
if res.ok:
goodWebsites.append(genURL)
but where Im stuck at is how to search the websites for all the possible names a package could be called.

Related

How to Extract Versions from Software Packages

I'm trying to extract the version number from software packages hosted on SourceForge based on this Stack Overflow post. Specifically, I'm using the Release API and the "best_release.json" call. I have the following examples:
7-zip: https://sourceforge.net/projects/sevenzip/best_release.json
KeePass: https://sourceforge.net/projects/keepass/best_release.json
OpenOffice.org:
https://sourceforge.net/projects/openofficeorg.mirror/best_release.json
Using the following code snippet:
import requests
"""
Un/comment the following lines to change the project name and test
different responses.
"""
proj = "keepass"
# proj = "sevenzip"
# proj = "openofficeorg.mirror"
r = requests.get(f'https://sourceforge.net/projects/{proj}/best_release.json')
json_resp = r.json()
print(json_resp['release']['filename'])
I receive the respective results for each package:
7-Zip: /7-Zip/22.00/7z2200-linux-x86.tar.xz
KeePass: /KeePass 2.x/2.51.1/KeePass-2.51.1.zip
Openoffice.org: /extended/iso/en/OOo_3.3.0_Win_x86_install_en-US_20110219.iso
I'm wondering how I can extract the file versions from these disparate packages. Looking at the results, one can see that there are different naming conventions. For example, 7-Zip puts the file version as "22.00" in the second directory level. KeePass, however, puts it in the second directory level as well as the filename itself. OpenOffice.org puts it inside the filename.
Is there a way to do some sort of fuzzy match that can attempt to extract a "best guess" file version given a filename?
I thought of using regular expressions, re. For example, I can use the (\d+) capture group to capture one or more digits, as demonstrated here. However, this would also capture text such as "x86," which I don't want. I just desire some text that looks closest to a version number, but I'm unsure how to do this.

Get all names from wikipedia-site?

i try to extract all names from this site -
https://en.wikipedia.org/wiki/Category:Masculine_given_names
(and i want to have all names which are listed on this site and the following pages - but also the subcategories which are listed at the top like Afghan masculine given names, African masculine given names, etc.)
I tried this with the following code:
import pywikibot
from pywikibot import pagegenerators
site = pywikibot.Site()
cat = pywikibot.Category(site,'Category:Masculine_given_names')
gen = pagegenerators.CategorizedPageGenerator(cat)
for idx,page in enumerate(gen):
text = page.text
print(idx)
print(text)
Which generally works fine and gave me at least the detail-page of a single name page. But how can i get all the names / from all the subpages on this site but also from the subcategories?

How to find subcategories and subpages on wikipedia using pywikibot?
This is already answered here using Category methods but you can also use pagegenerators CategorizedPageGenerator function. All you need is setting the recurse option:
>>> gen = pagegenerators.CategorizedPageGenerator(cat, recurse=True)
Refer the documentation for it. You may also include pagegenerators options within your script in such a way decribed in this example and call your script with -catr option:
pwb.py <yourscript> -catr:Masculine_given_names

bibtex to html with pybtex, python 3

I want to take a file of one or more bibtex entries and output it as an html-formatted string. The specific style is not so important, but let's just say APA. Basically, I want the functionality of bibtex2html but with a Python API since I'm working in Django. A few people have asked similar questions here and here. I also found someone who provided a possible solution here.
The first issue I'm having is pretty basic, which is that I can't even get the above solutions to run. I keep getting errors similar to ModuleNotFoundError: No module named 'pybtex.database'; 'pybtex' is not a package. I definitely have pybtex installed and can make basic API calls in the shell no problem, but whenever I try to import pybtex.database.whatever or pybtex.plugin I keep getting ModuleNotFound errors. Is it maybe a python 2 vs python 3 thing? I'm using the latter.
The second issue is that I'm having trouble understanding the pybtex python API documentation. Specifically, from what I can tell it looks like the format_from_string and format_from_file calls are designed specifically for what I want to do, but I can't seem to get the syntax correct. Specifically, when I do
pybtex.format_from_file('foo.bib',style='html')
I get pybtex.plugin.PluginNotFound: plugin pybtex.style.formatting.html not found. I think I'm just not understanding how the call is supposed to work, and I can't find any examples of how to do it properly.

Here's a function I wrote for a similar use case--incorporating bibliographies into a website generated by Pelican.
from pybtex.plugin import find_plugin
from pybtex.database import parse_string
APA = find_plugin('pybtex.style.formatting', 'apa')()
HTML = find_plugin('pybtex.backends', 'html')()
def bib2html(bibliography, exclude_fields=None):
exclude_fields = exclude_fields or []
if exclude_fields:
bibliography = parse_string(bibliography.to_string('bibtex'), 'bibtex')
for entry in bibliography.entries.values():
for ef in exclude_fields:
if ef in entry.fields.__dict__['_dict']:
del entry.fields.__dict__['_dict'][ef]
formattedBib = APA.format_bibliography(bibliography)
return "<br>".join(entry.text.render(HTML) for entry in formattedBib)
Make sure you've installed the following:
pybtex==0.22.2
pybtex-apa-style==1.3

Reading xbrl with python

I am trying find particular tag in an xbrl file. I originally tried using python-xbrl package, but it is not exactly what I want, so I based my code on the one available from the package.
Here's the part of xbrl that I am interested in
<us-gaap:LiabilitiesCurrent contextRef="eol_PE2035----1510-Q0008_STD_0_20150627_0" unitRef="iso4217_USD" decimals="-6" id="id_5025426_6FEF05CB-B19C-4D84-AAF1-79B431731049_1_24">65285000000</us-gaap:LiabilitiesCurrent>
<us-gaap:Liabilities contextRef="eol_PE2035----1510-Q0008_STD_0_20150627_0" unitRef="iso4217_USD" decimals="-6" id="id_5025426_6FEF05CB-B19C-4D84-AAF1-79B431731049_1_28">147474000000</us-gaap:Liabilities>
Here is the code
python-xbrl package is based on beautifulsoup4 and several other packages.
liabilities = xbrl.find_all(name=re.compile("(us-gaap:Liabilities)",
re.IGNORECASE | re.MULTILINE))
I get the value for us-gaap:LiabilitiesCurrent, but I want value for us-gaap:Liabilities.
Right now as soon as it finds a match it, stores it. But in many cases its the wrong match due to the tag format in xbrl. I believe I need to change re.compile() part to make it work correctly.

I'd be very wary about using this approach to parsing XBRL (or indeed, any XML with namespaces in it). "us-gaap:Liabilities" is a QName, consisting of a prefix ("us-gaap") and a local name ("Liabilities"). The prefix is just a shorthand for a full namespace URI such as "http://fasb.org/us-gaap/2015-01-31", which is defined by a namespace declaration, usually at the top of the document. If you look at the top of the document you'll see something like:
xmlns:us-gaap="http://fasb.org/us-gaap/2015-01-31"
This means that within the scope of this document, "us-gaap" is taken to mean that full namespace URI.
XML creators are free to use whatever prefixes they want, so there is no guarantee that the element will actually be called "us-gaap:Liabilities" across all documents that you encounter.
beautifulsoup4 has very limited support for namespaces, so I wouldn't recommend it as a starting point for building an XBRL processor. It may be worth taking a look at the Arelle project, which is a full XBRL processor, and will make it easier to do other tasks such as finding the labels and other information associated with facts in the taxonomy.

Try it with a $ dollar sign at the end to indicate not to match anything else following the dollar sign:
liabilities = xbrl.find_all(name=re.compile("(us-gaap:Liabilities$)",
re.IGNORECASE | re.MULTILINE))

Incremental Saves

I am trying to write up a script on incremental saves but there are a few hiccups that I am running into.
If the file name is "aaa.ma", I will get the following error - ValueError: invalid literal for int() with base 10: 'aaa' # and it does not happens if my file is named "aaa_0001"
And this happens if I wrote my code in this format: Link
As such, to rectify the above problem, I input in an if..else.. statement - Link, it seems to have resolved the issue on hand, but I was wondering if there is a better approach to this?
Any advice will be greatly appreciated!

Use regexes for better flexibility especially for file rename scripts like these.
In your case, since you know that the expected filename format is "some_file_name_<increment_number>", you can use regexes to do the searching and matching for you. The reason we should do this is because people/users may are not machines, and may not stick to the exact naming conventions that our scripts expect. For example, the user may name the file aaa_01.ma or even aaa001.ma instead of aaa_0001 that your script currently expects. To build this flexibility into your script, you can use regexes. For your use case, you could do:
# name = lastIncFile.partition(".")[0] # Use os.path.split instead
name, ext = os.path.splitext(lastIncFile)
import re
match_object = re.search("([a-zA-Z]*)_*([0-9]*)$", name)
# Here ([a-zA-Z]*) would be group(1) and would have "aaa" for ex.
# and ([0-9]*) would be group(2) and would have "0001" for ex.
# _* indicates that there may be an _, or not.
# The $ indicates that ([0-9]*) would be the LAST part of the name.
padding = 4 # Try and parameterize as many components as possible for easy maintenance
default_starting = 1
verName = str(default_starting).zfill(padding) # Default verName
if match_object: # True if the version string was found
name = match_object.group(1)
version_component = match_object.group(2)
if version_component:
verName = str(int(version_component) + 1).zfill(padding)
newFileName = "%s_%s.%s" % (name, verName, ext)
incSaveFilePath = os.path.join(curFileDir, newFileName)
Check out this nice tutorial on Python regexes to get an idea what is going on in the above block. Feel free to tweak, evolve and build the regex based on your use cases, tests and needs.
Extra tips:
Call cmds.file(renameToSave=True) at the beginning of the script. This will ensure that the file does not get saved over itself accidentally, and forces the script/user to rename the current file. Just a safety measure.
If you want to go a little fancy with your regex expression and make them more readable, you could try doing this:
match_object = re.search("(?P<name>[a-zA-Z]*)_*(?P<version>[0-9]*)$", name)
name = match_object.group('name')
version_component = match_object('version')
Here we use the ?P<var_name>... syntax to assign a dict key name to the matching group. Makes for better readability when you access it - mo.group('version') is much more clearer than mo.group(2).
Make sure to go through the official docs too.
Save using Maya's commands. This will ensure Maya does all it's checks while and before saving:
cmds.file(rename=incSaveFilePath)
cmds.file(save=True)
Update-2:
If you want space to be checked here's an updated regex:
match_object = re.search("(?P<name>[a-zA-Z]*)[_ ]*(?P<version>[0-9]*)$", name)
Here [_ ]* will check for 0 - many occurrences of _ or (space). For more regex stuff, trying and learn on your own is the best way. Check out the links on this post.
Hope this helps.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.