bibtex to html with pybtex, python 3 - python

I want to take a file of one or more bibtex entries and output it as an html-formatted string. The specific style is not so important, but let's just say APA. Basically, I want the functionality of bibtex2html but with a Python API since I'm working in Django. A few people have asked similar questions here and here. I also found someone who provided a possible solution here.
The first issue I'm having is pretty basic, which is that I can't even get the above solutions to run. I keep getting errors similar to ModuleNotFoundError: No module named 'pybtex.database'; 'pybtex' is not a package. I definitely have pybtex installed and can make basic API calls in the shell no problem, but whenever I try to import pybtex.database.whatever or pybtex.plugin I keep getting ModuleNotFound errors. Is it maybe a python 2 vs python 3 thing? I'm using the latter.
The second issue is that I'm having trouble understanding the pybtex python API documentation. Specifically, from what I can tell it looks like the format_from_string and format_from_file calls are designed specifically for what I want to do, but I can't seem to get the syntax correct. Specifically, when I do
pybtex.format_from_file('foo.bib',style='html')
I get pybtex.plugin.PluginNotFound: plugin pybtex.style.formatting.html not found. I think I'm just not understanding how the call is supposed to work, and I can't find any examples of how to do it properly.

Here's a function I wrote for a similar use case--incorporating bibliographies into a website generated by Pelican.
from pybtex.plugin import find_plugin
from pybtex.database import parse_string
APA = find_plugin('pybtex.style.formatting', 'apa')()
HTML = find_plugin('pybtex.backends', 'html')()
def bib2html(bibliography, exclude_fields=None):
exclude_fields = exclude_fields or []
if exclude_fields:
bibliography = parse_string(bibliography.to_string('bibtex'), 'bibtex')
for entry in bibliography.entries.values():
for ef in exclude_fields:
if ef in entry.fields.__dict__['_dict']:
del entry.fields.__dict__['_dict'][ef]
formattedBib = APA.format_bibliography(bibliography)
return "<br>".join(entry.text.render(HTML) for entry in formattedBib)
Make sure you've installed the following:
pybtex==0.22.2
pybtex-apa-style==1.3

Related

python lxml xpath AttributeError (NoneType) with correct xpath and usually working

I am trying to migrate a forum to phpbb3 with python/xpath. Although I am pretty new to python and xpath, it is going well. However, I need help with an error.
(The source file has been downloaded and processed with tagsoup.)
Firefox/Firebug show xpath: /html/body/table[5]/tbody/tr[position()>1]/td/a[3]/b
(in my script without tbody)
Here is an abbreviated version of my code:
forumfile="morethread-alte-korken-fruchtweinkeller-89069-6046822-0.html"
XPOSTS = "/html/body/table[5]/tr[position()>1]"
t = etree.parse(forumfile)
allposts = t.xpath(XPOSTS)
XUSER = "td[1]/a[3]/b"
XREG = "td/span"
XTIME = "td[2]/table/tr/td[1]/span"
XTEXT = "td[2]/p"
XSIG = "td[2]/i"
XAVAT = "td/img[last()]"
XPOSTITEL = "/html/body/table[3]/tr/td/table/tr/td/div/h3"
XSUBF = "/html/body/table[3]/tr/td/table/tr/td/div/strong[position()=1]"
for p in allposts:
unreg=0
username = None
username = p.find(XUSER).text #this is where it goes haywire
When the loop hits user "tompson" / position()=11 at the end of the file, I get
AttributeError: 'NoneType' object has no attribute 'text'
I've tried a lot of try except else finallys, but they weren't helpful.
I am getting much more information later in the script such as date of post, date of user registry, the url and attributes of the avatar, the content of the post...
The script works for hundreds of other files/sites of this forum.
This is no encode/decode problem. And it is not "limited" to the XUSER part. I tried to "hardcode" the username, then the date of registry will fail. If I skip those, the text of the post (code see below) will fail...
#text of getpost
text = etree.tostring(p.find(XTEXT),pretty_print=True)
Now, this whole error would make sense if my xpath would be wrong. However, all the other files and the first numbers of users in this file work. it is only this "one" at position()=11
Is position() uncapable of going >10 ? I don't think so?
Am I missing something?
Question answered!
I have found the answer...
I must have been very tired when I tried to fix it and came here to ask for help. I did not see something quite obvious...
The way I posted my problem, it was not visible either.
the HTML I downloaded and processed with tagsoup had an additional tag at position 11... this was not visible on the website and screwed with my xpath
(It probably is crappy html generated by the forum in combination with tagsoups attempt to make it parseable)
out of >20000 files less than 20 are afflicted, this one here just happened to be the first...
additionally sometimes the information is in table[4], other times in table[5]. I did account for this and wrote a function that will determine the correct table. Although I tested the function a LOT and thought it working correctly (hence did not inlcude it above), it did not.
So I made a better xpath:
'/html/body/table[tr/td[#width="20%"]]/tr[position()>1]'
and, although this is not related, I ran into another problem with unxpected encoding in the html file (not utf-8) which was fixed by adding:
parser = etree.XMLParser(encoding='ISO-8859-15')
t = etree.parse(forumfile, parser)
I am now confident that after adjusting for strange additional and multiple , and tags my code will work on all files...
Still I will be looking into lxml.html, as I mentioned in the comment, I have never used it before, but if it is more robust and may allow for using the files without tagsoup, it might be a better fit and save me extensive try/except statements and loops to fix the few files screwing with my current script...

Python xgoogle library - site specific search

I am using the xgoogle python library to try to search as specific site. The code works for me when I do not use the "site:" indicator in the keyword search. If I do used it, the result set is empty. Does anyone have any thoughts how to get the code below to work?
from xgoogle.search import GoogleSearch, SearchError
gs = GoogleSearch("site:reddit.com fun")
gs.results_per_page = 50
results = gs.get_results()
print results
for res in results:
print res.title.encode("utf8")
print
A simple url with the "q" parameter (e.g. "http://www.google.com/search?&q=site:reddit.com+fun") works, so I assume it's some other problem.
If you are using pkrumins/xgoogle, a quick (and dirty) fix is to modify search.py line 240 as follows:
if not title or not url:
This is because Google changes their SERP layout, which breaks the _extract_description() function.
You can also take a look at this fork.
Put keyword before site:XX. It works for me.

MX Record lookup and check

I need to create a tool that will check a domains live mx records against what should be expected (we have had issues with some of our staff fiddling with them and causing all incoming mail to redirected into the void)
Now I won't lie, I'm not a competent programmer in the slightest! I'm about 40 pages into "dive into python" and can read and understand the most basic code. But I'm willing to learn rather than just being given an answer.
So would anyone be able to suggest which language I should be using?
I was thinking of using python and starting with something along the lines of using 0s.system() to do a (dig +nocmd domain.com mx +noall +answer) to pull up the records, I then get a bit confused about how to compare this to a existing set of records.
Sorry if that all sounds like nonsense!
Thanks
Chris
With dnspython module (not built-in, you must pip install it):
>>> import dns.resolver
>>> domain = 'hotmail.com'
>>> for x in dns.resolver.resolve(domain, 'MX'):
... print(x.to_text())
...
5 mx3.hotmail.com.
5 mx4.hotmail.com.
5 mx1.hotmail.com.
5 mx2.hotmail.com.
Take a look at dnspython, a module that should do the lookups for you just fine without needing to resort to system calls.
the above solutions are correct. some things I would like to add and update.
the dnspython has been updated to be used with python3 and it has superseeded the dnspython3 library so use of dnspython is recommended
the domain will strictly take in the domain and nothing else.
for example: dnspython.org is valid domain, not www.dnspython.org
here's a function if you want to get the mail servers for a domain.
def get_mx_server(domain: str = "dnspython.org") -> str:
mail_servers = resolver.resolve(domain, 'MX')
mail_servers = list(set([data.exchange.to_text()
for data in mail_servers]))
return ",".join(mail_servers)

How can I use Microsoft Word's spelling/grammar checker programmatically?

I want to process a medium to large number of text snippets using a spelling/grammar checker to get a rough approximation and ranking of their "quality." Speed is not really of concern either, so I think the easiest way is to write a script that passes off the snippets to Microsoft Word (2007) and runs its spelling and grammar checker on them.
Is there a way to do this from a script (specifically, Python)? What is a good resource for learning about controlling Word programmatically?
If not, I suppose I can try something from Open Source Grammar Checker (SO).
Update
In response to Chris' answer, is there at least a way to a) open a file (containing the snippet(s)), b) run a VBA script from inside Word that calls the spelling and grammar checker, and c) return some indication of the "score" of the snippet(s)?
Update 2
I've added an answer which seems to work, but if anyone has other suggestions I'll keep this question open for some time.
It took some digging, but I think I found a useful solution. Following the advice at http://www.nabble.com/Edit-a-Word-document-programmatically-td19974320.html I'm using the win32com module (if the SourceForge link doesn't work, according to this Stack Overflow answer you can use pip to get the module), which allows access to Word's COM objects. The following code demonstrates this nicely:
import win32com.client, os
wdDoNotSaveChanges = 0
path = os.path.abspath('snippet.txt')
snippet = 'Jon Skeet lieks ponies. I can haz reputashunz? '
snippet += 'This is a correct sentence.'
file = open(path, 'w')
file.write(snippet)
file.close()
app = win32com.client.gencache.EnsureDispatch('Word.Application')
doc = app.Documents.Open(path)
print "Grammar: %d" % (doc.GrammaticalErrors.Count,)
print "Spelling: %d" % (doc.SpellingErrors.Count,)
app.Quit(wdDoNotSaveChanges)
which produces
Grammar: 2
Spelling: 3
which match the results when invoking the check manually from Word.

Is it possible to encode (<em>asdf</em>) in python Textile?

I'm using python Textile to store markup in the database. I would like to yield the following HTML snippet:
(<em>asdf</em>)
The obvious doesn't get encoded:
(_asdf_) -> <p>(_asdf_)</p>
The following works, but yields an ugly space:
( _asdf_) -> <p>( <em>asdf</em>)
Am I missing something obvious or is this just not possible using python Textile?
It's hard to say if this is a bug or not; in the form on the Textile website, (_foo_) works as you want, but in the downloadable PHP implementation, it doesn't.
You should be able to do this:
([_asdf_]) -> <p>(<em>asdf</em>)</p>
However, this doesn't work, which is a bug in py-textile. You either need to use this:
(]_asdf_])
or patch textile.py by changing line 918 (in the Textile.span() method) to:
(?:^|(?<=[\s>%(pnct)s])|([{[]))
(the difference is in the final group; the brackets are incorrectly reversed.)
You could also change the line to:
(?:^|(?<=[\s>(%(pnct)s])|([{[]))
(note the added parenthesis) to get the behavior you desire for (_foo_), but I'm not sure if that would break anything else.
Follow up: the latest version of the PHP Textile class does indeed make a similar change to the one I suggested.

Categories

Resources