I am building a simple python api to search addresses in a database using free form text. I would like to use the libpostal library to convert the inputs to something specific (structured) that makes lookups in my DB quick.
However, I am hoping to do this very lean on AWS or GCP (read: cheap) and since the libpostal trained model seems to be almost 4gb because it includes all the world, I was wondering if anyone knows a way to do this for a specific country only (the U.K. in my case).
Related
I hope you are feeling good and safe.
I'm working on Natural language processing project for my master degree, and I do need to translate
my local dialect to ENG, and I noticed that Facebook translate machine did very well with my local dialect.
So my question is there any way to use Facebook translate service in my project, like is there any api or python module that use it.
Which language is your local language?
Facebook has many machine translation models, so it depends on how good it has to be and how much computing power you have. I am not sure if they offer their latest state-of-the-art ones that they use in their products as an independent translation tool as well.
First Option: Run full models locally
One way would be using one of their models on huggingface (see the "Generation" part):
https://huggingface.co/docs/transformers/model_doc/m2m_100#training-and-generation
They also have some easy-to-use pretrained models in their torch.hub module (but that probably doesnt cover your local language):
https://github.com/pytorch/fairseq/blob/main/examples/translation/README.md
Second Option: APIs
As I said it depends on what quality you need, you could try out some easy-to-use (non-facebook) APIs and see how far that gets you, as this is much easier and you can use them online:
e.g. https://libretranslate.com/
Or check out this comparison of APIs: https://rapidapi.com/collection/google-translate-api-alternatives
APIs are usually limited to a maximum number of characters/words/requests per month/day/minute so you'll have to see if that is enough for your case and if the quality is acceptable.
Third Option: pip packages which use APIs
For example check out: https://pypi.org/project/deep-translator/
Fourth Option: pip wrapper packages which run translation locally
A great package which actually has some pretty strong facebook MT models is: https://pypi.org/project/EasyNMT/ (it also has their strong m2m_100 models)
More lightweight but probably not as strong: https://github.com/argosopentech/argos-translate
Conclusion:
Since I assume your local language is not supported by that many models I would first try the fourth option (start with biggest models and if it doesnt work try smaller ones).
If that doesnt work out you can try if you can get APIs work for your case.
If you have a lot of computing power and want to go a bit deeper you can run the full model inference locally using huggingface or fairseq.
I have collected a large Twitter dataset (>150GB) that is stored in some text files. Currently I retrieve and manipulate the data using custom Python scripts, but I am wondering whether it would make sense to use a database technology to store and query this dataset, especially given its size. If anybody has experience handling twitter datasets of this size, please share your experiences, especially if you have any suggestions as to what database technology to use and how long the import might take. Thank you
I recommend using a database schema for this, especially considering it's size. (this is without knowing anything about what the dataset holds) That being said, I suggest now or for future questions of this nature using the software suggestions website for this plus adding more about what the dataset would look like.
As for suggesting a certain database in specific, I recommend doing some research about what each do but for something that just holds data with no relations any will do and could show great query improvement vs just txt files as query's can be cached and data is faster to retrieve due to how databases store and lookup files weather it just be hashed values or whatever they use.
Some popular databases:
MYSQL, PostgreSQL - Relational Databases (simple and fast and easy to use/setup but need some knowledge of SQL)
MongoDB - NoSQL Database (also easy to use and setup and no SQL needed, it relies more on dicts to access DB through the API. Also memory mapped so can be faster than Relational but need to have enough RAM for the Indexes.)
ZODB - Full Python NoSQL Database (Kind of like MongoDB but written in Python)
These are very light and brief explanations of each DB, be sure to do your research before using them, they each have their pros and cons. Also, remember this is just a couple of many popular and highly used Databases, there's also TinyDB, SQLite (comes with Python), and PickleDB that are full Python but are generally for small applications.
My experience is mainly with PostgreSQL, TinyDB, and MongoDB, my favorite being MongoDB and PGSQL. For you, I'd look at either of those but don't limit yourself there's a slue of them plus many drivers that help you write easier/less code if that's what you want. Remember google is your friend! And welcome to Stack Overflow!
Edit
If your dataset is and will remain fairly simple but just large and you want to keep with using txt files, consider pandas and maybe a JSON or a csv format and library. It can greatly help and increase efficiency when querying/managing data like this from txt files plus less memory usage as it won't always or ever need the entire dataset in memory.
you can try using any NOSql DB. Mongo DB would be a good place to start
Can someone help me out with some suggestion for a full-text searching engine that supports Python?
Right now we have a MySQL database in place and I'd like to add the ability to have a full-text search engine index some of the text in some of the tables in this database. This text data would be used by a web application to search for the corresponding records in the database. For instance, index the customer name information in our customer table, full text search that with the web application to get the MySQL record for the customer.
I've looked (briefly) at Lucene, Swish-E and MongoDB, and few others, but I'm not sure what would be a good choice for me considering a couple of things:
I'm not a Java guy (though I've been programming for a long time),
we only want to search a relatively small set of data,
we're looking to index text in a MySQL database,
and would like that index to be updated in semi-realtime.
Any hints, tips or pointers would be greatly appreciated!
Have a look at Whoosh. I've heard it doesn't scale up terribly well (maybe that's fixed now) but for small collections, it might be useful.
For a scalable solution, consider using Lucene with PyLucene or Jython.
Building pylucene a few months ago was one of the most painful experiences I had. The project won't get any traction IMHO if it's so hard to build.
With a few other folks having the same itch to scratch, we started https://code.google.com/a/apache-extras.org/p/pylucene-extra/ to gather prebuilt pylucene and jcc eggs on several operating systems, Python versions and Java runtimes combos. It is not very active lately, though.
Whoosh might be a good fit, or you may want to have a look at Sphinx, ElasticSearch or HaystackSearch (CAVEAT: I did not work on any of these).
Or maybe try to access Solr via python (there are a few APIs), which might be much easier than using pylucene. Consider that lucene will still need a JVM to run, of course.
Since you don't have huge scalability needs, I would focus on simple usage and community support rather than performance and scale. Hope it helps.
Solr is a great wrapper to Lucene, it greatly simplifies things. It doesn't require any Java tinkering for most things, you just need to configure some XML files. It does run as another process, so this may complicate your deployment.
I have had great results with pysolr, but really, you could write your own python communication library since Solr uses REST, so it is really simple to send and retrieve data in either xml or json.
I'm about to start the development of a web analytics tool for an e-commerce website.
I'm going to log several different events, basically clicks on various elements of the page and page views.
These events carry metadata (username of the loggedin user, his country, his age, etc...) and the page itself carries other metadata (category, subcategory, product etc...).
My companies would like something like an OLAP cube, to be able to answer questions like:
How many customer from country x visited category y?
How many pageviews for category x in January 2012?
How many customer from country x visited category y?
My understanding is that I should use an OLAP engine to record these events, and then build a reporting interface to allow my colleagues to use it.
Am I right? Do you have advices on the engine and frontend/reporting tool I should use? I'm a Python programmer, so anything Python-friendly would be nice.
Thank you!
The main question is how big your cube is going to be and if you need an open source OLAP solution or not.
If you're dealing with big cubes and want to get room for future features you might go for a real OLAP Server. A few are open source - Mondrian - and other have a 'limited' community edition - Palo, icCube. The important point here is being compatible with MDX and XMLA. defacto OLAP standard, so you can plug different reporting tools and/or using existing libraries. My understanding, there is no Phyton version for an XMLA library as in Java or .NET not sure this is the way to go.
If you're cubes are small you can develop something on your own or go for other quicker solutions as the comment of Charlax is indicating.
As mentioned in the selected answer, it depends on your data amount. However, just you run into a case that a light-weight Python OLAP framework would be sufficient, then you might try Cubes, sources are on github. It contains SQL backend (any other might be implemented as well) and provides a light HTTP OLAP server. Example of an application (PHP front-end with HTTP Slicer OLAP server backend) using it can be found here It does not contain visualization layer and complex queries thought, but that is trade-off for being small.
I need to develop a graph database in python (I would enjoy if anybody can join me in the development. I already have a bit of code, but I would gladly discuss about it).
I did my research on the internet. in Java, neo4j is a candidate, but I was not able to find anything about actual disk storage. In python, there are many graph data models (see this pre-PEP proposal, but none of them satisfy my need to store and retrieve from disk.
I do know about triplestores, however. triplestores are basically RDF databases, so a graph data model could be mapped in RDF and stored, but I am generally uneasy (mainly due to lack of experience) about this solution. One example is Sesame. Fact is that, in any case, you have to convert from in-memory graph representation to RDF representation and viceversa in any case, unless the client code wants to hack on the RDF document directly, which is mostly unlikely. It would be like handling DB tuples directly, instead of creating an object.
What is the state-of-the-art for storage and retrieval (a la DBMS) of graph data in python, at the moment? Would it make sense to start developing an implementation, hopefully with the help of someone interested in it, and in collaboration with the proposers for the Graph API PEP ? Please note that this is going to be part of my job for the next months, so my contribution to this eventual project is pretty damn serious ;)
Edit: Found also directededge, but it appears to be a commercial product
I have used both Jena, which is a Java framework, and Allegrograph (Lisp, Java, Python bindings). Jena has sister projects for storing graph data and has been around a long, long time. Allegrograph is quite good and has a free edition, I think I would suggest this cause it is easy to install, free, fast and you could be up and going in no time. The power you would get from learning a little RDF and SPARQL may very well be worth your while. If you know SQL already then you are off to a great start. Being able to query your graph using SPARQL would yield some great benefits to you. Serializing to RDF triples would be easy, and some of the file formats are super easy ( NT for instance ). I'll give an example. Lets say you have the following graph node-edge-node ids:
1 <- 2 -> 3
3 <- 4 -> 5
these are already subject predicate object form so just slap some URI notation on it, load it in the triple store and query at-will via SPARQL. Here it is in NT format:
<http://mycompany.com#1> <http://mycompany.com#2> <http://mycompany.com#3> .
<http://mycompany.com#3> <http://mycompany.com#4> <http://mycompany.com#5> .
Now query for all nodes two hops from node 1:
SELECT ?node
WHERE {
<http://mycompany.com#1> ?p1 ?o1 .
?o1 ?p2 ?node .
}
This would of course yield <http://mycompany.com#5>.
Another candidate would be Mulgara, written in pure Java. Since you seem more interested in Python though I think you should take a look at Allegrograph first.
I think the solution really depends on exactly what it is you want to do with the graph once you have managed to store it on disk/in database, and this is a little unclear in your question. However, a couple of things you might wish to consider are:
if you just want to persist the graph without using any of the features or properties you might expect from an rdbms solution (such as ACID), then how about just pickling the objects into a flat file? Very rudimentary, but like I say, depends on exactly what you want to achieve.
ZODB is an object database for Python (a spin off from the Zope project I think). I can't say I've had much experience of it in a high performance environment, but bar a few restrictions does allow you to store Python objects natively.
if you wish to pursue RDF, there is an RDF Alchemy project which might help to alleviate some of your concerns about converting from your graph to RDF structures and I think has Sesame as part of it's stack.
There are some other persistence tools detailed on the python site which may be of interest, however I spent quite a while looking into this area last year, and ultimately I found there wasn't a native Python solution that met my requirements.
The most success I had was using MySQL with a custom ORM and I posted a couple of relevant links in an answer to this question. Additionally, if you want to contribute to an RDBMS project, when I spoke to someone from Open Query about a Graph storage engine for MySQL them seemed interested in getting active participation in their project.
Sorry I can't give a more definitive answer, but I don't think there is one... If you do start developing your own implementation, I'd be interested to keep up-to-date with how you get on.
Greetings from your Serius Cybernetics Intelligent Agent!
Some useful links...
Programming the Semantic Web
SEMANTIC PROGRAMMING
RDFLib Python Library for RDF
Hmm, maybe you should take a look at CubicWeb
Regarding Neo4j, did you notice the existing Python bindings? As for the disk storage, take a look at this thread on the mailing list.
For graphdbs in Python, the Hypergraph Database Management System project was recently started on SourceForge by Maurice Ling.
Redland (http://librdf.org) is probably the solution you're looking for. It has Python bindings too.
RDFLib is a python library that you can use. Using harschware's example:
Create a test.nt file like below:
<http://mycompany.com#1> <http://mycompany.com#2> <http://mycompany.com#3> .
<http://mycompany.com#3> <http://mycompany.com#4> <http://mycompany.com#5> .
To query for all nodes two hops from node 1 in RDFLib:
from rdflib import Graph
g = Graph()
g.parse("test.nt", format="nt")
qres = g.query(
"""SELECT ?node
WHERE {
<http://mycompany.com#1> ?p1 ?o1 .
?o1 ?p2 ?node .
}"""
)
for row in qres:
print(node)
Should return the answer <http://mycompany.com#5>.