This is quite a general question, though I’ll give the specific use case for context.
I'm using a FileMaker Pro database to record personal bird observations. For each bird on the national list, I have extracted quite a lot of base data by website scraping in Python, for example conservation status, geographical range, scientific name and so on. In day-to-day use of the database, this base data remains fixed and unchanging. However, once a year or so I will want to re-scrape the base data to pick up the most recent published information on status, range, and even changes in scientific name (that happens).
I know there are options such as PyFilemaker or bBox which should allow me to write to the FileMaker database from Python, so the update mechanism itself shouldn't be a problem.
It would be rather dangerous simply to overwrite all of last year’s base data with the newly scraped data, and I'm looking for general advice as to how best to provide visibility for the changes before manually importing them. What I have in mind is to use pandas to generate a spreadsheet using the base data, and to highlight the changed cells. Does that sound a sensible way of doing it? I suspect that this may be a very standard requirement, and if anybody could help out with comments on an approach which is straightforward to implement in Python that would be most helpful.
This is not a standard requirement and there is no easy way of doing this. The best way to track changes is a Source Control system like git, but it is not applicable to FileMaker Pro as the files are binary.
You can try your approach, or you can try to add the new records in FileMaker instead of updating them and flag them as current or use only the last record
There are some amazing guys here, but you might want to take it to one of the FileMAker forums as the FIleMAker audience there is much larger then in SO
Related
Using DynamoDB.
An online store where you see a snapshot of the data for 20 items on the first screen, including price, picture, product name.
When you click on the product it reveals more data, such as description, any deals, more images, etc.
Would you make another call to the database for the second view or just get all the data on the first call?
When considering "best practices", I wouldn't consider one approach a clear winner over another. Which approach works best for you will depend on the specifics of your application. For example, are you in a high latency/low bandwidth environment? Are you on a mobile device? How often is this query run? Etc.
I would recommend starting by pulling back all the data you need in a single request because it's simpler. If you run into performance issues along the way, you can circle back and try fetching additional data on an as-needed basis.
If there is a "best practice" here, it's to avoid pre-optimizing your code. Now if only I could take my own advice :)
I am building an image mosaic that detect if the user's selected area are taken or not.
My idea is to store the available_spots in a list, and I would just have to look through the list to check whether a spot is available or not.
The problem is that when I reload the website, avaliable_spots also gets reset to blank list,
so I want to store this array somewhere, that is fast to read and write to.
I am currently thinking about a text file to store this, but that might take forever to read since array length is over 1.4 million. Is there any other solutions that might be better?
You can't store the data in a file for a few reasons: (1) GAE standard won't let you, (2) the data is lost when your server is restarted, and (3) different instances will have different data.
Of course you can and should store the data in a database of your choice. Firestore is likely a better and cheaper option than SQL. It should be fast enough for you and you can implement caching if needed.
You might be able to store the data in a single Firestore entity and consider using compression if you are getting close to the max entity size.
If you want to store into a database you can use the "sqlite3" module.
Is a simple database that gets stored in a file so you dont have to install a database program. Is great for small projects.
If you want to do more complex stuff with databases you can use "sqlalchemy".
I just need pointers on where to begin. I have some experience with Python, but nothing to brag about.
My end goal is to create a website that will allow multiple users to access it from different computers to fill the table with simple data, very simillar to what Google Sheets alows, and then printing it on a single sheet of paper. Idealy I want my programm to intelegently determine the width of rows and columns so that the table would look decently and would fill the page accordingly.
Right now all I need is some pointers on where to begin. Like can I use SQL to create these tables and have online fuctionality for users to access and fill the spreadsheet, and how to go about printing it.
I know this is very noob question, but I can't seem to find anything relevant here on by just using google.
Thank you.
I don't think this is a very good StackOverflow question because it is very broad and not programming specific. You are asking how to start a new software project which in my opinion belongs more in the software engineering meta: https://softwareengineering.stackexchange.com/
Anyhow, how I would take on such a project:
First I would define my project scope. What is the functionality of the end-product? What must it be able to do and what not? Who are the end-users using the product and what do they expect? These are so called functional requirements.
In which way does the product deliver value? Is it fast, modifiable, distributed... These are so called non-functional requirements.
Develop a basic software architecture based on the previous requirements using patterns and tactics and identify the different subsystems. On the top of my hat I would divide it in a frontend component using a web application, backend component in your favourite language and a database component for persistence.
Research possible languages frameworks for each component, decide and start coding!
For the 4th step I suggest you have a look at Python Django which includes all of this stuff out-of-the-box.
Well, I might be doing some work in Python that would end up with hundreds of thousands, maybe millions of rows of data, each with entries in maybe 50 or more columns. I want a way to keep track of this data and work with it. Since I also want to learn Microsoft Access, I suggest putting the data in there. Is there any easy way to do this? I also want to learn SAS, so that would be fine too. Or, is there some other program/method I should know for such a situation?
Thanks for any help!
Yes, you can talk to any ODBC database from Python, and that should include Access. You'll want the "windows" version of Python (which includes stuff like ODBC) from ActiveState.
I'd be more worried about the "millions of rows" in Access, it can get a bit slow on retrieval if you're actually using it for relational tasks (that is, JOINing different tables together).
I'd also take a look at your 50 column tables — sometimes you need 50 columns but more often it means you haven't decomposed your data sufficiently to get it in normal form.
Finally, if you use Python to read and write an Access database I don't know if I'd count that as "learning Access". Really learning Access would be using the front end to create and maintain the database, creating forms and reports in Access (which would not be available from Python) and programming in Visual Basic for Applications (VBA).
I really like SQLite as an embedded database solution, especially from Python, and its SQL dialect is probably "purer" than Access's.
Since I also want to learn Microsoft Access,
Don't waste your time learning Access.
I suggest putting the data in there. Is there any easy way to do this?
ODBC.
Or, is there some other program/method I should know for such a situation?
SQLite and MySQL are far, far better choices than MS-Access.
I need to develop a graph database in python (I would enjoy if anybody can join me in the development. I already have a bit of code, but I would gladly discuss about it).
I did my research on the internet. in Java, neo4j is a candidate, but I was not able to find anything about actual disk storage. In python, there are many graph data models (see this pre-PEP proposal, but none of them satisfy my need to store and retrieve from disk.
I do know about triplestores, however. triplestores are basically RDF databases, so a graph data model could be mapped in RDF and stored, but I am generally uneasy (mainly due to lack of experience) about this solution. One example is Sesame. Fact is that, in any case, you have to convert from in-memory graph representation to RDF representation and viceversa in any case, unless the client code wants to hack on the RDF document directly, which is mostly unlikely. It would be like handling DB tuples directly, instead of creating an object.
What is the state-of-the-art for storage and retrieval (a la DBMS) of graph data in python, at the moment? Would it make sense to start developing an implementation, hopefully with the help of someone interested in it, and in collaboration with the proposers for the Graph API PEP ? Please note that this is going to be part of my job for the next months, so my contribution to this eventual project is pretty damn serious ;)
Edit: Found also directededge, but it appears to be a commercial product
I have used both Jena, which is a Java framework, and Allegrograph (Lisp, Java, Python bindings). Jena has sister projects for storing graph data and has been around a long, long time. Allegrograph is quite good and has a free edition, I think I would suggest this cause it is easy to install, free, fast and you could be up and going in no time. The power you would get from learning a little RDF and SPARQL may very well be worth your while. If you know SQL already then you are off to a great start. Being able to query your graph using SPARQL would yield some great benefits to you. Serializing to RDF triples would be easy, and some of the file formats are super easy ( NT for instance ). I'll give an example. Lets say you have the following graph node-edge-node ids:
1 <- 2 -> 3
3 <- 4 -> 5
these are already subject predicate object form so just slap some URI notation on it, load it in the triple store and query at-will via SPARQL. Here it is in NT format:
<http://mycompany.com#1> <http://mycompany.com#2> <http://mycompany.com#3> .
<http://mycompany.com#3> <http://mycompany.com#4> <http://mycompany.com#5> .
Now query for all nodes two hops from node 1:
SELECT ?node
WHERE {
<http://mycompany.com#1> ?p1 ?o1 .
?o1 ?p2 ?node .
}
This would of course yield <http://mycompany.com#5>.
Another candidate would be Mulgara, written in pure Java. Since you seem more interested in Python though I think you should take a look at Allegrograph first.
I think the solution really depends on exactly what it is you want to do with the graph once you have managed to store it on disk/in database, and this is a little unclear in your question. However, a couple of things you might wish to consider are:
if you just want to persist the graph without using any of the features or properties you might expect from an rdbms solution (such as ACID), then how about just pickling the objects into a flat file? Very rudimentary, but like I say, depends on exactly what you want to achieve.
ZODB is an object database for Python (a spin off from the Zope project I think). I can't say I've had much experience of it in a high performance environment, but bar a few restrictions does allow you to store Python objects natively.
if you wish to pursue RDF, there is an RDF Alchemy project which might help to alleviate some of your concerns about converting from your graph to RDF structures and I think has Sesame as part of it's stack.
There are some other persistence tools detailed on the python site which may be of interest, however I spent quite a while looking into this area last year, and ultimately I found there wasn't a native Python solution that met my requirements.
The most success I had was using MySQL with a custom ORM and I posted a couple of relevant links in an answer to this question. Additionally, if you want to contribute to an RDBMS project, when I spoke to someone from Open Query about a Graph storage engine for MySQL them seemed interested in getting active participation in their project.
Sorry I can't give a more definitive answer, but I don't think there is one... If you do start developing your own implementation, I'd be interested to keep up-to-date with how you get on.
Greetings from your Serius Cybernetics Intelligent Agent!
Some useful links...
Programming the Semantic Web
SEMANTIC PROGRAMMING
RDFLib Python Library for RDF
Hmm, maybe you should take a look at CubicWeb
Regarding Neo4j, did you notice the existing Python bindings? As for the disk storage, take a look at this thread on the mailing list.
For graphdbs in Python, the Hypergraph Database Management System project was recently started on SourceForge by Maurice Ling.
Redland (http://librdf.org) is probably the solution you're looking for. It has Python bindings too.
RDFLib is a python library that you can use. Using harschware's example:
Create a test.nt file like below:
<http://mycompany.com#1> <http://mycompany.com#2> <http://mycompany.com#3> .
<http://mycompany.com#3> <http://mycompany.com#4> <http://mycompany.com#5> .
To query for all nodes two hops from node 1 in RDFLib:
from rdflib import Graph
g = Graph()
g.parse("test.nt", format="nt")
qres = g.query(
"""SELECT ?node
WHERE {
<http://mycompany.com#1> ?p1 ?o1 .
?o1 ?p2 ?node .
}"""
)
for row in qres:
print(node)
Should return the answer <http://mycompany.com#5>.