I am using Orange (in Python) for some data mining tasks. More specifically, for clustering. Although I have gone through the tutorial and read most of the documentation, I still have a problem.
All the examples in docs and tutorials assume that I have a tab delimited table with data in it. However, there is nothing saying how one can go about creating a new table from scratch. For example, I want to create a table for word frequencies across different documents.
Maybe I am missing something so if anyone has any insight it'd be appreciated.
Thanks
George
EDIT:
This is how I create my table
#First construct the domain object (top row)
vars = []
for var in variables:
vars.append(Orange.data.variable.Continuous(str(var)))
domain = Orange.data.Domain(vars, classed) #The second argument indicated that the last attr must not be a class
#Add data rows assuming we have a matrix
t = Orange.data.Table(domain, matrix)
This took me hours to figure out. In python, do this:
Import Orange
List, Of, Column, Variables = [Orange.feature.Discrete(x) for x in ['What','Theyre','Called','AsStrings']]
Domain = Orange.data.Domain([List, Of, Column, Variables])
Table = Orange.data.Table(Domain)
Table.save('NewTable.tab')
I'd tell you what each bit of code does, but as of now I'm not really sure. It's funny that such a powerful toolkit should have such hard to understand documentation, but I suspect it's because it's entire user base has doctorates.
The documentation is indeed insufficient if you ask me. This may not be the answer to the question but it could be helpful to someone else. I tried for hours to create a Table using constructors and Domains and what not, just for an association rule mining task, and finally found out that the easiest way to create a table is simply to write your data to a file with the extension .tab or .basket and create a table from that.
Orange.data.Table("yourFile.basket")
Of course the structure of the file needs to be correct. See the provided example files located in the Orange package directory inside datasets/
Related
I am looking for a more efficient way to update a bunch of model objects. Every night I have background jobs creating 'NCAABGame' objects from an API once the scores are final.
In the morning I have to update all the fields in the model with the stats that the API did not provide.
As of right now I get the stats formatted from an excel file and I copy and paste each update and run it like this:
NCAABGame.objects.filter(
name__name='San Francisco', updated=False).update(
field_goals=38,
field_goal_attempts=55,
three_points=11,
three_point_attempts=24,
...
)
The other day there were 183 games, most days between 20-30 so it can be very timely doing it this way. I've looked into bulk_update and a few other things but I can't really find a solution. I'm sure there is something simple that I'm just not seeing.
I appreciate any ideas or solutions you can offer.
If you need to update each object that gets created via the API manually anyway, I would not even bother going through Django. Just load your games from the API directly in Excel, then make your edits in Excel, and save as CSV file. Then I would add the CSV directly into the database table, unless there is a specific reason that objects must be created via Django? I mean, you can of course do that with something like the below, which could be modified to also work for your current method via updates, but then you need to first retrieve the correct pk of the object that you want to update.
import csv
with open("my_data.csv", 'r') as my_data_file:
reader = csv.reader(my_data_file)
for row in reader:
# get_or_create returns a tuple. 'created' is a boolean that indicates
# if a new object was created or not, with game holding the object that
# was either retrieved or created
game, created = NCAABGame.objects.get_or_create(
name=row[0],
field_goals=row[1],
field_goal_attempts=row[2],
....,
)
I have a sqlalchemy class R which implements a m:n relation between two other classes A and B. So R has two integer columns source_id and target_id which hold the ids of the referenced instances. And R has two properties source_obj and target_obj which are defined via relationship. It's more or less the same as decribed here in the documenation.
What I want to do is to retrieve the referenced classes from R. I'm using sqlalchemy 0.8 and tried to use the inspect() method on R.source_obj, but I only get back a InstrumentedAttribute which seems not to be of much help. At least I was not able to extract any useful information or to find any documentation about it.
Any help would be very appreciated! How do I get A and B from R?
Try something like this. I'm also dealing with this and find no documentation, think this can help you to start.
from sqlalchemy import inspect
i = inspect(model)
for relation in i.relationships:
print(relation.direction.name)
print(relation.remote_side)
print(relation._reverse_property)
dir(relation)
I spent the majority of the day working on this same problem, and I was able to write a list comprehension that takes in a table and then spits out a list of the table names which are connected via a relationship or a foreign key. You need to convert that string into a reference to the actual class, but otherwise it works just fine.
relationship_list = [str(list(column.remote_side)[0]).split('.')[0] for column \
in inspect(table).relationships]
By removing the .split('.')[0], you can get a list of the actual columns which are referred to by the connections. The comprehension is pretty ugly, but it works. Hope this helps anyone else who is looking for the same thing I was!
First off, this is my first project using SQLAlchemy, so I'm still fairly new.
I am making a system to work with GTFS data. I have a back end that seems to be able to query the data quite efficiently.
What I am trying to do though is allow for the GTFS files to update the database with new data. The problem that I am hitting is pretty obvious, if the data I'm trying to insert is already in the database, we have a conflict on the uniqueness of the primary keys.
For Efficiency reasons, I decided to use the following code for insertions, where model is the model object I would like to insert the data into, and data is a precomputed, cleaned list of dictionaries to insert.
for chunk in [data[i:i+chunk_size] for i in xrange(0, len(data), chunk_size)]:
engine.execute(model.__table__.insert(),chunk)
There are two solutions that come to mind.
I find a way to do the insert, such that if there is a collision, we don't care, and don't fail. I believe that the code above is using the TableClause, so I checked there first, hoping to find a suitable replacement, or flag, with no luck.
Before we perform the cleaning of the data, we get the list of primary key values, and if a given element matches on the primary keys, we skip cleaning and inserting the value. I found that I was able to get the PrimaryKeyConstraint from Table.primary_key, but I can't seem to get the Columns out, or find a way to query for only specific columns (in my case, the Primary Keys).
Either should be sufficient, if I can find a way to do it.
After looking into both of these for the last few hours, I can't seem to find either. I was hoping that someone might have done this previously, and point me in the right direction.
Thanks in advance for your help!
Update 1: There is a 3rd option I failed to mention above. That is to purge all the data from the database, and reinsert it. I would prefer not to do this, as even with small GTFS files, there are easily hundreds of thousands of elements to insert, and this seems to take about half an hour to perform, which means if this makes it to production, lots of downtime for updates.
With SQLAlchemy, you simply create a new instance of the model class, and merge it into the current session. SQLAlchemy will detect if it already knows about this object (from cache or the database) and will add a new row to the database if needed.
newentry = model(chunk)
session.merge(newentry)
Also see this question for context: Fastest way to insert object if it doesn't exist with SQLAlchemy
Apologies for the longish description.
I want to run a transform on every doc in a large-ish Mongodb collection with 10 million records approx 10G. Specifically I want to apply a geoip transform to the ip field in every doc and either append the result record to that doc or just create a whole other record linked to this one by say id (the linking is not critical, I can just create a whole separate record). Then I want to count and group by say city - (I do know how to do the last part).
The major reason I believe I cant use map-reduce is I can't call out to the geoip library in my map function (or at least that's the constraint I believe exists).
So I the central question is how do I run through each record in the collection apply the transform - using the most efficient way to do that.
Batching via Limit/skip is out of question as it does a "table scan" and it is going to get progressively slower.
Any suggestions?
Python or Js preferred just bec I have these geoip libs but code examples in other languages welcome.
Since you have to go over "each record", you'll do one full table scan anyway, then a simple cursor (find()) + maybe only fetching few fields (_id, ip) should do it. python driver will do the batching under the hood, so maybe you can give a hint on what's the optimal batch size (batch_size) if the default is not good enough.
If you add a new field and it doesn't fit the previously allocated space, mongo will have to move it to another place, so you might be better off creating a new document.
Actually I am also attempting another approach in parallel (as plan B) which is to use mongoexport. I use it with --csv to dump a large csv file with just the (id, ip) fields. Then the plan is to use a python script to do a geoip lookup and then post back to mongo as a new doc on which map-reduce can now be run for count etc. Not sure if this is faster or the cursor is. We'll see.
I'm working on a Python application involving the use of a GTK table. The application requires that widgets of various sizes be added to a table dynamically. Because of this, I need to be able to ask the table what cells are in use (more accurately, NOT in use) so that I know where I can place a new widget without overlapping.
Based on the information in the reference manual (http://www.pygtk.org/docs/pygtk/) I have been unable to find a way to get that information directly from the table. The only other option I can think of is to create a map object that holds used cell information, and have it updated upon changes to the table.
Since I'm sure someone has dealt with this before me, and I would hope GTK would provide a better way, it seemed wise to ask around before trying to implement the map.
Help would be greatly appreciated.
This function should give you a set of the free cells in the table:
def free_cells(table):
free_cells = set([(x,y) for x in range(table.props.n_columns) for y in range(table.props.n_rows)])
def func(child):
(l,r,t,b) = table.child_get(child, 'left-attach','right-attach','top-attach','bottom-attach')
used_cells = set([(x,y) for x in range(l,r) for y in range(t,b)])
free_cells.difference_update(used_cells)
table.foreach(func)
return free_cells
It starts with a set of all the table cells, then iterates over the children of the table, removing the cells occupied by each child.
I'm the original poster, was logged into wrong account when posting question.
Anyway, this appears to be exactly what I'm looking for! Thanks Geoff!