Determine Categorical Hierarchy Level of Freebase MID Value - python

After using the Google Cloud Vision API, I received MID values in the format of /m/XXXXXXX (not necessarily 7 characters at the end though). What I would like to do is determine how specific one MID value is compared to the others. Essentially how broad vs. refined a term is. For example, the term Vehicle might be level 1 while the term Van might be level 2.
I have tried to run the MID values through the Google Knowledge Graph API but unfortunately these MIDs are not in that database and return no information. For example, a few MIDs and descriptions I have are as follows:
/m/07s6nbt = text
/m/03gq5hm = font
/m/01n5jq = poster
/m/067408 = album cover
My initial thought on why these MIDs return nothing in the Knowledge Graph API is that they were not carried over after the discontinuation of Freebase. I understand that Google provides an RDF dump of Freebase but I'm not sure how to read that data in Python and use it to determine the depth of a mid in the hierarchy.
If it's not possible to determine the category level of the MID value, the number of connections a term had would also be an appropriate proxy. Assuming broader terms have more connections to other terms than more refined terms. I found an article that discusses the amount of "edges" a MID has which I believe means the number of connections. However, they do some converting between MID values to Long Values and use various scripts that keep giving me numerous errors in Python. I was hoping for a simple table with MID values in one column and the number of connections in another but I'm lost in their code, converting values, and Python Errors.
If you have any suggestions for easily determining the amount of connections a MID has or its hierarchical level, it would be greatly appreciated. Thank you!

Those MIDs look like they're for pretty common things, so I'm surprised their not in the Knowledge Graph. Do you prefix the MIDs to form URIs?
"kg": "http://g.co/kg"
"kg:/m/067408"
Freebase and the Knowledge Graph aren't organized as hierarchies, so your level finding idea doesn't really work. I'm also dubious about your idea of degree (ie # of edges) being correlated with broader vs narrower, but you should be able to use the dump that you've found to test it.
The Freebase ExQ Data Dump that you found is super confusing because they rename Freebase types as topics (not to be confused with Freebase topics), but I think their freebase-nodes-in-out-name.tsv contains information that you're looking for (# of edges == degree). You can use either the inDegree, outDegree or the sum of the two.
Their MID to integer conversion code doesn't look right to me (and doesn't match the comments) but you'll need to use a compatible implementation to match up with what they've done.
Looking at
/m/02w0000 "Clibadium subsessilifolium"#en
it's encoded as
48484848875048
or
48 48 48 48 87 50 48
0 0 0 0 w 2 0
So, just take the ASCII values from right to left and concatenate them left to right. Confusing, inefficient, and wrong all in one! (It's actually a base 36 (or 37?) coding)

Related

Trying to work out how to produce a synthetic data set using python or javascript in a repeatable way

I have a reasonably technical background and have done a fair bit of node development, but I’m a bit of a novice when it comes to statistics and a complete novice with python, so any advice on a synthetic data generation experiment I’m trying my hand at would be very welcome :)
I’ve set myself the problem of generating some realistic(ish) sales data for a bricks and mortar store (old school, I know).
I’ve got a smallish real-world transactional dataset (~500k rows) from the internet that I was planning on analysing with a tool of some sort, to provide the input to a PRNG.
Hopefully if I explain my thinking across a couple of broad problem domains, someone(s?!) can help me:
PROBLEM 1
I think I should be able to use the real data I have to either:
a) generate a probability distribution curve or
b) identify an ‘out of the box’ distribution that’s the closest match to the actual data
I’m assuming there’s a tool or library in Python or Node that will do one or both of those things if fed the data and, further, give me the right values to plug in to a PRNG to produce a series of data points that not are not only distributed like the original's, but also within the same sort of ranges.
I suspect b) would be less expensive computationally and, also, better supported by tools - my need for absolute ‘realness’ here isn’t that high - it’s only an experiment :)
Which leads me to…
QUESTION 1: What tools could I use to do do the analysis and generate the data points? As I said, my maths is ok, but my statistics isn't great (and the docs for the tools I’ve seen are a little dense and, to me at least, somewhat impenetrable), so some guidance on using the tool would also be welcome :)
And then there’s my next, I think more fundamental, problem, which I’m not even sure how to approach…
PROBLEM 2
While I think the approach above will work well for generating timestamps for each row, I’m going round in circles a little bit on how to model what the transaction is actually for.
I’d like each transaction to be relatable to a specific product from a list of products.
Now the products don’t need to be ‘real’ (I reckon I can just use something like Faker to generate random words for the brand, product name etc), but ideally the distribution of what is being purchased should be a bit real-ey (if that’s a word).
My first thought was just to do the same analysis for price as I’m doing for timestamp and then ‘make up’ a product for each price that’s generated, but I discarded that for a couple of reasons: It might be consistent ‘within’ a produced dataset, but not ‘across’ data sets. And I imagine on largish sets would double count quite a bit.
So my next thought was I would create some sort of lookup table with a set of pre-defined products that persists across generation jobs, but Im struggling with two aspects of that:
I’d need to generate the list itself. I would imagine I could filter the original dataset to unique products (it has stock codes) and then use the spread of unit costs in that list to do the same thing as I would have done with the timestamp (i.e. generate a set of products that have a similar spread of unit cost to the original data and then Faker the rest of the data).
QUESTION 2: Is that a sensible approach? Is there something smarter I could do?
When generating the transactions, I would also need some way to work out what product to select. I thought maybe I could generate some sort of bucketed histogram to work out what the frequency of purchases was within a range of costs (say $0-1, 1-2$ etc). I could then use that frequency to define the probability that a given transaction's cost would fall within one those ranges, and then randomly select a product whose cost falls within that range...
QUESTION 3: Again, is that a sensible approach? Is there a way I could do that lookup with a reasonably easy to understand tool (or at least one that’s documented in plain English :))
This is all quite high level I know, but any help anyone could give me would be greatly appreciated as I’ve hit a wall with this.
Thanks in advance :)
The synthesised dataset would simply have timestamp, product_id and item_cost columns.
The source dataset looks like this:
InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/2010 8:26,2.55,17850,United Kingdom
536365,71053,WHITE METAL LANTERN,6,12/1/2010 8:26,3.39,17850,United Kingdom
536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/1/2010 8:26,2.75,17850,United Kingdom
536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/2010 8:26,3.39,17850,United Kingdom
536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12/1/2010 8:26,3.39,17850,United Kingdom
536365,22752,SET 7 BABUSHKA NESTING BOXES,2,12/1/2010 8:26,7.65,17850,United Kingdom
536365,21730,GLASS STAR FROSTED T-LIGHT HOLDER,6,12/1/2010 8:26,4.25,17850,United Kingdom
536366,22633,HAND WARMER UNION JACK,6,12/1/2010 8:28,1.85,17850,United Kingdom

Python - Database for small scale static data

Been studying Python on my own for some months. I'm about to ventur into the field of databases. I am currently aiming to create a very small scale application which would feature retrieving data based on a simple search (keyword) and reflect static data linked to it. Some numbers to put it in perspective:
About 150-200 "key" values
About 5-10 values to be displayed per "key" value
Editable (although the info tends to remain same most of time - maybe 1 or 2 amendments/month in total on all data stored)
An example would be the cards you see when you do a simple google search. For example you search for an actor (key value) and your query generates a "card" with values (age, length, wage, brothers,...).
As it would be the first attempt on creating something which works with such a (for me personally) larger amount of data, I am a bit puzzled with the options available to me. I have been reading up on different database models (relational,..). Ultimately I came to below 3 possibilities:
XML
database
hardcode
I intend on making the data amendable by 1 privileged user. This alone makes hardcoding it not really an option. Even though, as I am a novice, I could be interpreting this wrong here.
I'd be happy if you could point me in the right direction and if you'd go for a database, which one you'd recommend going for (MySQL,..).
Many thanks in advance. If I made any error with the post (as it is my initial post), do not hesitate to point this out.

Algorithm in Python to store and search daily occurrence for thousands of numbered events?

I'm investigating solutions of storing and querying a historical record of event occurrences for a large number of items.
This is the simplified scenario: I'm getting a daily log of 200 000 streetlamps (labeled sl1 to sl200000) which shows if the lamp was operational on the day or not. It does not matter for how long the lamp was in service only that it was on a given calendar day.
Other bits of information are stored for each lamp as well and the beginning of the Python class looks something like this:
class Streetlamp(object):
"""Class for streetlamp record"""
def __init__(self, **args):
self.location = args['location']
self.power = args['power']
self.inservice = ???
My py-foo is not too great and I would like to avoid a solution which is too greedy on disk/memory storage. So a solution with a dict of (year, month, day) tuples could be one solution, but I'm hoping to get pointers for a more efficient solution.
A record could be stored as a bit stream with each bit representing a day of a year starting with Jan 1. Hence, if a lamp was operational the first three days of 2010, then the record could be:
sl1000_up = dict('2010': '11100000000000...', '2011':'11111100100...')
Search across year boundaries would need a merge, leap years are a special case, plus I'd need to code/decode a fair bit with this home grown solution. It seems not quiet right. speed-up-bitstring-bit-operations, how-do-i-find-missing-dates-in-a-list and finding-data-gaps-with-bit-masking where interesting postings I came across. I also investigated python-bitstring and did some googling, but nothing seems to really fit.
Additionally I'd like search for 'gaps' to be possible, e.g. 'three or more days out of action' and it is essential that a flagged day can be converted into a real calendar date.
I would appreciate ideas or pointers to possible solutions. To add further detail, it might be of interest that the back-end DB used is ZODB and pure Python objects which can be pickled are preferred.
Create a 2D-array in Numpy:
import numpy as np
nbLamps = 200000
nbDays = 365
arr = np.array([nbLamps, nbDays], dtype=np.bool)
It will be very memory-efficient and you can aggregate easily the days and lamps.
In order to manipulate the days even better, have a look at scikits.timeseries. They will allow you to access the dates with datetime objects.
I'd probably dictionary the lamps and have each of them contain a list of state changes where the first element is the time of the change and the second the value that's valid since that time.
This way when you get to the next sample you do nothing unless the state changed compared to the last item.
Searching is quick and efficient as you can use binary search approaches on the times.
Persisting it is also easy and you can append data to an existing and running system without any problems too as well as dictionary the lamp state lists to further reduce resource usage.
If you want to search for a gap you just go over all the items and compare the next and prev times - and if you decided to dictionary the state lists then you'll be able to do it just once for every different list rather then every lamp and then get all the lamps that had the same "offline" states with just one iteration which may sometimes help

Good algorithm to find themes in tweets ranked by follower counts?

I'm new to data mining and experimenting a bit.
Let's say I have N twitter users and what I want to find
is the overall theme they're writing about (based on tweets).
Then I want to give higher weight to each theme if that user has higher followers.
Then I want to merge all themes if there're similar enough but still
retain the weighting by twitter count.
So basically a list of "important" themes ranked by authority (user's twitter count)
For instance, like news.google.com but ranking would be based on twitter followers that are responsible for theme.
I'd prefer something in python since that's the language I'm most familiar with.
Any ideas?
Thanks
EDIT:
Here's a good example of what I'm trying to do (but with diff data)
http://www.facebook.com/notes/facebook-data-team/whats-on-your-mind/477517358858
Basically analyzing various data and their correlation to each other: work categories and each persons age or word categories and friend count as in this example.
Where would I begin to solve this and generate such graphs?
Generally speaking : R has some packages specifically directed at text mining and datamining, offering a wide range of techniques. I have no knowledge of that kind of packages in Python, but that doesn't mean they don't exist. I just wouldn't implement it all myself, it's a bit more complicated than it looks at first sight.
Some things you have to consider :
define "theme" : Is that the tags they use? Do you group tags? Do you have a small list with a limited set, or is the set unlimited?
define "general theme" : Is that the most used theme? How do you deal with ties? If a user writes about 10 themes about as much, what then?
define "weight" : Is that equivalent to the number of users? The square root? Some category?
If you have a general idea about this, you can start using the tm package for extracting all the information in a workable format. The package is based on matrices, and metadata objects. These allow you to get weighted frequencies for the different themes, provided you have defined what you consider a theme. You can also use different weighting functions to obtain what you want. The manual is here. But please also visit crossvalidated.com for extra guidance if you're not sure about what you're doing. This is actually more a question about data mining than it is about programming.
I have no specific code but I believe the methodology you want to employ is TF-IDF. It is explained here: http://en.wikipedia.org/wiki/Tf%E2%80%93idf and is used quote often classify text.

App Engine: 13 StringPropertys vs. 1 StringListProperty (w.r.t. indexing/storage and query performance)

A bit of background first: GeoModel is a library I wrote that adds very basic geospatial indexing and querying functionality to App Engine apps. It is similar in approach to geohashing. The equivalent location hash in GeoModel is called a 'geocell.'
Currently, the GeoModel library adds 13 properties (location_geocell__n_, n=1..13) to each location-aware entity. For example, an entity can have property values such as:
location_geocell_1 = 'a'
location_geocell_2 = 'a3'
location_geocell_3 = 'a3f'
...
This is required in order to not use up an inequality filter during spatial queries.
The problem with the 13-properties approach is that, for any geo query an app would like to run, 13 new indexes must be defined and built. This is definitely a maintenance hassle, as I've just painfully realized while rewriting the demo app for the project. This leads to my first question:
QUESTION 1: Is there any significant storage overhead per index? i.e. if I have 13 indexes with n entities in each, versus 1 index with 13n entities in it, is the former much worse than the latter in terms of storage?
It seems like the answer to (1) is no, per this article, but I'd just like to see if anyone has had a different experience.
Now, I'm considering adjusting the GeoModel library so that instead of 13 string properties, there'd only be one StringListProperty called location_geocells, i.e.:
location_geocells = ['a', 'a3', 'a3f']
This results in a much cleaner index.yaml. But, I do question the performance implications:
QUESTION 2: If I switch from 13 string properties to 1 StringListProperty, will query performance be adversely affected; my current filter looks like:
query.filter('location_geocell_%d =' % len(search_cell), search_cell)
and the new filter would look like:
query.filter('location_geocells =', search_cell)
Note that the first query has a search space of _n_ entities, whereas the second query has a search space of _13n_ entities.
It seems like the answer to (2) is that both result in equal query performance, per tip #6 in this blog post, but again, I'd like to see if anyone has any differing real-world experiences with this.
Lastly, if anyone has any other suggestions or tips that can help improve storage utilization, query performance and/or ease of use (specifically w.r.t. index.yaml), please do let me know! The source can be found here geomodel & geomodel.py
You're correct that there's no significant overhead per-index - 13n entries in one index is more or less equivalent to n entries in 13 indexes. There's a total index count limit of 100, though, so this eats up a fair chunk of your available indexes.
That said, using a ListProperty is definitely a far superior approach from usability and index consumption perspectives. There is, as you supposed, no performance difference between querying a small index and a much larger index, supposing both queries return the same number of rows.
The only reason I can think of for using separate properties is if you knew you only needed to index on certain levels of detail - but that could be accomplished better at insert-time by specifying the levels of detail you want added to the list in the first place.
Note that in either case you only need the indexes if you intend to query the geocell properties in conjunction with a sort order or inequality filter, though - in all other cases, the automatic indexing will suffice.
Lastly, if anyone has any other suggestions or tips that can help improve storage utilization, query performance and/or ease of use
The StringListproperty is the way to go for the reasons mentioned above, but in actual usage one might want to add the geocells to ones own previously existing StringList so one could query against multiple properties.
So, if you were to provide a lower level api it could work with full text search implementations like bill katz's
def point2StringList(Point, stub="blah"):
.....
return ["blah_1:a", "blah_2":"a3", "blah_3":"a3f" ....]
def boundingbox2Wheresnippet(Box, stringlist="words", stub="blah"):
.....
return "words='%s_3:a3f' AND words='%s_3:b4g' ..." %(stub)
etc.
Looks like you ended up with 13 indices because you encoded in hex (for human readability / map levels?).
If you had utilized full potential of a byte (ByteString), you'd have had 256 cells instead of 16 cells per character (byte). There by reducing to far fewer number of indices for the same precision.
ByteString is just a subclass of a str and is indexed similarly if less than 500bytes in length.
However number of levels might be lower; to me 4 or 5 levels is practically good enough for most situations on 'the Earth'. For a larger planet or when cataloging each sand particle, more divisions might anyway need to be introduced irrespective of encoding used. In either case ByteString is better than hex encoding. And helps reduce indexing substantially.
For representing 4 billion low(est) level cells, all we need is 4 bytes or just 4 indices. (From basic computer arch or memory addressing).
For representing the same, we'd need 16 hex digits or 16 indices.
I could be wrong. May be the number of index levels matching map zoom levels are more important. Please correct me. Am planning to try this instead of hex if just one (other) person here finds this meaningful :)
Or a solution that has fewer large cells (16) but more (128,256) as we go down the hierarchy.
Any thoughts?
eg:
[0-15][0-31][0-63][0-127][0-255] gives 1G low level cells with 5 indices with log2 decrement in size.
[0-15][0-63][0-255][0-255][0-255] gives 16G low level cells with 5 indices.

Categories

Resources