I am stuck at a problem that might be easy, so all I'm asking for is ideas to get started:
In python, I have generated links between fileNames. Each fileName is associated to at least two numbers in a dictionary nameNumber {fileName:[list of numbers]} different fileNames can have some associated numbers in common. In order to see that, I created a dictionary numberName {number:[list of associated fileNames]}. What I want to do is some sort of single linkage: Regroup all the fileNames that have at least one number in common. This grouping has to be efficient as I have millions of fileNames.
You could try using graphs with networkx.
Each fileName would be a node of your graph (G.add_node()) and you could link the fileNames that have common numbers with edges. networkx should then enable you to find the cliques of you graph.
Related
I have a dataset of postal codes for each store and nearby postal codes for each. It looks like the following:
PostalCode
nearPC
Travel Time
L2L 3J9
[N1K 0A1', 'N1K 0A2', 'N1K 0A3', 'N1K 0A4', '...
[nan,nan,9,5,nan...]
I know I can explode the data but that would result in tons more rows ~40M. Another preprocessing step I can perform is to remove the values in each list where the travel time is not available. However, then I would need to remove it from the nearPC list.
Is there a way to incorporate networkx to create this graph? I've tried using
G = nx.from_pandas_edgelist(df,'near_PC','PostalCode',['TravelTime'])
but I don't think it allows lists as the source or targets.
TypeError: unhashable type: 'list'
Is there a way around this? If not how can I remove the same indices of a list per row based on a conditional in an efficient way?
You've identified your problem, although you may not realize it. You have a graph with 40M edges, but you appropriately avoid the table explosion. You do have to code that explosion in some form, because your graph needs all 40M edges.
For what little trouble it might save you, I suggest that you write a simple generator expression for the edges: take one node from PostalCode, iterating through the nearPC list for the other node. Let Python and NetworkX worry about the in-line expansion.
You switch the nx build method you call, depending on the format you generate. You do slow down the processing somewhat, but the explosion details get hidden in the language syntax. Also, if there is any built-in parallelization between that generator and the nx method, you'll get that advantage implicitly.
I have multiple node- and edgelists which form a large graph, lets call that the maingraph. My current strategy is to first read all the nodelists and import it with add_vertices. Every node then gets an internal id which depends on the order they are ingested and therefore isnt very reliable (as i've read it, if you delete one, all higher ids than the one deleted change). I assign every node a name attribute which corresponds to the external ID I use so I can keep track of my nodes between frameworks and a type attribute.
Now, how do I add the edges? When I read an edgelist it will start making a new graph (subgraph) and hence starts the internal ID at 0. Therefore, "merging" the graphs with maingraph.add_edges(subgraph.get_edgelist) inevitably fails.
It is possible to work around this and use the name attribute from both maingraph and subgraph to find out which internal ID each edges' incident nodes have in the maingraph:
def _get_real_source_and_target_id(edge):
''' takes an edge from the to-be-added subgraph and gets the ids of the corresponding nodes in the
maingraph by their name '''
source_id = maingraph.vs.select(name_eq=subgraph.vs[edge[0]]["name"])[0].index
target_id = maingraph.vs.select(name_eq=subgraph.vs[edge[1]]["name"])[0].index
return (source_id,target_id)
And then I tried
edgelist = [_get_real_source_and_target_id(x) for x in subgraph.get_edgelist()]
maingraph.add_edges(edgelist)
But that is hoooooorribly slow. The graph has millions of nodes and edges, which takes 10 seconds to load with the fast, but incorrect maingraph.add_edges(subgraph.get_edgelist) approach. with the correct approach explained above, it takes minutes (I usually stop it after 5 minutes o so). I will have to do this tens of thousands of times. I switched from NetworkX to Igraph because of the fast loading, but it doesn't really help if I have to do it like this.
Does anybody have a more clever way to do this? Any help much appreciated!
Thanks!
Nevermind, I figured out that the mistake was elsewhere. I used numpy.loadtxt() to read the node's names as strings, which somehow did funny stuff when the names were incrementing numbers with more than five figures (see my issue report here). Therefore the above solution got stuck when it tried to get the nodes where numpy messed up the node name. maingraph.vs.select(name_eq=subgraph.vs[edge[0]]["name"])[0].index simply sat there when it couldnt find the node. Now I use pandas to read the node names and it works fine.
The solution above is still ~10x faster than my previous NetworkX solution, so I will just leave it helps someone. Feel free to delete it otherwise.
I'm building a Python program to compress/decompress a text file using a Huffman tree. Previously, I would store the frequency table a .json file alongside the compressed file. When I read in the compressed data and .json, I would rebuild the decompression tree from the frequency table. I thought this was a pretty eloquent solution.
However, I was running into an odd issue with files of medium length where they would decompress into strings of seemingly random characters. I found that the issue occurred when two character where occurring the same number of times. When I rebuilt my tree, any of those characters with matching frequencies would have the chance of getting swapped. For the majority of files, particularly large and small files, this wasn't a problem. Most letter occurred slightly more or slightly less than others. But for some medium sized files, a large portion of the characters occurred the same number of times as another character resulting in gibberish.
Is there a unique identifier for my nodes that I can use instead to easily rebuild my tree? Or should I be approaching the tree writing completely differently?
In the Huffman algorithm you need to pick the lowest two frequencies in a deterministic way that is the same on both sides. If there is a tie, you need to use the symbol to break the tie. Without that, you have no assurance that the sorting on both sides will pick the same symbols when faced with equal frequencies.
You don't need to send the frequencies. All you need to send is the bit lengths for the symbols. The lengths can be coded much more compactly than the frequencies. You can build a canonical code from just the lengths, using the symbols to order the codes unambiguously.
I have coded a network using igraph (undirected), and I want to obtain the list of pairs of nodes that are not connected in the network.
Looking through the igraph's documentation (Python), I haven't found a method that does this. Do I have to do this manually?
An related question: given any pair of nodes in the network, how do I find the list of common neighbors of these two nodes using igraph? Again, there seems no such method readily available in igraph.
Re the first question (listing pairs of disconnected nodes): yes, you have to do this manually, but it is fairly easy:
from itertools import product
all_nodes = set(range(g.vcount())
disconnected_pairs = [list(product(cluster, all_nodes.difference(cluster))) \
for cluster in g.clusters()]
But beware, this could be a fairly large list if your graph is large and consists of a lot of disconnected components.
Re the second question (listing common neighbors): again, you have to do this manually but it only takes a single set intersection operation in Python:
set(g.neighbors(v1)).intersection(set(g.neighbors(v2)))
If you find that you need to do this for many pairs of nodes, you should probably create the neighbor sets first:
neighbor_sets = [set(neis) for neis in g.get_adjlist()]
Then you can simply write neighbor_sets[i] instead of set(g.neighbors(i)).
[
[
[2,33,64,276,1],
[234,5,234,7,34,36,7,2],
[]
]
[
[2,4,5]
]
.
.
.
etc
]
I'm not looking for an exact solution to this, as the structure above is just an example. I'm trying to search for an ID that can be nested several levels deep within a group of IDs ordered randomly.
Currently I'm just doing a linear search which takes a few minutes to get a result when each of the deepest levels has a couple hundred of IDs. I was wondering if anyone could suggest a faster algorithm for searching through multiple levels of random data? I am doing this in Python if that matters.
Note: The IDs are always at the deepest level and the number of levels is consistent for each branch down. Not sure if that matters or not.
Also to clarify the data points are unique and cannot be repeated. My example has some repeats because I was just smashing the keyboard.
The fastest search through random data is linear. Pretending your data isn't nested, it's still random, so even flattening it won't help.
To decrease the time complexity, you can increase the space complexity -- keep a dict containing IDs as keys and whatever information you want (possibly a list of indices pointing to the list containing the ID at each level), and update it every time you create/update/delete an element.