Related
There are 3 columns, levels 1-3. A file is read, and each line of the file contains various data, including the level to which it belongs, located at the back of the string.
Sample lines from file being read:
thing_1 - level 1
thing_17 - level 3
thing_22 - level 2
I want to assign each "thing" to it's corresponding column. I have looked into pandas, but it would seem that DataFrame columns won't work, as passed data would need to have attributes that match the number of columns, where in my case, I need 3 columns, but each piece of data only has 1 data point.
How could I approach this problem?
Desired output:
level 1 level 2 level 3
thing_1 thing_22 thing_17
Edit:
In looking at suggestion, I can refine my question further. I have up to 3 columns, and the line from file needs to be assigned to one of the 3 columns. Most solutions seem to need something like:
data = [['Mary', 20], ['John', 57]]
columns = ['Name', 'Age']
This does not work for me, since there are 3 columns, and each piece of data goes into only one.
There's an additional wrinkle here that I didn't notice at first. If each of your levels has the same number of things, then you can build a dictionary and then use it to supply the table's columns to PrettyTable:
from prettytable import PrettyTable
# Create an empty dictionary.
levels = {}
with open('data.txt') as f:
for line in f:
# Remove trailing \n and split into the parts we want.
thing, level = line.rstrip('\n').split(' - ')
# If this is is a new level, set it to a list containing its thing.
if level not in levels:
levels[level] = [thing]
# Otherwise, add the new thing to the level's list.
else:
levels[level].append(thing)
# Create the table, and add each level as a column
table = PrettyTable()
for level, things in levels.items():
table.add_column(level, things)
print(table)
For the example data you showed, this prints:
+---------+----------+----------+
| level 1 | level 3 | level 2 |
+---------+----------+----------+
| thing_1 | thing_17 | thing_22 |
+---------+----------+----------+
The Complication
I probably wouldn't have posted an answer (believing it was covered sufficiently in this answer), except that I realized there's an unintuitive hurdle here. If your levels contain different numbers of things each, you get an error like this:
Exception: Column length 2 does not match number of rows 1!
Because none of the solutions readily available have an obvious, "automatic" solution to this, here is a simple way to do it. Build the dictionary as before, then:
# Find the length of the longest list of things.
longest = max(len(things) for things in levels.values())
table = PrettyTable()
for level, things in levels.items():
# Pad out the list if it's shorter than the longest.
things += ['-'] * (longest - len(things))
table.add_column(level, things)
print(table)
This will print something like this:
+---------+----------+----------+
| level 1 | level 3 | level 2 |
+---------+----------+----------+
| thing_1 | thing_17 | thing_22 |
| - | - | thing_5 |
+---------+----------+----------+
Extra
If all of that made sense and you'd like to know about a way part of it can be streamlined a little, take a look at Python's defaultdict. It can take care of the "check if this key already exists" process, providing a default (in this case a new list) if nothing's already there.
from collections import defaultdict
levels = defaultdict(list)
with open('data.txt') as f:
for line in f:
# Remove trailing \n and split into the parts we want.
thing, level = line.rstrip('\n').split(' - ')
# Automatically handles adding a new key if needed:
levels[level].append(thing)
I am new to Python and stuck with building a hierarchy out of a relational dataset.
It would be of immense help if someone has an idea on how to proceed with this.
I have a relational data-set with data like
_currentnode, childnode_
root, child1
child1, leaf2
child1, child3
child1, leaf4
child3, leaf5
child3, leaf6
so-on. I am looking for some python or pyspark code to
build a hierarchy dataframe like below
_level1, level2, level3, level4_
root, child1, leaf2, null
root, child1, child3, leaf5
root, child1, child3, leaf6
root, child1, leaf4, null
The data is alpha-numerics and is a huge dataset[~50mil records].
Also, the root of the hierarchy is known and can be hardwired in the code.
So in the example, above, the root of the hierarchy is 'root'.
Shortest Path with Pyspark
The input data can be interpreted as a graph with the connections between currentnode and childnode. Then the question is what is the shortest path between the root node and all leaf nodes and is called single source shortest path.
Spark has Graphx to handle parallel computations of graphs. Unfortunately, GraphX does not provide a Python API (more details can be found here). A graph library with Python support is GraphFrames. GraphFrames uses parts of GraphX.
Both GraphX and GraphFrames provide an solution for sssp. Unfortunately again, both implementations return only the length of the shortest paths, not the paths themselves (GraphX and GraphFrames). But this answer provides an implementation of the algorithm for GraphX and Scala that also returns the paths. All three solutions use Pregel.
Translating the aforementioned answer to GraphFrames/Python:
1. Data preparation
Provide unique IDs for all nodes and change the column names so that they fit to the names described here
import pyspark.sql.functions as F
df = ...
vertices = df.select("currentnode").withColumnRenamed("currentnode", "node").union(df.select("childnode")).distinct().withColumn("id", F.monotonically_increasing_id()).cache()
edges = df.join(vertices, df.currentnode == vertices.node).drop(F.col("node")).withColumnRenamed("id", "src")\
.join(vertices, df.childnode== vertices.node).drop(F.col("node")).withColumnRenamed("id", "dst").cache()
Nodes Edges
+------+------------+ +-----------+---------+------------+------------+
| node| id| |currentnode|childnode| src| dst|
+------+------------+ +-----------+---------+------------+------------+
| leaf2| 17179869184| | child1| leaf4| 25769803776|249108103168|
|child1| 25769803776| | child1| child3| 25769803776| 68719476736|
|child3| 68719476736| | child1| leaf2| 25769803776| 17179869184|
| leaf6|103079215104| | child3| leaf6| 68719476736|103079215104|
| root|171798691840| | child3| leaf5| 68719476736|214748364800|
| leaf5|214748364800| | root| child1|171798691840| 25769803776|
| leaf4|249108103168| +-----------+---------+------------+------------+
+------+------------+
2. Create the GraphFrame
from graphframes import GraphFrame
graph = GraphFrame(vertices, edges)
3. Create UDFs that will form the single parts of the Pregel algorithm
The message type:
from pyspark.sql.types import *
vertColSchema = StructType()\
.add("dist", DoubleType())\
.add("node", StringType())\
.add("path", ArrayType(StringType(), True))
The vertex program:
def vertexProgram(vd, msg):
if msg == None or vd.__getitem__(0) < msg.__getitem__(0):
return (vd.__getitem__(0), vd.__getitem__(1), vd.__getitem__(2))
else:
return (msg.__getitem__(0), vd.__getitem__(1), msg.__getitem__(2))
vertexProgramUdf = F.udf(vertexProgram, vertColSchema)
The outgoing messages:
def sendMsgToDst(src, dst):
srcDist = src.__getitem__(0)
dstDist = dst.__getitem__(0)
if srcDist < (dstDist - 1):
return (srcDist + 1, src.__getitem__(1), src.__getitem__(2) + [dst.__getitem__(1)])
else:
return None
sendMsgToDstUdf = F.udf(sendMsgToDst, vertColSchema)
Message aggregation:
def aggMsgs(agg):
shortest_dist = sorted(agg, key=lambda tup: tup[1])[0]
return (shortest_dist.__getitem__(0), shortest_dist.__getitem__(1), shortest_dist.__getitem__(2))
aggMsgsUdf = F.udf(aggMsgs, vertColSchema)
4. Combine the parts
from graphframes.lib import Pregel
result = graph.pregel.withVertexColumn(colName = "vertCol", \
initialExpr = F.when(F.col("node")==(F.lit("root")), F.struct(F.lit(0.0), F.col("node"), F.array(F.col("node")))) \
.otherwise(F.struct(F.lit(float("inf")), F.col("node"), F.array(F.lit("")))).cast(vertColSchema), \
updateAfterAggMsgsExpr = vertexProgramUdf(F.col("vertCol"), Pregel.msg())) \
.sendMsgToDst(sendMsgToDstUdf(F.col("src.vertCol"), Pregel.dst("vertCol"))) \
.aggMsgs(aggMsgsUdf(F.collect_list(Pregel.msg()))) \
.setMaxIter(10) \
.setCheckpointInterval(2) \
.run()
result.select("vertCol.path").show(truncate=False)
Remarks:
maxIter should be set to a value at least as large as the longest path. If the value is higher, the result will stay unchanged, but the computation time becomes longer. If the value is too small, the longer paths will be missing in the result. The current version of GraphFrames (0.8.0) does not support stopping the loop when no more new messages are sent.
checkpointInterval should be set to a value smaller than maxIter. The actual value depends on the data and the available hardware. When OutOfMemory exception occur or the Spark session hangs for some time, the value could be reduced.
The final result is a regular dataframe with the content
+-----------------------------+
|path |
+-----------------------------+
|[root, child1] |
|[root, child1, leaf4] |
|[root, child1, child3] |
|[root] |
|[root, child1, child3, leaf6]|
|[root, child1, child3, leaf5]|
|[root, child1, leaf2] |
+-----------------------------+
If necessary the non-leaf nodes could be filtered out here.
I have an application which writes/concatenates data into JSON, and then displays/graphs it via dygraphs. At times, various events can cause the values to go out of range. That range is user-subjective, so clamping that range at run-time is not the direction I am wishing to go.
I believe jq can help here - ideally I would be able to search for a field > x and if it is > x, replace it with x. I've gone searching for jq examples and not really found anything that's making sense to me yet.
I have spent a bit of time on this but not been able to make anything do what I think it should do ... at all. Like, I don't have bad code to show you because I've not made it do anything yet. I sincerely hope what I am asking is narrowed down enough for someone to be able to show me, in context, so I can extend it for the larger project.
Here's a line which I would expect to be able to modify:
{"cols":[{"type":"datetime","id":"Time","label":"Time"},{"type":"number","id":"Room1Temp","label":"Room One Temp"},{"type":"number","id":"Room1Set","label":"Room One Set"},{"type":"string","id":"Annot1","label":"Room One Note"},{"type":"number","id":"Room2Temp","label":"Room Two Temp"},{"type":"number","id":"Room2Set","label":"Room Two Set"},{"type":"string","id":"Annot2","label":"Room Two Note"},{"type":"number","id":"Room3Temp","label":"Room Three Temp"},{"type":"number","id":"State","label":"State"},{"type":"number","id":"Room4Temp","label":"Room Four Temp"},{"type":"number","id":"Quality","label":"Quality"}],"rows":[
{"c":[{"v":"Date(2019,6,4,20,31,13)"},{"v":68.01},{"v":68.0},null,{"v":62.02},{"v":55.89},null,null,{"v":4},{"v":69.0},{"v":1.052}]}]}
I'd want to do something like:
if JSONFile.Room2Set < 62
set Room2Set = 62
Here's a larger block of JSON which is the source of the chart shown below:
Example Chart
With a function clamp functions defined like so (in your ~/.jq file or inline):
def clamp_min($minInc): if . < $minInc then $minInc else . end;
def clamp_max($maxInc): if . > $maxInc then $maxInc else . end;
def clamp($minInc; $maxInc): clamp_min($minInc) | clamp_max($maxInc);
And with that data, you'll want to find the corresponding cells for each row and modify the value.
$ jq --arg col "Room2Set" --argjson max '62' '
def clamp_max($maxInc): if . > $maxInc then $maxInc else . end;
(INDEX(.cols|to_entries[]|{id:.value.id,index:.key};.id)) as $cols
| .rows[].c[$cols[$col].index] |= (objects.v |= clamp_max($max))
' input.json
With an invocation such as:
jq --arg col Room2Set --argjson mx 72 --argjson mn 62 -f clamp.jq input.json
where clamp.jq contains:
def clamp: if . > $mx then $mx elif . < $mn then $mn else . end;
(.cols | map(.id) | index($col)) as $ix
| .rows[].c[$ix].v |= clamp
the selected cells should be "clamped".
I have a .csv with several million rows. The first column is the id of each entry, and each id only occurs one time. The first column is sorted. Intuitively I'd say that it might be pretty easy to query this file efficiently using a divide and conquer algorithm. However, I couldn't find anything related to this.
Sample .csv file:
+----+------------------+-----+
| id | name | age |
+----+------------------+-----+
| 1 | John Cleese | 34 |
+----+------------------+-----+
| 3 | Mary Poppins | 35 |
+----+------------------+-----+
| .. | ... | .. |
+----+------------------+-----+
| 87 | Barry Zuckerkorn | 45 |
+----+------------------+-----+
I don't want to load the file in memory (too big), and I prefer to not use databases. I know I can just import this file in sqlite, but then I have multiple copies of this data, and I'd prefer to avoid that for multiple reasons.
Is there a good package I'm overlooking? Or is it something that I'd have to write myself?
Ok, my understanding is that you want some of the functionnalities of a light database, but are constrained to use a csv text file to hold the data. IMHO, this is probably a questionable design: past several hundred of rows, I would only see a csv file an an intermediate or exchange format.
As it is a very uncommon design, it is unlikely that a package for it already exists - for my part I know none. So I would imagine 2 possible ways: scan the file once and build an index id->row_position, and then use that index for your queries. Depending on the actual length of you rows, you could index only every n-th row to change speed for memory. But it costs an index file
An alternative way would be a direct divide and conquer algo: use stat/fstat to get the file size, and search for the next end of line starting at the middle of the file. You immediately get an id after it. If the id you want is that one, fine you have won, if it is greater, just recurse in the upper part, if lesser, recurse in the lower part. But because of the necessity to search for end of lines, be prepared to corner case like never finding the end of line in the expected range, or find it at the end.
After Serges answer I decided to write my own implementation, here it is. It doesn't allow newlines and doesn't deal with a lot of details regarding the .csv format. It assumes that the .csv is sorted on the first column, and that the first column are integer values.
import os
def query_sorted_csv(fname, id):
filesize = os.path.getsize(fname)
with open(fname) as fin:
row = look_for_id_at_location(fin, 0, filesize, id)
if not row:
raise Exception('id not found!')
return row
def look_for_id_at_location(fin, location_lower, location_upper, id, sep=',', id_column=0):
location = int((location_upper + location_lower) / 2)
if location_upper - location_lower < 2:
return False
fin.seek(location)
next(fin)
try:
full_line = next(fin)
except StopIteration:
return False
id_at_location = int(full_line.split(sep)[id_column])
if id_at_location == id:
return full_line
if id_at_location > id:
return look_for_id_at_location(fin, location_lower, location, id)
else:
return look_for_id_at_location(fin, location, location_upper, id)
row = query_sorted_csv('data.csv', 505)
You can look up about 4000 ids per second in a 2 million row 250MB .csv file. In comparison, you can look up 3 ids per second whilst looping over the entire file line by line.
I have a list of words. For example:
reel
road
root
curd
I would like to store this data in a manner that reflects the following structure:
Start -> r -> e -> reel
-> o -> a -> road
o -> root
c -> curd
It is apparent to me that I need to implement a tree. From this tree, I must be able to easily obtain statistics such as the height of a node, the number of descendants of a node, searching for a node and so on. Adding a node should 'automatically' add it to the correct position in the tree, since this position is unique.
It would also like to be able to visualize the data in the form of an actual graphical tree. Since the tree is going to be huge, I would need zoom / pan controls on the visualization. And of course, a pretty visualization is always better than an ugly one.
Does anyone know of a Python package which would allow me to achieve all this simply? Writing the code myself will take quite a while. Do you think http://packages.python.org/ete2/ would be appropriate for this task?
I'm on Python 2.x, btw.
I discovered that NLTK has a trie class - nltk.containers.trie. This is convenient for me, since I already use NLTK. Does anyone know how to use this class? I can't find any examples anywhere! For example, how do I add words to the trie?
ETE2 is an environment for tree exploration, in principle made for browsing, building and exploring phylogenetic trees, and i've used it long time ago for these purposes.
But its possible that if you set your data properly, you could get it done.
You just have to place paretheses wherever you need to split your tree and create a branch. See the following example, taken from ETE doc.
If you change these "(A,B,(C,D));" for your words/letters it should be done.
from ete2 import Tree
unrooted_tree = Tree( "(A,B,(C,D));" )
print unrooted_tree
output:
/-A
|
----|--B
|
| /-C
\---|
\-D
...and this package will let u do most of the operations you want, giving u the chance to select every branch individually, and operating with it in an easy way.
I recommend u to give a look to the tutorial anyway, not pretty difficult :)
I think the following example does pretty much what you want, using the ETE toolkit.
from ete2 import Tree
words = [ "reel", "road", "root", "curd", "curl", "whatever","whenever", "wherever"]
#Creates a empty tree
tree = Tree()
tree.name = ""
# Lets keep tree structure indexed
name2node = {}
# Make sure there are no duplicates
words = set(words)
# Populate tree
for wd in words:
# If no similar words exist, add it to the base of tree
target = tree
# Find relatives in the tree
for pos in xrange(len(wd), -1, -1):
root = wd[:pos]
if root in name2node:
target = name2node[root]
break
# Add new nodes as necessary
fullname = root
for letter in wd[pos:]:
fullname += letter
new_node = target.add_child(name=letter, dist=1.0)
name2node[fullname] = new_node
target = new_node
# Print structure
print tree.get_ascii()
# You can also use all the visualization machinery from ETE
# (http://packages.python.org/ete2/tutorial/tutorial_drawing.html)
# tree.show()
# You can find, isolate and operate with a specific node using the index
wh_node = name2node["whe"]
print wh_node.get_ascii()
# You can rebuild words under a given node
def recontruct_fullname(node):
name = []
while node.up:
name.append(node.name)
node = node.up
name = ''.join(reversed(name))
return name
for leaf in wh_node.iter_leaves():
print recontruct_fullname(leaf)
/n-- /e-- /v-- /e-- /-r
/e--|
/w-- /h--| \r-- /e-- /v-- /e-- /-r
| |
| \a-- /t-- /e-- /v-- /e-- /-r
|
| /e-- /e-- /-l
----|-r--|
| | /o-- /-t
| \o--|
| \a-- /-d
|
| /-d
\c-- /u-- /r--|
\-l