I'm trying to parallelize some heavier computations like so:
inputs = (p | "Read" >> beam.io.ReadFromAvro('/mypath/myavrofiles*')
| "Generate Key" >> beam.Map(lambda row: (gen_key(row), row)))
calc1_results = inputs | "perform calc1" >> beam.Pardo(Calc1())
calc2_results = inputs | "perform calc2" >> beam.Pardo(Calc2())
combined = (({"calc1": calc1_results, "calc2": calc2_results})
| beam.CoGroupByKey()
| beam.Values())
final = combined | "Use Grouped results" >> beam.ParDo(PerformFinalCalculation())
Each heavy calc emits (key, result)
Each key is unique for each input. One input, One result, one Key
Is there some way to emit from the CoGroupByKey after a single result1/result2 has been collected for each key?
Ultimately I'd like to achieve something along the lines of:
+------------+
| |
| Input |
| +-----------------+
+------------+ |
| |
v-------------------v |
+------------+ +------------+ |
| | | | |
| Heavy | | Heavy | |
| Calc 1 | | Calc 2 | |
| | | | |
+------------+ +------------+ |
| | |
| | |
| | |
+--v------------v--+ |
| Merged | |
| original dict, +<---------------+
|result 1, result2 |
| |
+------------------+
Related
Can anyone help me sort the order of last page viewed?
I have a dataframe where I am attempting to sort it by the previous page viewed and I am having a really hard time coming up with an efficient method using Pandas.
For example from this:
+------------+------------------+----------+
| Customer | previousPagePath | pagePath |
+------------+------------------+----------+
| 1051471580 | A | D |
| 1051471580 | C | B |
| 1051471580 | A | exit |
| 1051471580 | B | A |
| 1051471580 | D | A |
| 1051471580 | entrance | C |
+------------+------------------+----------+
To this:
+------------+------------------+----------+
| Customer | previousPagePath | pagePath |
+------------+------------------+----------+
| 1051471580 | entrance | C |
| 1051471580 | C | B |
| 1051471580 | B | A |
| 1051471580 | A | D |
| 1051471580 | D | A |
| 1051471580 | A | exit |
+------------+------------------+----------+
However it could be millions of rows long for thousands of different customers so I really need to think how to make this efficient.
pd.DataFrame({
'Customer':'1051471580',
'previousPagePath': ['E','C','B','A','D','A'],
'pagePath': ['C','B','A','D','A','F']
})
Thanks!
What you're trying to do is topological sorting, which can be achieved with networkx. Note that I had to change some values in your dataframe in order to prevent it throwing a cycle error, so I hope that the data you work on contains unique values:
import networkx as nx
import pandas as pd
data = [ [1051471580, "Z", "D"], [1051471580,"C","B" ], [1051471580,"A","exit" ], [1051471580,"B","Z" ], [1051471580,"D","A" ], [1051471580,"entrance","C" ] ]
df = pd.DataFrame(data, columns=['Customer', 'previousPagePath', 'pagePath'])
edges = df[df.pagePath != df.previousPagePath].reset_index()
dg = nx.from_pandas_edgelist(edges, source='previousPagePath', target='pagePath', create_using=nx.DiGraph())
order = list(nx.lexicographical_topological_sort(dg))
result = df.set_index('previousPagePath').loc[order[:-1], :].dropna().reset_index()
result = result[['Customer', 'previousPagePath', 'pagePath']]
Output:
| | Customer | previousPagePath | pagePath |
|---:|-----------:|:-------------------|:-----------|
| 0 | 1051471580 | entrance | C |
| 1 | 1051471580 | C | B |
| 2 | 1051471580 | B | Z |
| 3 | 1051471580 | Z | D |
| 4 | 1051471580 | D | A |
| 5 | 1051471580 | A | exit |
you can sort your DataFrame by column like that.
df = pd.DataFrame({'Customer':'1051471580','previousPagePath':['E','C','B','A','D','A'], 'pagePath':['C','B','A','D','A','F']})
df.sort_values(by='previousPagePath')
and you can find the document here pandas.DataFrame.sort_values
I have the following two tables in PySpark:
Table A - dfA
| ip_4 | ip |
|---------------|--------------|
| 10.10.10.25 | 168430105 |
| 10.11.25.60 | 168499516 |
And table B - dfB
| net_cidr | net_ip_first_4 | net_ip_last_4 | net_ip_first | net_ip_last |
|---------------|----------------|----------------|--------------|-------------|
| 10.10.10.0/24 | 10.10.10.0 | 10.10.10.255 | 168430080 | 168430335 |
| 10.10.11.0/24 | 10.10.11.0 | 10.10.11.255 | 168430336 | 168430591 |
| 10.11.0.0/16 | 10.11.0.0 | 10.11.255.255 | 168493056 | 168558591 |
I have joined both tables in PySpark using the following command:
dfJoined = dfB.alias('b').join(F.broadcast(dfA).alias('a'),
(F.col('a.ip') >= F.col('b.net_ip_first'))&
(F.col('a.ip') <= F.col('b.net_ip_last')),
how='right').select('a.*, b.*)
So I obtain:
| ip | net_cidr | net_ip_first_4 | net_ip_last_4| ...
|---------------|---------------|----------------|--------------| ...
| 10.10.10.25 | 10.10.10.0/24 | 10.10.10.0 | 10.10.10.255 | ...
| 10.11.25.60 | 10.10.11.0/24 | 10.10.11.0 | 10.10.11.255 | ...
The size of the tables makes this option not optimal due to the 2 conditions, I had thought of sorting table B so that it only implies one join condition.
Is there any way to limit the join and take only the first record that matches the join condition? Or some way to make the join in an optimal way?
Table A (number of records) << Table B (number of records)
Thank you!
I need to use vlookup functionality in pandas.
DataFrame 2: (FEED_NAME has no duplicate rows)
+-----------+--------------------+---------------------+
| FEED_NAME | Name | Source |
+-----------+--------------------+---------------------+
| DMSN | DMSN_YYYYMMDD.txt | Main hub |
| PCSUS | PCSUS_YYYYMMDD.txt | Basement |
| DAMJ | DAMJ_YYYYMMDD.txt | Effiel Tower router |
+-----------+--------------------+---------------------+
DataFrame 1:
+-------------+
| SYSTEM_NAME |
+-------------+
| DMSN |
| PCSUS |
| DAMJ |
| : |
| : |
+-------------+
DataFrame 1 contains lot more number of rows. It is acutally a column in much larger table. I need to merger df1 with df2 to make it look like:
+-------------+--------------------+---------------------+
| SYSTEM_NAME | Name | Source |
+-------------+--------------------+---------------------+
| DMSN | DMSN_YYYYMMDD.txt | Main Hub |
| PCSUS | PCSUS_YYYYMMDD.txt | Basement |
| DAMJ | DAMJ_YYYYMMDD.txt | Eiffel Tower router |
| : | | |
| : | | |
+-------------+--------------------+---------------------+
in excel I just would have done VLOOKUP(,,1,TRUE) and then copied the same across all cells.
I have tried various combinations with merge and join but I keep getting KeyError:'SYSTEM_NAME'
Code:
_df1 = df1[["SYSTEM_NAME"]]
_df2 = df2[['FEED_NAME','Name','Source']]
_df2.rename(columns = {'FEED_NAME':"SYSTEM_NAME"})
_df3 = pd.merge(_df1,_df2,how='left',on='SYSTEM_NAME')
_df3.head()
You missed the inplace=True argument in the line _df2.rename(columns = {'FEED_NAME':"SYSTEM_NAME"}) so the _df2 columns name haven't changed. Try this instead :
_df1 = df1[["SYSTEM_NAME"]]
_df2 = df2[['FEED_NAME','Name','Source']]
_df2.rename(columns = {'FEED_NAME':"SYSTEM_NAME"}, inplace=True)
_df3 = pd.merge(_df1,_df2,how='left',on='SYSTEM_NAME')
_df3.head()
s = [0,2,6,4,7,1,5,3]
def row_top():
print("|--|--|--|--|--|--|--|--|")
def cell_left():
print("| ", end = "")
def solution(s):
for i in range(8):
row(s[i])
def cell_data(isQ):
if isQ:
print("X", end = "")
return ()
else:
print(" ", end = "")
def row_data(c):
for i in range(9):
cell_left()
cell_data(i == c)
def row(c):
row_top()
row_data(c)
print("\n")
solution(s)
My output has a space every two lines, when there shouldn't be, I'm not sure where it's creating that extra line.
The output is suppose to look like this:
|--|--|--|--|--|--|--|--|
| | | | | | X| | |
|--|--|--|--|--|--|--|--|
| | | X| | | | | |
|--|--|--|--|--|--|--|--|
| | | | | X| | | |
|--|--|--|--|--|--|--|--|
| | | | | | | | X|
|--|--|--|--|--|--|--|--|
| X| | | | | | | |
|--|--|--|--|--|--|--|--|
| | | | X| | | | |
|--|--|--|--|--|--|--|--|
| | X| | | | | | |
|--|--|--|--|--|--|--|--|
| | | | | | | X| |
|--|--|--|--|--|--|--|--|
I know this chess board isn't very square but this is only a rough draft at the moment.
Here is an alternative implementation:
def make_row(rowdata, col, empty, full):
items = [col] * (2*len(rowdata) + 1)
items[1::2] = (full if d else empty for d in rowdata)
return ''.join(items)
def make_board(queens, col="|", row="---", empty=" ", full=" X "):
size = len(queens)
bar = make_row(queens, col, row, row)
board = [bar] * (2*size + 1)
board[1::2] = (make_row([i==q for i in range(size)], col, empty, full) for q in queens)
return '\n'.join(board)
queens = [0,2,6,4,7,1,5,3]
print(make_board(queens))
which results in
|---|---|---|---|---|---|---|---|
| X | | | | | | | |
|---|---|---|---|---|---|---|---|
| | | X | | | | | |
|---|---|---|---|---|---|---|---|
| | | | | | | X | |
|---|---|---|---|---|---|---|---|
| | | | | X | | | |
|---|---|---|---|---|---|---|---|
| | | | | | | | X |
|---|---|---|---|---|---|---|---|
| | X | | | | | | |
|---|---|---|---|---|---|---|---|
| | | | | | X | | |
|---|---|---|---|---|---|---|---|
| | | | X | | | | |
|---|---|---|---|---|---|---|---|
It is now very easy to change the width of the board by changing the strings passed to row, empty, full; I added an extra char to each, resulting in a (somewhat) squarer board.
You are still printing an extra newline:
def row(c):
row_top()
row_data(c)
print("\n")
Remove the explicit ''\n'` character:
def row(c):
row_top()
row_data(c)
print()
or better still, follow my previous answer more closely and print a closing | bar:
def row(c):
row_top()
row_data(c)
print('|')
In Python, how do I turn RDF/SKOS taxonomy data into a dictionary that represents the concept hierarchy only?
The dictionary must have this format:
{ 'term1': [ 'term2', 'term3'], 'term3': [{'term4' : ['term5', 'term6']}, 'term6']}
I tried using RDFLib with JSON plugins, but did not get the result I want.
I'm not much of a Python user, and I haven't worked with RDFLib, but I just pulled the SKOS and vocabulary from the SKOS vocabularies page. I wasn't sure what concepts (RDFS or OWL classes) were in the vocabulary, nor what their hierarchy was, so I ran this a SPARQL query using Jena's ARQ to select classes and their subclasses. I didn't get any results. (There were classes defined of course, but none had subclasses.) Then I decided to use both the SKOS and SKOS-XL vocabularies, and to ask for properties and subproperties as well as classes and subclasses. This is the SPARQL query I used:
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
SELECT ?property ?subproperty ?class ?subclass WHERE {
{ ?subclass rdfs:subClassOf ?class }
UNION
{ ?subproperty rdfs:subPropertyOf ?property }
}
ORDER BY ?class ?property
The results I got were
-------------------------------------------------------------------------------------------------------------------
| property | subproperty | class | subclass |
===================================================================================================================
| rdfs:label | skos:altLabel | | |
| rdfs:label | skos:hiddenLabel | | |
| rdfs:label | skos:prefLabel | | |
| skos:broader | skos:broadMatch | | |
| skos:broaderTransitive | skos:broader | | |
| skos:closeMatch | skos:exactMatch | | |
| skos:inScheme | skos:topConceptOf | | |
| skos:mappingRelation | skos:broadMatch | | |
| skos:mappingRelation | skos:closeMatch | | |
| skos:mappingRelation | skos:narrowMatch | | |
| skos:mappingRelation | skos:relatedMatch | | |
| skos:narrower | skos:narrowMatch | | |
| skos:narrowerTransitive | skos:narrower | | |
| skos:note | skos:changeNote | | |
| skos:note | skos:definition | | |
| skos:note | skos:editorialNote | | |
| skos:note | skos:example | | |
| skos:note | skos:historyNote | | |
| skos:note | skos:scopeNote | | |
| skos:related | skos:relatedMatch | | |
| skos:semanticRelation | skos:broaderTransitive | | |
| skos:semanticRelation | skos:mappingRelation | | |
| skos:semanticRelation | skos:narrowerTransitive | | |
| skos:semanticRelation | skos:related | | |
| | | _:b0 | <http://www.w3.org/2008/05/skos-xl#Label> |
| | | skos:Collection | skos:OrderedCollection |
-------------------------------------------------------------------------------------------------------------------
It looks like there's not much concept hierarchy in SKOS at all. Could that explain why you didn't get the results you wanted before?