Create new Data frame from an existing one in pyspark

Create new Data frame from an existing one in pyspark - python

I created this dataframe with pySpark from txt file that includes searches queries and user ID.
`spark = SparkSession.builder.getOrCreate()
df = spark.read.option("header", "true") \
.option("delimiter", "\t") \
.option("inferSchema", "true") \
.csv("/content/drive/MyDrive/my_data.txt")
df.select("AnonID","Query").show()`
And it look like that:
+------+--------------------+
|AnonID| Query|
+------+--------------------+
| 142| rentdirect.com|
| 142|www.prescriptionf...|
| 142| staple.com|
| 142| staple.com|
| 142|www.newyorklawyer...|
| 142|www.newyorklawyer...|
| 142| westchester.gov|
| 142| space.comhttp|
| 142| dfdf|
| 142| dfdf|
| 142| vaniqa.comh|
| 142| www.collegeucla.edu|
| 142| www.elaorg|
| 142| 207 ad2d 530|
| 142| 207 ad2d 530|
| 142| broadway.vera.org|
| 142| broadway.vera.org|
| 142| vera.org|
| 142| broadway.vera.org|
| 142| frankmellace.com|
| 142| ucs.ljx.com|
| 142| attornyleslie.com|
| 142|merit release app...|
| 142| www.bonsai.wbff.org|
| 142| loislaw.com|
| 142| rapny.com|
| 142| whitepages.com|
| 217| lottery|
| 217| lottery|
| 217| ameriprise.com|
| 217| susheme|
| 217| united.com|
| 217| mizuno.com|
| 217|p; .; p;' p; ' ;'...|
| 217|p; .; p;' p; ' ;'...|
| 217|asiansexygoddess.com|
| 217| buddylis|
| 217|bestasiancompany.com|
| 217| lottery|
| 217| lottery|
| 217| ask.com|
| 217| weather.com|
| 217| wellsfargo.com|
| 217|www.tabiecummings...|
| 217| wanttickets.com|
| 217| yahoo.com|
| 217| -|
| 217| www.ngo-quen.org|
| 217| -|
| 217| vietnam|
+------+--------------------+
What I want to do is that each user ID will be a row and each query will be in a column.
+------+------------+---------
|ID | 1 | 2 | 3 .......
+------+------------+---------
|142| query1|query2| query3
|217| query1|query2| query3
|993| query1|query2| query3
|1268| query1|query2| query3
|1326| query1|query2| query3
.
.
.
I tried to switch between rows and columns with the help of a search I did on Google, but I didn't succeed.

You can group the dataframe by AnonID, and then pivot the Query column to create new columns for each unique query:
import pyspark.sql.functions as F
df = df.groupBy("AnonID").pivot("Query").agg(F.first("Query"))
If you have a lot of distinct values try
df = df.groupBy("AnonID").agg(F.collect_list("Query").alias("Queries"))
You can then rename the columns to 1, 2, 3, etc.
df = df.selectExpr("AnonID", *[f"`{i+1}` as `{i+1}`" for i in range(len(df.columns)-1)])

Related

Spark solution for below problem (Transform , Pivot, CrossJoin)

This is the input dataframe:
df1_input = spark.createDataFrame([ \
("P1","A","B","C"), \
("P1","D","E","F"), \
("P1","G","H","I"), \
("P1","J","K","L") ], ["Person","L1B","B2E","J3A"])
df1_input.show()
+------+---+---+---+
|Person|L1B|B2E|J3A|
+------+---+---+---+
| P1| A| B| C|
| P1| D| E| F|
| P1| G| H| I|
| P1| J| K| L|
+------+---+---+---+
Below gives the corresponding descriptions:
df1_item_details = spark.createDataFrame([ \
("L1B","item Desc1","A","Detail Desc1"), \
("L1B","item Desc1","D","Detail Desc2"), \
("L1B","item Desc1","G","Detail Desc3"), \
("L1B","item Desc1","J","Detail Desc4"), \
("B2E","item Desc2","B","Detail Desc5"), \
("B2E","item Desc2","E","Detail Desc6"), \
("B2E","item Desc2","H","Detail Desc7"), \
("B2E","item Desc2","K","Detail Desc8"), \
("J3A","item Desc3","C","Detail Desc9"), \
("J3A","item Desc3","F","Detail Desc10"), \
("J3A","item Desc3","I","Detail Desc11"), \
("J3A","item Desc3","L","Detail Desc12")], ["Item","Item Desc","Detail","Detail Desc"])
df1_item_details.show()
+----+----------+------+-------------+
|Item| Item Desc|Detail| Detail Desc|
+----+----------+------+-------------+
| L1B|item Desc1| A| Detail Desc1|
| L1B|item Desc1| D| Detail Desc2|
| L1B|item Desc1| G| Detail Desc3|
| L1B|item Desc1| J| Detail Desc4|
| B2E|item Desc2| B| Detail Desc5|
| B2E|item Desc2| E| Detail Desc6|
| B2E|item Desc2| H| Detail Desc7|
| B2E|item Desc2| K| Detail Desc8|
| J3A|item Desc3| C| Detail Desc9|
| J3A|item Desc3| F|Detail Desc10|
| J3A|item Desc3| I|Detail Desc11|
| J3A|item Desc3| L|Detail Desc12|
+----+----------+------+-------------+
Below are some standard information that need to be plastered on the final output:
df1_stdColumns = spark.createDataFrame([ \
("School","BMM"), \
("College","MSRIT"), \
("Workplace1","Blr"), \
("Workplace2","Chn")], ["StdKey","StdVal"])
df1_stdColumns.show()
+----------+------+
| StdKey|StdVal|
+----------+------+
| School| BMM|
| College| MSRIT|
|Workplace1| Blr|
|Workplace2| Chn|
+----------+------+
The expected output would look like below:
+--------+-----+---------------+-----+---------------+-----+----------------+--------+---------+------------+------------+
| Person | L1B | Item Desc1 | B2E | Item Desc2 | J3A | Item Desc3 | School | College | WorkPlace1 | WorkPlace2 |
+--------+-----+---------------+-----+---------------+-----+----------------+--------+---------+------------+------------+
| P1 | A | Detail Desc 1 | B | Detail Desc 5 | C | Detail Desc 9 | Bmm | MSRIT | Blr | Chn |
| P1 | D | Detail Desc 2 | E | Detail Desc 6 | F | Detail Desc 10 | Bmm | MSRIT | Blr | Chn |
| P1 | G | Detail Desc 3 | H | Detail Desc 7 | I | Detail Desc 11 | Bmm | MSRIT | Blr | Chn |
| P1 | J | Detail Desc 4 | K | Detail Desc 8 | L | Detail Desc 12 | Bmm | MSRIT | Blr | Chn |
+--------+-----+---------------+-----+---------------+-----+----------------+--------+---------+------------+------------+
Could someone suggest an optimal Spark way of doing this ? The input dataset size is in millions. Current code that i have runs for around 10hrs and it not optimal.. looking for some good performant spark(Python \ Scala \ Sql) code if possible
edit : Below is the code that i have that works but takes forever to finish when input volume is in millions
from pyspark.sql.functions import monotonically_increasing_id
import pyspark.sql.functions as F
from pyspark.sql.functions import array, col, explode, lit, struct
from pyspark.sql.types import StructType, StructField, LongType
from pyspark.sql import DataFrame
from typing import Iterable
#Databricks runtime 7.3 on spark 3.0.1 which supports AQE
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "1")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
spark.conf.set("spark.sql.shuffle.partitions", "10000")
spark.conf.set("spark.sql.adaptive.coalescePartitions.minPartitionNum","1")
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")
spark.conf.set("spark.sql.adaptive.skewJoin.skewedPartitionFactor", "1")
spark.conf.set("spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes", "10KB")
spark.conf.set("spark.sql.adaptive.advisoryPartitionSizeInBytes", "1B")
df1_input=df1_input.withColumn("RecordId", monotonically_increasing_id())
df1_input_2=df1_input
#Custom function to do transpose
def melt(df: DataFrame, id_vars: Iterable[str], value_vars: Iterable[str], var_name: str="variable", value_name: str="Value") -> DataFrame:
# Create array<struct<variable: str, value: ...>>
_vars_and_vals = array(*(
struct(lit(c).alias(var_name), col(c).alias(value_name))
for c in value_vars))
# Add to the DataFrame and explode
_tmp = df.withColumn("_vars_and_vals", explode(_vars_and_vals))
cols = id_vars + [
col("_vars_and_vals")[x].alias(x) for x in [var_name, value_name]]
return _tmp.select(*cols)
df9_ColsList=melt(df1_stdColumns, id_vars=['StdKey'], value_vars=df1_stdColumns.columns).filter("variable <>'StdKey'")
df9_ColsList=df9_ColsList.groupBy("variable").pivot("StdKey").agg(F.first("value")).drop("variable")
df1_input_2=melt(df1_input_2, id_vars=['RecordId','Person'], value_vars=df1_input_2.columns).filter("variable != 'Person'").filter("variable != 'RecordId'").withColumnRenamed('variable','Name')
df_prevStepInputItemDets=(df1_input_2.join(df1_item_details,(df1_input_2.Name == df1_item_details.Item) & (df1_input_2.Value == df1_item_details.Detail)))
#Since pivot performs better if the columns are know in advance, sacrificing a collect to do it. (Since Pivot without providing this was performing worse)
CurrStagePivotCols_tmp = df1_item_details.select("Item Desc").rdd.flatMap(lambda x: x).collect()
CurrStagePivotCols = []
[CurrStagePivotCols.append(x) for x in CurrStagePivotCols_tmp if x not in CurrStagePivotCols]
df_prevStepInputItemDets=(df_prevStepInputItemDets \
.groupBy('RecordId',"Person") \
.pivot("Item Desc",CurrStagePivotCols) \
#.pivot("Item Desc") \
.agg(F.first("Detail Desc"))).drop("RecordId")
#combine codes and descriptions
#Add rowNumber to both dataframes so that they can be merged side-by-side
def add_rowNum(sdf):
new_schema = StructType(sdf.schema.fields + [StructField("RowNum", LongType(), False),])
return sdf.rdd.zipWithIndex().map(lambda row: row[0] + (row[1],)).toDF(schema=new_schema)
ta = df1_input.alias('ta')
tb = df_prevStepInputItemDets.alias('tb')
ta = add_rowNum(ta)
tb = add_rowNum(tb)
df9_code_desc = tb.join(ta.drop("Katashiki"), on="RowNum",how='inner').drop("RowNum")
#CrossJoin to plaster standard columns
df9_final=df9_code_desc.crossJoin(df9_ColsList).drop("RecordId")
display(df9_final)

CoGroupByKey, emit after n records have been grouped

I'm trying to parallelize some heavier computations like so:
inputs = (p | "Read" >> beam.io.ReadFromAvro('/mypath/myavrofiles*')
| "Generate Key" >> beam.Map(lambda row: (gen_key(row), row)))
calc1_results = inputs | "perform calc1" >> beam.Pardo(Calc1())
calc2_results = inputs | "perform calc2" >> beam.Pardo(Calc2())
combined = (({"calc1": calc1_results, "calc2": calc2_results})
| beam.CoGroupByKey()
| beam.Values())
final = combined | "Use Grouped results" >> beam.ParDo(PerformFinalCalculation())
Each heavy calc emits (key, result)
Each key is unique for each input. One input, One result, one Key
Is there some way to emit from the CoGroupByKey after a single result1/result2 has been collected for each key?
Ultimately I'd like to achieve something along the lines of:
+------------+
| |
| Input |
| +-----------------+
+------------+ |
| |
v-------------------v |
+------------+ +------------+ |
| | | | |
| Heavy | | Heavy | |
| Calc 1 | | Calc 2 | |
| | | | |
+------------+ +------------+ |
| | |
| | |
| | |
+--v------------v--+ |
| Merged | |
| original dict, +<---------------+
|result 1, result2 |
| |
+------------------+

vlookup on text field using pandas

I need to use vlookup functionality in pandas.
DataFrame 2: (FEED_NAME has no duplicate rows)
+-----------+--------------------+---------------------+
| FEED_NAME | Name | Source |
+-----------+--------------------+---------------------+
| DMSN | DMSN_YYYYMMDD.txt | Main hub |
| PCSUS | PCSUS_YYYYMMDD.txt | Basement |
| DAMJ | DAMJ_YYYYMMDD.txt | Effiel Tower router |
+-----------+--------------------+---------------------+
DataFrame 1:
+-------------+
| SYSTEM_NAME |
+-------------+
| DMSN |
| PCSUS |
| DAMJ |
| : |
| : |
+-------------+
DataFrame 1 contains lot more number of rows. It is acutally a column in much larger table. I need to merger df1 with df2 to make it look like:
+-------------+--------------------+---------------------+
| SYSTEM_NAME | Name | Source |
+-------------+--------------------+---------------------+
| DMSN | DMSN_YYYYMMDD.txt | Main Hub |
| PCSUS | PCSUS_YYYYMMDD.txt | Basement |
| DAMJ | DAMJ_YYYYMMDD.txt | Eiffel Tower router |
| : | | |
| : | | |
+-------------+--------------------+---------------------+
in excel I just would have done VLOOKUP(,,1,TRUE) and then copied the same across all cells.
I have tried various combinations with merge and join but I keep getting KeyError:'SYSTEM_NAME'
Code:
_df1 = df1[["SYSTEM_NAME"]]
_df2 = df2[['FEED_NAME','Name','Source']]
_df2.rename(columns = {'FEED_NAME':"SYSTEM_NAME"})
_df3 = pd.merge(_df1,_df2,how='left',on='SYSTEM_NAME')
_df3.head()

You missed the inplace=True argument in the line _df2.rename(columns = {'FEED_NAME':"SYSTEM_NAME"}) so the _df2 columns name haven't changed. Try this instead :
_df1 = df1[["SYSTEM_NAME"]]
_df2 = df2[['FEED_NAME','Name','Source']]
_df2.rename(columns = {'FEED_NAME':"SYSTEM_NAME"}, inplace=True)
_df3 = pd.merge(_df1,_df2,how='left',on='SYSTEM_NAME')
_df3.head()

Psycopg2 Insert into with conditions

I have this table in Postgres
ID| | IP | Remote-as | IRR-Record |
1 | | 192.168.1.1 |100 | |
2 | | 192.168.2.1 |200 | |
3 | | 192.168.3.1 |300 | |
4 | | 192.168.4.1 |400 | |
I want to add for each ip address the IIR-Record.
The IIR-record is inside a variable.
c = conn.cursor()
query = 'select * From "peers"'
c.execute(query)
for row in c:
c.execute('''INSERT INTO "peers" ("IRR-
Record") VALUES(variable)
conn.commit()
This code doesn't work because i gat the IIR-Record at the end of my table.
ID| | IP | Remote-as | IRR-Record |
1 | | 192.168.1.1 |100 | |
2 | | 192.168.2.1 |200 | |
3 | | 192.168.3.1 |300 | |
4 | | 192.168.4.1 |400 |
|Variable
|Variable
|Varibale
any Idea!!!!

You need to use an UPDATE query instead of INSERT
UPDATE peers
SET IRR-Record = <<MYVALUE>>
WHERE ID = <<MYID>>

I think it's something like this:
for row in c:
c.execute('UPDATE peers SET "IRR-Record"=%s WHERE ID=%s', (record_var, id_var))
Edit: Use UPDATE sql statement as per #Devasta's answer.

RDF/SKOS concept hierarchy as Python dictionary

In Python, how do I turn RDF/SKOS taxonomy data into a dictionary that represents the concept hierarchy only?
The dictionary must have this format:
{ 'term1': [ 'term2', 'term3'], 'term3': [{'term4' : ['term5', 'term6']}, 'term6']}
I tried using RDFLib with JSON plugins, but did not get the result I want.

I'm not much of a Python user, and I haven't worked with RDFLib, but I just pulled the SKOS and vocabulary from the SKOS vocabularies page. I wasn't sure what concepts (RDFS or OWL classes) were in the vocabulary, nor what their hierarchy was, so I ran this a SPARQL query using Jena's ARQ to select classes and their subclasses. I didn't get any results. (There were classes defined of course, but none had subclasses.) Then I decided to use both the SKOS and SKOS-XL vocabularies, and to ask for properties and subproperties as well as classes and subclasses. This is the SPARQL query I used:
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
SELECT ?property ?subproperty ?class ?subclass WHERE {
{ ?subclass rdfs:subClassOf ?class }
UNION
{ ?subproperty rdfs:subPropertyOf ?property }
}
ORDER BY ?class ?property
The results I got were
-------------------------------------------------------------------------------------------------------------------
| property | subproperty | class | subclass |
===================================================================================================================
| rdfs:label | skos:altLabel | | |
| rdfs:label | skos:hiddenLabel | | |
| rdfs:label | skos:prefLabel | | |
| skos:broader | skos:broadMatch | | |
| skos:broaderTransitive | skos:broader | | |
| skos:closeMatch | skos:exactMatch | | |
| skos:inScheme | skos:topConceptOf | | |
| skos:mappingRelation | skos:broadMatch | | |
| skos:mappingRelation | skos:closeMatch | | |
| skos:mappingRelation | skos:narrowMatch | | |
| skos:mappingRelation | skos:relatedMatch | | |
| skos:narrower | skos:narrowMatch | | |
| skos:narrowerTransitive | skos:narrower | | |
| skos:note | skos:changeNote | | |
| skos:note | skos:definition | | |
| skos:note | skos:editorialNote | | |
| skos:note | skos:example | | |
| skos:note | skos:historyNote | | |
| skos:note | skos:scopeNote | | |
| skos:related | skos:relatedMatch | | |
| skos:semanticRelation | skos:broaderTransitive | | |
| skos:semanticRelation | skos:mappingRelation | | |
| skos:semanticRelation | skos:narrowerTransitive | | |
| skos:semanticRelation | skos:related | | |
| | | _:b0 | <http://www.w3.org/2008/05/skos-xl#Label> |
| | | skos:Collection | skos:OrderedCollection |
-------------------------------------------------------------------------------------------------------------------
It looks like there's not much concept hierarchy in SKOS at all. Could that explain why you didn't get the results you wanted before?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Create new Data frame from an existing one in pyspark - python

Related

Spark solution for below problem (Transform , Pivot, CrossJoin)

CoGroupByKey, emit after n records have been grouped

vlookup on text field using pandas

Psycopg2 Insert into with conditions

RDF/SKOS concept hierarchy as Python dictionary

Categories

Resources