How do I make row objects value into every single column - python

Columns are like:
| index | nutrition per 100g |
|:------:| -----:|
| 0 | {'energy': 81.12946131894766, 'fat': 2.14013926351589, 'protein': 6.914436593565535, 'salt': 0.055978167389859, 'saturates': 0.36534716195613903, 'sugars': 5.086341034361439} |
| 1 | {'energy': 477.09640393594606, 'fat': 23.412485931109796, 'protein': 7.625491714677334, 'salt': 0.5486205522805531, 'saturates': 3.425053768233838, 'sugars': 14.298442949953758}|
I want like this:
| index | energy | fat | protein | salt | saturates | sugars |
|:----|:---- |:------:| -----:|----:|----:|----:|
|0 | 81.12946131894766 | 2.14013926351589 | 6.914436593565535 | 0.055978167389859 | 0.36534716195613903 | 5.086341034361439 |
| 1| 477.09640393594606 | 23.412485931109796 | 7.625491714677334 | 0.5486205522805531 | 3.425053768233838 | 14.298442949953758 |

If your initial data is already a dataframe, then you could try something like this:
# Use first row values to create a new df
new_df = pd.DataFrame.from_dict(initial_df.iloc[0]["nutrition per 100g"])
# The iterate over the other rows and append to the new dataframe
for index, row in df.iterrows():
if index == 0:
continue
temp_df = pd.DataFrame.from_dict(row.iloc[0])
new_df.append(temp_df, ignore_index=True)

Related

how to find the sum of a dataframe?

while finding sum as follows
g.loc[g.index[0], 'sum'] = g[RDM].sum()
where RDM is
RDM = [f"R_Dist_meas_{i}" for i in range(48)]
the error was as follows:
KeyError: "None of [Index(['R_Dist_meas_0', 'R_Dist_meas_1', 'R_Dist_meas_2',\n .........................'R_Dist_meas_45', 'R_Dist_meas_46', 'R_Dist_meas_47'],\n dtype='object')] are in the [columns]"
the sample dataframe is as follows,it have many other column other than distance(angle,velocity etc..)
The format of dataframe is A0B0C0 A1B1C1 A2B2C2 A3B3C3 ....... A47B47C47
| R_Dist_meas_0 |R_vel_meas_0 | R_Dist_meas_1 |R_vel_meas_1 | R_Dist_meas_2 |R_vel_meas_2 |--------| R_Dist_meas_47 |R_vel_meas_47 |
|---------------|-------------|---------------|-------------|---------------|-------------|
| 5 | | | | | |
| | | | |10 | |
| | | | | 8 | |
| 2 | | 8 | | | |
the sum = 33
How to solve it?
Your list comprehension will go out of bounds if you try to search the dataframe since you only have columns up to R_Dist_meas_2. If you try to use the RDM as header keys you will be looking for columns not rows.
sum(g.iloc[:,:2].sum())
Excluding the sum outside, this allows you to sum up the rows of each column seperately and then add their totals for the final summation. This should give you the sum you are looking for.

Pyspark: match columns from two different dataframes and add value

I am trying to compare the values of two columns that exist in different dataframes to create a new dataframe based on the matching of the criteria:
df1=
| id |
| -- |
| 1 |
| 2 |
| 3 |
| 4 |
| 5 |
df2 =
| id |
| -- |
| 2 |
| 5 |
| 1 |
So, I want to add an 'x' in the is_used field when the field of df2 exists in the field of df1, else add 'NA', to generate a result dataframe like this:
df3 =
| id | is_used |
| -- | ------- |
| 1 | X |
| 2 | X |
| 3 | NA |
| 4 | NA |
| 5 | X |
I have tried this way, but the selection criteria places an "X" in all columns:
df3 = df3.withColumn('is_used', F.when(
condition = (F.arrays_overlap(F.array(df1.id), F.array(df2.id))) == False,
value = 'NA'
).otherwise('X'))
I would appreciate any help
Try with fullouter join:
df3 = (
df1.join(df2.alias("df2"), df1.id == df2.id, "fullouter")
.withColumn(
"is_used",
F.when(F.col("df2.id").isNotNull(), F.lit("X")).otherwise(F.lit("NA")),
)
.drop(F.col("df2.id"))
.orderBy(F.col("id"))
)
Result:
+---+-------+
|id |is_used|
+---+-------+
|1 |X |
|2 |X |
|3 |NA |
|4 |NA |
|5 |X |
+---+-------+
Try the following code, it would give you a similar result and you can make the rest of the changes:
df3 = df1.alias("df1").\
join(df2.alias("df2"), (df1.id==df2.id), how='left').\
withColumn('is_true', F.when(df1.id == df2.id,F.lit("X")).otherwise(F.lit("NA"))).\
select("df1.*","is_true")
df3.show()
First of all, I want to thank the people who contributed their code, it was very useful to understand what was happening.
The problem was that when trying to do df1.id == df2.id Spark was inferring both columns as one because they both had the same name, so the result of all iterations would always be True.
Just rename the fields I wanted to compare and it totally worked for me.
Here is the code:
df2 = df2.withColumnRenamed("id", "id1")
df3 = df1.alias("df1").join(df2.alias("df2"),
(df1.id == df2.id1), "left")
df3 = df3.withColumn("is_used", F.when(df1.id == df2.id1),
"X").otherwise("NA")
df3 = df3.drop("id1")

Create a dataframe by iterating over column of list in another dataframe

In pyspark, I have a DataFrame with a column that contains a list of ordered nodes to go through:
osmDF.schema
Out[1]:
StructType(List(StructField(id,LongType,true),
StructField(nodes,ArrayType(LongType,true),true),
StructField(tags,MapType(StringType,StringType,true),true)))
osmDF.head(3)
Out[2]:
| id | nodes | tags |
|-----------|-----------------------------------------------------|---------------------|
| 62960871 | [783186590,783198852] | "{""foo"":""bar""}" |
| 211528816 | [2215187080,2215187140,2215187205,2215187256] | "{""foo"":""boo""}" |
| 62960872 | [783198772,783183397,783167527,783169067,783198772] | "{""foo"":""buh""}" |
I need to create a dataframe with a row for each consecutive combination of 2 nodes the list of nodes, then save it as parquet.
The expected result will have a length of n-1, with n len(nodes) for each rows. It would look like this (with other columns that I'll add):
| id | from | to | tags |
|-----------------------|------------|------------|---------------------|
| 783186590_783198852 | 783186590 | 783198852 | "{""foo"":""bar""}" |
| 2215187080_2215187140 | 2215187080 | 2215187140 | "{""foo"":""boo""}" |
| 2215187140_2215187205 | 2215187140 | 2215187205 | "{""foo"":""boo""}" |
| 2215187205_2215187256 | 2215187205 | 2215187256 | "{""foo"":""boo""}" |
| 783198772_783183397 | 783198772 | 783183397 | "{""foo"":""buh""}" |
| 783183397_783167527 | 783183397 | 783167527 | "{""foo"":""buh""}" |
| 783167527_783169067 | 783167527 | 783169067 | "{""foo"":""buh""}" |
| 783169067_783198772 | 783169067 | 783198772 | "{""foo"":""buh""}" |
I tried to initiate with the following
from pyspark.sql.functions import udf
def split_ways_into_arcs(row):
arcs = []
for node in range(len(row['nodes']) - 1):
arc = dict()
arc['id'] = str(row['nodes'][node]) + "_" + str(row['nodes'][node + 1])
arc['from'] = row['nodes'][node]
arc['to'] = row['nodes'][node + 1]
arc['tags'] = row['tags']
arcs.append(arc)
return arcs
# Declare function as udf
split = udf(lambda row: split_ways_into_arcs(row.asDict()))
The issue I'm having is I don't know how many nodes there are in each row of the original DataFrame.
I know how to apply a udf to add a column to an existing DataFrame, but not to create a new one from lists of dicts.
Iterate over the nodes array using transform and explode the array afterwards:
from pyspark.sql import functions as F
df = ...
df.withColumn("nodes", F.expr("transform(nodes, (n,i) -> named_struct('from', nodes[i], 'to', nodes[i+1]))")) \
.withColumn("nodes", F.explode("nodes")) \
.filter("not nodes.to is null") \
.selectExpr("concat_ws('_', nodes.to, nodes.from) as id", "nodes.*", "tags") \
.show(truncate=False)
Output:
+---------------------+----------+----------+-----------------+
|id |from |to |tags |
+---------------------+----------+----------+-----------------+
|783198852_783186590 |783186590 |783198852 |{""foo"":""bar""}|
|2215187140_2215187080|2215187080|2215187140|{""foo"":""boo""}|
|2215187205_2215187140|2215187140|2215187205|{""foo"":""boo""}|
|2215187256_2215187205|2215187205|2215187256|{""foo"":""boo""}|
|783183397_783198772 |783198772 |783183397 |{""foo"":""buh""}|
|783167527_783183397 |783183397 |783167527 |{""foo"":""buh""}|
|783169067_783167527 |783167527 |783169067 |{""foo"":""buh""}|
|783198772_783169067 |783169067 |783198772 |{""foo"":""buh""}|
+---------------------+----------+----------+-----------------+

vlookup on text field using pandas

I need to use vlookup functionality in pandas.
DataFrame 2: (FEED_NAME has no duplicate rows)
+-----------+--------------------+---------------------+
| FEED_NAME | Name | Source |
+-----------+--------------------+---------------------+
| DMSN | DMSN_YYYYMMDD.txt | Main hub |
| PCSUS | PCSUS_YYYYMMDD.txt | Basement |
| DAMJ | DAMJ_YYYYMMDD.txt | Effiel Tower router |
+-----------+--------------------+---------------------+
DataFrame 1:
+-------------+
| SYSTEM_NAME |
+-------------+
| DMSN |
| PCSUS |
| DAMJ |
| : |
| : |
+-------------+
DataFrame 1 contains lot more number of rows. It is acutally a column in much larger table. I need to merger df1 with df2 to make it look like:
+-------------+--------------------+---------------------+
| SYSTEM_NAME | Name | Source |
+-------------+--------------------+---------------------+
| DMSN | DMSN_YYYYMMDD.txt | Main Hub |
| PCSUS | PCSUS_YYYYMMDD.txt | Basement |
| DAMJ | DAMJ_YYYYMMDD.txt | Eiffel Tower router |
| : | | |
| : | | |
+-------------+--------------------+---------------------+
in excel I just would have done VLOOKUP(,,1,TRUE) and then copied the same across all cells.
I have tried various combinations with merge and join but I keep getting KeyError:'SYSTEM_NAME'
Code:
_df1 = df1[["SYSTEM_NAME"]]
_df2 = df2[['FEED_NAME','Name','Source']]
_df2.rename(columns = {'FEED_NAME':"SYSTEM_NAME"})
_df3 = pd.merge(_df1,_df2,how='left',on='SYSTEM_NAME')
_df3.head()
You missed the inplace=True argument in the line _df2.rename(columns = {'FEED_NAME':"SYSTEM_NAME"}) so the _df2 columns name haven't changed. Try this instead :
_df1 = df1[["SYSTEM_NAME"]]
_df2 = df2[['FEED_NAME','Name','Source']]
_df2.rename(columns = {'FEED_NAME':"SYSTEM_NAME"}, inplace=True)
_df3 = pd.merge(_df1,_df2,how='left',on='SYSTEM_NAME')
_df3.head()

Merge two spark dataframes based on a column

I have 2 dataframes which I need to merge based on a column (Employee code). Please note that the dataframe has about 75 columns, so I am providing a sample dataset to get some suggestions/sample solutions. I am using databricks, and the datasets are read from S3.
Following are my 2 dataframes:
DATAFRAME - 1
|-----------------------------------------------------------------------------------|
|EMP_CODE |COLUMN1|COLUMN2|COLUMN3|COLUMN4|COLUMN5|COLUMN6|COLUMN7|COLUMN8|COLUMN9|
|-----------------------------------------------------------------------------------|
|A10001 | B | | | | | | | | |
|-----------------------------------------------------------------------------------|
DATAFRAME - 2
|-----------------------------------------------------------------------------------|
|EMP_CODE |COLUMN1|COLUMN2|COLUMN3|COLUMN4|COLUMN5|COLUMN6|COLUMN7|COLUMN8|COLUMN9|
|-----------------------------------------------------------------------------------|
|A10001 | | | | | C | | | | |
|B10001 | | | | | | | | |T2 |
|A10001 | | | | | | | | B | |
|A10001 | | | C | | | | | | |
|C10001 | | | | | | C | | | |
|-----------------------------------------------------------------------------------|
I need to merge the 2 dataframes based on EMP_CODE, basically join dataframe1 with dataframe2, based on emp_code. I am getting duplicate columns when i do a join, and I am looking for some help.
Expected final dataframe:
|-----------------------------------------------------------------------------------|
|EMP_CODE |COLUMN1|COLUMN2|COLUMN3|COLUMN4|COLUMN5|COLUMN6|COLUMN7|COLUMN8|COLUMN9|
|-----------------------------------------------------------------------------------|
|A10001 | B | | C | | C | | | B | |
|B10001 | | | | | | | | |T2 |
|C10001 | | | | | | C | | | |
|-----------------------------------------------------------------------------------|
There are 3 rows with emp_code A10001 in dataframe1, and 1 row in dataframe2. All data should be merged as one record without any duplicate columns.
Thanks much
you can use inner join
output = df1.join(df2,['EMP_CODE'],how='inner')
also you can apply distinct at the end to remove duplicates.
output = df1.join(df2,['EMP_CODE'],how='inner').distinct()
You can do that in scala if both dataframes have same columns by
output = df1.union(df2)
First you need to aggregate the individual dataframes.
from pyspark.sql import functions as F
df1 = df1.groupBy('EMP_CODE').agg(F.concat_ws(" ", F.collect_list(df1.COLUMN1)))
you have to write this for all columns and for all dataframes.
Then you'll have to use union function on all dataframes.
df1.union(df2)
and then repeat same aggregation on that union dataframe.
What you need is a union.
If both dataframes have the same number of columns and the columns that are to be "union-ed" are positionally the same (as in your example), this will work:
output = df1.union(df2).dropDuplicates()
If both dataframes have the same number of columns and the columns that need to be "union-ed" have the same name (as in your example as well), this would be better:
output = df1.unionByName(df2).dropDuplicates()

Categories

Resources