How can I find the combination of columns in a data set (pyspark) that can be considered the primary key.
I tried to generate the combination of all the columns then compare the number of distinct records of each subset with the whole set, but it is very expensive.
from itertools import combinations
l_key = []
for i in range(len(df.columns)+1):
print(f'Iter:{i+2}..{len(df.columns)+1}')
for c in list(combinations(df.columns, i+2)):
if(df.select(*c).distinct().count() == df.count()):
l_key.append(c)
print(f'Key:{c}')
Are there any functions or libraries that can generate this type of analysis?
You can try it by creating the combinations as columns and then grouping.
import pyspark.sql.functions as F
import pandas as pd
from itertools import combinations
pdf = pd.DataFrame([[1, 1, 1], [2, 1, 1], [3, 2, 1]], columns=['x', 'y', 'z'])
df = spark.createDataFrame(pdf)
initCols = df.columns
for i in range(len(initCols)+1):
for c in list(combinations(initCols, i+2)):
df = df.withColumn(','.join(c), F.concat_ws(',', *c))
finalCols = df.columns
exprs = [F.size(F.collect_set(x)).alias(x) for x in finalCols]
df = df\
.withColumn("aggCol", F.lit("a"))\
.groupBy("aggCol")\
.agg(*exprs)
df.show()
Output:
+------+---+---+---+---+---+---+-----+
|aggCol| x| y| z|x,y|x,z|y,z|x,y,z|
+------+---+---+---+---+---+---+-----+
| a| 3| 2| 1| 3| 3| 2| 3|
+------+---+---+---+---+---+---+-----+
I believe this should be less expensive. I tested it quickly on a small dataframe (~20k rows, 7 cols) and it didn't take too much time. Let me know how this works out for your dataset.
Related
I'm new to Spark, trying to use it like I have used Pandas for data analysis.
In pandas, to see a variable, I will write the following:
import pandas as pd
df = pd.DataFrame({a:[1,2,3],b:[4,5,6]})
print(df.head())
In Spark, my print statements are not printed to the terminal. Based on David's comment on this answer, print statements are sent to stdout/stderr, and there is a way to get it with Yarn, but he doesn't say how. I can't find anything that makes sense by Googling "how to capture stdout spark".
What I want is a way to see bits of my data to troubleshoot my data analysis. "Did adding that column work?" That sort of thing. I'd also welcome new ways to troubleshoot that are better for huge datasets.
Yes, you can use different ways to print your dataframes:
>>> l = [[1, 1], [2, 2], [3, 3], [4, 4], [5, 5]]
>>> spark.createDataFrame(l, ["a", 'b']).show()
+---+---+
| a| b|
+---+---+
| 1| 1|
| 2| 2|
| 3| 3|
| 4| 4|
| 5| 5|
+---+---+
>>> print(spark.createDataFrame(l, ['a', 'b']).limit(5).toPandas())
a b
0 1 1
1 2 2
2 3 3
3 4 4
4 5 5
df.show() will print top 20 rows, but you can pass a number to that, for n of rows.
You can also use df.limit(n).toPandas() to get a pandas style df.head()
I am trying to increase all values in dataframe by 1 except for one column which is the ID column.
Example:
Results:
This is what I have so far but it gets a bit long when I have a lot of columns to do (e.g. 50).
df_add = df.select(
'Id',
(df['col_a'] + 1).alias('col_a'),
..
..
)
Is there a more pythonic way of achieving the same results?
EDIT (based on #Daniel comment):
You can directly use the lit function
from pyspark.sql.functions import col, lit
for column in plus_one_cols:
df = df.withColumn(column, col(column) + lit(1))
PREVIOUS ANSWER :
Adding "1" to columns is a columnar operation which can be better suited for a pandas_udf
from pyspark.sql.functions import col, pandas_udf, PandasUDFType
#pandas_udf('double', PandasUDFType.SCALAR)
def plus_one(v):
return v + 1
plus_one_cols = [x for x in df.columns if x != "Id"]
for column in plus_one_cols:
df = df.withColumn(column, plus_one(col(column)))
This will work much faster than the row-wise operations. You can also refer to Introducing Pandas UDF for PySpark - Databricks
If there are a lot of columns, you can use the below one-liner,
from pyspark.sql.functions import lit,col
df.select('Id', *[(col(i) + lit(1)) for i in df.columns if i != 'Id']).toDF(*df.columns).show()
Output:
+---+-----+-----+-----+
| Id|col_a|col_b|col_c|
+---+-----+-----+-----+
| 1| 4| 21| 6|
| 5| 6| 1| 1|
| 6| 10| 2| 1|
+---+-----+-----+-----+
You can use the withColumn method and then iterate over the columns as follows:
df_add = df
for column in ["col_a", "col_b", "col_c"]:
df_add = df_add.withColumn(column, expr(f"{column} +1").cast("integer"))
Use pyspark.sql.functions.lit to add values to columns
Ex:
from pyspark.sql import functions as psf
df = spark.sql("""select 1 as test""")
df.show()
# +----+
# |test|
# +----+
# | 1|
# +----+
df_add = df.select(
'test',
(df['test'] + psf.lit(1)).alias('col_a'),
)
df_add.show()
# +----+-----+
# |test|col_a|
# +----+-----+
# | 1| 2|
# +----+-----+
###
# If you want to do it for all columns then:
###
list_of_columns = ["col1", "col2", ...]
df_add = df.select(
[(df[col] + psf.lit(1)).alias(col) for col in list_of_columns]
)
df_add.show()
This question already has answers here:
Spark DataFrame: count distinct values of every column
(6 answers)
Show distinct column values in pyspark dataframe
(12 answers)
Closed 1 year ago.
How it is possible to calculate the number of unique elements in each column of a pyspark dataframe:
import pandas as pd
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = pd.DataFrame([[1, 100], [1, 200], [2, 300], [3, 100], [4, 100], [4, 300]], columns=['col1', 'col2'])
df_spark = spark.createDataFrame(df)
print(df_spark.show())
# +----+----+
# |col1|col2|
# +----+----+
# | 1| 100|
# | 1| 200|
# | 2| 300|
# | 3| 100|
# | 4| 100|
# | 4| 300|
# +----+----+
# Some transformations on df_spark here
# How to get a number of unique elements (just a number) in each columns?
I know only the following solution which is very slow, both of these lines are calculated in the same amount of time:
col1_num_unique = df_spark.select('col1').distinct().count()
col2_num_unique = df_spark.select('col2').distinct().count()
There are about 10 millions rows in df_spark.
Try this:
from pyspark.sql.functions import col, countDistinct
df_spark.agg(*(countDistinct(col(c)).alias(c) for c in df_spark.columns))
EDIT:
As #pault suggested, its an expensive operation and you can use approx_count_distinct() The one he suggested is currently deprecated (spark version >= 2.1)
#Manrique solved the problem, but only slightly modified solution worked for me:
expression = [countDistinct(c).alias(c) for c in df.columns]
df.select(*expression).show()
This is mush faster:
df_spark.select(F.countDistinct("col1")).show()
I'm attempting to cast multiple String columns to integers in a dataframe using PySpark 2.1.0. The data set is a rdd to begin, when created as a dataframe it generates the following error:
TypeError: StructType can not accept object 3 in type <class 'int'>
A sample of what I'm trying to do:
import pyspark.sql.types as typ
from pyspark.sql.functions import *
labels = [
('A', typ.StringType()),
('B', typ.IntegerType()),
('C', typ.IntegerType()),
('D', typ.IntegerType()),
('E', typ.StringType()),
('F', typ.IntegerType())
]
rdd = sc.parallelize(["1", 2, 3, 4, "5", 6])
schema = typ.StructType([typ.StructField(e[0], e[1], False) for e in labels])
df = spark.createDataFrame(rdd, schema)
df.show()
cols_to_cast = [dt[0] for dt in df.dtypes if dt[1]=='string']
#df2 = df.select(*(c.cast("integer").alias(c) for c in cols_to_cast))
df2 = df.select(*( df[dt[0]].cast("integer").alias(dt[0])
for dt in df.dtypes if dt[1]=='string'))
df2.show()
The problem to begin with is the dataframe is not being created based on the RDD.
Thereafter, I have tried two ways to cast (df2), the first is commented out.
Any suggestions?
Alternatively is there anyway I could use the .withColumn functions for casting all columns in 1 go, instead of specifying each column?
The actual dataset, although not large, has many columns.
Problem isnt your code, its your data. You are passing single list which will be treated as single column instead of six that you want.
Try rdd line as below and it should work fine.( Notice extra brackets around list )-
rdd = sc.parallelize([["1", 2, 3, 4, "5", 6]])
you code with above corrected line shows me following output :
+---+---+---+---+---+---+
| A| B| C| D| E| F|
+---+---+---+---+---+---+
| 1| 2| 3| 4| 5| 6|
+---+---+---+---+---+---+
+---+---+
| A| E|
+---+---+
| 1| 5|
+---+---+
I am working on Ipython and Spark and I have a RDD from which I form a list. Now from this list I want to form a dataframe which has multiple columns from parent list but these columns are not contiguous. I wrote this but it seems to be working wrong:
list1 = rdd.collect()
columns_num = [1,8,11,17,21,24]
df2 = [list[i] for i in columns_num]
The above code only selects 6 rows, with only column 1 data, from parent list and forms the new dataframe with those 6 columns 1 data.
How can I form a new dataframe with multiple not contiguous columns from another list
For example like this:
rdd = sc.parallelize([("a", 1, 2, 4.0, "foo"), ("b", 3, 4, 5.0, "bar")])
columns_num = [0, 3]
df = rdd.toDF()
df2 = df.select(*(df.columns[i] for i in columns_num))
df2.show()
## +---+---+
## | _1| _4|
## +---+---+
## | a|4.0|
## | b|5.0|
## +---+---+
or like this:
df = rdd.map(lambda row: [row[i] for i in columns_num]).toDF()
df.show()
## +---+---+
## | _1| _4|
## +---+---+
## | a|4.0|
## | b|5.0|
## +---+---+
On a side not you should never collect data just to reshape. In the best case scenario it will be slow, in the worst case scenario it will simply crash.
With Optimus this is really easy. You just need to install it with:
pip install optimuspyspark
Then you import it (it will start Spark for you):
import optimus as op
Let's create the DF:
rdd = sc.parallelize([("a", 1, 2, 4.0, "foo"), ("b", 3, 4, 5.0, "bar")])
df = rdd.toDF()
And start the transformer:
transformer = op.DataFrameTransformer(df)
And select your columns
df_new = transformer.select_idx([0,2]).df
And you have it now:
df_new.show()
+---+---+
| _1| _3|
+---+---+
| a| 2|
| b| 4|
+---+---+