pyspark - if statement inside select - python

Following code finds maximum length of all columns in dataframe df.
Question: In the code below how can we check the max length of only string columns?
from pyspark.sql.functions import col, length, max
df=df.select([max(length(col(name))) for name in df.schema.names])

Instead of using schema.names, you can use schema.fields that returns list of StructField’s which you can iterate through and get name and type of each field.
df.select([max(length(col(field.name))) for field in df.schema.fields if field.dataType.typeName == "string"])

You can add a condition that tests for the dataType of df.schema. For example:
from pyspark.sql.types import StringType
df = spark.createDataFrame(
[
(1, '2', '1'),
(1, '4', '2'),
(1, '2', '3'),
],
['col1','col2','col3']
)
df.select([
max(length(col(schema.name))).alias(f'{schema.name}_max_length')
for schema in df.schema
if schema.dataType == StringType()
])
+---------------+---------------+
|col2_max_length|col3_max_length|
+---------------+---------------+
| 1| 1|
+---------------+---------------+

df = df.select([max(length(col(name))) for (name, type) in df.dtypes if type == 'string'])

Related

Get column names with corresponding index in python pandas

I have this dataframe df where
>>> df = pd.DataFrame({'Date':['10/2/2011', '11/2/2011', '12/2/2011', '13/2/11'],
'Event':['Music', 'Poetry', 'Theatre', 'Comedy'],
'Cost':[10000, 5000, 15000, 2000],
'Name':['Roy', 'Abraham', 'Blythe', 'Sophia'],
'Age':['20', '10', '13', '17']})
I want to determine the column index with the corresponding name. I tried it with this:
>>> list(df.columns)
But the solution above only returns the column names without index numbers.
How can I code it so that it would return the column names and the corresponding index for that column? Like This:
0 Date
1 Event
2 Cost
3 Name
4 Age
Simpliest is add pd.Series constructor:
pd.Series(list(df.columns))
Or convert columns to Series and create default index:
df.columns.to_series().reset_index(drop=True)
Or:
df.columns.to_series(index=False)
You can use loop like this:
myList = list(df.columns)
index = 0
for value in myList:
print(index, value)
index += 1
A nice short way to get a dictionary:
d = dict(enumerate(df))
output: {0: 'Date', 1: 'Event', 2: 'Cost', 3: 'Name', 4: 'Age'}
For a Series, pd.Series(list(df)) is sufficient as iteration occurs directly on the column names
In addition to using enumerate, this also can get a numbers in order using zip, as follows:
import pandas as pd
df = pd.DataFrame({'Date':['10/2/2011', '11/2/2011', '12/2/2011', '13/2/11'],
'Event':['Music', 'Poetry', 'Theatre', 'Comedy'],
'Cost':[10000, 5000, 15000, 2000],
'Name':['Roy', 'Abraham', 'Blythe', 'Sophia'],
'Age':['20', '10', '13', '17']})
result = list(zip([i for i in range(len(df.columns))], df.columns.values,))
for r in result:
print(r)
#(0, 'Date')
#(1, 'Event')
#(2, 'Cost')
#(3, 'Name')
#(4, 'Age')

How to get the missing record Row number and column names using python?

Using python and pandas, I would like to achieve the output below. Whenever there are Null or Nan values present in the file then it needs to print the both row number and column name.
import pandas as pd
# List of Tuples
employees = [('Stuti', 'Null', 'Varanasi', 20000),
('Saumya', 'NAN', 'NAN', 35000),
('Saumya', 32, 'Delhi', 30000),
('Aaditya', 40, 'Dehradun', 24000),
('NAN', 45, 'Delhi', 70000)
]
# Create a DataFrame object from list
df = pd.DataFrame(employees,
columns =['Name', 'Age',
'City', 'Salary'])
print(df)
Expected Output:
Row 0: column Age missing
Row 1: Column Age, column City missing
Row 4: Column Name missing
Try isin to mask the missing values, then matrix multiply # with the columns to concatenate them:
s = df.isin(['Null','NAN'])
missing = s.loc[s.any(1)] # ('column ' + df.columns + ', ')
for r, val in missing.str[:-2].items():
print(f'Row {r}: {val} is missing')
Output:
Row 0: column Age is missing
Row 1: column Age, column City is missing
Row 4: column Name is missing

How to check whether key or value exist in Pyspark Map

I have a Map column in a spark DF and would like to filter this column on a particular key (i.e. keep the row if the key in the map matches desired value).
For example, my schema is defined as:
df_schema = StructType(
[StructField('id', StringType()),
StructField('rank', MapType(StringType(), IntegerType()))]
)
My sample data is:
{ "id": "0981850006", "rank": {"a": 1} }
Is there any way to filter my df on rows where "a" is in "rank" without using explode()?
Is there a better schema representation for the given json than what I have defined?
Accessing the key with rank.key would mean rank is a StructType(). Although explode is probably the best solution let's build a UDF to assess whether or not k is a key of rank.
First let's create our dataframe:
from pyspark.sql.types import *
df_schema = StructType(
[StructField('id', StringType()),
StructField('rank', MapType(StringType(), IntegerType()))]
)
df = spark.createDataFrame([
["0981850006", {"a": 1}],
["0981850006", {"b": 2, "c": 3}],
], df_schema)
Now our UDF:
def isKey(k,d):
return k in d.keys()
isKey_udf = lambda k: psf.udf(lambda d: isKey(k,d), BooleanType())
Which gives:
df.withColumn(
"is_key",
isKey_udf('a')(df.rank)
)
+----------+-------------------+------+
| id| rank|is_key|
+----------+-------------------+------+
|0981850006| Map(a -> 1)| true|
|0981850006|Map(b -> 2, c -> 3)| false|
+----------+-------------------+------+

Convert pyspark.sql.dataframe.DataFrame type Dataframe to Dictionary

I have a pyspark Dataframe and I need to convert this into python dictionary.
Below code is reproducible:
from pyspark.sql import Row
rdd = sc.parallelize([Row(name='Alice', age=5, height=80),Row(name='Alice', age=5, height=80),Row(name='Alice', age=10, height=80)])
df = rdd.toDF()
Once I have this dataframe, I need to convert it into dictionary.
I tried like this
df.set_index('name').to_dict()
But it gives error. How can I achieve this
Please see the example below:
>>> from pyspark.sql.functions import col
>>> df = (sc.textFile('data.txt')
.map(lambda line: line.split(","))
.toDF(['name','age','height'])
.select(col('name'), col('age').cast('int'), col('height').cast('int')))
+-----+---+------+
| name|age|height|
+-----+---+------+
|Alice| 5| 80|
| Bob| 5| 80|
|Alice| 10| 80|
+-----+---+------+
>>> list_persons = map(lambda row: row.asDict(), df.collect())
>>> list_persons
[
{'age': 5, 'name': u'Alice', 'height': 80},
{'age': 5, 'name': u'Bob', 'height': 80},
{'age': 10, 'name': u'Alice', 'height': 80}
]
>>> dict_persons = {person['name']: person for person in list_persons}
>>> dict_persons
{u'Bob': {'age': 5, 'name': u'Bob', 'height': 80}, u'Alice': {'age': 10, 'name': u'Alice', 'height': 80}}
The input that I'm using to test data.txt:
Alice,5,80
Bob,5,80
Alice,10,80
First we do the loading by using pyspark by reading the lines. Then we convert the lines to columns by splitting on the comma. Then we convert the native RDD to a DF and add names to the colume. Finally we convert to columns to the appropriate format.
Then we collect everything to the driver, and using some python list comprehension we convert the data to the form as preferred. We convert the Row object to a dictionary using the asDict() method. In the output we can observe that Alice is appearing only once, but this is of course because the key of Alice gets overwritten.
Please keep in mind that you want to do all the processing and filtering inside pypspark before returning the result to the driver.
Hope this helps, cheers.
You need to first convert to a pandas.DataFrame using toPandas(), then you can use the to_dict() method on the transposed dataframe with orient='list':
df.toPandas().set_index('name').T.to_dict('list')
# Out[1]: {u'Alice': [10, 80]}
RDDs have built in function asDict() that allows to represent each row as a dict.
If you have a dataframe df, then you need to convert it to an rdd and apply asDict().
new_rdd = df.rdd.map(lambda row: row.asDict(True))
One can then use the new_rdd to perform normal python map operations like:
# You can define normal python functions like below and plug them when needed
def transform(row):
# Add a new key to each row
row["new_key"] = "my_new_value"
return row
new_rdd = new_rdd.map(lambda row: transform(row))
One easy way can be to collect the row RDDs and iterate over it using dictionary comprehension. Here i will try to demonstrate something similar:
Lets assume a movie dataframe:
movie_df
movieId
avg_rating
1
3.92
10
3.5
100
2.79
100044
4.0
100068
3.5
100083
3.5
100106
3.5
100159
4.5
100163
2.9
100194
4.5
We can use dictionary comprehension and iterate over the row RDDs like below:
movie_dict = {int(row.asDict()['movieId']) : row.asDict()['avg_rating'] for row in movie_avg_rating.collect()}
print(movie_dict)
{1: 3.92,
10: 3.5,
100: 2.79,
100044: 4.0,
100068: 3.5,
100083: 3.5,
100106: 3.5,
100159: 4.5,
100163: 2.9,
100194: 4.5}

Sorting column label pandas pivot table

Another pandas sort question (tried all current SO ones and didn't find any solutions).
I have a pandas pivot_table like so:
rows = ['Tool_Location']
cols = ['shift_date','Part_Number', 'shift']
pt = df.pivot_table(index='Tool_Location', values='Number_of_parts', dropna=False, fill_value ='0', columns=cols, aggfunc='count')
produces:
Shift date 10/19
Part_number 40001
shift first second third
tool_loc
T01 0 1 0
T02 2 1 0
I'd like to switch the order of shift labels so it is third first second
EDIT:
Getting closer to a solution but not seeing it.
Using:
col_list = pt.columns.tolist()
output:
[('10/20/16', 'first'), ('10/20/16', 'second'), ('10/20/16', 'third'), ('10/21/16', 'first'), ('10/21/16', 'second'), ('10/21/16', 'third')]
Anyone know how to dynamically reorder the items so its:
[('10/20/16', 'third'), ('10/20/16', 'first'), ('10/20/16', 'second'), ('10/21/16', 'first'), ('10/21/16', 'second'), ('10/21/16', 'second')]
Because then we could reorder the columns by using pt = pt[col_list]
df.pivot_table produces a dataframe. What if you do something like this after your lines:
pt = pt[["third","first","second"]]
You can use lambda and dict to sort your tuples:
pt = pt[sorted(df.columns.tolist(), key=lambda tup: list(your_dict.keys())[list(your_dict.values()).index(tup[1])])]
To do this, you have also declare a dict with needed order:
your_dict = {1: 'first', 3: 'third', 2: 'second'}

Categories

Resources