pyspark sql functions instead of rdd distinct

pyspark sql functions instead of rdd distinct - python

I have been attempting to replace strings in a data set for specific columns. Either with 1 or 0, 'Y' if 1, otherwise 0.
I have managed to identify which columns to target, using a dataframe to rdd conversion with a lambda, but it is taking a while to process.
A switch to an rdd is done for each column and then a distinct is performed, this is taking a while!
If a 'Y' exists in the distinct result set then the column is identified as requiring a transformation.
I was wondering if anyone can suggest how I can use pyspark sql functions exclusively to obtain the same result instead of having to switch for each column?
The code, on sample data, is as follows:
import pyspark.sql.types as typ
import pyspark.sql.functions as func
col_names = [
('ALIVE', typ.StringType()),
('AGE', typ.IntegerType()),
('CAGE', typ.IntegerType()),
('CNT1', typ.IntegerType()),
('CNT2', typ.IntegerType()),
('CNT3', typ.IntegerType()),
('HE', typ.IntegerType()),
('WE', typ.IntegerType()),
('WG', typ.IntegerType()),
('DBP', typ.StringType()),
('DBG', typ.StringType()),
('HT1', typ.StringType()),
('HT2', typ.StringType()),
('PREV', typ.StringType())
]
schema = typ.StructType([typ.StructField(c[0], c[1], False) for c in col_names])
df = spark.createDataFrame([('Y',22,56,4,3,65,180,198,18,'N','Y','N','N','N'),
('N',38,79,3,4,63,155,167,12,'N','N','N','Y','N'),
('Y',39,81,6,6,60,128,152,24,'N','N','N','N','Y')]
,schema=schema)
cols = [(col.name, col.dataType) for col in df.schema]
transform_cols = []
for s in cols:
if s[1] == typ.StringType():
distinct_result = df.select(s[0]).distinct().rdd.map(lambda row: row[0]).collect()
if 'Y' in distinct_result:
transform_cols.append(s[0])
print(transform_cols)
The output is :
['ALIVE', 'DBG', 'HT2', 'PREV']

I managed to use udf in order to do the task. First, pick the column with Y or N (here I use func.first in order to skim through the first row):
cols_sel = df.select([func.first(col).alias(col) for col in df.columns]).collect()[0].asDict()
cols = [col_name for (col_name, v) in cols_sel.items() if v in ['Y', 'N']]
# return ['HT2', 'ALIVE', 'DBP', 'HT1', 'PREV', 'DBG']
Next, You can create udf function in order to map Y, N to 1, 0.
def map_input(val):
map_dict = dict(zip(['Y', 'N'], [1, 0]))
return map_dict.get(val)
udf_map_input = func.udf(map_input, returnType=typ.IntegerType())
for col in cols:
df = df.withColumn(col, udf_map_input(col))
df.show()
Finally, you can sum the column. I then transform output into dictionary and check which columns has value greater than 0 (i.e. contains Y)
out = df.select([func.sum(col).alias(col) for col in cols]).collect()
out = out[0]
print([col_name for (col_name, val) in out.asDict().items() if val > 0])
Output
['DBG', 'HT2', 'ALIVE', 'PREV']

Related

How to use zip function to associate column number with value of excel cell using openpyxl

I'm creating a dictionary where the keys should be the row number and the values of the dictionary should be a list of column numbers with the order being determined by the values of that row, sorted in descending order.
My code below is:
from openpyxl import load_workbook
vWB = load_workbook(filename="voting.xlsx")
vSheet = vWB.active
d = {}
for idx, row in enumerate(vSheet.values, start=1):
row = sorted(row, reverse=True)
d[idx] = row
output:
{1: [0.758968208500514, 0.434362232763003, 0.296177589742431, 0.0330331941352554], 2: [0.770423104229537, 0.770423104229537, 0.559322784244604, 0.455791535747786] etc..}
What I want:
{1: [4,2,1,3], 2: [3,4,1,2], etc..}
I've been trying to create a number to represent the column number of each value
genKey = [i for i in range(values.max_column)
And then using the zip function to associate the column number with each value
dict(zip(column key list, values list)
So I have a dict with columns as keys 1,2,n and values as values and then I can sort the keys in Descending order and I can iterate over row and zip again with the key being the row number.
I'm unsure how to use this zip function and get to my desired endpoint. Any help is welcomed.

Simply use enumerate:
dictionary = {
row_id: [cell.value for cell in row]
for row_id, row in enumerate(vSheet.rows)
}
enumerate can be applied to iterables and returns an iterator over a tuple of index and value. For example:
x = ['a', 'b', 'c']
print(list(enumerate(x)))
yields [("a", 0), ("b", 1), ("c", 2)].
If you want to start with 1, then use enumerate(x, 1).

You can use enumerate
dictionary = {
i: [cell.value for cell in row[0:]]
for i, row in enumerate(vSheet.rows)
}

If you just want the cell values this is easy:
d = {}
for idx, row in enumerate(ws.values, start=1):
row_s = sorted(row) # created a sorted copy for comparison
d[idx] = [row_s.index(v) + 1 for v in row]

handling a geohash dict look up with spatial joins

I have a dictionary with geohash as keys and a value associated with them. I am looking up values from the dict to create a new column in my pandas dataframe.
geo_dict = {'9q5dx': 10, '9q9hv': 15, '9q5dv': 20}
df = pd.DataFrame({'geohash': ['9q5dx','9qh0g','9q9hv','9q5dv'],
'label': ['a', 'b', 'c', 'd']})
df['value'] = df.apply(lambda x : geo_dict[x.geohash], axis=1)
I need to be able to handle non-matches, i.e geohashes that do not exist in the dictionary. Expected handling below:
Find k-number of geohashes nearby and compute the mean value
Assign the mean of neighboring geohashes to pandas column
Questions -
Is there a library I can use to find nearby geohashes?
How do I code up this solution?

The module pygeodesy has several functions to calculate distance between geohashes. We can wrap this in a function that first checks is a match exists in the dict, else returns the mean value of the n closest geohashes:
import pygeodesy as pgd
import pandas as pd
geo_dict = {'9q5dx': 10, '9q9hv': 15, '9q5dv': 20}
geo_df = pd.DataFrame(zip(geo_dict.keys(), geo_dict.values()), columns=['geohash', 'value'])
df = pd.DataFrame({'geohash': ['9q5dx','9qh0g','9q9hv','9q5dv'],
'label': ['a', 'b', 'c', 'd']})
def approximate_distance(geohash1, geohash2):
return pgd.geohash.distance_(geohash1, geohash2)
#return pgd.geohash.equirectangular_(geohash1, geohash2) #alternative ways to calculate distance
#return pgd.geohash.haversine_(geohash1, geohash2)
def get_value(x, n=2): #set number of closest geohashes to use for approximation with n
val = geo_df.loc[geo_df['geohash'] == x]
if not val.empty:
return val['value'].iloc[0]
else:
geo_df['tmp_dist'] = geo_df['geohash'].apply(lambda y: approximate_distance(y,x))
return geo_df.nlargest(n, 'tmp_dist')['value'].mean()
df['value'] = df['geohash'].apply(get_value)
result:
geohash
label
value
0
9q5dx
a
10
1
9qh0g
b
12.5
2
9q9hv
c
15
3
9q5dv
d
20

How to properly write if-then lambda statement for pandas df?

I have the following code:
data = [[11001218, 'Value', 93483.37, 'G', '', 93483.37, '', '56117J100', 'FRA', 'Equity'],
[11001218, 'Value', 3572.73, 'G', 3572.73, '', '56117J100', '', 'LUM', 'Equity'],
[11001218, 'Value', 89910.64, 'G', 89910.64, '', '56117J100', '', 'WAR', 'Equity'],
[11005597, 'Value', 72640313.34,'L','',72640313.34, 'REVR21964', '','IN2', 'Repo']
]
df = pd.DataFrame(data, columns = ['ID', 'Type', 'Diff', 'Group', 'Amount','Amount2', 'Id2', 'Id3', 'Executor', 'Name'])
def logic_builder(row, row2, row3):
if row['Name'] == 'Repo' and row['Group'] == 'L':
return 'Fine resultant'
elif (row['ID'] == row2['ID']) and (row['ID'] == row3['ID']) and (row['Group'] == row2['Group']) and (row['Group'] == row3['Group']) and (row['Executor'] != row2['Executor']) and (row['Executor'] != row3['Executor']):
return 'Difference in Executor'
df['Results'] = df.apply(lambda row: logic_builder(row, row2, row3), axis=1)
If you look at the first 3 rows, they are all technically the same. They contain the same ID, Type, Group, and Name. The only difference is the executor, hence I would like my if-then statement to return "Difference in Executor". I am having trouble figuring out how to right the if-then to look at all the rows with similar attributes for the fields I mentioned above.
Thank you.

You can pass a single row, then determine its index and look for the other rows with df.iloc[index].
Here an example
def logic_builder(row):
global df #you need to access the df
i = row.name #row index
#get next rows
try:
row2 = df.iloc[i+1]
row3 = df.iloc[i+2]
except IndexError:
return
if row['Name'] == 'Repo' and row['Group'] == 'L':
return 'Fine resultant'
elif (row['ID'] == row2['ID']) and (row['ID'] == row3['ID']) and (row['Group'] == row2['Group']) and (row['Group'] == row3['Group']) and (row['Executor'] != row2['Executor']) and (row['Executor'] != row3['Executor']):
return 'Difference in Executor'
df['Results'] = df.apply(logic_builder, axis=1)
Of course, since the result depend on the next two rows, you can't run it on the last 2 rows of the dataframe.

You can modify the function a bit to perform on a chunk/slice of a dataframe, based on group using groupby since you are performing the action per group. A modified version of the function you have written would look something like this:
def logic_builder(group):
if group['Name'].eq('Repo').all() and group['Group'].eq('L').all():
return 'Fine resultant'
elif group['Group'].nunique()==1 and group['Executor'].nunique()>1:
return 'Difference in Executor'
row1, row2, row3,..,rown is not going to work actually, because there may be one or more rows per group, so better strategy is to perform if else using all, and nunique (which essentially gives number of unique values in the selected column) for the above logic that you have.
Then apply the function on groupby object:
df.groupby('ID').apply(logic_builder)
ID
11001218 Difference in Executor
11005597 Fine resultant
dtype: object
You can finally join above value to the actual dataframe if needed.

Pandas dataframes custom ordering

In one column, I have 4 possible (non-sequential) values: A, 2, +, ? and I want order rows according to a custom sequence 2, ?, A, +, I followed some code I followed online:
order_by_custom = pd.CategoricalDtype(['2', '?', 'A', '+'], ordered=True)
df['column_name'].astype(order_by_custom)
df.sort_values('column_name', ignore_index=True)
But for some reason, although it does sort, it still does so according to alphabetical (or binary value) position rather than the order I've entered them in the order_by_custom object.
Any ideas?

.astype does return Series after conversion, but you did not anything with it. Try assigning it to your df. Consider following example:
import pandas as pd
df = pd.DataFrame({'orderno':[1,2,3],'custom':['X','Y','Z']})
order_by_custom = pd.CategoricalDtype(['Z', 'Y', 'X'], ordered=True)
df['custom'] = df['custom'].astype(order_by_custom)
print(df.sort_values('custom'))
output
orderno custom
2 3 Z
1 2 Y
0 1 X

You can use a customized dictionary to sort it. For example a dictionary will be as:
my_custom_dict = {'2': 0, '?': 1, 'A': 2, '+' : 3}
If your column name is "my_column_name" then,
df.sort_values(by=['my_column_name'], key=lambda x: x.map(my_custom_dict))

searching in a pandas df that contains ranges

I have a pandas df that contains 2 columns 'start' and 'end' (both are integers). I would like an efficient method to search for rows such that the range that is represented by the row [start,end] contains a specific value.
Two additional notes:
It is possible to assume that ranges don't overlap
The solution should support a batch mode - that given a list of inputs, the output will be a mapping (dictionary or whatever) to the row indices that contain the matching range.
For example:
start end
0 7216 7342
1 7343 7343
2 7344 7471
3 7472 8239
4 8240 8495
and the query of
[7215,7217,7344]
will result in
{7217: 0, 7344: 2}
Thanks!

Brute force solution, could use lots of improvements though.
df = pd.DataFrame({'start': [7216, 7343, 7344, 7472, 8240],
'end': [7342, 7343, 7471, 8239, 8495]})
search = [7215, 7217, 7344]
res = {}
for i in search:
mask = (df.start <= i) & (df.end >= i)
idx = df[mask].index.values
if len(idx):
res[i] = idx[0]
print res
Yields
{7344: 2, 7217: 0}

Selected solution
This new solution could have better performances. But there is a limitation, it will only works if there is no gap between ranges like in the example provided.
# Test data
df = pd.DataFrame({'start': [7216, 7343, 7344, 7472, 8240],
'end': [7342, 7343, 7471, 8239, 8495]}, columns=['start','end'])
query = [7215,7217,7344]
# Reshaping the original DataFrame
df = df.reset_index()
df = pd.concat([df['start'], df['end']]).reset_index()
df = df.set_index(0).sort_index()
# Creating a DataFrame with a continuous index
max_range = max(df.index) + 1
min_range = min(df.index)
s = pd.DataFrame(index=range(min_range,max_range))
# Joining them
s = s.join(df)
# Filling the gaps
s = s.fillna(method='backfill')
# Then a simple selection gives the result
s.loc[query,:].dropna().to_dict()['index']
# Result
{7217: 0.0, 7344: 2.0}
Previous proposal
# Test data
df = pd.DataFrame({'start': [7216, 7343, 7344, 7472, 8240],
'end': [7342, 7343, 7471, 8239, 8495]}, columns=['start','end'])
# Constructing a DataFrame containing the query numbers
query = [7215,7217,7344]
result = pd.DataFrame(np.tile(query, (len(df), 1)), columns=query)
# Merging the data and the query
df = pd.concat([df, result], axis=1)
# Making the test
df = df.apply(lambda x: (x >= x['start']) & (x <= x['end']), axis=1).loc[:,query]
# Keeping only values found
df = df[df==True]
df = df.dropna(how='all', axis=(0,1))
# Extracting to the output format
result = df.to_dict('split')
result = dict(zip(result['columns'], result['index']))
# The result
{7217: 0, 7344: 2}

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

pyspark sql functions instead of rdd distinct - python

Related

How to use zip function to associate column number with value of excel cell using openpyxl

handling a geohash dict look up with spatial joins

How to properly write if-then lambda statement for pandas df?

Pandas dataframes custom ordering

searching in a pandas df that contains ranges

Categories

Resources