I am new to Spark and am trying to optimize code written by another developer. The scenario is as follows:
There is a list of dictionaries with three key-value pairs. One is source:value, second is target:value and third is column:value.
CASE WHEN THEN statement is generated based on above three key-value pairs. For instance, the list of dictionaries is as follows:
values = [{'target': 'Unknown', 'source': '', 'column': 'gender'},
{'target': 'F', 'source': '0', 'column': 'gender'},
{'target': 'M', 'source': '1', 'column': 'gender'},
{'target': 'F', 'source': 'F', 'column': 'gender'},
{'target': 'F', 'source': 'Fe', 'column': 'gender'}]
The following code generates the CASE WHEN THEN statement that follows.
for value in values:
source_value = value.get("source")
op = op.when(df[column] == source, value.get("target"))
Column<'CASE WHEN (gender = ) THEN Unknown
WHEN (gender = 0) THEN F
WHEN (gender = 1) THEN M
WHEN (gender = F) THEN F
WHEN (gender = Fe) THEN F END'>
This CASE WHEN THEN is then used to select data from a dataframe.
Question: Is the usage of CASE WHEN THEN valid here (is it optimized)? Some of the CASE WHEN statements are very very lengthy (around 1000+). Is there a better way to redo the code (regex perhaps)?
I looked at the below questions, but were not relevant for my case.
CASE WHEN ... THEN
SPARK SQL - case when then
Thanks.
Two alternatives:
Use UDF, in which you can access a dictionary of values
Build a table, and perform broadcast join
The way to know which is better is by examining the execution plan, job duration and total shuffle.
Related
Is it possible to "explode" an array that contains multiple dictionaries using pandas or python?
I am developing a code that returns these two arrays (simplified version):
data_for_dataframe = ["single nucleotide variant",
[{'assembly': 'GRCh38',
'start': '11016874',
'end': '11016874',
'ref': 'C',
'alt': 'T',
'risk_allele': 'T'},
{'assembly': 'GRCh37',
'start': '11076931',
'end': '11076931',
'ref': 'C',
'alt': 'T',
'risk_allele': 'T'}]]
columns = ["variant_type", "assemblies"]
So I created a pandas dataframe with using these two arrays - "data_for_dataframe" and "columns":
import pandas as pd
df = pd.DataFrame(data_for_dataframe, columns).transpose()
And the output was:
The type of the "variant_type" column is string and the type of the "assemblies" column is array. My question is whether it is possible, and if so, how, to "explode" the "assemblies" column and create a dataframe as shown in the following image:
Could you help me?
It's possible with a combination of apply() and explode().
exploded = df['assemblies'].explode().apply(pd.Series)
exploded['variant_type'] = df['variant_type']
Output:
assembly start end ref alt risk_allele variant_type
0 GRCh38 11016874 11016874 C T T single nucleotide variant
0 GRCh37 11076931 11076931 C T T single nucleotide variant
How to add a column based on one of the values of a list of lists in python?
I have the following list and I need to add a new column based on the value of Currency.
If Pound , Euro = Amount *0.9
If USD , Euro =Amount *1.2
I need to code without libraries.
[['Buyer', 'Seller', 'Amount', 'Property_Type','Currency'],
['100', '200', '4923', 'c', 'Pound'],
['600', '429', '838672', 'a', 'USD'],
['650', '400', '8672', 'a', 'Euro']
Result
[['Buyer', 'Seller', 'Amount', 'Property_Type', 'Currency', 'Euro'],
['100', '200', '5000', 'c', 'Livre', '6000'],
['600', '429', '10000', 'a', 'USD', '9000'],
['650', '400', '8600', 'a', 'Euro', '8600']
Thank you very much, any readings on how to import a csv and manipulate it, without libraries, would be much appreciated.
Assuming the columns are always in the same order...
EXCH_RATES = {
'Pound': Decimal('0.9'),
'USD': Decimal('1.2'),
'Euro': 1,
}
rows[0].append('Euro')
for row in rows[1:]:
exch_rate = EXCH_RATES[row[4]]
row.append(str(exch_rate * Decimal[row[2]]))
Check the last item in the list inside the list then check what its currency is then change the amount like so:
lst = [['Buyer', 'Seller', 'Amount', 'Property_Type','Currency'], ['100', '200', '4923', 'c', 'Pound'], ['600', '429', '838672', 'a', 'USD'], ['650', '400', '8672', 'a', 'Euro']]
for i in range(3):
if lst[i][3] == 'Pound':
lst[i].append(str(int(lst[i][2]) * 0.9))
elif lst[i][3] == 'USD':
lst[i].append(str(int(lst[i][2]) * 1.2))
else:
lst[i].append(lst[i][2])
Although you would be better storing the data in a csv file but then you would have to use the csv library.
Tell me if this helps and if you want to use the csv library tell me so I can tell you how to use it.
I have a csv containing ~45,000 rows, which equates to seven days' worth of data. It has been sorted by datetime, with the oldest record first.
This is a sample row once the csv has been passed into the csv module's DictReader:
{'end': '423', 'g': '2', 'endid': '17131', 'slat': '40.7', 'endname': 'Horchata', 'cid': '1', 'startname': 'Sriracha', 'startid': '521', 'slon': '-73.9', 'usertype': 'Sub', 'stoptime': '2015-02-01 00:14:00+00', 'elong': '-73.9', 'starttime': '2015-02-01 00:00:00+00', 'elat': '40.7', 'dur': '801', 'meppy': '', 'birth_year': '1978'}
...and another:
{'end': '418', 'g': '1', 'endid': '17108', 'slat': '40.7', 'endname': 'Guacamole', 'cid': '1', 'startname': 'Cerveza', 'startid': '519', 'slon': '-73.9', 'usertype': 'Sub', 'stoptime': '2015-02-01 00:14:00+00', 'elong': '-73.9', 'starttime': '2015-02-02 00:00:00+00', 'elat': '40.7', 'dur': '980', 'meppy': '', 'birth_year': '1983'}
I recently wrote the code below. It runs through the csv (after it's been passed to DictReader). The code yields the first row of each new day, i.e. whenever the day changes, based on starttime:
dayList = []
def first_ride(reader):
for row in reader:
starttime = dateutil.parser.parse(row['starttime'])
if starttime.day not in dayList:
day_holder.append(starttime.day)
yield row
else:
pass
My goal now is to produce a single list containing the value associated with birth_year from each of the seven records, i.e.:
[1992, 1967, 1988, 1977, 1989, 1953, 1949]
The catch is that I want to understand how to do it using Python's HOFs to the maximum extent possible (i.e. map/ reduce, and likely filter), without the generator (currently used in my code), and without global variables. To eliminate the global variable, my guess is that each starttime's day will have to be compared to the one before, but not using the list, as I currently have it set up. As a final FYI, I run Python 2.7.
I majorly appreciate any expertise donated.
You can just reduce the dayList, into a list of birth_years:
reduce(lambda r, d: r + [d['birth_year']], dayList, [])
Or you can use a comprehension (preferred):
[d['birth_year'] for d in dayList]
I am tyring to match the value of two dicts of two sperate keys by looping over them-with hopefully if i in line_aum['id_class'] == line_investor['id_class'] becoming True, then the next sum dunction will work:
Tho it kicks out a different result
so far I have:
for line_aum in aum_obj:
for line_investor in investor_obj:
if i in line_aum['id_class'] == line_investor['id_class']:
total = (sum,line_investor['amount'], line_aum['value'])
amount = line['id_class']
print(amount,total)
Example data:
{'fund_name': '', 'fund_code': 'LFE', 'aumc': '406.37', 'value': '500', 'ddate': '2013-01-01', 'id_fund': '165', 'currency': 'EUR', 'nav': '24.02', 'shares': '16.918', 'estimate': '0', 'id_class': '4526', 'class_name': 'LTD - CLASS B (EUR)'}
Use itertools.product instead of nested loops if both aum_obj and investor_obj are lists:
from itertools import product
for line_aum, line_investor in product(aum_obj, investor_obj):
if line_aum['id_class'] == line_investor['id_class']:
# `line_aum` and `line_investor` have matching values for the `id_class` keys.
Total Python noob here, probably missing something obvious. I've searched everywhere and haven't found a solution yet, so I thought I'd ask for some help.
I'm trying to write a function that will build a nested dictionary from a large csv file. The input file is in the following format:
Product,Price,Cost,Brand,
blue widget,5,4,sony,
red widget,6,5,sony,
green widget,7,5,microsoft,
purple widget,7,6,microsoft,
etc...
The output dictionary I need would look like:
projects = { `<Brand>`: { `<Product>`: { 'Price': `<Price>`, 'Cost': `<Cost>` },},}
But obviously with many different brands containing different products. In the input file, the data is ordered alphabetically by brand name, but I know that it becomes unordered as soon as DictReader executes, so I definitely need a better way to handle the duplicates. The if statement as written is redundant and unnecessary.
Here's the non-working, useless code I have so far:
def build_dict(source_file):
projects = {}
headers = ['Product', 'Price', 'Cost', 'Brand']
reader = csv.DictReader(open(source_file), fieldnames = headers, dialect = 'excel')
current_brand = 'None'
for row in reader:
if Brand != current_brand:
current_brand = Brand
projects[Brand] = {Product: {'Price': Price, 'Cost': Cost}}
return projects
source_file = 'merged.csv'
print build_dict(source_file)
I have of course imported the csv module at the top of the file.
What's the best way to do this? I feel like I'm way off course, but there is very little information available about creating nested dicts from a CSV, and the examples that are out there are highly specific and tend not to go into detail about why the solution actually works, so as someone new to Python, it's a little hard to draw conclusions.
Also, the input csv file doesn't normally have headers, but for the sake of trying to get a working version of this function, I manually inserted a header row. Ideally, there would be some code that assigns the headers.
Any help/direction/recommendation is much appreciated, thanks!
import csv
from collections import defaultdict
def build_dict(source_file):
projects = defaultdict(dict)
headers = ['Product', 'Price', 'Cost', 'Brand']
with open(source_file, 'rb') as fp:
reader = csv.DictReader(fp, fieldnames=headers, dialect='excel',
skipinitialspace=True)
for rowdict in reader:
if None in rowdict:
del rowdict[None]
brand = rowdict.pop("Brand")
product = rowdict.pop("Product")
projects[brand][product] = rowdict
return dict(projects)
source_file = 'merged.csv'
print build_dict(source_file)
produces
{'microsoft': {'green widget': {'Cost': '5', 'Price': '7'},
'purple widget': {'Cost': '6', 'Price': '7'}},
'sony': {'blue widget': {'Cost': '4', 'Price': '5'},
'red widget': {'Cost': '5', 'Price': '6'}}}
from your input data (where merged.csv doesn't have the headers, only the data.)
I used a defaultdict here, which is just like a dictionary but when you refer to a key that doesn't exist instead of raising an Exception it simply makes a default value, in this case a dict. Then I get out -- and remove -- Brand and Product, and store the remainder.
All that's left I think would be to turn the cost and price into numbers instead of strings.
[modified to use DictReader directly rather than reader]
Here I offer another way to satisfy your requirement(different from DSM)
Firstly, this is my code:
import csv
new_dict={}
with open('merged.csv','rb')as csv_file:
data=csv.DictReader(csv_file,delimiter=",")
for row in data:
dict_brand=new_dict.get(row['Brand'],dict())
dict_brand[row['Product']]={k:row[k] for k in ('Cost','Price')}
new_dict[row['Brand']]=dict_brand
print new_dict
Briefly speaking, the main point to solve is to figure out what the key-value pairs are in your requirements. According to your requirement,it can be called as a 3-level-dict,here the key of first level is the value of Brand int the original dictionary, so I extract it from the original csv file as
dict_brand=new_dict.get(row['Brand'],dict())
which is going to judge if there exists the Brand value same as the original dict in our new dict, if yes, it just inserts, if no, it creates, then maybe the most complicated part is the second level or middle level, here you set the value of Product of original dict as the value of the new dict of key Brand, and the value of Product is also the key of the the third level dict which has Price and Cost of the original dict as the value,and here I extract them like:
dict_brand[row['Product']]={k:row[k] for k in ('Cost','Price')}
and finally, what we need to do is just set the created 'middle dict' as the value of our new dict which has Brand as the key.
Finally, the output is
{'sony': {'blue widget': {'Price': '5', 'Cost': '4'},
'red widget': {'Price': '6', 'Cost': '5'}},
'microsoft': {'purple widget': {'Price': '7', 'Cost': '6'},
'green widget': {'Price': '7', 'Cost': '5'}}}
That's that.