Using Python's Higher Order Functions on a CSV - python

I have a csv containing ~45,000 rows, which equates to seven days' worth of data. It has been sorted by datetime, with the oldest record first.
This is a sample row once the csv has been passed into the csv module's DictReader:
{'end': '423', 'g': '2', 'endid': '17131', 'slat': '40.7', 'endname': 'Horchata', 'cid': '1', 'startname': 'Sriracha', 'startid': '521', 'slon': '-73.9', 'usertype': 'Sub', 'stoptime': '2015-02-01 00:14:00+00', 'elong': '-73.9', 'starttime': '2015-02-01 00:00:00+00', 'elat': '40.7', 'dur': '801', 'meppy': '', 'birth_year': '1978'}
...and another:
{'end': '418', 'g': '1', 'endid': '17108', 'slat': '40.7', 'endname': 'Guacamole', 'cid': '1', 'startname': 'Cerveza', 'startid': '519', 'slon': '-73.9', 'usertype': 'Sub', 'stoptime': '2015-02-01 00:14:00+00', 'elong': '-73.9', 'starttime': '2015-02-02 00:00:00+00', 'elat': '40.7', 'dur': '980', 'meppy': '', 'birth_year': '1983'}
I recently wrote the code below. It runs through the csv (after it's been passed to DictReader). The code yields the first row of each new day, i.e. whenever the day changes, based on starttime:
dayList = []
def first_ride(reader):
for row in reader:
starttime = dateutil.parser.parse(row['starttime'])
if starttime.day not in dayList:
day_holder.append(starttime.day)
yield row
else:
pass
My goal now is to produce a single list containing the value associated with birth_year from each of the seven records, i.e.:
[1992, 1967, 1988, 1977, 1989, 1953, 1949]
The catch is that I want to understand how to do it using Python's HOFs to the maximum extent possible (i.e. map/ reduce, and likely filter), without the generator (currently used in my code), and without global variables. To eliminate the global variable, my guess is that each starttime's day will have to be compared to the one before, but not using the list, as I currently have it set up. As a final FYI, I run Python 2.7.
I majorly appreciate any expertise donated.

You can just reduce the dayList, into a list of birth_years:
reduce(lambda r, d: r + [d['birth_year']], dayList, [])
Or you can use a comprehension (preferred):
[d['birth_year'] for d in dayList]

Related

Replacement for Spark's CASE WHEN THEN

I am new to Spark and am trying to optimize code written by another developer. The scenario is as follows:
There is a list of dictionaries with three key-value pairs. One is source:value, second is target:value and third is column:value.
CASE WHEN THEN statement is generated based on above three key-value pairs. For instance, the list of dictionaries is as follows:
values = [{'target': 'Unknown', 'source': '', 'column': 'gender'},
{'target': 'F', 'source': '0', 'column': 'gender'},
{'target': 'M', 'source': '1', 'column': 'gender'},
{'target': 'F', 'source': 'F', 'column': 'gender'},
{'target': 'F', 'source': 'Fe', 'column': 'gender'}]
The following code generates the CASE WHEN THEN statement that follows.
for value in values:
source_value = value.get("source")
op = op.when(df[column] == source, value.get("target"))
Column<'CASE WHEN (gender = ) THEN Unknown
WHEN (gender = 0) THEN F
WHEN (gender = 1) THEN M
WHEN (gender = F) THEN F
WHEN (gender = Fe) THEN F END'>
This CASE WHEN THEN is then used to select data from a dataframe.
Question: Is the usage of CASE WHEN THEN valid here (is it optimized)? Some of the CASE WHEN statements are very very lengthy (around 1000+). Is there a better way to redo the code (regex perhaps)?
I looked at the below questions, but were not relevant for my case.
CASE WHEN ... THEN
SPARK SQL - case when then
Thanks.
Two alternatives:
Use UDF, in which you can access a dictionary of values
Build a table, and perform broadcast join
The way to know which is better is by examining the execution plan, job duration and total shuffle.

Filtering through a list with embedded dictionaries

I've got a json format list with some dictionaries within each list, it looks like the following:
[{"id":13, "name":"Albert", "venue":{"id":123, "town":"Birmingham"}, "month":"February"},
{"id":17, "name":"Alfred", "venue":{"id":456, "town":"London"}, "month":"February"},
{"id":20, "name":"David", "venue":{"id":14, "town":"Southampton"}, "month":"June"},
{"id":17, "name":"Mary", "venue":{"id":56, "town":"London"}, "month":"December"}]
The amount of entries within the list can be up to 100. I plan to present the 'name' for each entry, one result at a time, for those that have London as a town. The rest are of no use to me. I'm a beginner at python so I would appreciate a suggestion in how to go about this efficiently. I initially thought it would be best to remove all entries that don't have London and then I can go through them one by one.
I also wondered if it might be quicker to not filter but to cycle through the entire json and select the names of entries that have the town as London.
You can use filter:
data = [{"id":13, "name":"Albert", "venue":{"id":123, "town":"Birmingham"}, "month":"February"},
{"id":17, "name":"Alfred", "venue":{"id":456, "town":"London"}, "month":"February"},
{"id":20, "name":"David", "venue":{"id":14, "town":"Southampton"}, "month":"June"},
{"id":17, "name":"Mary", "venue":{"id":56, "town":"London"}, "month":"December"}]
london_dicts = filter(lambda d: d['venue']['town'] == 'London', data)
for d in london_dicts:
print(d)
This is as efficient as it can get because:
The loop is written in C (in case of CPython)
filter returns an iterator (in Python 3), which means that the results are loaded to memory one by one as required
One way is to use list comprehension:
>>> data = [{"id":13, "name":"Albert", "venue":{"id":123, "town":"Birmingham"}, "month":"February"},
{"id":17, "name":"Alfred", "venue":{"id":456, "town":"London"}, "month":"February"},
{"id":20, "name":"David", "venue":{"id":14, "town":"Southampton"}, "month":"June"},
{"id":17, "name":"Mary", "venue":{"id":56, "town":"London"}, "month":"December"}]
>>> [d for d in data if d['venue']['town'] == 'London']
[{'id': 17,
'name': 'Alfred',
'venue': {'id': 456, 'town': 'London'},
'month': 'February'},
{'id': 17,
'name': 'Mary',
'venue': {'id': 56, 'town': 'London'},
'month': 'December'}]

how to take the specific details out in Python that are separated by a semi colon or a slash?

I have the following results from a vet analyser
result{type:PT/APTT;error:0;PT:32.3 s;INR:0.0;APTT:119.2;code:470433200;lot:405
4H0401;date:20/01/2017 06:47;PID:TREKKER20;index:015;C1:-0.1;C2:-0.1;qclock:0;ta
rget:2;name:;Sex:;BirthDate:;operatorID:;SN:024000G0900046;version:V2.8.0.09}
Using Python how do i separate the date the time the type PT and APTT.... please note that the results will be different everytime so i need to make a code that will find the date using the / and will get the time because of four digits and the : .... do i use a for loop?
This code makes further usage of fields easier by converting them to dict.
from pprint import pprint
result = "result{type:PT/APTT;error:0;PT:32.3 s;INR:0.0;APTT:119.2;code:470433200;lot:405 4H0401;date:20/01/2017 06:47;PID:TREKKER20;index:015;C1:-0.1;C2:-0.1;qclock:0;ta rget:2;name:;Sex:;BirthDate:;operatorID:;SN:024000G0900046;version:V2.8.0.09}"
if result.startswith("result{") and result.endswith("}"):
result = result[(result.index("{") + 1):result.index("}")]
# else:
# raise ValueError("Invalid data '" + result + "'")
# Separate fields
fields = result.split(";")
# Separate field names and values
# First part is the name of the field for sure, but any additional ":" must not be split, as example "date:dd/mm/yyyy HH:MM" -> "date": "dd/mm/yyyy HH:MM"
fields = [field.split(":", 1) for field in fields]
fields = {field[0]: field[1] for field in fields}
a = fields['type'].split("/")
print(fields)
pprint(fields)
print(a)
The result:
{'type': 'PT/APTT', 'error': '0', 'PT': '32.3 s', 'INR': '0.0', 'APTT': '119.2', 'code': '470433200', 'lot': '405 4H0401', 'date': '20/01/2017 06:47', 'PID': 'TREKKER20', 'index': '015', 'C1': '-0.1', 'C2': '-0.1', 'qclock': '0', 'ta rget': '2', 'name': '', 'Sex': '', 'BirthDate': '', 'operatorID': '', 'SN': '024000G0900046', 'version': 'V2.8.0.09'}
{'APTT': '119.2',
'BirthDate': '',
'C1': '-0.1',
'C2': '-0.1',
'INR': '0.0',
'PID': 'TREKKER20',
'PT': '32.3 s',
'SN': '024000G0900046',
'Sex': '',
'code': '470433200',
'date': '20/01/2017 06:47',
'error': '0',
'index': '015',
'lot': '405 4H0401',
'name': '',
'operatorID': '',
'qclock': '0',
'ta rget': '2',
'type': 'PT/APTT',
'version': 'V2.8.0.09'}
['PT', 'APTT']
Note that dictionaries are not sorted (they don't need to be in most cases as you access the fields by the keys).
If you want to split the results by semicolon:
result_array = result.split(';')
In results_array you'll get all strings separated by semicolon, then you can access the date there: result_array[index]
That's quite a bad format to store data as fields might have colons in their values, but if you have to - you can strip away the surrounding result, split the rest on a semicolon, then do a single split on a colon to get dict key-value pairs and then just build a dict from that, e.g.:
data = "result{type:PT/APTT;error:0;PT:32.3 s;INR:0.0;APTT:119.2;code:470433200;lot:405 " \
"4H0401;date:20/01/2017 06:47;PID:TREKKER20;index:015;C1:-0.1;C2:-0.1;qclock:0;ta " \
"rget:2;name:;Sex:;BirthDate:;operatorID:;SN:024000G0900046;version:V2.8.0.09}"
parsed = dict(e.split(":", 1) for e in data[7:-1].split(";"))
print(parsed["APTT"]) # 119.2
print(parsed["PT"]) # 32.3 s
print(parsed["date"]) # 20/01/2017 06:47
If you need to further separate the date field to date and time, you can just do date, time = parsed["date"].split(), although if you're going to manipulate the object I'd suggest you to use the datetime module and parse it e.g.:
import datetime
date = datetime.datetime.strptime(parsed["date"], "%d/%m/%Y %H:%M")
print(date) # 2017-01-20 06:47:00
print(date.year) # 2017
print(date.hour) # 6
# etc.
To go straight to the point and get your type, PT, APTT, date and time, use re:
import re
from source import result_gen
result = result_gen()
def from_result(*vars):
regex = re.compile('|'.join([f'{re.encode(var)}:.*?;' for var in vars]))
matches =dict(f.group().split(':', 1) for f in re.finditer(regex, result))
return tuple(matches[v][:-1] for v in vars)
type, PT, APTT, datetime = from_result('type', 'PT', 'APTT', 'date')
date, time = datetime.split()
Notice that this can be easily extended in the event you became suddenly interested in some other 'var' in the string...
In short you can optimize this further (to avoid the split step) by capturing groups in the regex search...

python key value two dict matches

I am tyring to match the value of two dicts of two sperate keys by looping over them-with hopefully if i in line_aum['id_class'] == line_investor['id_class'] becoming True, then the next sum dunction will work:
Tho it kicks out a different result
so far I have:
for line_aum in aum_obj:
for line_investor in investor_obj:
if i in line_aum['id_class'] == line_investor['id_class']:
total = (sum,line_investor['amount'], line_aum['value'])
amount = line['id_class']
print(amount,total)
Example data:
{'fund_name': '', 'fund_code': 'LFE', 'aumc': '406.37', 'value': '500', 'ddate': '2013-01-01', 'id_fund': '165', 'currency': 'EUR', 'nav': '24.02', 'shares': '16.918', 'estimate': '0', 'id_class': '4526', 'class_name': 'LTD - CLASS B (EUR)'}
Use itertools.product instead of nested loops if both aum_obj and investor_obj are lists:
from itertools import product
for line_aum, line_investor in product(aum_obj, investor_obj):
if line_aum['id_class'] == line_investor['id_class']:
# `line_aum` and `line_investor` have matching values for the `id_class` keys.

Using Python CSV DictReader to create multi-level nested dictionary

Total Python noob here, probably missing something obvious. I've searched everywhere and haven't found a solution yet, so I thought I'd ask for some help.
I'm trying to write a function that will build a nested dictionary from a large csv file. The input file is in the following format:
Product,Price,Cost,Brand,
blue widget,5,4,sony,
red widget,6,5,sony,
green widget,7,5,microsoft,
purple widget,7,6,microsoft,
etc...
The output dictionary I need would look like:
projects = { `<Brand>`: { `<Product>`: { 'Price': `<Price>`, 'Cost': `<Cost>` },},}
But obviously with many different brands containing different products. In the input file, the data is ordered alphabetically by brand name, but I know that it becomes unordered as soon as DictReader executes, so I definitely need a better way to handle the duplicates. The if statement as written is redundant and unnecessary.
Here's the non-working, useless code I have so far:
def build_dict(source_file):
projects = {}
headers = ['Product', 'Price', 'Cost', 'Brand']
reader = csv.DictReader(open(source_file), fieldnames = headers, dialect = 'excel')
current_brand = 'None'
for row in reader:
if Brand != current_brand:
current_brand = Brand
projects[Brand] = {Product: {'Price': Price, 'Cost': Cost}}
return projects
source_file = 'merged.csv'
print build_dict(source_file)
I have of course imported the csv module at the top of the file.
What's the best way to do this? I feel like I'm way off course, but there is very little information available about creating nested dicts from a CSV, and the examples that are out there are highly specific and tend not to go into detail about why the solution actually works, so as someone new to Python, it's a little hard to draw conclusions.
Also, the input csv file doesn't normally have headers, but for the sake of trying to get a working version of this function, I manually inserted a header row. Ideally, there would be some code that assigns the headers.
Any help/direction/recommendation is much appreciated, thanks!
import csv
from collections import defaultdict
def build_dict(source_file):
projects = defaultdict(dict)
headers = ['Product', 'Price', 'Cost', 'Brand']
with open(source_file, 'rb') as fp:
reader = csv.DictReader(fp, fieldnames=headers, dialect='excel',
skipinitialspace=True)
for rowdict in reader:
if None in rowdict:
del rowdict[None]
brand = rowdict.pop("Brand")
product = rowdict.pop("Product")
projects[brand][product] = rowdict
return dict(projects)
source_file = 'merged.csv'
print build_dict(source_file)
produces
{'microsoft': {'green widget': {'Cost': '5', 'Price': '7'},
'purple widget': {'Cost': '6', 'Price': '7'}},
'sony': {'blue widget': {'Cost': '4', 'Price': '5'},
'red widget': {'Cost': '5', 'Price': '6'}}}
from your input data (where merged.csv doesn't have the headers, only the data.)
I used a defaultdict here, which is just like a dictionary but when you refer to a key that doesn't exist instead of raising an Exception it simply makes a default value, in this case a dict. Then I get out -- and remove -- Brand and Product, and store the remainder.
All that's left I think would be to turn the cost and price into numbers instead of strings.
[modified to use DictReader directly rather than reader]
Here I offer another way to satisfy your requirement(different from DSM)
Firstly, this is my code:
import csv
new_dict={}
with open('merged.csv','rb')as csv_file:
data=csv.DictReader(csv_file,delimiter=",")
for row in data:
dict_brand=new_dict.get(row['Brand'],dict())
dict_brand[row['Product']]={k:row[k] for k in ('Cost','Price')}
new_dict[row['Brand']]=dict_brand
print new_dict
Briefly speaking, the main point to solve is to figure out what the key-value pairs are in your requirements. According to your requirement,it can be called as a 3-level-dict,here the key of first level is the value of Brand int the original dictionary, so I extract it from the original csv file as
dict_brand=new_dict.get(row['Brand'],dict())
which is going to judge if there exists the Brand value same as the original dict in our new dict, if yes, it just inserts, if no, it creates, then maybe the most complicated part is the second level or middle level, here you set the value of Product of original dict as the value of the new dict of key Brand, and the value of Product is also the key of the the third level dict which has Price and Cost of the original dict as the value,and here I extract them like:
dict_brand[row['Product']]={k:row[k] for k in ('Cost','Price')}
and finally, what we need to do is just set the created 'middle dict' as the value of our new dict which has Brand as the key.
Finally, the output is
{'sony': {'blue widget': {'Price': '5', 'Cost': '4'},
'red widget': {'Price': '6', 'Cost': '5'}},
'microsoft': {'purple widget': {'Price': '7', 'Cost': '6'},
'green widget': {'Price': '7', 'Cost': '5'}}}
That's that.

Categories

Resources