How to write dictionaries to Bigquery in Dataflow using python

How to write dictionaries to Bigquery in Dataflow using python - python

I am trying to read from a csv from in GCP Storage, converting that into dictionaries and then write to a Bigquery table as follows:
p | ReadFromText("gs://bucket/file.csv")
| (beam.ParDo(BuildAdsRecordFn()))
| WriteToBigQuery('ads_table',dataset='dds',project='doubleclick-2',schema=ads_schema)
where: 'doubleclick-2' and 'dds' are existing project and dataset, ads_schema is defined as follows:
ads_schema='Advertiser_ID:INTEGER,Campaign_ID:INTEGER,Ad_ID:INTEGER,Ad_Name:STRING,Click_through_URL:STRING,Ad_Type:STRING'
BuildAdsRecordFn() is defined as follows:
class AdsRecord:
dict = {}
def __init__(self, line):
record = line.split(",")
self.dict['Advertiser_ID'] = record[0]
self.dict['Campaign_ID'] = record[1]
self.dict['Ad_ID'] = record[2]
self.dict['Ad_Name'] = record[3]
self.dict['Click_through_URL'] = record[4]
self.dict['Ad_Type'] = record[5]
class BuildAdsRecordFn(beam.DoFn):
def __init__(self):
super(BuildAdsRecordFn, self).__init__()
def process(self, element):
text_line = element.strip()
ads_record = AdsRecord(text_line).dict
return ads_record
However, when I run the pipeline, I got the following error:
"dataflow_job_18146703755411620105-B" failed., (6c011965a92e74fa): BigQuery job "dataflow_job_18146703755411620105-B" in project "doubleclick-2" finished with error(s): errorResult: JSON table encountered too many errors, giving up. Rows: 1; errors: 1., error: JSON table encountered too many errors, giving up. Rows: 1; errors: 1., error: JSON parsing error in row starting at position 0: Value encountered without start of object
Here is the sample testing data I used:
100001,1000011,10000111,ut,https://bloomberg.com/aliquam/lacus/morbi.xml,Brand-neutral
100001,1000011,10000112,eu,http://weebly.com/sed/vel/enim/sit.jsp,Dynamic Click
I'm new to both Dataflow and python so could not figure out what could be wrong in the above code. Greatly appreciate any help!

I just implemented your code and it didn't work as well, but I got a different message error (something like "you can't return a dict as the result of a ParDo").
This code worked normally for me, notice not only I'm not using the class attribute dict as well as now a list is returned:
ads_schema='Advertiser_ID:INTEGER,Campaign_ID:INTEGER,Ad_ID:INTEGER,Ad_Name:STRING,Click_through_URL:STRING,Ad_Type:STRING'
class BuildAdsRecordFn(beam.DoFn):
def __init__(self):
super(BuildAdsRecordFn, self).__init__()
def process(self, element):
text_line = element.strip()
ads_record = self.process_row(element)
return ads_record
def process_row(self, row):
dict_ = {}
record = row.split(",")
dict_['Advertiser_ID'] = int(record[0]) if record[0] else None
dict_['Campaign_ID'] = int(record[1]) if record[1] else None
dict_['Ad_ID'] = int(record[2]) if record[2] else None
dict_['Ad_Name'] = record[3]
dict_['Click_through_URL'] = record[4]
dict_['Ad_Type'] = record[5]
return [dict_]
with beam.Pipeline() as p:
(p | ReadFromText("gs://bucket/file.csv")
| beam.Filter(lambda x: x[0] != 'A')
| (beam.ParDo(BuildAdsRecordFn()))
| WriteToBigQuery('ads_table', dataset='dds',
project='doubleclick-2', schema=ads_schema))
#| WriteToText('test.csv'))
This is the data I simulated:
Advertiser_ID,Campaign_ID,Ad_ID,Ad_Name,Click_through_URL,Ad_Type
1,1,1,name of ad,www.url.com,sales
1,1,2,name of ad2,www.url2.com,sales with sales
I also filtered out the header line that I created in my file (in the Filter operation), if you don't have a header then this is not necessary

Related

Why Map works and ParDo does'nt?

I am trying to figure out the performance difference between Map and ParDo, but I cannot run the ParDo method somehow
I have already tried finding some resources that try to solve the problem but I did not find one
ParDo Method (This does not work):
class ci(beam.DoFn):
def compute_interest(self,data_item):
cust_id, cust_data = data_item
if(cust_data['basic'][0]['acc_opened_date']=='2010-10-10'):
new_data = {}
new_data['new_balance'] = (cust_data['account'][0]['cur_bal'] * cust_data['account'][0]['roi']) / 100
new_data.update(cust_data['account'][0])
new_data.update(cust_data['basic'][0])
del new_data['cur_bal']
return new_data
Map Method (This works):
def compute_interest(data_item):
cust_id, cust_data = data_item
if(cust_data['basic'][0]['acc_opened_date']=='2010-10-10'):
new_data = {}
new_data['new_balance'] = (cust_data['account'][0]['cur_bal'] * cust_data['account'][0]['roi']) / 100
new_data.update(cust_data['account'][0])
new_data.update(cust_data['basic'][0])
del new_data['cur_bal']
return new_data
ERROR:
raise NotImplementedError
RuntimeError: NotImplementedError [while running 'PIPELINE NAME']

Beam.DoFn expects a process method instead:
def process(self, element):
As explained in section 4.2.1.2 of the Beam programming guide:
Inside your DoFn subclass, you’ll write a method process where you provide the actual processing logic. You don’t need to manually extract the elements from the input collection; the Beam SDKs handle that for you. Your process method should accept an object of type element. This is the input element and output is emitted by using yield or return statement inside process method.
As an example we'll define both Map and ParDo functions:
def compute_interest_map(data_item):
return data_item + 1
class compute_interest_pardo(beam.DoFn):
def process(self, element):
yield element + 2
If you change process for another method name you'll get the NotImplementedError.
And the main pipeline will be:
events = (p
| 'Create' >> beam.Create([1, 2, 3]) \
| 'Add 1' >> beam.Map(lambda x: compute_interest_map(x)) \
| 'Add 2' >> beam.ParDo(compute_interest_pardo()) \
| 'Print' >> beam.ParDo(log_results()))
Output:
INFO:root:>> Interest: 4
INFO:root:>> Interest: 5
INFO:root:>> Interest: 6
code

Python: Automatic sharding in FileSystems.create() for a writing channel in Dataflow

I was using FileSystems inside a ParDo to be able to write to dynamic destinations in data storage.
However I was not able to do automatic sharding as in Text.IO using wildcard for filename?
Is there a way that I can do automatic sharding in FileSystems.create?
Edited:
This is the pipeline I used to run, the part of code in question is WriteToStorage where I want to write the result to date{week}at{year}/results*.json
with beam.Pipeline(options=pipeline_options) as p:
pcoll = (p | ReadFromText(known_args.input)
| beam.ParDo(WriteDecompressedFile())
| beam.Map(lambda x: ('{week}at{year}'.format(week=x['week'], year=x['year']), x))
| beam.GroupByKey()
| beam.ParDo(WriteToStorage()))
Here's the current version of WriteToStorage()
class WriteToStorage(beam.DoFn):
def __init__(self):
self.writer = None
def process(self, element):
(key, val) = element
week, year = [int(x) for x in key.split('at')]
if self.writer == None:
path = known_args.output + 'date-{week}at{year}/revisions-from-{rev}.json'.format(week=week, year=year, rev=element['rev_id'])
self.writer = filesystems.FileSystems.create(path)
logging.info('USERLOG: Write to path %s.'%path)
logging.info('TESTLOG: %s.'%type(val))
for output in val:
self.writer.write(json.dumps(output) + '\n')
def finish_bundle(self):
if not(self.writer == None):
self.writer.close()
Thank you.

You can use the StartBundle() method of your DoFn to open a connection to a new file for each worker.
However, you have to figure out a way to independently name your files. Text.IO seems to do this with _RoundRobinKeyFn.
A simpler way would be to use a timestamp for name generation, but I don't know how foolproof this would be.

Search via Python Search API timing out intermittently

We have an application that is basically just a form submission for requesting a team drive to be created. It's hosted on Google App Engine.
This timeout error is coming from a single field in the form that simply does typeahead for an email address. All of the names on the domain are indexed in the datastore, about 300k entities - nothing is being pulled directly from the directory api. After 10 seconds of searching (via the Python Google Search API), it will time out. This is currently intermittent, but errors have been increasing in frequency.
Error: line 280, in get_result raise _ToSearchError(e) Timeout: Failed to complete request in 9975ms
Essentially, speeding up the searches will resolve. I looked at the code and I don't believe there is any room for improvement there. I am not sure if increasing the instance class will improve this, it is currently an F2. Or if perhaps there is another way to improve the index efficiency. I'm not entirely sure how one would do that however. Any thoughts would be appreciated.
Search Code:
class LookupUsersorGrpService(object):
'''
lookupUsersOrGrps accepts various params and performs search
'''
def lookupUsersOrGrps(self,params):
search_results_json = {}
search_results = []
directory_users_grps = GoogleDirectoryUsers()
error_msg = 'Technical error'
query = ''
try:
#Default few values if not present
if ('offset' not in params) or (params['offset'] is None):
params['offset'] = 0
else:
params['offset'] = int(params['offset'])
if ('limit' not in params) or (params['limit'] is None):
params['limit'] = 20
else:
params['limit'] = int(params['limit'])
#Search related to field name
query = self.appendQueryParam(q=query, p=params, qname='search_name', criteria=':', pname='query', isExactMatch=True,splitString=True)
#Search related to field email
query = self.appendQueryParam(q=query, p=params, qname='search_email', criteria=':', pname='query', isExactMatch=True, splitString=True)
#Perform search
log.info('Search initialized :\"{}\"'.format(query) )
# sort results by name ascending
expr_list = [search.SortExpression(expression='name', default_value='',direction=search.SortExpression.ASCENDING)]
# construct the sort options
sort_opts = search.SortOptions(expressions=expr_list)
#Prepare the search index
index = search.Index(name= "GoogleDirectoryUsers",namespace="1")
search_query = search.Query(
query_string=query.strip(),
options=search.QueryOptions(
limit=params['limit'],
offset=params['offset'],
sort_options=sort_opts,
returned_fields = directory_users_grps.get_search_doc_return_fields()
))
#Execute the search query
search_result = index.search(search_query)
#Start collecting the values
total_cnt = search_result.number_found
params['limit'] = len(search_result.results)
#Prepare the response object
for teamdriveDoc in search_result.results:
teamdriveRecord = GoogleDirectoryUsers.query(GoogleDirectoryUsers.email==teamdriveDoc.doc_id).get()
if teamdriveRecord:
if teamdriveRecord.suspended == False:
search_results.append(teamdriveRecord.to_dict())
search_results_json.update({"users" : search_results})
search_results_json.update({"limit" : params['limit'] if len(search_results)>0 else '0'})
search_results_json.update({"total_count" : total_cnt if len(search_results)>0 else '0'})
search_results_json.update({"status" : "success"})
except Exception as e:
log.exception("Error in performing search")
search_results_json.update({"status":"failed"})
search_results_json.update({"description":error_msg})
return search_results_json
''' Retrieves the given param from dict and adds to query if exists
'''
def appendQueryParam(self, q='', p=[], qname=None, criteria='=', pname=None,
isExactMatch = False, splitString = False, defaultValue=None):
if (pname in p) or (defaultValue is not None):
if len(q) > 0:
q += ' OR '
q += qname
if criteria:
q += criteria
if defaultValue is None:
val = p[pname]
else:
val = defaultValue
if splitString:
val = val.replace("", "~")[1: -1]
#Helps to retain passed argument as it is, example email
if isExactMatch:
q += "\"" +val + "\""
else:
q += val
return q

An Index instance's search method accepts a deadline parameter, so you could use that to increase the time that you are willing to wait for the search to respond:
search_result = index.search(search_query, deadline=30)
The documentation doesn't specify acceptable value for deadline, but other App Engine services tend to accept values up to 60 seconds.

Having trouble parsing a .CSV file into a dict

I've done some simple .csv parsing in python but have a new file structure that's giving me trouble. The input file is from a spreadsheet converted into a .CSV file. Here is an example of the input:
Layout
Each set can have many layouts, and each layout can have many layers. Each layer has only one layer and name.
Here is the code I am using to parse it in. I suspect it's a logic/flow control problem because I've parsed things in before, just not this deep. The first header row is skipped via code. Any help appreciated!
import csv
import pprint
def import_layouts_schema(layouts_schema_file_name = 'C:\\layouts\\LAYOUT1.csv'):
class set_template:
def __init__(self):
self.set_name =''
self.layout_name =''
self.layer_name =''
self.obj_name =''
def check_layout(st, row, layouts_schema):
c=0
if st.layout_name == '':
st.layer_name = row[c+2]
st.obj_name = row[c+3]
layer = {st.layer_name : st.obj_name}
layout = {st.layout_name : layer}
layouts_schema.update({st.set_name : layout})
else:
st.layout_name = row[c+1]
st.layer_name = row[c+2]
st.obj_name = row[c+3]
layer = {st.layer_name : st.obj_name}
layout = {st.layout_name : layer}
layouts_schema.update({st.set_name : layout})
return layouts_schema
def layouts_schema_parsing(obj_list_raw1): #, location_categories, image_schema, set_location):
#------ init -----------------------------------
skipfirst = True
c = 0
firstrow = True
layouts_schema = {}
end_flag = ''
st = set_template()
#---------- start parsing here -----------------
print('Now parsing layouts schema list')
for row in obj_list_raw1:
#print ('This is the row: ', row)
if skipfirst==True:
skipfirst=False
continue
if row[c] != '':
st.set_name = row[c]
st.layout_name = row[c+1]
st.layer_name = row[c+2]
st.obj_name = row[c+3]
print('FOUND A NEW SET. SET details below:')
print('Set name:', st.set_name, 'Layout name:', st.layout_name, 'Layer name:', st.layer_name, 'Object name:', st.obj_name)
if firstrow == True:
print('First row of layouts import!')
layer = {st.layer_name : st.obj_name}
layout = {st.layout_name : layer}
layouts_schema = {st.set_name : layout}
firstrow = False
check_layout(st, row, layouts_schema)
continue
elif firstrow == False:
print('Not the first row of layout import')
layer = {st.layer_name : st.obj_name}
layout = {st.layout_name : layer}
layouts_schema.update({st.set_name : layout})
check_layout(st, row, layouts_schema)
return layouts_schema
#begin subroutine main
layouts_schema_file_name ='C:\\Users\\jason\\Documents\\RAY\\layout_schemas\\ANIBOT_LAYOUTS_SCHEMA.csv'
full_path_to_file = layouts_schema_file_name
print('============ Importing LAYOUTS schema from: ', full_path_to_file , ' ==============')
openfile = open(full_path_to_file)
reader_ob = csv.reader(openfile)
layout_list_raw1 = list(reader_ob)
layouts_schema = layouts_schema_parsing(layout_list_raw1)
print('=========== End of layouts schema import =========')
return layouts_schema
layouts_schema = import_layouts_schema()
Feel free to throw any part away that doesn't work. I suspect I've inside my head a little bit here. A for loop or another while loop may do the trick. Ultimately I just want to parse the file into a dict with the same key structure shown. i.e. the final dict's first line would look like:
{'RESTAURANT': {'RR_FACING1': {'BACKDROP': 'restaurant1'}}}
And the rest on from there. Ultimately I am goign to use this key structure and the dict for other purposes. Just can't get the parsing down!

Wouaw, that's a lot of code !
Maybe try something simpler :
with open('file.csv') as f:
keys = f.readline().split(';') # assuming ";" is your csv fields separator
for line in f:
vals = line.split(';')
d = dict(zip(keys, vals))
print(d)
Then either make a better data file (without blanks), or have the parser remembering the previous values.

While I agree with #AK47 that the code review site may be the better approach, I received so many help from SO that I'll try to give back a little: IMHO you are overthinking the problem. Please find below an approach that should get you in the right direction and doesn't even require converting from Excel to CSV (I like the xlrd module, it's very easy to use). If you already have a CSV, just exchange the loop in the process_sheet() function. Basically, I just store the last value seen for "SET" and "LAYOUT" and if they are different (and not empty), I set the new value. Hope that helps. And yes, you should think about a better data structure (redundancy is not always bad, if you can avoid empty cells :-) ).
import xlrd
def process_sheet(sheet : xlrd.sheet.Sheet):
curr_set = ''
curr_layout = ''
for rownum in range(1, sheet.nrows):
row = sheet.row(rownum)
set_val = row[0].value.strip()
layout_val = row[1].value.strip()
if set_val != '' and set_val != curr_set:
curr_set = set_val
if layout_val != '' and layout_val != curr_layout:
curr_layout = layout_val
result = {curr_set: {curr_layout: {row[2].value: row[3].value}}}
print(repr(result))
def main():
# open a workbook (adapt your filename)
# then get the first sheet (index 0)
# and call the process function
wbook = xlrd.open_workbook('/tmp/test.xlsx')
sheet = wbook.sheet_by_index(0)
process_sheet(sheet)
if __name__ == '__main__':
main()

Manage exceptions handling

I have class named ExcelFile, his job is to manage excel files (read, extract data, and differents things for the stack).
I want to implement a system for managing errors/exceptions.
For example, ExcelFile as a method load(), like a "setup"
def load(self):
"""
Setup for excel file
Load workbook, worksheet and others characteristics (data lines, header...)
:return: Setup successfully or not
:rtype: bool
Current usage
:Example:
> excefile = ExcelFile('test.xls')
> excefile.load()
True
> excefile.nb_rows()
4
"""
self.workbook = xlrd.open_workbook(self.url)
self.sheet = self.workbook.sheet_by_index(0)
self.header_row_index = self.get_header_row_index()
if self.header_row_index == None: # If file doesn't have header (or not valid)
return False
self.header_fields = self.sheet.row_values(self.header_row_index)
self.header_fields_col_ids = self.get_col_ids(self.header_fields) # Mapping between header fields and col ids
self.nb_rows = self.count_rows()
self.row_start_data = self.header_row_index + self.HEADER_ROWS
return True
As you can see, I can encounter 2 differents errors:
The file is not an excel file (raise xlrd.XLRDError)
The file has an invalid header (so I return False)
I want to implement a good management system of ExcelFile errors, because this class is used a lot in the stack.
This is my first idea for processing that :
Implement a standard exception
class ExcelFileException(Exception):
def __init__(self, message, type=None):
self.message = message
self.type = type
def __str__(self):
return "{} : {} ({})".format(self.__class__.__name__, self.message, self.type)
Rewrite load method
def load(self):
"""
Setup for excel file
Load workbook, worksheet and others characteristics (data lines, header...)
:return: Setup successfully or not
:rtype: bool
Current usage
:Example:
> excefile = ExcelFile('test.xls')
> excefile.load()
True
> excefile.nb_rows()
4
"""
try:
self.workbook = xlrd.open_workbook(self.url)
except xlrd.XLRDError as e:
raise ExcelFileException("Unsupported file type", e.__class__.__name__)
self.sheet = self.workbook.sheet_by_index(0)
self.header_row_index = self.get_header_row_index()
if self.header_row_index == None: # If file doesn't have header (or not valid)
raise ExcelFileException("Invalid or empty header")
self.header_fields = self.sheet.row_values(self.header_row_index)
self.header_fields_col_ids = self.get_col_ids(self.header_fields) # Mapping between header fields and col ids
self.nb_rows = self.count_rows()
self.row_start_data = self.header_row_index + self.HEADER_ROWS
return True
And this an example in a calling method, a big problem is I have to manage a dict named "report" with errors in french, for customers success and other.
...
def foo():
...
file = ExcelFile(location)
try:
file.load()
except ExcelFileException as e:
log.warn(e.__str__())
if e.type == 'XLRDError'
self.report['errors'] = 'Long description of the error, in french (error is about invalid file type)'
else:
self.report['errors'] = 'Long description of the error, in french (error is about invalid header)'
...
What do you think about that ? Do you have a better way ?
Thank you

You could change your exception to log the errors in your dict:
class ExcelFileException(Exception):
def __init__(self, message, report, type=None):
report['errors'].append(message)
self.message = message
self.type = type
def __str__(self):
return "{} : {} ({})".format(self.__class__.__name__, self.message, self.type)
When you will raise an exception:
raise ExcelFileException("Invalid or empty header", report)
The errors will be present in self.dictionnary['errors']

Also the error can be fixed by installing missing a optional dependence Xlrd
pip install Xlrd
More available python packages when working with excel

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to write dictionaries to Bigquery in Dataflow using python - python

Related

Why Map works and ParDo does'nt?

Python: Automatic sharding in FileSystems.create() for a writing channel in Dataflow

Search via Python Search API timing out intermittently

Having trouble parsing a .CSV file into a dict

Manage exceptions handling

Categories

Resources