I need to recursively walk through a JSON files (post responses from an API), extracting the strings that have ["text"] as a key {"text":"this is a string"}
I need to start to parse from the source that has the oldest date in metadata, extract the strings from that source and then move to the 2nd oldest source and so on. JSON file could be badly nested and the level where the strings are can change from time to time.
Problem:
There are many keys called ["text"] and I don't need all of them, I need ONLY the ones having values as string. Better, the "text":"string" I need are ALWAYS in the same object {} of a "type":"sentence". See image.
What I am asking
Modify the 2nd code below in order to recursively walk the file and extract ONLY the ["text"] values when they are in the same object {} together with "type":"sentence".
Below a snippet of JSON file (in green the text I need and the medatada, in red the ones I don't need to extract):
Link to full JSON sample: http://pastebin.com/0NS5BiDk
What I have done so far:
1) The easy way: transform the json file in string and search for content between the double quotes ("") because in all json post responses the "strings" I need are the only ones that come between double quotes. However this option prevent me to order the resources previously, therefore is not good enough.
r1 = s.post(url2, data=payload1)
j = str(r1.json())
sentences_list = (re.findall(r'\"(.+?)\"', j))
numentries = 0
for sentences in sentences_list:
numentries += 1
print(sentences)
print(numentries)
2) Smarter way: recursively walk trough a JSON file and extract the ["text"] values
def get_all(myjson, key):
if type(myjson) is dict:
for jsonkey in (myjson):
if type(myjson[jsonkey]) in (list, dict):
get_all(myjson[jsonkey], key)
elif jsonkey == key:
print (myjson[jsonkey])
elif type(myjson) is list:
for item in myjson:
if type(item) in (list, dict):
get_all(item, key)
print(get_all(r1.json(), "text"))
It extracts all the values that have ["text"] as Key. Unfortunately in the file there are other stuff (that I don't need) that has ["text"] as Key. Therefore it returns text that I don't need.
Please advise.
UPDATE
I have written 2 codes to sort the list of objects by a certain key. The 1st one sorts by the 'text' of the xml. The 2nd one by 'Comprising period from' value.
The 1st one works, but a few of the XMLs, even if they are higher in number, actually have documents inside older than I expected.
For the 2nd code the format of 'Comprising period from' is not consistent and sometimes the value is not present at all. The second one also gives me an error, but I cannot figure out why - string indices must be integers.
# 1st code (it works but not ideal)
j=r1.json()
list = []
for row in j["tree"]["children"][0]["children"]:
list.append(row)
newlist = sorted(list, key=lambda k: k['text'][-9:])
print(newlist)
# 2nd code I need something to expect missing values and to solve the
# list index error
list = []
for row in j["tree"]["children"][0]["children"]:
list.append(row)
def date(key):
return dparser.parse((' '.join(key.split(' ')[-3:])),fuzzy=True)
def order(list_to_order):
try:
return sorted(list_to_order,
key=lambda k: k[date(["metadata"][0]["value"])])
except ValueError:
return 0
print(order(list))
I think this will do what you want, as far as selecting the right strings. I also changed the way type-checking was done to use isinstance(), which is considered a better way to do it because it supports object-oriented polymorphism.
import json
_NUL = object() # unique value guaranteed to never be in JSON data
def get_all(myjson, kind, key):
""" Recursively find all the values of key in all the dictionaries in myjson
with a "type" key equal to kind.
"""
if isinstance(myjson, dict):
key_value = myjson.get(key, _NUL) # _NUL if key not present
if key_value is not _NUL and myjson.get("type") == kind:
yield key_value
for jsonkey in myjson:
jsonvalue = myjson[jsonkey]
for v in get_all(jsonvalue, kind, key): # recursive
yield v
elif isinstance(myjson, list):
for item in myjson:
for v in get_all(item, kind, key): # recursive
yield v
with open('json_sample.txt', 'r') as f:
data = json.load(f)
numentries = 0
for text in get_all(data, "sentence", "text"):
print(text)
numentries += 1
print('\nNumber of "text" entries found: {}'.format(numentries))
Related
I'm trying to parse a JSON response I get from a web API. The problem is that the JSON can have varying levels, which translate into dictionaries of dictionaries, with an occasional list in the mix.
Example (this works):
for r in json_doc['results']:
yield r.get('lastLoginLocation',{}).get('coordinates',{}).get('lat',{})
Can I do the same thing when a list of dictionaries is in there? I'd like it to return the specified key value from the first dictionary in the list if the list is populated, or return '{}' if the list is empty.
Example (this does NOT work)
yield r.get('profile',{}).get('phones',{})[0].get('default',{})
Quite simply in `get('phones') use a list with one empty dict as default, ie:
yield r.get('profile',{}).get('phones',[{}])[0].get('default',{})
Note that this will still break with an IndexError if r["profile"]["phones"] is an empty list. You could get around using a or, ie:
yield (r.get('profile',{}).get('phones',[{}]) or [{}])[0].get('default',{})
but it's getting really messy (and builds two empty dicts and lists for no good reason), so you'd probably be better with more explicit code, cf Pankaj Singhal's answer.
Your approach is quite suboptimal, since a missing profile key in the root dictionary does not terminate the search, but continues further on unnecessarily. There obviously would not be any keys in an empty dict.
You could instead use a try/except:
def get_value(container, keys=None)
if keys is None:
raise ValueError
for r in container:
item = r
for i in keys:
try:
item = item[i]
except (IndexError, KeyError):
yield {}
break
# finished cleanly
else:
yield item
get_value(json_doc['results'], keys=['profile', 'phones', 0, 'default'])
This get_nested helper function might be what you want. I used a similar technique in some XML parsing code in the past. It removes the
implementation from obscuring what your code is actually trying to achieve.
from contextlib import suppress
def get_nested(list_or_dict, keys, default={}):
"""Get value from nested list_or_dict using keys. If the current
level is a dict, lookup the current key. If the current
level is a list, lookup current key in the first element of the
list. Return default for any errors.
"""
def get(item, key):
if hasattr(item, 'get') and key in item:
return item[key]
raise KeyError
for key in keys:
with suppress(KeyError):
list_or_dict = get(list_or_dict, key)
continue
with suppress(IndexError, KeyError):
list_or_dict = get(list_or_dict[0], key)
continue
break
else:
return list_or_dict
return default
Your code to call it would be like this:
for r in json_doc['results']:
yield get_nested(r, ('lastLoginLocation', 'coordinates', 'lat'))
yield get_nested(r, ('profile', 'phones', 'default'))
I am working on getting all text that exists in several .yaml files placed into a new singular YAML file that will contain the English translations that someone can then translate into Spanish.
Each YAML file has a lot of nested text. I want to print the full 'path', aka all the keys, along with the value, for each value in the YAML file. Here's an example input for a .yaml file that lives in the myproject.section.more_information file:
default:
heading: Here’s A Title
learn_more:
title: Title of Thing
url: www.url.com
description: description
opens_new_window: true
and here's the desired output:
myproject.section.more_information.default.heading: Here’s a Title
myproject.section.more_information.default.learn_more.title: Title of Thing
mproject.section.more_information.default.learn_more.url: www.url.com
myproject.section.more_information.default.learn_more.description: description
myproject.section.more_information.default.learn_more.opens_new_window: true
This seems like a good candidate for recursion, so I've looked at examples such as this answer
However, I want to preserve all of the keys that lead to a given value, not just the last key in a value. I'm currently using PyYAML to read/write YAML.
Any tips on how to save each key as I continue to check if the item is a dictionary and then return all the keys associated with each value?
What you're wanting to do is flatten nested dictionaries. This would be a good place to start: Flatten nested Python dictionaries, compressing keys
In fact, I think the code snippet in the top answer would work for you if you just changed the sep argument to ..
edit:
Check this for a working example based on the linked SO answer http://ideone.com/Sx625B
import collections
some_dict = {
'default': {
'heading': 'Here’s A Title',
'learn_more': {
'title': 'Title of Thing',
'url': 'www.url.com',
'description': 'description',
'opens_new_window': 'true'
}
}
}
def flatten(d, parent_key='', sep='_'):
items = []
for k, v in d.items():
new_key = parent_key + sep + k if parent_key else k
if isinstance(v, collections.MutableMapping):
items.extend(flatten(v, new_key, sep=sep).items())
else:
items.append((new_key, v))
return dict(items)
results = flatten(some_dict, parent_key='', sep='.')
for item in results:
print(item + ': ' + results[item])
If you want it in order, you'll need an OrderedDict though.
Walking over nested dictionaries begs for recursion and by handing in the "prefix" to "path" this prevents you from having to do any manipulation on the segments of your path (as #Prune) suggests.
There are a few things to keep in mind that makes this problem interesting:
because you are using multiple files can result in the same path in multiple files, which you need to handle (at least throwing an error, as otherwise you might just lose data). In my example I generate a list of values.
dealing with special keys (non-string (convert?), empty string, keys containing a .). My example reports these and exits.
Example code using ruamel.yaml ¹:
import sys
import glob
import ruamel.yaml
from ruamel.yaml.comments import CommentedMap, CommentedSeq
from ruamel.yaml.compat import string_types, ordereddict
class Flatten:
def __init__(self, base):
self._result = ordereddict() # key to list of tuples of (value, comment)
self._base = base
def add(self, file_name):
data = ruamel.yaml.round_trip_load(open(file_name))
self.walk_tree(data, self._base)
def walk_tree(self, data, prefix=None):
"""
this is based on ruamel.yaml.scalarstring.walk_tree
"""
if prefix is None:
prefix = ""
if isinstance(data, dict):
for key in data:
full_key = self.full_key(key, prefix)
value = data[key]
if isinstance(value, (dict, list)):
self.walk_tree(value, full_key)
continue
# value is a scalar
comment_token = data.ca.items.get(key)
comment = comment_token[2].value if comment_token else None
self._result.setdefault(full_key, []).append((value, comment))
elif isinstance(base, list):
print("don't know how to handle lists", prefix)
sys.exit(1)
def full_key(self, key, prefix):
"""
check here for valid keys
"""
if not isinstance(key, string_types):
print('key has to be string', repr(key), prefix)
sys.exit(1)
if '.' in key:
print('dot in key not allowed', repr(key), prefix)
sys.exit(1)
if key == '':
print('empty key not allowed', repr(key), prefix)
sys.exit(1)
return prefix + '.' + key
def dump(self, out):
res = CommentedMap()
for path in self._result:
values = self._result[path]
if len(values) == 1: # single value for path
res[path] = values[0][0]
if values[0][1]:
res.yaml_add_eol_comment(values[0][1], key=path)
continue
res[path] = seq = CommentedSeq()
for index, value in enumerate(values):
seq.append(value[0])
if values[0][1]:
res.yaml_add_eol_comment(values[0][1], key=index)
ruamel.yaml.round_trip_dump(res, out)
flatten = Flatten('myproject.section.more_information')
for file_name in glob.glob('*.yaml'):
flatten.add(file_name)
flatten.dump(sys.stdout)
If you have an additional input file:
default:
learn_more:
commented: value # this value has a comment
description: another description
then the result is:
myproject.section.more_information.default.heading: Here’s A Title
myproject.section.more_information.default.learn_more.title: Title of Thing
myproject.section.more_information.default.learn_more.url: www.url.com
myproject.section.more_information.default.learn_more.description:
- description
- another description
myproject.section.more_information.default.learn_more.opens_new_window: true
myproject.section.more_information.default.learn_more.commented: value # this value has a comment
Of course if your input doesn't have double paths, your output won't have any lists.
By using string_types and ordereddict from ruamel.yaml makes this Python2 and Python3 compatible (you don't indicate which version you are using).
The ordereddict preserves the original key ordering, but this is of course dependent on the processing order of the files. If you want the paths sorted, just change dump() to use:
for path in sorted(self._result):
Also note that the comment on the 'commented' dictionary entry is preserved.
¹ ruamel.yaml is a YAML 1.2 parser that preserves comments and other data on round-tripping (PyYAML does most parts of YAML 1.1). Disclaimer: I am the author of ruamel.yaml
Keep a simple list of strings, being the most recent key at each indentation depth. When you progress from one line to the next with no change, simply change the item at the end of the list. When you "out-dent", pop the last item off the list. When you indent, append to the list.
Then, each time you hit a colon, the corresponding key item is the concatenation of the strings in the list, something like:
'.'.join(key_list)
Does that get you moving at an honorable speed?
I need to recursively walk through a JSON files (post responses from an API), extracting the strings that have ["text"] as a key {"text":"this is a string"}
I need to start to parse from the source that has the oldest date in metadata, extract the strings from that source and then move to the 2nd oldest source and so on. JSON file could be badly nested and the level where the strings are can change from time to time.
Problem:
There are many keys called ["text"] and I don't need all of them, I need ONLY the ones having values as string. Better, the "text":"string" I need are ALWAYS in the same object {} of a "type":"sentence". See image.
What I am asking
Modify the 2nd code below in order to recursively walk the file and extract ONLY the ["text"] values when they are in the same object {} together with "type":"sentence".
Below a snippet of JSON file (in green the text I need and the medatada, in red the ones I don't need to extract):
Link to full JSON sample: http://pastebin.com/0NS5BiDk
What I have done so far:
1) The easy way: transform the json file in string and search for content between the double quotes ("") because in all json post responses the "strings" I need are the only ones that come between double quotes. However this option prevent me to order the resources previously, therefore is not good enough.
r1 = s.post(url2, data=payload1)
j = str(r1.json())
sentences_list = (re.findall(r'\"(.+?)\"', j))
numentries = 0
for sentences in sentences_list:
numentries += 1
print(sentences)
print(numentries)
2) Smarter way: recursively walk trough a JSON file and extract the ["text"] values
def get_all(myjson, key):
if type(myjson) is dict:
for jsonkey in (myjson):
if type(myjson[jsonkey]) in (list, dict):
get_all(myjson[jsonkey], key)
elif jsonkey == key:
print (myjson[jsonkey])
elif type(myjson) is list:
for item in myjson:
if type(item) in (list, dict):
get_all(item, key)
print(get_all(r1.json(), "text"))
It extracts all the values that have ["text"] as Key. Unfortunately in the file there are other stuff (that I don't need) that has ["text"] as Key. Therefore it returns text that I don't need.
Please advise.
UPDATE
I have written 2 codes to sort the list of objects by a certain key. The 1st one sorts by the 'text' of the xml. The 2nd one by 'Comprising period from' value.
The 1st one works, but a few of the XMLs, even if they are higher in number, actually have documents inside older than I expected.
For the 2nd code the format of 'Comprising period from' is not consistent and sometimes the value is not present at all. The second one also gives me an error, but I cannot figure out why - string indices must be integers.
# 1st code (it works but not ideal)
j=r1.json()
list = []
for row in j["tree"]["children"][0]["children"]:
list.append(row)
newlist = sorted(list, key=lambda k: k['text'][-9:])
print(newlist)
# 2nd code I need something to expect missing values and to solve the
# list index error
list = []
for row in j["tree"]["children"][0]["children"]:
list.append(row)
def date(key):
return dparser.parse((' '.join(key.split(' ')[-3:])),fuzzy=True)
def order(list_to_order):
try:
return sorted(list_to_order,
key=lambda k: k[date(["metadata"][0]["value"])])
except ValueError:
return 0
print(order(list))
I think this will do what you want, as far as selecting the right strings. I also changed the way type-checking was done to use isinstance(), which is considered a better way to do it because it supports object-oriented polymorphism.
import json
_NUL = object() # unique value guaranteed to never be in JSON data
def get_all(myjson, kind, key):
""" Recursively find all the values of key in all the dictionaries in myjson
with a "type" key equal to kind.
"""
if isinstance(myjson, dict):
key_value = myjson.get(key, _NUL) # _NUL if key not present
if key_value is not _NUL and myjson.get("type") == kind:
yield key_value
for jsonkey in myjson:
jsonvalue = myjson[jsonkey]
for v in get_all(jsonvalue, kind, key): # recursive
yield v
elif isinstance(myjson, list):
for item in myjson:
for v in get_all(item, kind, key): # recursive
yield v
with open('json_sample.txt', 'r') as f:
data = json.load(f)
numentries = 0
for text in get_all(data, "sentence", "text"):
print(text)
numentries += 1
print('\nNumber of "text" entries found: {}'.format(numentries))
I am using the following sets of generators to parse XML in to CSV:
import xml.etree.cElementTree as ElementTree
from xml.etree.ElementTree import XMLParser
import csv
def flatten_list(aList, prefix=''):
for i, element in enumerate(aList, 1):
eprefix = "{}{}".format(prefix, i)
if element:
# treat like dict
if len(element) == 1 or element[0].tag != element[1].tag:
yield from flatten_dict(element, eprefix)
# treat like list
elif element[0].tag == element[1].tag:
yield from flatten_list(element, eprefix)
elif element.text:
text = element.text.strip()
if text:
yield eprefix[:].rstrip('.'), element.text
def flatten_dict(parent_element, prefix=''):
prefix = prefix + parent_element.tag
if parent_element.items():
for k, v in parent_element.items():
yield prefix + k, v
for element in parent_element:
eprefix = element.tag
if element:
# treat like dict - we assume that if the first two tags
# in a series are different, then they are all different.
if len(element) == 1 or element[0].tag != element[1].tag:
yield from flatten_dict(element, prefix=prefix)
# treat like list - we assume that if the first two tags
# in a series are the same, then the rest are the same.
else:
# here, we put the list in dictionary; the key is the
# tag name the list elements all share in common, and
# the value is the list itself
yield from flatten_list(element, prefix=eprefix)
# if the tag has attributes, add those to the dict
if element.items():
for k, v in element.items():
yield eprefix+k
# this assumes that if you've got an attribute in a tag,
# you won't be having any text. This may or may not be a
# good idea -- time will tell. It works for the way we are
# currently doing XML configuration files...
elif element.items():
for k, v in element.items():
yield eprefix+k
# finally, if there are no child tags and no attributes, extract
# the text
else:
yield eprefix, element.text
def makerows(pairs):
headers = []
columns = {}
for k, v in pairs:
if k in columns:
columns[k].extend((v,))
else:
headers.append(k)
columns[k] = [k, v]
m = max(len(c) for c in columns.values())
for c in columns.values():
c.extend(' ' for i in range(len(c), m))
L = [columns[k] for k in headers]
rows = list(zip(*L))
return rows
def main():
with open('2-Response_duplicate.xml', 'r', encoding='utf-8') as f:
xml_string = f.read()
xml_string= xml_string.replace('', '') #optional to remove ampersands.
root = ElementTree.XML(xml_string)
# for key, value in flatten_dict(root):
# key = key.rstrip('.').rsplit('.', 1)[-1]
# print(key,value)
writer = csv.writer(open("try5.csv", 'wt'))
writer.writerows(makerows(flatten_dict(root)))
if __name__ == "__main__":
main()
One column of the CSV, when opened in Excel, looks like this:
ObjectGuid
2adeb916-cc43-4d73-8c90-579dd4aa050a
2e77c588-56e5-4f3f-b990-548b89c09acb
c8743bdd-04a6-4635-aedd-684a153f02f0
1cdc3d86-f9f4-4a22-81e1-2ecc20f5e558
2c19d69b-26d3-4df0-8df4-8e293201656f
6d235c85-6a3e-4cb3-9a28-9c37355c02db
c34e05de-0b0c-44ee-8572-c8efaea4a5ee
9b0fe8f5-8ec4-4f13-b797-961036f92f19
1d43d35f-61ef-4df2-bbd9-30bf014f7e10
9cb132e8-bc69-4e4f-8f29-c1f503b50018
24fd77da-030c-4cb7-94f7-040b165191ce
0a949d4f-4f4c-467e-b0a0-40c16fc95a79
801d3091-c28e-44d2-b9bd-3bad99b32547
7f355633-426d-464b-bab9-6a294e95c5d5
This is due to the fact that there are 14 tags with name ObjectGuid. For example, one of these tags looks like this:
<ObjectGuid>2adeb916-cc43-4d73-8c90-579dd4aa050a</ObjectGuid>
My question: is there an efficient method to enumerate the headers (the keys) such that each key is enumerated like so with it's corresponding value (text in the XML data structure):
It would be displayed in Excel as follows:
ObjectGuid_1 ObjectGuid_2 ObejectGuid3 etc.....
Please let me know if there is any other information that you need from me (such as sample XML). Thank you for your help.
It is a mistake to add an element, attribute, or annotative descriptor to the data set itself for the purpose of identity… Normalizing the data should only be done if you own that data and know with 100% guarantee that doing so will not
have any negative effect on additional consumers (ones relying on attribute order to manipulate the DOM). However what is the point of using a dict or nested dicts (which I don’t quite get either t) if the efficiency of the hashed table lookup is taken right back by making 0(n) checks for this attribute new attribute? The point of this hashing is random look up..
If it’s simply the structured in (key, value) pair, which makes sense here.. Why not just use some other contiguous data structure, but treat it like a dictionary.. Say a named tuple…
A second solution is if you want to add additional state is to throw your generator in a class.
class order:
def__init__(self, lines, order):
self.lines = lines
self.order - python(max)
def __iter__(self):
for l, line in enumerate(self.lines, 1);
self.order.append( l, line))
yield line
when open (some file.csv) as f:
lines = oder( f);
Messing with the data a Harmless conversion? For example if were to create a conversion dictionary (see below)
Well that’s fine, that is until one of the values is blank…
types = [ (‘x ’, float’),
(‘y’, float’)
with open(‘some.csv’) as f:
for row in cvs.DictReader(f):
row.update((key, conversion (row [ key]))
for key, conversion in field_types)
[x: ,y: 2. 2] — > that is until there is an empty data point.. Kaboom.
So My suggestion would not be to change or add to the data, but change the algorithm in which deal with such.. If the problem is order why not simply treat say a tuple as a named tuple similar to a dictionary, the caveat being mutability however makes sense with data uniformity...
*I don’t understand the nested dictionary…That is for the y header values yes?
values and order key —> key — > ( key: value ) ? or you could just skip the
first row :p..
So just skip the first row..
for line in {header: list, header: list }
line.next() # problem solved.. or print(line , end = ‘’)
*** Notables
-To iterator over multiple sequences in parallel
h = [a,b,c]
x = [1,2,3]
for i in zip(a,b):
print(i)
(a, 1)
(b, 2)
-Chaining
a = [1,2 , 3]
b= [a, b , c ]enter code here
for x in chain(a, b):
//remove white space
I am new to python and I am using python 2.7. I have two dictionaries that have the same keys. One dictionary is always the same. The other may not have all the matching keys that are in the first dictionary. I have tried many variations based on other questions in Stack overflow on this topic.
I have even tried testing with the following code. The part that doesn't work is comparing the keys in the different dictionaries.
Sample Dict:
clubDict = {'001':'Alabama','066':'MountainWest','602':'The Auto Club Group'}
data = {'001':6021, ,'066':1134}
As you can see there is no key 602 in the data dictionary. The data dict is being built in this code from a csv file that gets multiple numbers added for a total. Much of this code was answered in stack overflow.
the code worked when I didn't have the if, elifs in it. However the print statement would give differetn results when printed. this is do to missing keys in data{}. I added an if, else to try to compare the keys using pass. Still wouldn't compare. so i have tried what you see now.
here is part of my code:
def getTotals():
result = defaultdict(int)
regexp = re.compile(r'(?:ttp_ws_sm|ttpv1)_(\d+)_')
with open(os.path.join(source, 'ttp_13_08.csv'), 'r') as f:
rows = csv.reader(f)
#adds total values for each club code (from csv file)
for row in rows:
match = regexp.search(row[1])
if match:
result[match.group(1)] += int(row[13])
for key, value in result.items():
data.update(result.items())
for value, key in clubDict.items():
#f = open(output_path + filename, 'a')
shared_keys = set(clubDict.keys()).union(data.keys())
if key not in data:
print "No counts avialable"
elif key not in clubDict:
print "Check for Club code"
elif data[key] == clubDict[key]:
print 'match'#, '{0}, {1}, {2}'.format(key, value, data[value])
else:
print '{0}, {1}, {2}'.format(key, value, data[value])
file.close
def main():
try:
getTotals()
except:
print "No more results"
the results aren't what i need.
this is the desired results:
Alabama 001 6021
MountainWest 066 1134.
I have reviewed many q/a in stack overflow and cannot seem to get these results. I could be just searching incorrectly on my question.
You swapped the value and key in your loop:
for value, key in clubDict.items():
.items() gives you (key, value) tuples.
Reworking your code a little to remove redundacies:
def getTotals():
result = defaultdict(int)
regexp = re.compile(r'(?:ttp_ws_sm|ttpv1)_(\d+)_')
with open(os.path.join(source, 'ttp_13_08.csv'), 'r') as f:
rows = csv.reader(f)
#adds total values for each club code (from csv file)
for row in rows:
match = regexp.search(row[1])
if match:
result[match.group(1)] += int(row[13])
data.update(result)
for key in clubDict.viewkeys() & data:
club_value, data_value = clubDict[key], data[key]
if club_value == data_value:
print 'match'
else:
print '{0}, {1}, {2}'.format(key, club_value, data_value)
You already calculated the intersection of the two dictionaries, which gives you keys only present in both, but you do need to loop over that intersection itself, not clubDict.
I used dict.viewkeys() to get a set-like object directly, which can be intersected with another iterable, like the data dictionary, very efficiently, without intermediary results.