Anonymizing a List of Values in Python

Anonymizing a List of Values in Python - python

Assume you have a data set as something like a CSV file that contains mildly sensitive information, like who passed a note to whom in a 12 Grade English class. While it's not a crisis if this data got out, it would be nice to strip out the identifying information so the data could be made public, shared with collaborators, etc. The data looks something like this:
Giver, Recipient:
Anna,JoeAnna,MarkMark,MindyMindy,Joe
How would you process through this list, assign each name a unique but arbitrary identifier, then strip out the names and replace them with said identifier in Python such that you end up with something like:
1,21,3
3,44,2

you can use hash() to generate a unique arbitrary identifier, it will return always return same integer for a particular string:
with open("data1.txt") as f:
lis=[x.split(",") for x in f]
items=[map(lambda y:hash(y.strip()),x) for x in lis]
for x in items:
print ",".join(map(str,x))
....:
-1319295970,1155173045
-1319295970,-1963774321
-1963774321,-1499251772
-1499251772,1155173045
or you can also use iterools.count:
In [80]: c=count(1)
In [81]: with open("data1.txt") as f:
lis=[map(str.strip,x.split(",")) for x in f]
dic={}
for x in set(chain(*lis)):
dic.setdefault(x.strip(),next(c))
for x in lis:
print ",".join(str(dic[y.strip()]) for y in x)
....:
3,2
3,4
4,1
1,2
or improving my previous answer using the unique_everseen recipe from itertools, you can get the exact answer :
In [84]: c=count(1)
In [85]: def unique_everseen(iterable, key=None):
seen = set()
seen_add = seen.add
if key is None:
for element in ifilterfalse(seen.__contains__, iterable):
seen_add(element)
yield element
else:
for element in iterable:
k = key(element)
if k not in seen:
seen_add(k)
yield element
....:
In [86]: with open("data1.txt") as f:
lis=[map(str.strip,x.split(",")) for x in f]
dic={}
for x in unique_everseen(chain(*lis)):
dic.setdefault(x.strip(),next(c))
for x in lis:
print ",".join(str(dic[y.strip()]) for y in x)
....:
1,2
1,3
3,4
4,2

names = """
Anna,Joe
Anna,Mark
Mark,Mindy
Mindy,Joe
"""
nameset = set((",".join(names.strip().splitlines())).split(","))
for i,name in enumerate(nameset):
names = names.replace(name,str(i))
print names
2,1
2,3
3,0
0,1

You could use hash to get a unique ID for each name of you could use a dictionary mapping names to their values (if you want numbers to be as in your example):
data = [("Anna", "Joe"), ("Anna", "Mark"), ("Mark", "Mindy"), ("Mindy", "Joe")]
names = {}
def anon(name):
if not name in names:
names[name] = len(names) + 1
return names[name]
result = []
for n1, n2 in data:
result.append((anon(n1), anon(n2)))
print names
print result
Will give when run:
{'Mindy': 4, 'Joe': 2, 'Anna': 1, 'Mark': 3}
[(1, 2), (1, 3), (3, 4), (4, 2)]

First, read your file into a list of rows:
import csv
with open('myFile.csv') as f:
rows = [row for row in csv.reader(f)]
At this point, you could build a dict to hold the mapping:
nameSet = set()
for row in rows:
for name in row:
nameSet.add(name)
map = dict((name, i) for i, name in enumerate(nameSet))
Alternatively, you could build the dict directly:
nextID = 0
map = {}
for row in rows:
for name in row:
if name not in map:
map[name] = nextID
nextID += 1
Either way, you go through the rows again and apply the mapping:
output = [[map[name] for name in row] for row in rows]

To genuinely anonymize the data, you need random aliases for the names. Hashes are good for that, but if you just want to map each name to an integer, you could do something like this:
from random import shuffle
data = [("Anna", "Joe"), ("Anna", "Mark"), ("Mark", "Mindy"), ("Mindy", "Joe")]
names = list(set(x for pair in data for x in pair))
shuffle(names)
aliases = dict((k, v) for v, k in enumerate(names))
munged = [(aliases[a], aliases[b]) for a, b in data]
That'll give you something like:
>>> data
[('Anna', 'Joe'), ('Anna', 'Mark'), ('Mark', 'Mindy'), ('Mindy', 'Joe')]
>>> names
['Mindy', 'Joe', 'Anna', 'Mark']
>>> aliases
{'Mindy': 0, 'Joe': 1, 'Anna': 2, 'Mark': 3}
>>> munged
[(2, 1), (2, 3), (3, 0), (0, 1)]
You can then (if you need to) get the name from the alias, and vice versa:
>>> aliases["Joe"]
1
>>> names[2]
'Anna'

Related

Top Three Values in a Dictionary, No Repeated Values

I want to be able to print the top three values in a dictionary created in another function, where there may be repeating values.
For example, if I have a dictionary d = { a:1, b:2, c:3, d:3, e:4 } I would only want a, b, and c returned.
This is what I currently have, but the output would be a, b, c, d. I don't want to remove d from the dictionary, I just don't want it returned when I run this function.
def top3(filename: str):
"""
Takes dict defined in wd_inventory, identifies top 3 words in dict
:param filename:
:return:
"""
d = max_frequency(filename)
x = list(d.values())
x.sort(reverse=True)
y = set(x)
x = x[0:3]
for i in x:
for j in d.keys():
if d[j] == i:
print(str(j) + " : " + str(d[j]))
return

One solution could be the following:
d = { "a":3, "b":4, "c":2, "d":5, "e":1}
print(sorted(d.items(), key=lambda x: x[1])[:3])
OUTPUT
[('e', 1), ('c', 2), ('a', 3)]
Note that will return truly the top 3 entry (by value), not the ones with keys 1, 2 and 3.
EDIT
I don't know what repeating value means exactly, but let's assume that in a dictionary like:
d = {"a":1, "b": 2, "c": 3, "d": 1, "e": 1}
You would like to print just a, b and c (given that d and e repeat the same value as a)
You could use the following approach:
from collections import defaultdict
res = defaultdict(list)
for key, val in sorted(d.items()):
res[val].append(key)
print([y[0] for x, y in list(res.items())])
OUTPUT
['a', 'b', 'c']

You can use heapq.nsmallest() to get the n smallest values in an iterable. This might be especially useful if the dict is very large, because it saves sorting a whole list only to select just three elements of it.
from heapq import nsmallest
from operator import itemgetter
def top3(dct):
return nsmallest(3, dct.items(), key=itemgetter(1))
dct = {'a':1, 'b':2, 'c':3, 'd':3, 'e':4}
for k, v in top3(dct):
print(f"{k}: {v}")
Output
a: 1
b: 2
c: 3
Due credit: I copied parts of j1-lee's code to use as a template.

[edited]
sorry, i have overseen that the smallest number has the highest status.
the code now is sorting the dictionary. this creates a list of tuples.
dic = {'aaa':3, 'xxx':1, 'ccc':8, 'yyy': 4, 'kkk':12}
res = sorted(dic.items(), key=lambda x: x[1])
print(res[:3])
result is:
[('xxx', 1), ('aaa', 3), ('yyy', 4)]

Problems with creating a Dict with label and a list of corresponding tuples

I would like to read a csv data which contains list entries with a label and two corresponding data points. All in All, there are three labels N,M,U.
I would like to create a Dict with a key for each of the label and all the corresponding data points in a list as value for the key. I tried with the below code, but have the problem that it returns a Dict with {"N":[all datapoint]}, so it assigns every data point to the label N and doesn't create a new key for M and U.
Does anybody see the problem here?
with open('./data.csv', 'r') as i:
D = {}
for line in i:
datatuple = tuple(line[2:-1].split(","))
floattuple = (float(datatuple[0])),float(datatuple[1])
label = line[:1]
if label in D:
D[label].append(floattuple)
else:
D[label] = [floattuple]
return D
Example data from the csv:
Thanks!

Your problem is the exact reason Python's dict has the .setdefault() method.
First, let's define a generator to generate some random data
In [28]: def lines():
...: from random import random, randrange
...: for _ in range(12):
...: key = {0:'M', 1:'N', 2:'U'}[randrange(3)]
...: yield ','.join((
...: key,
...: "%+5.3f"%(random()*10-5),
...: "%+5.3f"%(random()*10-5)
...: ))
Then, just as you read lines from a file, we read lines from the generator and update our dictionary, using the setdefault() method that, if the item is new, provides a default value, here an empty list, that you can immediately use to append the x, y point (I have placed some prints into the code so that you can check its correctness)
In [29]: d = {}
...:
...: for line in lines():
...: print(line)
...: key, x, y = line.split(',')
...: d.setdefault(key, []).append((float(x), float(y)))
...: print(*((k+': '+', '.join(str(t) for t in d[k])) for k in d), sep='\n')
M,-0.141,+1.755
M,+0.088,+3.354
N,+3.295,-3.847
U,+1.771,-3.268
M,-4.215,-4.499
U,-2.647,+1.218
U,-0.039,-0.357
U,+3.311,-3.312
N,-0.015,+2.039
N,-0.157,+3.319
N,-4.088,-0.914
U,+4.266,+4.863
M: (-0.141, 1.755), (0.088, 3.354), (-4.215, -4.499)
N: (3.295, -3.847), (-0.015, 2.039), (-0.157, 3.319), (-4.088, -0.914)
U: (1.771, -3.268), (-2.647, 1.218), (-0.039, -0.357), (3.311, -3.312), (4.266, 4.863)
In [30]:

This should do the job:
# i = ["N,1,2", "U,3,4", "U,5,6"]
D = {}
with open('./data.csv', 'r') as i:
for line in i:
line_list = line.split(",")
datatuple = tuple(map(float, line_list[1:]))
label = line_list[0]
D[label] = D.get(label, list()) + [datatuple]
return D
Using the example data i = ["N,1,2", "U,3,4", "U,5,6"] this results in {'N': [(1.0, 2.0)], 'U': [(3.0, 4.0), (5.0, 6.0)]}.
An arguably better option would be to use pandas read_csv. Depending on the size of your data, this will also be much faster:
import numpy as np
import pandas as pd
np.random.seed(3)
# Create example data (same structure as in the OP) and write to disk
pd.DataFrame({"label": np.random.choice(["M", "N", "U"], 10),
"x": map("{:.3f}".format, np.random.normal(size=10)),
"y": map("{:.3f}".format, np.random.normal(size=10))}
).to_csv("./data.csv", header=False, index=False)
# read data to dataframe, convert to tuple, groupby and convert to dict
D = (pd.read_csv("./data.csv", header=None, names=["label", "x", "y"])
.set_index("label")
.apply(tuple, axis=1)
.groupby("label")
.apply(list)
.to_dict())
# Output:
{'M': [(-0.581, -1.69), (-1.147, -1.73), (-0.611, 0.696), (-1.19, 0.565)],
'N': [(-0.152, -0.349),
(0.872, 0.48),
(-0.016, -0.29600000000000004),
(-2.1590000000000003, -0.86)],
'U': [(0.278, 0.7559999999999999), (1.167, -0.42)]}
Some rounding errors occur when reading the csv file (0.29600000000000004 etc.).

Conditionally add multiple items to dictionary in Python

I'm constructing a dictionary in Python from many elements, some of which are nan's and I don't want to add them to the dictionary at all (because then I'll be inserting it into database and I don't want to have fields which don't make sense).
At the moment I'm doing something like this:
data = pd.read_csv("data.csv")
for i in range(len(data)):
mydict = OrderedDict([("type", "mydata"), ("field2", data.ix[i,2]), ("field5", data.ix[i,5])])
if not math.isnan(data.ix[i,3]):
mydict['field3'] = data.ix[i,3]
if not math.isnan(data.ix[i,4]):
mydict['field4'] = data.ix[i,4]
if not math.isnan(data.ix[i,8]):
mydict['field8'] = data.ix[i,8]
etc....
Can it be done in a flatter structure, i.e., defining an array of field names and field numbers I'd like to conditionally insert?

>>> fields = [float('nan'),2,3,float('nan'),5]
>>> {"field%d"%i:v for i,v in enumerate(fields) if not math.isnan(v)}
{'field2': 3, 'field1': 2, 'field4': 5}
Or an ordered dict:
>>> OrderedDict(("field%d"%i,v) for i,v in enumerate(fields) if not math.isnan(v))
OrderedDict([('field1', 2), ('field2', 3), ('field4', 5)])

Is this what you were looking for?
data = pd.read_csv("data.csv")
for i in range(len(data)):
mydict = OrderedDict([("type", "mydata"), ("field2", data.ix[i,2]), ("field5", data.ix[i,5])])
# field numbers
fields = [3,4,8]
for f in fields:
if not math.isnan(data.ix[i,f]):
mydict['field'+str(f)] = data.ix[i,f]

conditional_fields = ((3, 'field3'), (4, 'field4'), (8, 'field8'))
for i in range(len(data)):
mydict = OrderedDict([("type", "mydata"), ("field2", data.ix[i,2]), ("field5", data.ix[i,5])])
for (index, fieldname) in conditional_fields:
if not math.isnan(data.ix[i, index]):
mydict[fieldname] = data.ix[i, index]
I am assuming the actual field names are not literally 'field8' etc.

Find duplicates of two columns from csv

I want to find duplicate values of one column and replaced with value of another column of csv which has multiple columns. So first I put two columns from the csv to the dictionary. Then I want to find duplicate values of dictionary that has string values and keys. I tried with solutions of remove duplicates of dictionary but got the error as not hashable or no result. Here is the first part of code.
import csv
from collections import defaultdict
import itertools as it
mydict = {}
index = 0
reader = csv.reader(open(r"computing.csv", "rb"))
for i, rows in enumerate(reader):
if i == 0:
continue
if len(rows) == 0:
continue
k = rows[3].strip()
v = rows[2].strip()
if k in mydict:
mydict[k].append(v)
else:
mydict[k] = [v]
#mydict = hash(frozenset(mydict))
print mydict
d = {}
while True:
try:
d = defaultdict(list)
for k,v in mydict.iteritems():
#d[frozenset(mydict.items())]
d[v].append(k)
except:
continue
writer = csv.writer(open(r"OLD.csv", 'wb'))
for key, value in d.items():
writer.writerow([key, value])

Your question is unclear. So I hope I got it right.
Please give an example of input columns and the desired output columns.
Please give a printout of the error and let us know which line caused the error.
if column1=[1,2,3,1,4] and column2=[a,b,c,d,e] do you want the output to be n_column1=[a,2,3,d,4] and column2 =[1,b,c,d,e]
I imagine the exception was in d[v].append(k) since clearly v is a list. you cannot use a list as a key in a dictionary.
In [1]: x = [1,2,3,1,4]
In [2]: y = ['a','b','c','d','e']
In [5]: from collections import defaultdict
In [6]: d = defaultdict(int)
In [7]: for a in x:
...: d[a] += 1
In [8]: d
Out[8]: defaultdict(<type 'int'>, {1: 2, 2: 1, 3: 1, 4: 1})
In [9]: x2 = []
In [10]: for a,b in zip(x,y):
....: x2.append(a if d[a]==1 else b)
....:
In [11]: x
Out[11]: [1, 2, 3, 1, 4]
In [12]: x2
Out[12]: ['a', 2, 3, 'd', 4]
In that case, I guess if I had to change your code to fit. I'd do something like that:
import csv
from collections import defaultdict
import itertools as it
mydict = {}
index = 0
reader = csv.reader(open(r"computing.csv", "rb"))
histogram = defaultdict(int)
k = []
v = []
for i, rows in enumerate(reader):
if i == 0:
continue
if len(rows) == 0:
continue
k.append(rows[3].strip())
v.append(rows[2].strip())
item = k[-1]
histogram[item] += 1
output_column = []
for first_item, second_item in zip(k,v):
output_column.append(first_item if histogram[first_item]==1 else second_item)
writer = csv.writer(open(r"OLD.csv", 'wb'))
for c1, c2 in zip(output_column, v):
writer.writerow([c1, c2])

How to find most common element in list, and if there's a tie, the one who's last occurance is first?

Basically if given a list
events = [123,123,456,456,456,123]
I expect it returns 456 because 456 was last seen earlier than 123 was last seen.
I made lists comprised of the counts and indices of the initial list of numbers.
I also made a dictionary in which the key is the element from events (original part) and hte value is the .count() of the key.
I don't really know where to go from here and could use some help.

Approach
Find the most frequently occurring items (Counter.most_common). Then find the item among those candidates that has the minimum index (enumerate into a dictionary of indexes, min of {index: key}.iteritems()).
Code
Stealing liberally from #gnibbler and #Jeff:
from collections import Counter
def most_frequent_first(events):
frequencies = Counter(events)
indexes = {event: i for i, event in enumerate(events)}
most_frequent_with_indexes = {indexes[key]: key for key, _ in frequencies.most_common()}
return min(most_frequent_with_indexes.iteritems())[1]
events = [123,123,456,456,456,123, 1, 2, 3, 2, 3]
print(most_frequent_first(events))
Result
>>> print(most_frequent_first(events))
456
Code
A better piece of code would provide you with the frequency and the index, showing you that the code is working correctly. Here is an implementation that uses a named_tuple:
from collections import Counter, namedtuple
frequent_first = namedtuple("frequent_first", ["frequent", "first"])
def most_frequent_first(events):
frequencies = Counter(events)
indexes = {event: i for i, event in enumerate(events)}
combined = {key: frequent_first(value, indexes[key]) for key, value in frequencies.iteritems()}
return min(combined.iteritems(), key=lambda t: (-t[1].frequent, t[1].first))
events = [123,123,456,456,456,123, 1, 2, 3, 2, 3]
print(most_frequent_first(events))
Result
>>> print(most_frequent_first(events))
(456, frequent_first(frequent=3, first=4))

Use collections.counter
>>> import collections
>>> events = [123,123,456,456,456,123]
>>> counts = collections.Counter(events)
>>> print counts
Counter({456: 3, 123: 3})
>>> mostCommon = counts.most_common()
>>> print mostCommon
[(456, 3), (123, 3)]
That's the hard part.

>>> from collections import Counter
>>> events = [123,123,456,456,456,123]
>>> c = Counter(events)
>>> idxs = {k: v for v,k in enumerate(events)}
>>> sorted(c.items(), key=lambda (k,v): (-v, idxs[k]))
[(456, 3), (123, 3)]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Anonymizing a List of Values in Python - python

names = """ Anna,Joe Anna,Mark Mark,Mindy Mindy,Joe """ nameset = set((",".join(names.strip().splitlines())).split(",")) for i,name in enumerate(nameset): names = names.replace(name,str(i)) print names 2,1 2,3 3,0 0,1

Related

Top Three Values in a Dictionary, No Repeated Values

Problems with creating a Dict with label and a list of corresponding tuples

Conditionally add multiple items to dictionary in Python

Find duplicates of two columns from csv

How to find most common element in list, and if there's a tie, the one who's last occurance is first?

Categories

Resources