Python deduplicate records - dedupe

Python deduplicate records - dedupe - python

I want to use https://github.com/datamade/dedupe to deduplicate some records in python. Looking at their examples
data_d = {}
for row in data:
clean_row = [(k, preProcess(v)) for (k, v) in row.items()]
row_id = int(row['id'])
data_d[row_id] = dict(clean_row)
the dictionary consumes quite a lot of memory compared to e.g. a dictionary created by pandas out of a pd.Datafrmae, or even a normal pd.Dataframe.
If this format is required, how can I convert a pd.Dataframe efficiently to such a dictionary?
edit
Example what pandas generates
{'column1': {0: 1389225600000000000,
1: 1388707200000000000,
2: 1388707200000000000,
3: 1389657600000000000,....
Example what dedupe expects
{'1': {column1: 1389225600000000000, column2: "ddd"},
'2': {column1: 1111, column2: "ddd} ...}

It appears that df.to_dict(orient='index') will produce the representation you are looking for:
import pandas
data = [[1, 2, 3], [4, 5, 6]]
columns = ['a', 'b', 'c']
df = pandas.DataFrame(data, columns=columns)
df.to_dict(orient='index')
results in
{0: {'a': 1, 'b': 2, 'c': 3}, 1: {'a': 4, 'b': 5, 'c': 6}}

You can try something like this:
df = pd.DataFrame({'A': [1,2,3,4,5], 'B': [6,7,8,9,10]})
A B
0 1 6
1 2 7
2 3 8
3 4 9
4 5 10
print(df.T.to_dict())
{0: {'A': 1, 'B': 6}, 1: {'A': 2, 'B': 7}, 2: {'A': 3, 'B': 8}, 3: {'A': 4, 'B': 9}, 4: {'A': 5, 'B': 10}}
This is the same output as in #chthonicdaemon answer so his answer is probably better. I am using pandas.DataFrame.T to transpose index and columns.

A python dictionary is not required, you just need an object that allows indexing by column name. i.e. row['col_name']
So, assuming data is a pandas dataframe should just be able to do something like:
data_d = {}
for row_id, row in data.iterrows():
data_d[row_id] = row
That said, the memory overhead of python dicts is not going to be where you have memory bottlenecks in dedupe.

Related

Export data in CSV but with different columns set. Can it be done via a library?

Data input:
[
{'a': 1, 'b': 2, 'c': 3},
{'b': 2, 'd': 4, 'e': 5, 'a': 1},
{'b': 2, 'd': 4, 'a': 1}
]
CVS output (columns order does not matter):
a, b, c, d, e
1, 2, 3
1, 2, , 4, 5
1, 2, , 4
Standard library csv module cannot cover such kind of input.
Is there some package or library for a single-method export?
Or a good solution to deal with column discrepancies?

It can be done fairly easily using the included csv module with a little preliminary processing.
import csv
data = [
{'a': 1, 'b': 2, 'c': 3},
{'b': 2, 'd': 4, 'e': 5, 'a': 1},
{'b': 2, 'd': 4, 'a': 1}
]
fields = sorted(set.union(*(set(tuple(d.keys())) for d in data))) # Determine columns.
with open('output.csv', 'w', newline='') as file:
writer = csv.DictWriter(file, fieldnames=fields)
writer.writeheader()
writer.writerows(data)
print('-fini-')
Contents of file produced:
a,b,c,d,e
1,2,3,,
1,2,,4,5
1,2,,4,

Straightforward with pandas:
import pandas as pd
lst = [
{'a': 1, 'b': 2, 'c': 3},
{'b': 2, 'd': 4, 'e': 5, 'a': 1},
{'b': 2, 'd': 4, 'a': 1}
]
df = pd.DataFrame(lst)
print(df.to_csv(index=None))
Output:
a,b,c,d,e
1,2,3.0,,
1,2,,4.0,5.0
1,2,,4.0,

you have to pass a restval argument to Dictwriter which is the default argument for missing keys in dictionaries
writer = Dictwriter(file, list('abcde'), restval='')

is it possible to recognize unique values within the keys in the dictionaries?

I have
df.shape
> (12702, 27)
df['x'][0]
>{'a': '123',
'b': '214',
'c': '654',}
I try:
df['x'].unique()
>TypeError: unhashable type: 'list'
is it possible to recognize unique values within the keys in the dictionaries?
or
should i use dummies?

(Providing an answer similar to https://stackoverflow.com/a/12897477/15744261)
Edited:
Sorry for the multiple edits, I misunderstood your question.
From your snippet it looks like your df[x] is returning a list of dictionaries. If what you're asking is to get all unique values across some of the dictionaries, you can add the keys using list(my_dict) (which will return a list of the keys). Then use a set on the list to return the unique values. Example:
values = set(list(df['x'][0]) + list(df['x'][1]) + ... )
If you need unique keys across all of these dictionaries, you could get a little more creative with list comprehension to compile all the keys and then wrap that in a set for the unique values.
Old answer:
For a list you can simply convert to a set which will remove the duplicates:
values = set(df['x'][0])
If you want to use these values as a list you can convert that set into a list as well:
list_values = list(values)
Or in one line:
values = list(set(df['x'][0]))
Keep in mind, this is certainly not the most efficient way to do this. I'm sure there are better ways to do it if you're dealing with a large amount of data.

It seems that you want to find the unique keys across all the dictionaries in this column. This can be done easily with functools.reduce. I've generated some sample data:
import pandas as pd
import random
possible_keys = 'abcdefg'
df = pd.DataFrame({'x': [{key: 1 for key in random.choices(possible_keys, k=3)} for _ in range(10)]})
This dataframe looks like this:
x
0 {'c': 1, 'a': 1}
1 {'b': 1, 'd': 1, 'c': 1}
2 {'d': 1, 'b': 1}
3 {'b': 1, 'f': 1, 'e': 1}
4 {'a': 1, 'd': 1, 'c': 1}
5 {'g': 1, 'b': 1}
6 {'d': 1}
7 {'e': 1}
8 {'c': 1, 'd': 1, 'f': 1}
9 {'b': 1, 'a': 1, 'f': 1}
Now the actual meat of the answer:
from functools import reduce
reduce(set.union, df['x'], set())
Results in:
{'a', 'b', 'c', 'd', 'e', 'f', 'g'}

DataFrame to Dictionary with different numbers of values Pandas

I am trying to create a dictionary from a DataFrame where the key sometimes has multiple values.
For example:
df
ID value
A 10
B 45
C 20
C 30
D 20
E 10
E 70
E 110
F 20
And I want the dictionary to look like:
dic = {'A': 10,
'B': 45,
'C':[20,30],
'D': 20,
'E': [10,70,110],
'F': 20}
I tried using the following code:
dic=df.set_index('ID').T.to_dict('list')
But it returned a dictionary with only one value per ID:
{'A': 10,
'B': 45,
'C': 30,
'D': 20,
'E': 110,
'F': 20}
I'm assuming the right way to go about it is with some kind of loop appending to an empty dictionary but I'm not sure what the proper syntax would be.
My actual DataFrame is much longer than this, so what would I use to convert the DataFrame to the dictionary?
Thanks!

example dataframe:
df = pd.DataFrame({'ID':['A', 'B', 'B'], 'value': [1,2,3]})
df_tmp = df.groupby('ID')['value'].apply(list).reset_index()
dict(zip(df_tmp['ID'], df_tmp['value']))
outputs
{'A': [1], 'B': [2, 3]}

Splitting a DataFrame based on duplicate values in columns

Here's my starting dataframe:
StartDF = pd.DataFrame({'A': {0: 1, 1: 1, 2: 2, 3: 4, 4: 5, 5: 5, 6: 5, 7: 5}, 'B': {0: 2, 1: 2, 2: 4, 3: 2, 4: 2, 5: 4, 6: 4, 7: 5}, 'C': {0: 10, 1: 1000, 2: 250, 3: 100, 4: 550, 5: 100, 6: 3000, 7: 250}})
I need to create a list of individual dataframes based on duplicate values in columns A and B, so it should look like this:
df1 = pd.DataFrame({'A': {0: 1, 1: 1}, 'B': {0: 2, 1: 2}, 'C': {0: 10, 1: 1000}})
df2 = pd.DataFrame({'A': {0: 2}, 'B': {0: 4}, 'C': {0: 250}})
df3 = pd.DataFrame({'A': {0: 4}, 'B': {0: 2}, 'C': {0: 100}})
df4 = pd.DataFrame({'A': {0: 5}, 'B': {0: 2}, 'C': {0: 550}})
df5 = pd.DataFrame({'A': {0: 5, 1: 5}, 'B': {0: 4, 1: 4}, 'C': {0: 100, 1: 3000}})
df6 = pd.DataFrame({'A': {0: 5}, 'B': {0: 5}, 'C': {0: 250}})
I've seen a lot of answers that explain how to DROP duplicates, but I need to keep the duplicate values because the information in column C will usually be different between rows regardless of duplicates in columns A and B. All of the row data needs to be preserved in the new dataframes.
Additional note, the starting dataframe (StartDF) will change in length, so each time this is run, the number of individual dataframes created will be variable. Ultimately, I need to print the newly created dataframes to their own csv files (I know how to do this part). Just need to know how to break out the data from the original dataframe in an elegant way.

You can use a groupby, iterate over each group and build a list using a list comprehension.
df_list = [g for _, g in df.groupby(['A', 'B'])]
print(*df_list, sep='\n\n')
A B C
0 1 2 10
1 1 2 1000
A B C
2 2 4 250
A B C
3 4 2 100
A B C
4 5 2 550
A B C
5 5 4 100
6 5 4 3000
A B C
7 5 5 250

Return dictionary with one changed element

Let's say I have a list of dictionaries:
>>> d = [{'a': 2, 'b': 3, 'c': 4}, {'a': 5, 'b': 6, 'c': 7}]
And I want to perform a map operation where I change just one value in each dictionary. One possible way to do that is to create a new dictionary which simply contains the original values along with the changed ones:
>>> map(lambda x: {'a': x['a'], 'b': x['b'] + 1, 'c': x['c']}, d)
[{'a': 2, 'c': 4, 'b': 4}, {'a': 5, 'c': 7, 'b': 7}]
This can get unruly if the dictionaries have many items.
Another way might be to define a function which copies the original dictionary and only changes the desired values:
>>> def change_b(x):
... new_x = x.copy()
... new_x['b'] = x['b'] + 1
... return new_x
...
>>> map(change_b, d)
[{'a': 2, 'c': 4, 'b': 4}, {'a': 5, 'c': 7, 'b': 7}]
This, however, requires writing a separate function and loses the elegance of a lambda expression.
Is there a better way?

This works (and is compatible with python2 and python31):
>>> map(lambda x: dict(x, b=x['b']+1), d)
[{'a': 2, 'c': 4, 'b': 4}, {'a': 5, 'c': 7, 'b': 7}]
With that said, I think that more often than not, lambda based solutions are less elegant than non-lambda counterparts... The rational behind this statement is that I can immediately look at the non-lambda solution that you proposed and I know exactly what it does. The lambda based solution that I just wrote would take a bit of thinking to parse and then more thinking to actually understand...
1Though, map will give you an iterable object on python3.x that isn't a list...

First, writing a function doesn't seem that inelegant to me in the first place. That said, welcome to the brave new world of Python 3.5 and PEP 448:
>>> d = [{'a': 2, 'b': 3, 'c': 4}, {'a': 5, 'b': 6, 'c': 7}]
>>> d
[{'b': 3, 'a': 2, 'c': 4}, {'b': 6, 'a': 5, 'c': 7}]
>>> [{**x, 'b': x['b']+1} for x in d]
[{'b': 4, 'a': 2, 'c': 4}, {'b': 7, 'a': 5, 'c': 7}]
From how your map is behaving, it's clear you're using 2, but that's easy enough to fix. :-)

You can use a for loop with an update call. Here is a hacky one-liner:
dcts = [{'a': 2, 'b': 3, 'c': 4}, {'a': 5, 'b': 6, 'c': 7}]
dcts = [d.update({'b': d['b']+1}) or d for d in dcts]
Edit: To preserve original dicts:
from copy import copy
dcts = [d.update({'b': d['b']+1}) or d for d in map(copy, dcts)]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python deduplicate records - dedupe - python

Related

Export data in CSV but with different columns set. Can it be done via a library?

is it possible to recognize unique values within the keys in the dictionaries?

DataFrame to Dictionary with different numbers of values Pandas

Splitting a DataFrame based on duplicate values in columns

Return dictionary with one changed element

Categories

Resources

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python deduplicate records - dedupe - python

Related

Export data in CSV but with different columns set. Can it be done via a library?

is it possible to recognize unique values ​within the keys in the dictionaries?

DataFrame to Dictionary with different numbers of values Pandas

Splitting a DataFrame based on duplicate values in columns

Return dictionary with one changed element

Categories

Resources

is it possible to recognize unique values within the keys in the dictionaries?