Comparing 2 dataframes by ID

Comparing 2 dataframes by ID - python

I am very new to Python. I want to compare two dataframes. They both have the same columns, first column is the key variable (ID). My goal is to print the differences.
For example:
import pandas as pd
import numpy as np
dframe1 = {'ID': [1, 2, 3, 4, 5], 'Apple': ['C', 'B', 'C', 'A', 'E'], 'Pear': [2, 3, 5, 6, 7]}
dframe2 = {'ID': [4, 2, 1, 3], 'Apple': ['A', 'C', 'C', 'C'], 'Pear': [6, 'NA', 'NA', 5]}
df1 = pd.DataFrame(dframe1)
df2 = pd.DataFrame(dframe2)
import datacompy
compare=datacompy.Compare(
df1,
df2,
df1_name='Reference',
df2_name='Test',
on_index=True
)
print(compare.report())
This produces a comparison report but I want my output to be like the following. Columns of my desired output:
out1 = {'var.x': ['Apple', 'Pear', 'Pear'], 'var.Y': ['Apple', 'Pear', 'Pear'], 'ID': [2, 1, 2],'values.x': ['B', '2', '3'], 'values.Y': ['C','NA','NA'],'row.x': [2, 1, 4], 'row.y': [2, 3, 1]}
outp = pd.DataFrame(out1)
print(outp)
Thanks a lot for your support.

Related

Can I make 4 new columns aggregating 4 previous ones?

I have a data set like this:
data = ({'A': ['John', 'Dan', 'Tom', 'Mary'], 'B': [1, 3, 4, 5], 'C': ['Tom', 'Mary', 'Dan', 'Mike'], 'D': [3, 4, 6, 12]})
Where Dan in A has the corresponding number 3 in B, and where Dan in C has the corresponding number 6 in D.
I would like to create 2 new columns, one with the name Dan and the other with 9 (3+6).
Desired Output
data = ({'A': ['John', 'Dan', 'Tom', 'Mary'], 'B': [1, 3, 4, 5], 'C': ['Tom', 'Mary', 'Dan', 'Mike'], 'D': [3, 4, 6, 12], 'E': ['Dan', 'Tom', 'Mary'], 'F': [9, 7, 9], 'G': ['John', 'Mike'], 'H': [1, 12]})
For names, John and Mike 2 different columns with their values unchanged.
I have tried using some for loops and .loc, but I am not anywhere close.
Thanks!

df = data[['A','B']]
_df = data[['C','D']]
_df.columns = ['A','B']
df = pd.concat([df,_df]).groupby(['A'],as_index=False)['B'].sum().reset_index()
df.columns = ['E','F']
data = data.merge(df,how='left',left_on=['A'],right_on=['E'])
Although you can join on column C too, that's something you have choose. Or alternatively if you want just columns E & F, then skip the last line!

You can try this:
import pandas as pd
data = {'A': ['John', 'Dan', 'Tom', 'Mary'], 'B': [1, 3, 4, 5], 'C': ['Tom', 'Mary', 'Dan', 'Mike'], 'D': [3, 4, 6, 12]}
df=pd.DataFrame(data)
df=df.rename(columns={"C": "A", "D": "B"})
df=df.stack().reset_index(0, drop=True).rename_axis("index").reset_index()
df=df.pivot(index=df.index//2, columns="index")
df.columns=map(lambda x: x[1], df.columns)
df=df.groupby("A", as_index=False).sum()
Outputs:
>>> df
A B
0 Dan 9
1 John 1
2 Mary 9
3 Mike 12
4 Tom 7

pythonic way to reverse a dict where values are lists?

I have a dictionary that looks something like this:
letters_by_number = {
1: ['a', 'b', 'c', 'd'],
2: ['b', 'd'],
3: ['a', 'c'],
4: ['a', 'd'],
5: ['b', 'c']
}
I want to reverse it to look something like this:
numbers_by_letter = {
'a': [1, 3, 4],
'b': [1, 2, 5],
'c': [1, 3, 5],
'd': [1, 2, 4]
}
I know that I could do this by looping through (key, value) through letters_by_number, looping through value (which is a list), and adding (val, key) to a list in the dictionary. This is cumbersome and I feel like there must be a more "pythonic" way to do this. Any suggestions?

This is well-suited for collections.defaultdict:
>>> from collections import defaultdict
>>> numbers_by_letter = defaultdict(list)
>>> for k, seq in letters_by_number.items():
... for letter in seq:
... numbers_by_letter[letter].append(k)
...
>>> dict(numbers_by_letter)
{'a': [1, 3, 4], 'b': [1, 2, 5], 'c': [1, 3, 5], 'd': [1, 2, 4]}
Note that you don't really need the final dict() call (a defaultdict will already give you the behavior you probably want), but I included it here because the result from your question is type dict.

Use setdefault:
letters_by_number = {
1: ['a', 'b', 'c', 'd'],
2: ['b', 'd'],
3: ['a', 'c'],
4: ['a', 'd'],
5: ['b', 'c']
}
inv = {}
for k, vs in letters_by_number.items():
for v in vs:
inv.setdefault(v, []).append(k)
print(inv)
Output
{'a': [1, 3, 4], 'b': [1, 2, 5], 'c': [1, 3, 5], 'd': [1, 2, 4]}

A (trivial) subclass of dict would make this very easy:
class ListDict(dict):
def __missing__(self, key):
value = self[key] = []
return value
letters_by_number = {
1: ['a', 'b', 'c', 'd'],
2: ['b', 'd'],
3: ['a', 'c'],
4: ['a', 'd'],
5: ['b', 'c']
}
numbers_by_letter = ListDict()
for key, values in letters_by_number.items():
for value in values:
numbers_by_letter[value].append(key)
from pprint import pprint
pprint(numbers_by_letter, width=40)
Output:
{'a': [1, 3, 4],
'b': [1, 2, 5],
'c': [1, 3, 5],
'd': [1, 2, 4]}

Here's a solution using a dict comprehension, without adding list elements in a loop. Build a set of keys by joining all the lists together, then build each list using a list comprehension. To be more efficient, I've first built another dictionary containing sets instead of lists, so that k in v is an O(1) operation.
from itertools import chain
def invert_dict_of_lists(d):
d = { i: set(v) for i, v in d.items() }
return {
k: [ i for i, v in d.items() if k in v ]
for k in set(chain.from_iterable(d.values()))
}
Strictly, dictionaries in modern versions of Python 3 retain the order that keys are inserted in. This produces a result where the keys are in the order they appear in the lists; not alphabetical order like in your example. If you do want the keys in sorted order, change for k in set(...) to for k in sorted(set(...)).

Transpose or Pivot multiple columns in Pandas

I would like to transpose multiple columns in a dataframe. I have looked through most of the transpose and pivot pandas posts but could not get it to work.
Here is what my dataframe looks like.
df = pd.DataFrame()
df['L0'] = ['fruit', 'fruit', 'fruit', 'fruit', 'fruit', 'fruit', 'vegetable', 'vegetable', 'vegetable', 'vegetable', 'vegetable', 'vegetable']
df['L1'] = ['apple', 'apple', 'apple', 'banana', 'banana', 'banana', 'tomato', 'tomato', 'tomato', 'lettuce', 'lettuce', 'lettuce']
df['Type'] = ['X', 'Y', 'Z', 'X', 'Y', 'Z', 'X', 'Y', 'Z', 'X', 'Y', 'Z']
df['A'] = [3, 0, 4, 3, 1, 3, 2, 2, 2, 4, 2, 4]
df['B'] = [3, 1, 0, 4, 1, 4, 4, 4, 2, 1, 2, 1]
df['C'] = [0, 4, 1, 0, 2, 4, 1, 1, 2, 3, 2, 3]
I would like to transpose/pivot columns A, B and C and replace them with values from column "Type". Resulting dataframe should look like this.
df2 = pd.DataFrame()
df2['L0'] = ['fruit', 'fruit', 'fruit', 'fruit', 'fruit', 'fruit', 'vegetable', 'vegetable', 'vegetable', 'vegetable', 'vegetable', 'vegetable']
df2['L1'] = ['apple', 'apple', 'apple', 'banana', 'banana', 'banana', 'tomato', 'tomato', 'tomato', 'lettuce', 'lettuce', 'lettuce']
df2['Type2'] = ['A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C']
df2['X'] = [3, 3, 0, 3, 4, 0, 2, 4, 1, 4, 1, 3]
df2['Y'] = [0, 1, 4, 1, 1, 2, 2, 4, 1, 2, 2, 2]
df2['Z'] = [4, 0, 1, 3, 4, 4, 2, 2, 2, 4, 1, 3]
The best I could do was this
df.groupby(['L0', 'L1', 'Type'])['A', 'B', 'C'].sum().unstack('Type')
But this is not really what I want. Thank you!

Add stack before unstack:
df = (df.groupby(['L0', 'L1', 'Type'])['A', 'B', 'C']
.sum()
.stack()
.unstack('Type')
.reset_index()
.rename_axis(None, axis=1)
.rename(columns={'level_2':'Type2'}))
print (df)
L0 L1 Type2 X Y Z
0 fruit apple A 3 0 4
1 fruit apple B 3 1 0
2 fruit apple C 0 4 1
3 fruit banana A 3 1 3
4 fruit banana B 4 1 4
5 fruit banana C 0 2 4
6 vegetable lettuce A 4 2 4
7 vegetable lettuce B 1 2 1
8 vegetable lettuce C 3 2 3
9 vegetable tomato A 2 2 2
10 vegetable tomato B 4 4 2
11 vegetable tomato C 1 1 2

Slicing flat list into multi-level nested list efficiently

For example, I have a flat list
[1, 2, 3, 4, 5, 6, 7, 8, 9, 'A', 'B', 'C', 'D', 'E', 'F', 'G']
I want to transform it into 4-deep list
[[[[1, 2], [3, 4]], [[5, 6], [7, 8]]], [[[9, 'A'], ['B', 'C']], [['D', 'E'] ['F', 'G']]]]
Is there a way to do it without creating a separate variable for every level? What is the most memory- and performance-efficient way?
UPDATE:
Also, is there a way to do it in a non-symmetrical fashion?
[[[[1, 2, 3], 4], [[5, 6, 7], 8]]], [[[9, 'A', 'B'], 'C']], [['D', 'E', 'F'], 'G']]]]

Note that your first list has 15 elements instead of 16. Also, what should A be? Is it a constant you've defined somewhere else? I'll just assume it's a string : 'A'.
If you work with np.arrays, you could simply reshape your array:
import numpy as np
r = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 'A', 'B', 'C', 'D', 'E', 'F', 'G'])
r.reshape(2,2,2,2)
It outputs:
array([[[['1', '2'],
['3', '4']],
[['5', '6'],
['7', '8']]]
[[['9', 'A'],
['B', 'C']],
[['D', 'E'],
['F', 'G']]]
dtype='<U11')
This should be really efficient because numpy doesn't change the underlying data format. It's still a flat array, displayed differently.
Numpy doesn't support irregular shapes. You'll have to work with standard python lists then:
i = iter([1, 2, 3, 4, 5, 6, 7, 8, 9, 'A', 'B', 'C', 'D', 'E', 'F', 'G'])
l1 = []
for _ in range(2):
l2 = []
for _ in range(2):
l3 = []
l4 = []
for _ in range(3):
l4.append(next(i))
l3.append(l4)
l3.append(next(i))
l2.append(l3)
l1.append(l2)
print(l1)
# [[[[1, 2, 3], 4], [[5, 6, 7], 8]], [[[9, 'A', 'B'], 'C'], [['D', 'E', 'F'], 'G']]]
As you said, you'll have to define a temporary variable for each level. I guess you could use list comprehensions, but they wouldn't be pretty.

Looping through a dictionary

I'm trying to loop through a dictionary, and starting from the first key, it looks its
value and loops through each element in the list and doubles it. Once it's done with the list it adds the key and the value to a new dictionary and then continues to the next key in the dictionary and continues the process. The value that is attached with each key will always be a list. Preferably without importing any modules.
Some input's and outputs to better understand what the code is supposed to be doing(the order will be always be different, so sometimes you'll have 'b' first or 'a' first.):
>>> create_dict({'a': [1, 2], 'b': [3, 4]})
{'a': ['1', '1', '2', '2'], 'b': ['3', '3', '4', '4']}
>>> create_dict({'a': ['c', 'd'], 'b': ['d', 'e']})
{'a': ['c', 'c', 'd', 'd'], 'b': ['d', 'd', 'e', 'e']}
>>> create_dict({'a': ['e', 'f'], 'b': ['g', 'h']})
{'a': ['e', 'e', 'f', 'f'], 'b': ['g', 'g', 'h', 'h']}
What I've written so far:
def create_dict(sample_dict):
'''(dict) -> dict
Given a dictionary, loop through the value in the first key and double
each element in the list and add the result to a new dictionary, move on
to the next key and continue the process.
>>> create_dict({'a': [1, 2], 'b': [3, 4]})
{'a': ['1', '1', '2', '2'], 'b': ['3', '3', '4', '4']}
>>> create_dict({'a': ['c', 'd'], 'b': ['d', 'e']})
{'a': ['c', 'c', 'd', 'd'], 'b': ['d', 'd', 'e', 'e']}
>>> create_dict({'name': ['bob', 'smith'], 'last': ['jones', 'li']})
{'name': ['bob', 'bob', 'smith', 'smith'], 'last': ['jones', 'jones', 'li', 'li']}
'''
new_dict = {}
new_list = []
for index in sample_dict.values():
for element in index:
new_list.extend([element] * 2)
return new_dict
However the result I'm getting does not quite match what I had in mind:
>>> create_dict({'name': ['bob', 'smith'], 'last': ['jones', 'li']})
{'last': ['jones', 'jones', 'li', 'li', 'bob', 'bob', 'smith', 'smith'], 'name': ['jones', 'jones', 'li', 'li', 'bob', 'bob', 'smith', 'smith']}
>>> create_dict({'a': [1, 2], 'b': [3, 4]})
{'b': [3, 3, 4, 4, 1, 1, 2, 2], 'a': [3, 3, 4, 4, 1, 1, 2, 2]}
Thank you for those who help :)

I think you're initializing your new_list too soon. It's grabbing too much data
So, try this:
def create_dict(sample_dict):
new_dict = {}
for key in sample_dict:
new_list = []
for val in sample_dict[key]:
new_list.extend([val] * 2)
new_dict[key] = new_list
return new_dict
print create_dict({'a': [1, 2], 'b': [3, 4]})
It returns {'a': [1, 1, 2, 2], 'b': [3, 3, 4, 4]}

This can be a lot simpler with dictionary comprehensions
d = {'a': ['c', 'd'], 'b': ['d', 'e']}
{key:[y for z in zip(value, value) for y in z] for (key, value) in d.items()}
{'a': ['c', 'c', 'd', 'd'], 'b': ['d', 'd', 'e', 'e']}

def create_dict(sample_dict):
new_dict = {} #build dict straight
for key,value in sample_dict.items(): #.items() returns tuples: (key,val)
new_list = [] #start with a new list for each pair in the dict
for element in value: #go over each element in 'val'
new_list.extend([element,element])
new_dict[key] = new_list
return new_dict
print create_dict({'name': ['bob', 'smith'], 'last': ['jones', 'li']})
Outputs:
>>>
{'last': ['jones', 'jones', 'li', 'li'], 'name': ['bob', 'bob', 'smith', 'smith']}

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Comparing 2 dataframes by ID - python

Related

Can I make 4 new columns aggregating 4 previous ones?

pythonic way to reverse a dict where values are lists?

Transpose or Pivot multiple columns in Pandas

Slicing flat list into multi-level nested list efficiently

Looping through a dictionary

Categories

Resources