Eliminate duplicates in dictionary Python - python

I have a csv file separated by tabs:
I need only to focus in the two first columns and find, for example, if the pair A-B appears in the document again as B-A and print A-B if the B-A appears. The same for the rest of pairs.
For the example proposed the output is:
ยท A-B
& C-D
dic ={}
import sys
import os
import pandas as pd
import numpy as np
import csv
colnames = ['col1', 'col2', 'col3', 'col4', 'col5']
data = pd.read_csv('koko.csv', names=colnames, delimiter='\t')
col1 = data.col1.tolist()
col2 = data.col2.tolist()
dataset = list(zip(col1,col2))
for a,b in dataset:
if (a,b) and (b,a) in dataset:
dic [a] = b
print (dic)
output = {'A': 'B', 'B': 'A', 'D': 'C', 'C':'D'}
How can I avoid duplicated (or swapped) results in the dictionary?

Does this work?:
import pandas as pd
import numpy as np
col_1 = ['A', 'B', 'C', 'B', 'D']
col_2 = ['B', 'C', 'D', 'A', 'C']
df = pd.DataFrame(np.column_stack([col_1,col_2]), columns = ['Col1', 'Col2'])
df['combined'] = list(zip(df['Col1'], df['Col2']))
final_set = set(tuple(sorted(t)) for t in df['combined'])
final_set looks like this:
{('C', 'D'), ('A', 'B'), ('B', 'C')}
The output contains more than A-B and C-D because of the second row that has B-C

The below should work,
example df used:
df = pd.DataFrame({'Col1' : ['A','C','D','B','D','A'], 'Col2' : ['B','D','C','A','C','B']})
This is the function I used:
temp = df[['Col1','Col2']].apply(lambda row: sorted(row), axis = 1)
print(temp[['Col1','Col2']].drop_duplicates())
useful links:
checking if a string is in alphabetical order in python
Difference between map, applymap and apply methods in Pandas

Here is one way.
df = pd.DataFrame({'Col1' : ['A','C','D','B','D','A','E'],
'Col2' : ['B','D','C','A','C','B','F']})
df = df.drop_duplicates()\
.apply(sorted, axis=1)\
.loc[df.duplicated(subset=['Col1', 'Col2'], keep=False)]\
.drop_duplicates()
# Col1 Col2
# 0 A B
# 1 C D
Explanation
The steps are:
Remove duplicate rows.
Sort dataframe by row.
Remove unique rows by keeping only duplicates.
Remove duplicate rows again.

Related

Dataframe to a dictionary as some columns in a list as keys and one as value

I have a pandas dataframe df that looks like this:
col1 col2 col3
A X 1
B Y 2
C Z 3
I want to convert this into a dictionary with col1 and col2 in a list as key and col3 as value. So, the output would look like this:
{
['A', 'X']: 1,
['B', 'Y']: 2,
['C', 'Z']: 3
}
How do I get my desired output?
There are several restrictions while defining dictionary keys. Lists or dictionaries can not be a dictionary's keys, because they are mutable - unhashable. Meaning they can change and we can not track them, sort of like they do not have a unique hashcode. Thus, you can not set lists as dictionary keys. But, you can set tuples as dictionary keys. Tuples are very similar to lists. Let's make your dataframe again:
import pandas as pd
data = {'col1':['A','B','C'],'col2':['X','Y','Z'],'col3':[1,2,3]}
df = pd.DataFrame(data)
Now, we have the same dataframe. Now, let's use list comprehension method to go (iterate) through all the rows of the dataframe, while selecting column1 and column2 as tuple keys and column3 as values:
my_dict = {(df.iloc[i,0],df.iloc[i,1]): df.iloc[i,2] for i in range(len(df))}
Now, you have the following output:
my_dict = {('A', 'X'): 1, ('B', 'Y'): 2, ('C', 'Z'): 3}
Here's a way to do this using pandas methods:
After creating the dataframe:
import pandas as pd
df = pd.DataFrame([['A', 'X', 1], ['B', 'Y', 2], ['C', 'Z', 3]],
columns=['col1', 'col2', 'col3'])
Set the index to the columns that form the dictionary keys, select the column for the values and convert it to a dictionary:
df.set_index(['col1', 'col2']).col3.to_dict()
To get the required result:
{('A', 'X'): 1, ('B', 'Y'): 2, ('C', 'Z'): 3}

Get counts of unique lists in Pandas

I have a pandas Dataframe where one of the columns is full of lists:
import pandas
df = pandas.DataFrame([[1, [a, b, c]],
[2, [d, e, f]],
[3, [a, b, c]]])
And I'd like to make a pivot table that shows the list and a count of occurrences
List Count
[a,b,c] 2
[d,e,f] 1
Because list is a non-hashable type, what aggregate functions could do this?
You can zip a list of rows and a list of counts, then make a dataframe from the zip object:
import pandas
df = pandas.DataFrame([[1, ['a', 'b', 'c']],
[2, ['d', 'e', 'f']],
[3, ['a', 'b', 'c']]])
rows = []
counts = []
for index,row in df.iterrows():
if row[1] not in rows:
rows.append(row[1])
counts.append(1)
else:
counts[rows.index(row[1])] += 1
df = pandas.DataFrame(zip(rows, counts))
print(df)
The solution I ended up using was:
import pandas
df = pandas.DataFrame([[1, ['a', 'b', 'c']],
[2, ['d','e', 'f']],
[3, ['a', 'b', 'c']]])
print(df[1])
df[1] = df[1].map(tuple)
#Thanks Ch3steR
df2 = pandas.pivot_table(df,index=df[1], aggfunc='count')
print(df2)

Normalize multiple columns of list/tuple data

I have a dataframe with multiple columns of tuple data. I'm trying to normalize the data within the tuple for each row per columns. This is an example with lists, but it should be the same concept for tuples as well-
df = pd.DataFrame(np.random.randn(5, 10), columns=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'])
df['arr1'] = df[['a', 'b', 'c', 'd', 'e']].values.tolist()
df['arr2'] = df[['f', 'g', 'h', 'i', 'j']].values.tolist()
If I wish to normalize each list row for a few columns, I would do this-
df['arr1'] = [preprocessing.scale(row) for row in df['arr1']]
df['arr2'] = [preprocessing.scale(row) for row in df['arr2']]
However, since I have about 100 such columns in my original dataset, I obviously don't want to manually normalize per column. How can I loop across all columns?
You can look through columns in a DataFrame like this to process each column:
for col in df.columns:
df[col] = [preprocessing.scale(row) for row in df[col]]
Of course, this only works if you want to process all of the columns in the DataFrame. If you only want a subset, you could create a list of columns first, or you could drop the other columns.
# Here's an example where you manually specify the columns
cols_to_process = ["arr1", "arr2"]
for col in cols_to_process:
df[col] = [preprocessing.scale(row) for row in df[col]]
# Here's an example where you drop the unwanted columns first
cols_to_drop = ["a", "b", "c"]
df = df.drop(columns=cols_to_drop)
for col in cols_to_process:
df[col] = [preprocessing.scale(row) for row in df[col]]
# Or, if you didn't want to actually drop the columns
# from the original DataFrame you could do it like this:
cols_to_drop = ["a", "b", "c"]
for col in df.drop(columns=cols_to_drop):
df[col] = [preprocessing.scale(row) for row in df[col]]

Creating variable number of lists from pandas dataframes

I have a pandas dataframe being generated by some other piece of code - the dataframe may have different number of columns each time it is generated: let's call them col1,col2,...,coln where n is not fixed. Please note that col1,col2,... are just placeholders, the actual names of columns can be arbitrary like TimeStamp or PrevState.
From this, I want to convert each column into a list, with the name of the list being the same as the column. So, I want a list named col1 with the entries in the first column of the dataframe and so on till coln.
How do I do this?
Thanks
It is not recommended, better is create dictionary:
d = df.to_dict('list')
And then select list by keys of dict from columns names:
print (d['col'])
Sample:
df = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
})
d = df.to_dict('list')
print (d)
{'A': ['a', 'b', 'c', 'd', 'e', 'f'], 'B': [4, 5, 4, 5, 5, 4], 'C': [7, 8, 9, 4, 2, 3]}
print (d['A'])
['a', 'b', 'c', 'd', 'e', 'f']
import pandas as pd
df = pd.DataFrame()
df["col1"] = [1,2,3,4,5]
df["colTWO"] = [6,7,8,9,10]
for col_name in df.columns:
exec(col_name + " = " + df[col_name].values.__repr__())

Efficient way to replace column of lists by matches with another data frame in Pandas

I have a pandas data frame that looks like:
col11 col12
X ['A']
Y ['A', 'B', 'C']
Z ['C', 'A']
And another one that looks like:
col21 col22
'A' 'alpha'
'B' 'beta'
'C' 'gamma'
I would like to replace col12 base on col22 in a efficient way and get, as a result:
col31 col32
X ['alpha']
Y ['alpha', 'beta', 'gamma']
Z ['gamma', 'alpha']
One solution is to use an indexed series as a mapper with a list comprehension:
import pandas as pd
df1 = pd.DataFrame({'col1': ['X', 'Y', 'Z'],
'col2': [['A'], ['A', 'B', 'C'], ['C', 'A']]})
df2 = pd.DataFrame({'col21': ['A', 'B', 'C'],
'col22': ['alpha', 'beta', 'gamma']})
s = df2.set_index('col21')['col22']
df1['col2'] = [list(map(s.get, i)) for i in df1['col2']]
Result:
col1 col2
0 X [alpha]
1 Y [alpha, beta, gamma]
2 Z [gamma, alpha]
I'm not sure its the most efficient way but you can turn your DataFrame to a dict and then use apply to map the keys to the values:
Assuming your first DataFrame is df1 and the second is df2:
df_dict = dict(zip(df2['col21'], df2['col22']))
df3 = pd.DataFrame({"31":df1['col11'], "32": df1['col12'].apply(lambda x: [df_dict[y] for y in x])})
or as #jezrael suggested with nested list comprehension:
df3 = pd.DataFrame({"31":df1['col11'], "32": [[df_dict[y] for y in x] for x in df1['col12']]})
note: df3 has a default index
31 32
0 X [alpha]
1 Y [alpha, beta, gamma]
2 Z [gamma, alpha]

Categories

Resources