How can I compare lists within two columns of a dataframe and identify if the elements of one list is within the other list and create another column with the missing elements.
The dataframe looks something like this:
df = pd.DataFrame({'A': ['a1', 'a2', 'a3'],
'B': [['b1', 'b2'], ['b1', 'b2', 'b3'], ['b2']],
'C': [['c1', 'b1'], ['b3'], ['b2', 'b1']],
'D': ['d1', 'd2', 'd3']})
I want to compare if elements of column C are in column B and output the missing values to column E, the desired output is:
df = pd.DataFrame({'A': ['a1', 'a2', 'a3'],
'B': [['b1', 'b2'], ['b1', 'b2', 'b3'], ['b2']],
'C': [['c1', 'b1'], ['b3'], ['b2', 'b1']],
'D': ['d1', 'd2', 'd3']
'E': ['b2', ['b1','b2'],'']})
Like your previous related question, you can use a list comprehension. As a general rule, you shouldn't force multiple different types of output, e.g. list or str, depending on result. Therefore, I have chosen lists throughout in this solution.
df['E'] = [list(set(x) - set(y)) for x, y in zip(df['B'], df['C'])]
print(df)
A B C D E
0 a1 [b1, b2] [c1, b1] d1 [b2]
1 a2 [b1, b2, b3] [b3] d2 [b1, b2]
2 a3 [b2] [b2, b1] d3 []
def Desintersection(i):
Output = [b for b in df['B'][i] if b not in df['C'][i]]
if(len(Output) == 0):
return ''
elif(len(Output) == 1):
return Output[0]
else:
return Output
df['E'] = df.index.map(Desintersection)
df
Like what I do for my previous answer
(df.B.map(set)-df.C.map(set)).map(list)
Out[112]:
0 [b2]
1 [b2, b1]
2 []
dtype: object
I agree with #jpp that you shouldn't mix the types so much, as when you try to apply the same function to the new E column, it will fail, cause it expected each element to be a list.
This would work on E, as it converts single str values to [str] before comparison.
import pandas as pd
df = pd.DataFrame({'A': ['a1', 'a2', 'a3'],
'B': [['b1', 'b2'], ['b1', 'b2', 'b3'], ['b2']],
'C': [['c1', 'b1'], ['b3'], ['b2', 'b1']],
'D': ['d1', 'd2', 'd3']})
def difference(df, A, B):
elements_to_list = lambda x: [n if isinstance(n, list) else [n] for n in x]
diff = [list(set(a).difference(set(b))) for a, b in zip(elements_to_list(df[A]), elements_to_list(df[B]))]
diff = [d if d else "" for d in diff] # replace empty lists with empty strings
return [d if len(d) != 1 else d[0] for d in diff] # return with single values extracted from the list
df['E'] = difference(df, "B", "C")
df['F'] = difference(df, "B", "E")
print(list(df['E']))
print(list(df['F']))
['b2', ['b2', 'b1'], '']
['b1', 'b3', 'b2']
Related
This question already has answers here:
Dictionary of lists to dataframe
(5 answers)
Closed 1 year ago.
I am trying to zip my dictionary into a panda's data frame, and do it by the keys that are in the dictionary and not manually:
import pandas as pd
dict = {'A': ['a1', 'a2', 'a3'], 'B': ['b1', 'b2', 'b3']}
columns = list(dict.keys()) # ['A', 'B']
manual_results = list(zip(dict['A'], dict['B'])) # [('a1', 'b1'), ('a2', 'b2'), ('a3', 'b3')]
df = pd.DataFrame(manual_results, columns=columns)
I wish to create the results without the need to explicitly write the name of each key (dict['A'], dict['B'] etc). Any Ideas?
There is no need to zip it. Pandas can create a dataframe directly from a dict:
import pandas as pd
d = {'A': ['a1', 'a2', 'a3'], 'B': ['b1', 'b2', 'b3']}
df = pd.DataFrame.from_dict(d)
print(df)
A B
0 a1 b1
1 a2 b2
2 a3 b3
reference: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.from_dict.html
Note: You can also orient it the other way (so the dict keys become the row index instead of colums) ...
import pandas as pd
d = {'A': ['a1', 'a2', 'a3'], 'B': ['b1', 'b2', 'b3']}
df = pd.DataFrame.from_dict(d,orient='index')
print(df)
0 1 2
A a1 a2 a3
B b1 b2 b3
There is no need to use zip() as pd.DataFrame natively expect the parameter data to be a dict that can contain Series, arrays, etc.
You can simply do as follows:
import pandas as pd
d = {'A': ['a1', 'a2', 'a3'], 'B': ['b1', 'b2', 'b3']}
df = pd.DataFrame(d)
Which output:
A B
0 a1 b1
1 a2 b2
2 a3 b3
I want to append lists of dataframes in an existing list of lists:
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']},
index=[0, 1, 2, 3])
fr_list = [[] for x in range(2)]
fr_list[0].append(df1)
fr_list[0].append(df1)
fr_list[1].append(df1)
fr2 = [[] for x in range(2)]
fr2[0].append(df1)
fr2[1].append(df1)
fr_list.append(fr2) # <-- here is the problem
Output: fr_list = [[df1, df1], [df1], [fr2[0], fr2[1]]] List contains 3 elements
Expected: fr_list = [[df1, df1, fr2[0]],[df1, fr2[1]]] List contains 2 elements
fr_list=[a+b for a,b in zip(fr_list,fr2)]
Replace fr_list.append(fr2) with the above code
Explanation: using zip & list comprehension, add corresponding lists in fr_list & fr2. What you did was appended the outer list in fr_list with outer list in fr & not the inner lists.
I'd like to remove values in list from column B based on column A, wondering how.
Given:
df = pd.DataFrame({
'A': ['a1', 'a2', 'a3', 'a4'],
'B': [['a1', 'a2'], ['a1', 'a2', 'a3'], ['a1', 'a3'], []]
})
I want:
result = pd.DataFrame({
'A': ['a1', 'a2', 'a3', 'a4'],
'B': [['a1', 'a2'], ['a1', 'a2', 'a3'], ['a1', 'a3'], []],
'Output': [['a2'], ['a1', 'a3'], ['a1'], []]
})
One way of doing that is applying a filtering function to each row via DataFrame.apply:
df['Output'] = df.apply(lambda x: [i for i in x.B if i != x.A], axis=1)
Another solution using iterrows():
for i,value in df.iterrows():
try:
value['B'].remove(value['A'])
except ValueError:
pass
print(df)
Output:
A B
0 a1 [a2]
1 a2 [a1, a3]
2 a3 [a1]
3 a4 []
Having a dataframe which looks like this:
import pandas as pd
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']},
index=[0, 1, 2, 3])
I wonder how to rearange the dataframe when having a different order in one column that one wants to apply to all the others, for example having changed the A column in this example?
df2 = pd.DataFrame({'A': ['A3', 'A0', 'A2', 'A1'],
'B': ['B3', 'B0', 'B2', 'B1'],
'C': ['C3', 'C0', 'C2', 'C1'],
'D': ['D3', 'D0', 'D2', 'D1']},
index=[0, 1, 2, 3])
You can use indexing via set_index, reindex and reset_index. Assumes your values in A are unique, which is the only case where such a transformation would make sense.
L = ['A3', 'A0', 'A2', 'A1']
res = df1.set_index('A').reindex(L).reset_index()
print(res)
A B C D
0 A3 B3 C3 D3
1 A0 B0 C0 D0
2 A2 B2 C2 D2
3 A1 B1 C1 D1
did you mean to sort 1 specific row? if so, use:
df1.iloc[:1] = df1.iloc[:1].sort_index(axis=1,ascending=False)
print(df1)
for all columns use:
df1 = df1.sort_index(axis=0,ascending=False)
for specific columns use the iloc function.
You can use the key parameter from the sorted function:
import pandas as pd
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']},
index=[0, 1, 2, 3])
key = {'A3': 0, 'A0': 1, 'A2' : 2, 'A1': 3}
df1['A'] = sorted(df1.A, key=lambda e: key.get(e, 4))
print(df1)
Output
A B C D
0 A3 B0 C0 D0
1 A0 B1 C1 D1
2 A2 B2 C2 D2
3 A1 B3 C3 D3
By changing the values of key, you can set whatever order you want.
UPDATE
If want you want is to alter the order of the other columns based on the new order of A, you could try something like this:
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']},
index=[0, 1, 2, 3])
df2 = pd.DataFrame({'A': ['A3', 'A0', 'A2', 'A1'],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']},
index=[0, 1, 2, 3])
key = [df1.A.values.tolist().index(k) for k in df2.A]
df2.B = df2['B'][key].tolist()
print(df2)
Output
A B C D
0 A3 B3 C0 D0
1 A0 B0 C1 D1
2 A2 B2 C2 D2
3 A1 B1 C3 D3
To alter all the columns just apply the above for each column. Somthing like this:
for column in df2.columns.values:
if column != 'A':
df2[column] = df2[column][key].tolist()
print(df2)
Output
A B C D
0 A3 B3 C3 D3
1 A0 B0 C0 D0
2 A2 B2 C2 D2
3 A1 B1 C1 D1
I have created a list which has the totality of all the data in the csv file.
How do I seperately call upon data in rows and columns?
For instance:
**a, b ,c**
**1** a1 b1 c1
**2** a2 b2 c2
How can I identify a single cell within the list?
try below code:
l = ['a', 'b', 'c','1','a1', 'b1', 'c1', '2', 'a2', 'b2','c2']
columns = 3
result = list(zip(*[iter(l[columns:])]*(columns+1)))
result2 = {i[0]:i[1:] for i in result}
item_id = '2'
result2[item_id]
output:
('a2', 'b2', 'c2')
or you could try below code:
l = ['a', 'b', 'c','1','a1', 'b1', 'c1', '2', 'a2', 'b2','c2']
columns = 3
item_id = '2'
index = l.index(item_id)
l[index:index+columns]
output:
['a2', 'b2', 'c2']