I have been struggling with this for a few days now. I read a lot online, found some similar questions such as: Pandas counting occurrence of list contained in column of lists or pandas: count string criteria across down rows but neither fully work in this case.
I have two dataframes: df1 consists of a column of strings. df2 consists of a column of lists (the lists are a combination of the strings from df1, each element within one list is unique).
I would like to know in how many lists of df2 occur each combination of strings. So, how many lists have "a" and "b" as elements? How many lists have "a" and "c" as elements and so forth.
This is how df1 looks like (simplified):
df1 = pd.DataFrame({"subject": ["a", "b", "c"]})
df1
subject
0 a
1 b
3 c
This is how df2 looks like (simplified).
df2 = pd.DataFrame({"subject_list": [["a", "b" ,"c"], ["b", "c"], ["a", "b"], ["b", "c"], ["c"]]})
df2
subject_list
0 ["a", "b" ,"c"]
1 ["a", "b"]
2 ["b", "c"]
3 ["c"]
4 ["b", "c"]
I have two codes which both work but aren't quite right:
This code looks for the combination of two rows in df1 (as wanted). However, df1 includes more rows than df2 so it stops with the last row of df2. But there are still some "string-combinations" to test.
df1["combination_0"] = df2["subject_list"].apply(lambda x: x.count(x and df.subject[0]))
This code counts the occurrence of one "list". However, I can't figure out how to change it so that it does it for each value combination.
df1["list_a_b"] = df2["subject_list"].apply(lambda x: x.count(df1.subject[0] and df1.subject[1]))
df1.list_a_b.sum()
Here is the solution I attempted.
Starting with the two dataframes that you have, you can use itertools to get all the possible combinations of the elements of df1 two by two:
import itertools
df1 = pd.DataFrame({"subject": ["a", "b", "c"]})
df2 = pd.DataFrame({"subject_list": [["a", "b", "c"], ["b", "c"], ["a", "b"], ["b", "c"], ["c"]]})
# Create a new dataframe with one column that has the possible two by two combinations from `df1`
df_combinations = pd.DataFrame({'combination': list(itertools.combinations(df1.subject, 2))})
Then loop through the new dataframe, df_combinations in this case, to find out how many times each combination occurs in df2:
for index, row in df_combinations.iterrows():
df_combinations.at[index, "number of occurrences"] = df2["subject_list"].apply(lambda x: all(i in x for i in row['combination'])).sum()
The main difference in this step with respect to your original solution is that I am not using x.count but rather all since this one guarantees that only instances where both values are present will be counted.
Finally df_combinations is:
combination number of occurrences
0 (a, b) 2.0
1 (a, c) 1.0
2 (b, c) 3.0
This problem is somewhat difficult because depending upon how many values you have, there can be a lot of pair-wise comparisons. I think you may want to create a dummy df with dummies for each value, and then you can use .all to easily query whatever pair-wise combination you want. It's also easy to generalize if you then want combinations of any number of elements.
First create the df_dummy which indicates whether that value is contained within the list.
df_dummy = df2.subject_list.str.join(sep='?').str.get_dummies(sep='?')
# a b c
#0 1 1 1
#1 0 1 1
#2 1 1 0
#3 0 1 1
#4 0 0 1
Then create your list of all pair-wise combinations you need to make (ignoring order) and the same values
vals = df1.subject.unique()
combos = list((vals[j], vals[i]) for i in range(len(vals)) for j in range(len(vals)) if i>j)
print(combos)
#[('a', 'b'), ('a', 'c'), ('b', 'c')]
Now check for all pair-wise combinations:
for x, y in combos:
df2[x+'_and_'+y]=df_dummy[[x, y]].all(axis=1)
df2 is:
subject_list a_and_b a_and_c b_and_c
0 [a, b, c] True True True
1 [b, c] False False True
2 [a, b] True False False
3 [b, c] False False True
4 [c] False False False
If you want to count the total, then just use sum, ignoring the first column
df2[df2.columns[1:]].sum()
#a_and_b 2
#a_and_c 1
#b_and_c 3
#dtype: int64
Here is my attempt to solve your problem.
There are two main steps:
generate all the possible lists to check from the values of df1
count how many rows in df2 contains each combination
Code:
import itertools
def all_in(elements, a_list):
# Check if all values in the list elements are present in a_list
return all(el in a_list for el in elements)
# All the (unique) values in df1
all_values = sorted(set(df1.sum()['subject']))
result = pd.Series()
# For each sequence length (1, 2, 3)
for length in range(1, len(all_values)+1):
# For each sequence of fixed length
for comb in itertools.combinations(all_values, length):
# Count how many rows of df2 contains the sequence
result["_".join(comb)] = df2.squeeze().apply(lambda x: all_in(comb, x)).sum()
which gives:
result
a 2
b 4
c 4
a_b 2
a_c 1
b_c 3
a_b_c 1
Depending on the size of the actual data and on your requirements, you could make things smarter. For example, if you know that 'a' is not in a row, then you would automatically assign False to any combination including 'a'
Here is a non-Pandas solution using collections.defaultdict and itertools.combinations. There are 2 parts to the logic:
Calculate all combinations from df1['subject'].
Iterate df2['subject_list'] and increment dictionary counts.
frozenset is used purposely since they are hashable and indicate, as in your question, that order is not relevant.
from collections import defaultdict
from itertools import combinations
df1 = pd.DataFrame({"subject": ["a", "b", "c"]})
df2 = pd.DataFrame({"subject_list": [["a", "b" ,"c"], ["b", "c"], ["a", "b"], ["b", "c"], ["c"]]})
# calculate all combinations
combs = (frozenset(c) for i in range(1, len(df1.index)+1) \
for c in combinations(df1['subject'], i))
# initialise defaultdict
d = defaultdict(int)
# iterate combinations and lists
for comb in combs:
for lst in df2['subject_list']:
if set(lst) >= comb:
d[comb] += 1
print(d)
defaultdict(int,
{frozenset({'a'}): 2,
frozenset({'b'}): 4,
frozenset({'c'}): 4,
frozenset({'a', 'b'}): 2,
frozenset({'a', 'c'}): 1,
frozenset({'b', 'c'}): 3,
frozenset({'a', 'b', 'c'}): 1})
Here is yet another approach. The two main insights are as follows:
We can start by intersecting each list in df2 with values of df1. This way we can avoid considering redundant subsets of each row of df2.
After step 1, df2 may contain duplicated sets. Collecting the duplicated may speed up the remaining computation.
The remaining task is to consider every subset of df1 and count the number of occurrences.
import pandas as pd
import numpy as np
from itertools import combinations
from collections import Counter
df1 = pd.DataFrame({"subject": ["a", "b", "c"]})
df2 = pd.DataFrame(
{
"subject_list": [
["a", "b", "c", "x", "y", "z", "1", "2", "3"],
["b", "c"],
["a", "b"],
["b", "c"],
["c"],
]
}
)
s1 = set(df1.subject.values)
def all_combs(xs):
for k in range(1, len(xs) + 1):
yield from combinations(xs, k)
def count_combs(xs):
return Counter(all_combs(xs))
res = (
df2.subject_list.apply(s1.intersection)
.apply(frozenset)
.value_counts()
.reset_index()
)
# (b, c) 2
# (c, b, a) 1
# (c) 1
# (b, a) 1
res2 = res["index"].apply(df1.subject.isin).mul(res.subject_list, axis=0)
res2.columns = df1.subject
# subject a b c
# 0 0 2 2
# 1 1 1 1
# 2 0 0 1
# 3 1 1 0
res3 = pd.Series(
{
"_".join(comb): res2[comb][(res2[comb] > 0).all(1)].sum(0).iloc[0]
for comb in map(list, all_combs(df1.subject.values))
}
)
# a 2
# b 4
# c 4
# a_b 2
# a_c 1
# b_c 3
# a_b_c 1
# dtype: int64
Related
Given a DataFrame, how can I add a new level to the columns based on an iterable given by the user? In other words, how do I append a new level?
The question How to simply add a column level to a pandas dataframe shows how to add a new level given a single value, so it doesn't cover this case.
Here is the expected behaviour:
>>> df = pd.DataFrame(0, columns=["A", "B"], index=range(2))
>>> df
A B
0 0 0
1 0 0
>>> append_level(df, ["C", "D"])
A B
C D
0 0 0
1 0 0
The solution should also work with MultiIndex columns, so
>>> append_level(append_level(df, ["C", "D"]), ["E", "F"])
A B
C D
E F
0 0 0
1 0 0
If the columns is not multiindex, you can just do:
df.columns = pd.MultiIndex.from_arrays([df.columns.tolist(), ['C','D']])
If its multiindex:
if isinstance(df.columns, pd.MultiIndex):
df.columns = pd.MultiIndex.from_arrays([*df.columns.levels, ['E', 'F']])
The pd.MultiIndex.levels gives a Frozenlist of level values and you need to unpack to form the list of lists as input to from_arrays
def append_level(df, new_level):
new_df = df.copy()
new_df.columns = pd.MultiIndex.from_tuples(zip(*zip(*df.columns), new_level))
return new_df
Let say I've a DataFrame indexed on unique Code. Each entry may herit from another (unique) entry: the parent's Code is given in col Herit.
I need a new column giving the list of children for every entries. I can obtain it providing the Code, but I don't succeed in setting up the whole column.
Here is my M(non)WE:
import pandas as pd
data = pd.DataFrame({
"Code": ["a", "aa", "ab", "b", "ba", "c"],
"Herit": ["", "a", "a", "", "b", ""],
"C": [12, 15, 13, 12, 14, 10]
}
)
data.set_index("Code", inplace=True)
print(data)
child_a = data[data.Herit == "a"].index.values
print(child_a)
data["child"] = data.apply(lambda x: data[data.Herit == x.index].index.values, axis=1)
print(data)
You can group by the Herit column and then reduce the corresponding Codes into lists:
>>> herits = df.groupby("Herit").Code.agg(list)
>>> herits
Herit
[a, b, c]
a [aa, ab]
b [ba]
Then you can map the Code column of your frame with this and assign to a new column and fill the slots who don't have any children with "":
>>> df["Children"] = df.Code.map(herits).fillna("")
>>> df
Code Herit C Children
0 a 12 [aa, ab]
1 aa a 15
2 ab a 13
3 b 12 [ba]
4 ba b 14
5 c 10
This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 1 year ago.
Given the following data:
data = pd.DataFrame(
{
"A": ["a", "a", "b", "b"],
"B": ["x", "y", "p", "q"],
"C": ["one", "two", "one", "two"],
}
)
which looks as:
A B C
0 a x one
1 a y two
2 b p one
3 b q two
I would like to create the following:
data_out = pd.DataFrame(
{
"A": ["a", "b"],
"one": ["x", "p"],
"two": ["y", "q"],
}
)
which looks as:
A one two
0 a x y
1 b p q
I'm aware that I could do something along the lines of:
d_piv = pd.pivot_table(
data,
index=["A"],
columns=["C"],
values=["B"],
aggfunc=lambda x: x,
).reset_index()
which gives:
A B
C one two
0 a x y
1 b p q
from which the columns could be cleaned up, but I'm wondering how I'd go about solving this using melt and unstack?
I have tried:
print(data.set_index("C", append=True).unstack())
which gives:
A B
C one two one two
0 a NaN x NaN
1 NaN a NaN y
2 b NaN p NaN
3 NaN b NaN q
The NaN values aren't wanted here, so I could instead try:
data.index = [0, 0, 1, 1]
data.set_index(["A", "C"], append=True).unstack(-1).reset_index(level=-1)
which gives:
A B
C one two
0 a x y
1 b p q
So that's closer - but it still feels as though there's still some unnecessary bits there.
Particularly coding the index like that.
Edit
Solution of :
df.set_index('A').pivot(columns='C', values='B').reset_index().rename_axis(None, axis=1)
is good, but I am wondering whether unstack can be used here instead of pivot?
First, set A column as the index then use df.pivot. To get the exact output we have to reset index and rename axis.
(df.set_index("A").pivot(columns="C", values="B")
.reset_index()
.rename_axis(None, axis=1))
A one two
0 a x y
1 b p q
Using df.unstack
df.set_index(["A", "C"])["B"].unstack().reset_index().rename_axis(None, axis=1)
A one two
0 a x y
1 b p q
This seems to be a simple question, and it could still be, on how to create new dataframe by selecting specific columns from other dataframes.
Lets illustrate it by having a three dummy dataframes df1, df2, df3, where "position" is common column in all dataframes
df1 = pd.DataFrame({"Position": ["A", "B", "C"], "Team1": ["xyz", "xyy", "xxy"],"Team2": ["xxz", "yyx", "yxy"],"Team3": ["xzy", "zzy", "zxz"]})
df2 = pd.DataFrame({"Position": ["A", "B", "C"],"T1": ["1", "2", "4"],"T2": ["3", "5", "2"],"T3":["2","1","4"] }, index=[0, 1, 2], )
df3 = pd.DataFrame({"Position": ["A", "B", "C"],"T_1": ["IN", "IN", "OUT"],"T2": ["IN", "OUT", "OUT"],"T3":["OUT","IN","IN"] }, index=[0, 1, 2], )
I need to create, in this instance three dataframes where I merge Team1,T1 and T_1 on "Position"...now the catch is I do not know how many teams are there, df1,df2, df3 all will have same number of Teams, however number of teams can vary (in this instance i made it three, but in actual scenario it can be variable, say N), i want to know if some iteration can be performed to create dataframe based on the number of teams?
Here is the graphical way of looking at the Inputs (teams are defined for this example, but actually it is variable) and Expected output
You could just concat horizontaly the relevant columns:
new_dfs = [pd.concat((df.set_index('Position').iloc[:,i] for df in (
df1, df2, df3)), axis=1).reset_index() for i in range(3)]
It gives:
for i in new_dfs:
print(i)
Position Team1 T1 T_1
0 A xyz 1 IN
1 B xyy 2 IN
2 C xxy 4 OUT
Position Team2 T2 T2
0 A xxz 3 IN
1 B yyx 5 OUT
2 C yxy 2 OUT
Position Team3 T3 T3
0 A xzy 2 OUT
1 B zzy 1 IN
2 C zxz 4 IN
Note: my question isn't this one, but something a little more subtle.
Say I have a dataframe that looks like this
df =
A B C
0 3 3 1
1 2 1 9
df[["A", "B", "D"]] will raise a KeyError.
Is there a python pandas way to let df[["A", "B", "D"]] == df[["A", "B"]]? (Ie: just select the columns that exist.)
One solution might be
good_columns = list(set(df.columns).intersection(["A", "B", "D"]))
mydf = df[good_columns]
But this has two problems:
It's clunky and inelegant.
The ordering of mydf.columns could be ["A", "B"] or ["B", "A"].
You can use filter, this will just ignore any extra keys:
df.filter(["A","B","D"])
A B
0 3 3
1 2 1
You can use a conditional list comprehension:
target_cols = ['A', 'B', 'D']
>>> df[[c for c in target_cols if c in df]]
A B
0 3 3
1 2 1