Creating new dataframe by selecting specific columns from other dataframe - python

This seems to be a simple question, and it could still be, on how to create new dataframe by selecting specific columns from other dataframes.
Lets illustrate it by having a three dummy dataframes df1, df2, df3, where "position" is common column in all dataframes
df1 = pd.DataFrame({"Position": ["A", "B", "C"], "Team1": ["xyz", "xyy", "xxy"],"Team2": ["xxz", "yyx", "yxy"],"Team3": ["xzy", "zzy", "zxz"]})
df2 = pd.DataFrame({"Position": ["A", "B", "C"],"T1": ["1", "2", "4"],"T2": ["3", "5", "2"],"T3":["2","1","4"] }, index=[0, 1, 2], )
df3 = pd.DataFrame({"Position": ["A", "B", "C"],"T_1": ["IN", "IN", "OUT"],"T2": ["IN", "OUT", "OUT"],"T3":["OUT","IN","IN"] }, index=[0, 1, 2], )
I need to create, in this instance three dataframes where I merge Team1,T1 and T_1 on "Position"...now the catch is I do not know how many teams are there, df1,df2, df3 all will have same number of Teams, however number of teams can vary (in this instance i made it three, but in actual scenario it can be variable, say N), i want to know if some iteration can be performed to create dataframe based on the number of teams?
Here is the graphical way of looking at the Inputs (teams are defined for this example, but actually it is variable) and Expected output

You could just concat horizontaly the relevant columns:
new_dfs = [pd.concat((df.set_index('Position').iloc[:,i] for df in (
df1, df2, df3)), axis=1).reset_index() for i in range(3)]
It gives:
for i in new_dfs:
print(i)
Position Team1 T1 T_1
0 A xyz 1 IN
1 B xyy 2 IN
2 C xxy 4 OUT
Position Team2 T2 T2
0 A xxz 3 IN
1 B yyx 5 OUT
2 C yxy 2 OUT
Position Team3 T3 T3
0 A xzy 2 OUT
1 B zzy 1 IN
2 C zxz 4 IN

Related

How to repeat row n times inside a iterrows

For each row of a dataframe I want to repeat the row n times inside a iterrows in a new dataframe. Basically I'm doing this:
df = pd.DataFrame(
[
("abcd", "abcd", "abcd") # create your data here, be consistent in the types.
],
["A", "B", "C"] # add your column names here
)
n_times = 2
for index, row in df.iterrows():
new_df = row.loc[row.index.repeat(n_times)]
new_df
and I get the following output:
0 abcd
0 abcd
1 abcd
1 abcd
2 abcd
2 abcd
Name: C, dtype: object
while it should be:
A B C
0 abcd abcd abcd
1 abcd abcd abcd
How should I proceed to get the desired output?
The df.T attribute in Pandas is used to transpose a DataFrame. Transposing a DataFrame means to flip its rows and columns, so that the rows become columns and the columns become rows.
I don't think you defined your df the right way.
df = pd.DataFrame(data = [["abcd", "abcd", "abcd"]],
columns = ["A", "B", "C"])
n_times = 2
for _ in range(n_times):
new_df = pd.concat([df, df], axis=0)
Is that how it should look like?

Add empty row after column that contains spefific text

I have a large dataframe where I need to add an empty row after any instance where colA contains a colon.
To be honest I have absolutely no clue how to do this, my guess is that a function/ for loop needs to be written but I have had no luck...
I think you are looking for this
You have dataframe like this
df = pd.DataFrame({"cola": ["a", "b", ":", "c", "d", ":", "e"]})
# wherever you find : in column a you want to append new empty row
idx = [0] + (df[df.cola.str.match(':')].index +1).tolist()
df1 = pd.DataFrame()
for i in range(len(idx)-1):
df1 = pd.concat([df1, df.iloc[idx[i]: idx[i+1]]],ignore_index=True)
df1.loc[len(df1)] = ""
df1 = pd.concat([df1, df.iloc[idx[-1]: ]], ignore_index=True)
print(df1)
# df1 is your result dataframe also it handles the case where colon is present at the last row of dataframe
Resultant dataframe
cola
0 a
1 b
2 :
3
4 c
5 d
6 :
7
8 e

Pandas new col with indexes of rows sharing a code in another col

Let say I've a DataFrame indexed on unique Code. Each entry may herit from another (unique) entry: the parent's Code is given in col Herit.
I need a new column giving the list of children for every entries. I can obtain it providing the Code, but I don't succeed in setting up the whole column.
Here is my M(non)WE:
import pandas as pd
data = pd.DataFrame({
"Code": ["a", "aa", "ab", "b", "ba", "c"],
"Herit": ["", "a", "a", "", "b", ""],
"C": [12, 15, 13, 12, 14, 10]
}
)
data.set_index("Code", inplace=True)
print(data)
child_a = data[data.Herit == "a"].index.values
print(child_a)
data["child"] = data.apply(lambda x: data[data.Herit == x.index].index.values, axis=1)
print(data)
You can group by the Herit column and then reduce the corresponding Codes into lists:
>>> herits = df.groupby("Herit").Code.agg(list)
>>> herits
Herit
[a, b, c]
a [aa, ab]
b [ba]
Then you can map the Code column of your frame with this and assign to a new column and fill the slots who don't have any children with "":
>>> df["Children"] = df.Code.map(herits).fillna("")
>>> df
Code Herit C Children
0 a 12 [aa, ab]
1 aa a 15
2 ab a 13
3 b 12 [ba]
4 ba b 14
5 c 10

Pivot data in pandas using melt and unstack [duplicate]

This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 1 year ago.
Given the following data:
data = pd.DataFrame(
{
"A": ["a", "a", "b", "b"],
"B": ["x", "y", "p", "q"],
"C": ["one", "two", "one", "two"],
}
)
which looks as:
A B C
0 a x one
1 a y two
2 b p one
3 b q two
I would like to create the following:
data_out = pd.DataFrame(
{
"A": ["a", "b"],
"one": ["x", "p"],
"two": ["y", "q"],
}
)
which looks as:
A one two
0 a x y
1 b p q
I'm aware that I could do something along the lines of:
d_piv = pd.pivot_table(
data,
index=["A"],
columns=["C"],
values=["B"],
aggfunc=lambda x: x,
).reset_index()
which gives:
A B
C one two
0 a x y
1 b p q
from which the columns could be cleaned up, but I'm wondering how I'd go about solving this using melt and unstack?
I have tried:
print(data.set_index("C", append=True).unstack())
which gives:
A B
C one two one two
0 a NaN x NaN
1 NaN a NaN y
2 b NaN p NaN
3 NaN b NaN q
The NaN values aren't wanted here, so I could instead try:
data.index = [0, 0, 1, 1]
data.set_index(["A", "C"], append=True).unstack(-1).reset_index(level=-1)
which gives:
A B
C one two
0 a x y
1 b p q
So that's closer - but it still feels as though there's still some unnecessary bits there.
Particularly coding the index like that.
Edit
Solution of :
df.set_index('A').pivot(columns='C', values='B').reset_index().rename_axis(None, axis=1)
is good, but I am wondering whether unstack can be used here instead of pivot?
First, set A column as the index then use df.pivot. To get the exact output we have to reset index and rename axis.
(df.set_index("A").pivot(columns="C", values="B")
.reset_index()
.rename_axis(None, axis=1))
A one two
0 a x y
1 b p q
Using df.unstack
df.set_index(["A", "C"])["B"].unstack().reset_index().rename_axis(None, axis=1)
A one two
0 a x y
1 b p q

Count occurrence of two elements in column of list

I have been struggling with this for a few days now. I read a lot online, found some similar questions such as: Pandas counting occurrence of list contained in column of lists or pandas: count string criteria across down rows but neither fully work in this case.
I have two dataframes: df1 consists of a column of strings. df2 consists of a column of lists (the lists are a combination of the strings from df1, each element within one list is unique).
I would like to know in how many lists of df2 occur each combination of strings. So, how many lists have "a" and "b" as elements? How many lists have "a" and "c" as elements and so forth.
This is how df1 looks like (simplified):
df1 = pd.DataFrame({"subject": ["a", "b", "c"]})
df1
subject
0 a
1 b
3 c
This is how df2 looks like (simplified).
df2 = pd.DataFrame({"subject_list": [["a", "b" ,"c"], ["b", "c"], ["a", "b"], ["b", "c"], ["c"]]})
df2
subject_list
0 ["a", "b" ,"c"]
1 ["a", "b"]
2 ["b", "c"]
3 ["c"]
4 ["b", "c"]
I have two codes which both work but aren't quite right:
This code looks for the combination of two rows in df1 (as wanted). However, df1 includes more rows than df2 so it stops with the last row of df2. But there are still some "string-combinations" to test.
df1["combination_0"] = df2["subject_list"].apply(lambda x: x.count(x and df.subject[0]))
This code counts the occurrence of one "list". However, I can't figure out how to change it so that it does it for each value combination.
df1["list_a_b"] = df2["subject_list"].apply(lambda x: x.count(df1.subject[0] and df1.subject[1]))
df1.list_a_b.sum()
Here is the solution I attempted.
Starting with the two dataframes that you have, you can use itertools to get all the possible combinations of the elements of df1 two by two:
import itertools
df1 = pd.DataFrame({"subject": ["a", "b", "c"]})
df2 = pd.DataFrame({"subject_list": [["a", "b", "c"], ["b", "c"], ["a", "b"], ["b", "c"], ["c"]]})
# Create a new dataframe with one column that has the possible two by two combinations from `df1`
df_combinations = pd.DataFrame({'combination': list(itertools.combinations(df1.subject, 2))})
Then loop through the new dataframe, df_combinations in this case, to find out how many times each combination occurs in df2:
for index, row in df_combinations.iterrows():
df_combinations.at[index, "number of occurrences"] = df2["subject_list"].apply(lambda x: all(i in x for i in row['combination'])).sum()
The main difference in this step with respect to your original solution is that I am not using x.count but rather all since this one guarantees that only instances where both values are present will be counted.
Finally df_combinations is:
combination number of occurrences
0 (a, b) 2.0
1 (a, c) 1.0
2 (b, c) 3.0
This problem is somewhat difficult because depending upon how many values you have, there can be a lot of pair-wise comparisons. I think you may want to create a dummy df with dummies for each value, and then you can use .all to easily query whatever pair-wise combination you want. It's also easy to generalize if you then want combinations of any number of elements.
First create the df_dummy which indicates whether that value is contained within the list.
df_dummy = df2.subject_list.str.join(sep='?').str.get_dummies(sep='?')
# a b c
#0 1 1 1
#1 0 1 1
#2 1 1 0
#3 0 1 1
#4 0 0 1
Then create your list of all pair-wise combinations you need to make (ignoring order) and the same values
vals = df1.subject.unique()
combos = list((vals[j], vals[i]) for i in range(len(vals)) for j in range(len(vals)) if i>j)
print(combos)
#[('a', 'b'), ('a', 'c'), ('b', 'c')]
Now check for all pair-wise combinations:
for x, y in combos:
df2[x+'_and_'+y]=df_dummy[[x, y]].all(axis=1)
df2 is:
subject_list a_and_b a_and_c b_and_c
0 [a, b, c] True True True
1 [b, c] False False True
2 [a, b] True False False
3 [b, c] False False True
4 [c] False False False
If you want to count the total, then just use sum, ignoring the first column
df2[df2.columns[1:]].sum()
#a_and_b 2
#a_and_c 1
#b_and_c 3
#dtype: int64
Here is my attempt to solve your problem.
There are two main steps:
generate all the possible lists to check from the values of df1
count how many rows in df2 contains each combination
Code:
import itertools
def all_in(elements, a_list):
# Check if all values in the list elements are present in a_list
return all(el in a_list for el in elements)
# All the (unique) values in df1
all_values = sorted(set(df1.sum()['subject']))
result = pd.Series()
# For each sequence length (1, 2, 3)
for length in range(1, len(all_values)+1):
# For each sequence of fixed length
for comb in itertools.combinations(all_values, length):
# Count how many rows of df2 contains the sequence
result["_".join(comb)] = df2.squeeze().apply(lambda x: all_in(comb, x)).sum()
which gives:
result
a 2
b 4
c 4
a_b 2
a_c 1
b_c 3
a_b_c 1
Depending on the size of the actual data and on your requirements, you could make things smarter. For example, if you know that 'a' is not in a row, then you would automatically assign False to any combination including 'a'
Here is a non-Pandas solution using collections.defaultdict and itertools.combinations. There are 2 parts to the logic:
Calculate all combinations from df1['subject'].
Iterate df2['subject_list'] and increment dictionary counts.
frozenset is used purposely since they are hashable and indicate, as in your question, that order is not relevant.
from collections import defaultdict
from itertools import combinations
df1 = pd.DataFrame({"subject": ["a", "b", "c"]})
df2 = pd.DataFrame({"subject_list": [["a", "b" ,"c"], ["b", "c"], ["a", "b"], ["b", "c"], ["c"]]})
# calculate all combinations
combs = (frozenset(c) for i in range(1, len(df1.index)+1) \
for c in combinations(df1['subject'], i))
# initialise defaultdict
d = defaultdict(int)
# iterate combinations and lists
for comb in combs:
for lst in df2['subject_list']:
if set(lst) >= comb:
d[comb] += 1
print(d)
defaultdict(int,
{frozenset({'a'}): 2,
frozenset({'b'}): 4,
frozenset({'c'}): 4,
frozenset({'a', 'b'}): 2,
frozenset({'a', 'c'}): 1,
frozenset({'b', 'c'}): 3,
frozenset({'a', 'b', 'c'}): 1})
Here is yet another approach. The two main insights are as follows:
We can start by intersecting each list in df2 with values of df1. This way we can avoid considering redundant subsets of each row of df2.
After step 1, df2 may contain duplicated sets. Collecting the duplicated may speed up the remaining computation.
The remaining task is to consider every subset of df1 and count the number of occurrences.
import pandas as pd
import numpy as np
from itertools import combinations
from collections import Counter
df1 = pd.DataFrame({"subject": ["a", "b", "c"]})
df2 = pd.DataFrame(
{
"subject_list": [
["a", "b", "c", "x", "y", "z", "1", "2", "3"],
["b", "c"],
["a", "b"],
["b", "c"],
["c"],
]
}
)
s1 = set(df1.subject.values)
def all_combs(xs):
for k in range(1, len(xs) + 1):
yield from combinations(xs, k)
def count_combs(xs):
return Counter(all_combs(xs))
res = (
df2.subject_list.apply(s1.intersection)
.apply(frozenset)
.value_counts()
.reset_index()
)
# (b, c) 2
# (c, b, a) 1
# (c) 1
# (b, a) 1
res2 = res["index"].apply(df1.subject.isin).mul(res.subject_list, axis=0)
res2.columns = df1.subject
# subject a b c
# 0 0 2 2
# 1 1 1 1
# 2 0 0 1
# 3 1 1 0
res3 = pd.Series(
{
"_".join(comb): res2[comb][(res2[comb] > 0).all(1)].sum(0).iloc[0]
for comb in map(list, all_combs(df1.subject.values))
}
)
# a 2
# b 4
# c 4
# a_b 2
# a_c 1
# b_c 3
# a_b_c 1
# dtype: int64

Categories

Resources