Python Pandas : Select data and ignoring KeyErrors - python

Note: my question isn't this one, but something a little more subtle.
Say I have a dataframe that looks like this
df =
A B C
0 3 3 1
1 2 1 9
df[["A", "B", "D"]] will raise a KeyError.
Is there a python pandas way to let df[["A", "B", "D"]] == df[["A", "B"]]? (Ie: just select the columns that exist.)
One solution might be
good_columns = list(set(df.columns).intersection(["A", "B", "D"]))
mydf = df[good_columns]
But this has two problems:
It's clunky and inelegant.
The ordering of mydf.columns could be ["A", "B"] or ["B", "A"].

You can use filter, this will just ignore any extra keys:
df.filter(["A","B","D"])
A B
0 3 3
1 2 1

You can use a conditional list comprehension:
target_cols = ['A', 'B', 'D']
>>> df[[c for c in target_cols if c in df]]
A B
0 3 3
1 2 1

Related

Append new level to DataFrame column

Given a DataFrame, how can I add a new level to the columns based on an iterable given by the user? In other words, how do I append a new level?
The question How to simply add a column level to a pandas dataframe shows how to add a new level given a single value, so it doesn't cover this case.
Here is the expected behaviour:
>>> df = pd.DataFrame(0, columns=["A", "B"], index=range(2))
>>> df
A B
0 0 0
1 0 0
>>> append_level(df, ["C", "D"])
A B
C D
0 0 0
1 0 0
The solution should also work with MultiIndex columns, so
>>> append_level(append_level(df, ["C", "D"]), ["E", "F"])
A B
C D
E F
0 0 0
1 0 0
If the columns is not multiindex, you can just do:
df.columns = pd.MultiIndex.from_arrays([df.columns.tolist(), ['C','D']])
If its multiindex:
if isinstance(df.columns, pd.MultiIndex):
df.columns = pd.MultiIndex.from_arrays([*df.columns.levels, ['E', 'F']])
The pd.MultiIndex.levels gives a Frozenlist of level values and you need to unpack to form the list of lists as input to from_arrays
def append_level(df, new_level):
new_df = df.copy()
new_df.columns = pd.MultiIndex.from_tuples(zip(*zip(*df.columns), new_level))
return new_df

Pivot data in pandas using melt and unstack [duplicate]

This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 1 year ago.
Given the following data:
data = pd.DataFrame(
{
"A": ["a", "a", "b", "b"],
"B": ["x", "y", "p", "q"],
"C": ["one", "two", "one", "two"],
}
)
which looks as:
A B C
0 a x one
1 a y two
2 b p one
3 b q two
I would like to create the following:
data_out = pd.DataFrame(
{
"A": ["a", "b"],
"one": ["x", "p"],
"two": ["y", "q"],
}
)
which looks as:
A one two
0 a x y
1 b p q
I'm aware that I could do something along the lines of:
d_piv = pd.pivot_table(
data,
index=["A"],
columns=["C"],
values=["B"],
aggfunc=lambda x: x,
).reset_index()
which gives:
A B
C one two
0 a x y
1 b p q
from which the columns could be cleaned up, but I'm wondering how I'd go about solving this using melt and unstack?
I have tried:
print(data.set_index("C", append=True).unstack())
which gives:
A B
C one two one two
0 a NaN x NaN
1 NaN a NaN y
2 b NaN p NaN
3 NaN b NaN q
The NaN values aren't wanted here, so I could instead try:
data.index = [0, 0, 1, 1]
data.set_index(["A", "C"], append=True).unstack(-1).reset_index(level=-1)
which gives:
A B
C one two
0 a x y
1 b p q
So that's closer - but it still feels as though there's still some unnecessary bits there.
Particularly coding the index like that.
Edit
Solution of :
df.set_index('A').pivot(columns='C', values='B').reset_index().rename_axis(None, axis=1)
is good, but I am wondering whether unstack can be used here instead of pivot?
First, set A column as the index then use df.pivot. To get the exact output we have to reset index and rename axis.
(df.set_index("A").pivot(columns="C", values="B")
.reset_index()
.rename_axis(None, axis=1))
A one two
0 a x y
1 b p q
Using df.unstack
df.set_index(["A", "C"])["B"].unstack().reset_index().rename_axis(None, axis=1)
A one two
0 a x y
1 b p q

Create a new category by using a value from another column

My dataset currently has 1 column with different opportunity types. I have another column with a dummy variable as to whether or not the opportunity is a first time client or not.
import pandas as pd
df = pd.DataFrame(
{"col_opptype": ["a", "b", "c", "d"],
"col_first": [1,0,1,0] }
)
I would like to create a new category within col_opptype based on col_first. Where only 1 category (i.e. a) will be matched to its corresponding col_first
I.e.,
col_opptype = {a_first, a_notfirst, b, c, d}
col_first = {1, 0}
where:
a_first is when col_opptype = a and col_first = 1
a_notfirst is when col_opptype = a and col_first = 0
desired output:
col_opptype col_first
0 a_first 1
1 b 0
2 c 1
3 a_notfirst 0
I am working on Python and am a relatively new user so I hope the above makes sense. Thank you!
This should solve your problem :)
Please add your code attempty and at least an example dataframe definition to your next question, so we do not have to invent examples to help you. An exact example of what the final result should look like would also have been great :)
Edit I adjusted the code to your changed question.
import pandas as pd
df = pd.DataFrame(
{"col_opptype": ["a", "b", "c", "d"],
"col_first": [1,0,1,0] }
)
def is_first_opptype(opptype: str, wanted_type:str, first: int):
if first and opptype == wanted_type:
return opptype + "_first"
elif not first and opptype == wanted_type:
return opptype + "_notfirst"
else:
return opptype
df["col_opptype"] = df.apply(lambda x: is_first_opptype(x["col_opptype"],
x["col_first"], "a"), axis=1)
print(df)
output:
col_opptype col_first
0 a_first 1
1 b 0
2 c 1
3 d 0

Pandas if statement in vectorized operation

df = pd.DataFrame([["a", "d"], ["", ""], ["", "3"]],
columns=["a", "b"])
df
a b
0 a d
1
2 3
I'm looking to do a vectorized string concatenation with an if statement like this:
df["c"] = df["a"] + "()" + df["b"] if df["a"].item != "" else ""
But it doesn't work because .item returns a series. Is it possible to do it like this without an apply or lambda method that goes through each row? In a vectorized operation pandas will try and concatenate multiple cells at a time and make it faster...
Desired output:
df
a b c
0 a d a ()b
1
2 3
Try this: using np.where()
df = pd.DataFrame([["a", "d"], ["", ""], ["", "3"]],
columns=["a", "b"])
df['c']=np.where(df['a']!='',df['a'] + '()' + df['b'],'')
print(df)
output:
a b c
0 a d a()d
1
2 3
IIUC you could use mask to concatenate both columns, separated by some string using str.cat, whenever a condition holds:
df['c'] = df.a.mask(df.a.ne(''), df.a.str.cat(df.b, sep='()'))
print(df)
a b c
0 a d a()d
1
2 3
Since nobody already mentioned it, you can also use the apply method:
df['c'] = df.apply(lambda r: r['a']+'()'+r['b'] if r['a']!='' else '', axis=1)
If anyone checks performances please comment below :)

Count occurrence of two elements in column of list

I have been struggling with this for a few days now. I read a lot online, found some similar questions such as: Pandas counting occurrence of list contained in column of lists or pandas: count string criteria across down rows but neither fully work in this case.
I have two dataframes: df1 consists of a column of strings. df2 consists of a column of lists (the lists are a combination of the strings from df1, each element within one list is unique).
I would like to know in how many lists of df2 occur each combination of strings. So, how many lists have "a" and "b" as elements? How many lists have "a" and "c" as elements and so forth.
This is how df1 looks like (simplified):
df1 = pd.DataFrame({"subject": ["a", "b", "c"]})
df1
subject
0 a
1 b
3 c
This is how df2 looks like (simplified).
df2 = pd.DataFrame({"subject_list": [["a", "b" ,"c"], ["b", "c"], ["a", "b"], ["b", "c"], ["c"]]})
df2
subject_list
0 ["a", "b" ,"c"]
1 ["a", "b"]
2 ["b", "c"]
3 ["c"]
4 ["b", "c"]
I have two codes which both work but aren't quite right:
This code looks for the combination of two rows in df1 (as wanted). However, df1 includes more rows than df2 so it stops with the last row of df2. But there are still some "string-combinations" to test.
df1["combination_0"] = df2["subject_list"].apply(lambda x: x.count(x and df.subject[0]))
This code counts the occurrence of one "list". However, I can't figure out how to change it so that it does it for each value combination.
df1["list_a_b"] = df2["subject_list"].apply(lambda x: x.count(df1.subject[0] and df1.subject[1]))
df1.list_a_b.sum()
Here is the solution I attempted.
Starting with the two dataframes that you have, you can use itertools to get all the possible combinations of the elements of df1 two by two:
import itertools
df1 = pd.DataFrame({"subject": ["a", "b", "c"]})
df2 = pd.DataFrame({"subject_list": [["a", "b", "c"], ["b", "c"], ["a", "b"], ["b", "c"], ["c"]]})
# Create a new dataframe with one column that has the possible two by two combinations from `df1`
df_combinations = pd.DataFrame({'combination': list(itertools.combinations(df1.subject, 2))})
Then loop through the new dataframe, df_combinations in this case, to find out how many times each combination occurs in df2:
for index, row in df_combinations.iterrows():
df_combinations.at[index, "number of occurrences"] = df2["subject_list"].apply(lambda x: all(i in x for i in row['combination'])).sum()
The main difference in this step with respect to your original solution is that I am not using x.count but rather all since this one guarantees that only instances where both values are present will be counted.
Finally df_combinations is:
combination number of occurrences
0 (a, b) 2.0
1 (a, c) 1.0
2 (b, c) 3.0
This problem is somewhat difficult because depending upon how many values you have, there can be a lot of pair-wise comparisons. I think you may want to create a dummy df with dummies for each value, and then you can use .all to easily query whatever pair-wise combination you want. It's also easy to generalize if you then want combinations of any number of elements.
First create the df_dummy which indicates whether that value is contained within the list.
df_dummy = df2.subject_list.str.join(sep='?').str.get_dummies(sep='?')
# a b c
#0 1 1 1
#1 0 1 1
#2 1 1 0
#3 0 1 1
#4 0 0 1
Then create your list of all pair-wise combinations you need to make (ignoring order) and the same values
vals = df1.subject.unique()
combos = list((vals[j], vals[i]) for i in range(len(vals)) for j in range(len(vals)) if i>j)
print(combos)
#[('a', 'b'), ('a', 'c'), ('b', 'c')]
Now check for all pair-wise combinations:
for x, y in combos:
df2[x+'_and_'+y]=df_dummy[[x, y]].all(axis=1)
df2 is:
subject_list a_and_b a_and_c b_and_c
0 [a, b, c] True True True
1 [b, c] False False True
2 [a, b] True False False
3 [b, c] False False True
4 [c] False False False
If you want to count the total, then just use sum, ignoring the first column
df2[df2.columns[1:]].sum()
#a_and_b 2
#a_and_c 1
#b_and_c 3
#dtype: int64
Here is my attempt to solve your problem.
There are two main steps:
generate all the possible lists to check from the values of df1
count how many rows in df2 contains each combination
Code:
import itertools
def all_in(elements, a_list):
# Check if all values in the list elements are present in a_list
return all(el in a_list for el in elements)
# All the (unique) values in df1
all_values = sorted(set(df1.sum()['subject']))
result = pd.Series()
# For each sequence length (1, 2, 3)
for length in range(1, len(all_values)+1):
# For each sequence of fixed length
for comb in itertools.combinations(all_values, length):
# Count how many rows of df2 contains the sequence
result["_".join(comb)] = df2.squeeze().apply(lambda x: all_in(comb, x)).sum()
which gives:
result
a 2
b 4
c 4
a_b 2
a_c 1
b_c 3
a_b_c 1
Depending on the size of the actual data and on your requirements, you could make things smarter. For example, if you know that 'a' is not in a row, then you would automatically assign False to any combination including 'a'
Here is a non-Pandas solution using collections.defaultdict and itertools.combinations. There are 2 parts to the logic:
Calculate all combinations from df1['subject'].
Iterate df2['subject_list'] and increment dictionary counts.
frozenset is used purposely since they are hashable and indicate, as in your question, that order is not relevant.
from collections import defaultdict
from itertools import combinations
df1 = pd.DataFrame({"subject": ["a", "b", "c"]})
df2 = pd.DataFrame({"subject_list": [["a", "b" ,"c"], ["b", "c"], ["a", "b"], ["b", "c"], ["c"]]})
# calculate all combinations
combs = (frozenset(c) for i in range(1, len(df1.index)+1) \
for c in combinations(df1['subject'], i))
# initialise defaultdict
d = defaultdict(int)
# iterate combinations and lists
for comb in combs:
for lst in df2['subject_list']:
if set(lst) >= comb:
d[comb] += 1
print(d)
defaultdict(int,
{frozenset({'a'}): 2,
frozenset({'b'}): 4,
frozenset({'c'}): 4,
frozenset({'a', 'b'}): 2,
frozenset({'a', 'c'}): 1,
frozenset({'b', 'c'}): 3,
frozenset({'a', 'b', 'c'}): 1})
Here is yet another approach. The two main insights are as follows:
We can start by intersecting each list in df2 with values of df1. This way we can avoid considering redundant subsets of each row of df2.
After step 1, df2 may contain duplicated sets. Collecting the duplicated may speed up the remaining computation.
The remaining task is to consider every subset of df1 and count the number of occurrences.
import pandas as pd
import numpy as np
from itertools import combinations
from collections import Counter
df1 = pd.DataFrame({"subject": ["a", "b", "c"]})
df2 = pd.DataFrame(
{
"subject_list": [
["a", "b", "c", "x", "y", "z", "1", "2", "3"],
["b", "c"],
["a", "b"],
["b", "c"],
["c"],
]
}
)
s1 = set(df1.subject.values)
def all_combs(xs):
for k in range(1, len(xs) + 1):
yield from combinations(xs, k)
def count_combs(xs):
return Counter(all_combs(xs))
res = (
df2.subject_list.apply(s1.intersection)
.apply(frozenset)
.value_counts()
.reset_index()
)
# (b, c) 2
# (c, b, a) 1
# (c) 1
# (b, a) 1
res2 = res["index"].apply(df1.subject.isin).mul(res.subject_list, axis=0)
res2.columns = df1.subject
# subject a b c
# 0 0 2 2
# 1 1 1 1
# 2 0 0 1
# 3 1 1 0
res3 = pd.Series(
{
"_".join(comb): res2[comb][(res2[comb] > 0).all(1)].sum(0).iloc[0]
for comb in map(list, all_combs(df1.subject.values))
}
)
# a 2
# b 4
# c 4
# a_b 2
# a_c 1
# b_c 3
# a_b_c 1
# dtype: int64

Categories

Resources