creating a column in dataframe pandas based on a condition - python

I have two lists
A = ['a','b','c','d','e']
B = ['c','e']
A dataframe with column
A
0 a
1 b
2 c
3 d
4 e
I wish to create an additional column for rows where elements in B matches A.
A M
0 a
1 b
2 c match
3 d
4 e match

You can use loc or numpy.where and condition with isin:
df.loc[df.A.isin(B), 'M'] = 'match'
print (df)
A M
0 a NaN
1 b NaN
2 c match
3 d NaN
4 e match
Or:
df['M'] = np.where(df.A.isin(B),'match','')
print (df)
A M
0 a
1 b
2 c match
3 d
4 e match

Related

Pandas extract columns and rows by ID from a distance matrix

I have a distance matrix with IDs as column and row names:
A B C D
A 0 1 2 3
B 1 0 4 5
C 2 4 0 6
D 3 5 6 0
How to efficiently extract values from a large matrix, e.g. for IDs A and C to get this matrix:
A C
A 0 2
C 2 0
Edit, missing IDs in the matrix should be ignored.
Use DataFrame.loc for get values by labels:
vals = ['A','C']
df = df.loc[vals, vals]
print (df)
A C
A 0 2
C 2 0
EDIT: If some values not match and need omit them add Index.intersection:
vals = ['J','A','C']
new = df.columns.intersection(vals, sort=False)
df = df.loc[new, new]
print (df)
A C
A 0 2
C 2 0

Grouping the columns and identifying values which are not part of this group

I have a DataFrame which looks like this:
df:-
A B
1 a
1 a
1 b
2 c
3 d
Now using this dataFrame i want to get the following new_df:
new_df:-
item val_not_present
1 c #1 doesn't have values c and d(values not part of group 1)
1 d
2 a #2 doesn't have values a,b and d(values not part of group 2)
2 b
2 d
3 a #3 doesn't have values a,b and c(values not part of group 3)
3 b
3 c
or an individual DataFrame for each items like:
df1:
item val_not_present
1 c
1 d
df2:-
item val_not_present
2 a
2 b
2 d
df3:-
item val_not_present
3 a
3 b
3 c
I want to get all the values which are not part of that group.
You can use np.setdiff and explode:
values_b = df.B.unique()
pd.DataFrame(df.groupby("A")["B"].unique().apply(lambda x: np.setdiff1d(values_b,x)).rename("val_not_present").explode())
Output:
val_not_present
A
1 c
1 d
2 a
2 b
2 d
3 a
3 b
3 c
Another approach is using crosstab/pivot_table to get counts and then filter on where count is 0 and transform to dataframe:
m = pd.crosstab(df['A'],df['B'])
pd.DataFrame(m.where(m.eq(0)).stack().index.tolist(),columns=['A','val_not_present'])
A val_not_present
0 1 c
1 1 d
2 2 a
3 2 b
4 2 d
5 3 a
6 3 b
7 3 c
You could convert B to a categorical datatype and then compute the value counts. Categorical variables will show categories that have frequency counts of zero so you could do something like this:
df['B'] = df['B'].astype('category')
new_df = (
df.groupby('A')
.apply(lambda x: x['B'].value_counts())
.reset_index()
.query('B == 0')
.drop(labels='B', axis=1)
.rename(columns={'level_1':'val_not_present',
'A':'item'})
)

How to simply add a column level to a pandas dataframe

let say I have a dataframe that looks like this:
df = pd.DataFrame(index=list('abcde'), data={'A': range(5), 'B': range(5)})
df
Out[92]:
A B
a 0 0
b 1 1
c 2 2
d 3 3
e 4 4
Asumming that this dataframe already exist, how can I simply add a level 'C' to the column index so I get this:
df
Out[92]:
A B
C C
a 0 0
b 1 1
c 2 2
d 3 3
e 4 4
I saw SO anwser like this python/pandas: how to combine two dataframes into one with hierarchical column index? but this concat different dataframe instead of adding a column level to an already existing dataframe.
-
As suggested by #StevenG himself, a better answer:
df.columns = pd.MultiIndex.from_product([df.columns, ['C']])
print(df)
# A B
# C C
# a 0 0
# b 1 1
# c 2 2
# d 3 3
# e 4 4
option 1
set_index and T
df.T.set_index(np.repeat('C', df.shape[1]), append=True).T
option 2
pd.concat, keys, and swaplevel
pd.concat([df], axis=1, keys=['C']).swaplevel(0, 1, 1)
A solution which adds a name to the new level and is easier on the eyes than other answers already presented:
df['newlevel'] = 'C'
df = df.set_index('newlevel', append=True).unstack('newlevel')
print(df)
# A B
# newlevel C C
# a 0 0
# b 1 1
# c 2 2
# d 3 3
# e 4 4
You could just assign the columns like:
>>> df.columns = [df.columns, ['C', 'C']]
>>> df
A B
C C
a 0 0
b 1 1
c 2 2
d 3 3
e 4 4
>>>
Or for unknown length of columns:
>>> df.columns = [df.columns.get_level_values(0), np.repeat('C', df.shape[1])]
>>> df
A B
C C
a 0 0
b 1 1
c 2 2
d 3 3
e 4 4
>>>
Another way for MultiIndex (appanding 'E'):
df.columns = pd.MultiIndex.from_tuples(map(lambda x: (x[0], 'E', x[1]), df.columns))
A B
E E
C D
a 0 0
b 1 1
c 2 2
d 3 3
e 4 4
I like it explicit (using MultiIndex) and chain-friendly (.set_axis):
df.set_axis(pd.MultiIndex.from_product([df.columns, ['C']]), axis=1)
This is particularly convenient when merging DataFrames with different column level numbers, where Pandas (1.4.2) raises a FutureWarning (FutureWarning: merging between different levels is deprecated and will be removed ... ):
import pandas as pd
df1 = pd.DataFrame(index=list('abcde'), data={'A': range(5), 'B': range(5)})
df2 = pd.DataFrame(index=list('abcde'), data=range(10, 15), columns=pd.MultiIndex.from_tuples([("C", "x")]))
# df1:
A B
a 0 0
b 1 1
# df2:
C
x
a 10
b 11
# merge while giving df1 another column level:
pd.merge(df1.set_axis(pd.MultiIndex.from_product([df1.columns, ['']]), axis=1),
df2,
left_index=True, right_index=True)
# result:
A B C
x
a 0 0 10
b 1 1 11
Another method, but using a list comprehension of tuples as the arg to pandas.MultiIndex.from_tuples():
df.columns = pd.MultiIndex.from_tuples([(col, 'C') for col in df.columns])
df
A B
C C
a 0 0
b 1 1
c 2 2
d 3 3
e 4 4

pandas: dataframe from dict with comma separated values

I'm trying to create a DataFrame from a nested dictionary, where the values are in comma separated strings.
Each value is nested in a dict, such as:
dict = {"1":{
"event":"A, B, C"},
"2":{
"event":"D, B, A, C"},
"3":{
"event":"D, B, C"}
}
My desired output is:
A B C D
0 A B C NaN
1 A B C D
2 NaN B C D
All I have so far is converting the dict to dataframe and splitting the items in each list. But I'm not sure this is getting me any closer to my objective.
df = pd.DataFrame(dict)
Out[439]:
1 2 3
event A, B, C D, B, A, C D, B, C
In [441]: df.loc['event'].str.split(',').apply(pd.Series)
Out[441]:
0 1 2 3
1 A B C NaN
2 D B A C
3 D B C NaN
Any help is appreciated. Thanks
You can use a couple of comprehensions to massage the nested dict into a better format for creation of a DataFrame that flags if an entry for the column exists or not:
the_dict = {"1":{
"event":"A, B, C"},
"2":{
"event":"D, B, A, C"},
"3":{
"event":"D, B, C"}
}
df = pd.DataFrame([[{z:1 for z in y.split(', ')} for y in x.values()][0] for x in the_dict.values()])
>>> df
A B C D
0 1.0 1 1 NaN
1 1.0 1 1 1.0
2 NaN 1 1 1.0
Once you've made the DataFrame you can simply loop through the columns and convert the values that flagged the existence of the letter into a letter using the where method(below this does where NaN leave as NaN, otherwise it inserts the letter for the column):
for col in df.columns:
df_mask = df[col].isnull()
df[col]=df[col].where(df_mask,col)
>>> df
A B C D
0 A B C NaN
1 A B C D
2 NaN B C D
Based on #merlin's suggestion you can go straight to the answer within the comprehension:
df = pd.DataFrame([[{z:z for z in y.split(', ')} for y in x.values()][0] for x in the_dict.values()])
>>> df
A B C D
0 A B C NaN
1 A B C D
2 NaN B C D
From what you have(modified the split a little to strip the extra spaces) df1, you can probably just stack the result and use pd.crosstab() on the index and value column:
df1 = df.loc['event'].str.split('\s*,\s*').apply(pd.Series)
df2 = df1.stack().rename('value').reset_index()
pd.crosstab(df2.level_0, df2.value)
# value A B C D
# level_0
# 1 1 1 1 0
# 2 1 1 1 1
# 3 0 1 1 1
This is not exactly as you asked for, but I imagine you may prefer this to your desired output.
To get exactly what you are looking for, you can add an extra column which is equal to the value column above and then unstack the index that contains the values:
df2 = df1.stack().rename('value').reset_index()
df2['value2'] = df2.value
df2.set_index(['level_0', 'value']).drop('level_1', axis = 1).unstack(level = 1)
# value2
# value A B C D
# level_0
# 1 A B C None
# 2 A B C D
# 3 None B C D

Convert N by N Dataframe to 3 Column Dataframe

I am using Python 2.7 with Pandas on a Windows 10 machine.
I have an n by n Dataframe where:
1) The index represents peoples names
2) The column headers are the same peoples names in the same order
3) Each cell of the Dataframeis the average number of times they email each other each day.
How would I transform that Dataframeinto a Dataframewith 3 columns, where:
1) Column 1 would be the index of the n by n Dataframe
2) Column 2 would be the row headers of the n by n Dataframe
3) Column 3 would be the cell value corresponding to those two names from the index, column header combination from the n by n Dataframe
Edit
Appologies for not providing an example of what I am looking for. I would like to take df1 and turn it into rel_df, from the code below.
import pandas as pd
from itertools import permutations
df1 = pd.DataFrame()
df1['index'] = ['a', 'b','c','d','e']
df1.set_index('index', inplace = True)
df1['a'] = [0,1,2,3,4]
df1['b'] = [1,0,2,3,4]
df1['c'] = [4,1,0,3,4]
df1['d'] = [5,1,2,0,4]
df1['e'] = [7,1,2,3,0]
##df of all relationships to build
flds = pd.Series(SO_df.fld1.unique())
flds = pd.Series(flds.append(pd.Series(SO_df.fld2.unique())).unique())
combos = []
for L in range(0, len(flds)+1):
for subset in permutations(flds, L):
if len(subset) == 2:
combos.append(subset)
if len(subset) > 2:
break
rel_df = pd.DataFrame.from_records(data = combos, columns = ['fld1','fld2'])
rel_df['value'] = [1,4,5,7,1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4]
print df1
>>> print df1
a b c d e
index
a 0 1 4 5 7
b 1 0 1 1 1
c 2 2 0 2 2
d 3 3 3 0 3
e 4 4 4 4 0
>>> print rel_df
fld1 fld2 value
0 a b 1
1 a c 4
2 a d 5
3 a e 7
4 b a 1
5 b c 1
6 b d 1
7 b e 1
8 c a 2
9 c b 2
10 c d 2
11 c e 2
12 d a 3
13 d b 3
14 d c 3
15 d e 3
16 e a 4
17 e b 4
18 e c 4
19 e d 4
Use melt:
df1 = df1.reset_index()
pd.melt(df1, id_vars='index', value_vars=df1.columns.tolist()[1:])
(If in your actual code you're explicitly setting the index as you do here, just skip that step rather than doing the reset_index; melt doesn't work on an index.)
# Flatten your dataframe.
df = df1.stack().reset_index()
# Remove duplicates (e.g. fld1 = 'a' and fld2 = 'a').
df = df.loc[df.iloc[:, 0] != df.iloc[:, 1]]
# Rename columns.
df.columns = ['fld1', 'fld2', 'value']
>>> df
fld1 fld2 value
1 a b 1
2 a c 4
3 a d 5
4 a e 7
5 b a 1
7 b c 1
8 b d 1
9 b e 1
10 c a 2
11 c b 2
13 c d 2
14 c e 2
15 d a 3
16 d b 3
17 d c 3
19 d e 3
20 e a 4
21 e b 4
22 e c 4
23 e d 4

Categories

Resources