Groupby contains two specific values - pandas - python

I'm aiming to return rows in a pandas df that contain two specific values grouped by a separate column. Using below, I'm grouping by Num and aiming to return rows where B is present but not A for each unique group.
If neither A nor B is assigned to a grouped value then continue. I only want to return the rows where B is present but not A.
import pandas as pd
df = pd.DataFrame({
'Num' : [1,1,2,2,2,2,3,3,4,4,4,4],
'Label' : ['X','Y','X','B','B','B','A','B','B','A','B','X'],
})
df = df.loc[(df['Label'] == 'A') | (df['Label'] == 'B')]
df = df.groupby('Num').filter(lambda x: any(x['Label'] == 'A'))
df = df.groupby('Num').filter(lambda x: any(x['Label'] == 'B'))
intended output:
Num Label
2 2 B
3 2 B
4 2 B
5 2 B

You can filter if all values per groups are B by GroupBy.transform with GroupBy.all:
df1 = df.loc[(df['Label'] == 'A') | (df['Label'] == 'B')]
df1 = df1[(df1['Label'] == 'B').groupby(df1['Num']).transform('all')]
print (df1)
Num Label
3 2 B
4 2 B
5 2 B
If need fitler original column Num use:
df = df[df['Num'].isin(df1['Num'])]
print (df)
Num Label
2 2 X
3 2 B
4 2 B
5 2 B
Another approach is filter by numpy.setdiff1d:
num = np.setdiff1d(df.loc[(df['Label'] == 'B'), 'Num'],
df.loc[(df['Label'] == 'A'), 'Num'])
df = df[df['Num'].isin(num)]
print (df)
Num Label
2 2 X
3 2 B
4 2 B
5 2 B

Related

lookup value in the pandas dataframe using the muliple values in the row of another dataframe

I have dataframes:
df1:
| |A|B|C|D|E|
|0|1|2|3|4|5|
|1|1|3|4|5|0|
|2|3|1|2|3|5|
|3|2|3|1|2|6|
|4|2|5|1|2|3|
df2:
| |K|L|M|N|
|0|1|3|4|2|
|1|1|2|5|3|
|2|3|2|3|1|
|3|1|4|5|0|
|4|2|2|3|6|
|5|2|1|2|7|
What I need to do is match column A of df1 with column k of df2; column C of df1 with L of df2; and column D of df1 with column M of df2. If the values are matched the corresponding value of N in df2 should be assigned to a new column F in df1. The output should be:
| |A|B|C|D|E|F|
|0|1|2|3|4|5|2|
|1|1|3|4|5|0|0|
|2|3|1|2|3|5|1|
|3|2|3|1|2|6|7|
|4|2|5|1|2|3|7|
Use DataFrame.merge with left join and rename columns for match:
df = df1.merge(df2.rename(columns={'K':'A','L':'C','M':'D', 'N':'F'}), how='left')
print (df)
A B C D E F
0 1 2 3 4 5 2
1 1 3 4 5 0 0
2 3 1 2 3 5 1
3 2 3 1 2 6 7
4 2 5 1 2 3 7
df3 = df1.join(df2)
F = []
for _, row in df3.iterrows():
if row['A'] == row['K'] and row['C'] == row['L'] and row['D'] == row['M']:
F.append(row['N'])
else:
F.append(0)
df1['F'] = F
df1

How do I delete a column that contains only zeros from a given row in pandas

I've found how to remove column with zeros for all the rows using the command df.loc[:, (df != 0).any(axis=0)], and I need to do the same but given the row number.
For example, for the folowing df
In [75]: df = pd.DataFrame([[1,1,0,0], [1,0,1,0]], columns=['a','b','c','d'])
In [76]: df
Out[76]:
a b c d
0 1 1 0 0
1 1 0 1 0
Give me the columns with non-zeros for the row 0 and I would expect the result:
a b
0 1 1
And for the row 1 get:
a c
1 1 1
I tried a lot of combinations of commands but I couldn't find a solution.
UPDATE:
I have a 300x300 matrix, I need to better visualize its result.
Below a pseudo-code trying to show what I need
for i in range(len(df[rows])):
_df = df.iloc[i]
_df = _df.filter(remove_zeros_columns)
print('Row: ', i)
print(_df)
Result:
Row: 0
a b
0 1 1
Row: 1
a c f
1 1 5 10
Row: 2
e
2 20
Best Regards.
Kleyson Rios.
You can change data structure:
df = df.reset_index().melt('index', var_name='columns').query('value != 0')
print (df)
index columns value
0 0 a 1
1 1 a 1
2 0 b 1
5 1 c 1
If need new column by values joined by , compare values for not equal by DataFrame.ne and use matrix multiplication by DataFrame.dot:
df['new'] = df.ne(0).dot(df.columns + ', ').str.rstrip(', ')
print (df)
a b c d new
0 1 1 0 0 a, b
1 1 0 1 0 a, c
EDIT:
for i in df.index:
row = df.loc[[i]]
a = row.loc[:, (row != 0).any()]
print ('Row {}'.format(i))
print (a)
Or:
def f(x):
print ('Row {}'.format(x.name))
print (x[x!=0].to_frame().T)
df.apply(f, axis=1)
Row 0
a b
0 1 1
Row 1
a c
1 1 1
df = pd.DataFrame([[1, 1, 0, 0], [1, 0, 1, 0]], columns=['a', 'b', 'c', 'd'])
def get(row):
return list(df.columns[row.ne(0)])
df['non zero column'] = df.apply(lambda x: get(x), axis=1)
print(df)
also if you want single liner use this
df['non zero column'] = [list(df.columns[i]) for i in df.ne(0).values]
output
a b c d non zero column
0 1 1 0 0 [a, b]
1 1 0 1 0 [a, c]
I think this answers your question more strictly.
Just change the value of given_row as needed.
given_row = 1
mask_all_rows = df.apply(lambda x: x!=0, axis=0)
mask_row = mask_all_rows.loc[given_row]
cols_to_keep = mask_row.index[mask_row == True].tolist()
df_filtered = df[cols_to_keep]
# And if you only want to keep the given row
df_filtered = df_filtered[df_filtered.index == given_row]

Compare values from two pandas data frames, order-independent

I am new to data science. I want to check which elements from one data frame exist in another data frame, e.g.
df1 = [1,2,8,6]
df2 = [5,2,6,9]
# for 1 output should be False
# for 2 output should be True
# for 6 output should be True
etc.
Note: I have matrix not vector.
I have tried using the following code:
import pandas as pd
import numpy as np
priority_dataframe = pd.read_excel(prioritylist_file_path, sheet_name='Sheet1', index=None)
priority_dict = {column: np.array(priority_dataframe[column].dropna(axis=0, how='all').str.lower()) for column in
priority_dataframe.columns}
keys_found_per_sheet = []
if file_path.lower().endswith(('.csv')):
file_dataframe = pd.read_csv(file_path)
else:
file_dataframe = pd.read_excel(file_path, sheet_name=sheet, index=None)
file_cell_array = list()
for column in file_dataframe.columns:
for file_cell in np.array(file_dataframe[column].dropna(axis=0, how='all')):
if isinstance(file_cell, str) == 'str':
file_cell_array.append(file_cell)
else:
file_cell_array.append(str(file_cell))
converted_file_cell_array = np.array(file_cell_array)
for key, values in priority_dict.items():
for priority_cell in values:
if priority_cell in converted_file_cell_array[:]:
keys_found_per_sheet.append(key)
break
I am doing something wrong in if priority_cell in converted_file_cell_array[:] ?
Is there any other efficient way to do that?
You can take the .values from each dataframe, convert them to a set(), and take the set intersection.
set1 = set(df1.values.reshape(-1).tolist())
set2 = set(dr2.values.reshape(-1).tolist())
different = set1 & set2
You can flatten all values of DataFrames by numpy.ravel and then use set.intersection():
df1 = pd.DataFrame({'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')})
print (df1)
A B C D E F
0 a 4 7 1 5 a
1 b 5 8 3 3 a
2 c 4 9 5 6 a
3 d 5 4 7 9 b
4 e 5 2 1 2 b
5 f 4 3 0 4 b
df2 = pd.DataFrame({'A':[2,3,13,4], 'Z':list('abfr')})
print (df2)
A Z
0 2 a
1 3 b
2 13 f
3 4 r
L = list(set(df1.values.ravel()).intersection(df2.values.ravel()))
print (L)
['f', 2, 3, 4, 'a', 'b']

pandas - select and pivot column names into values

Starting with this dataframe df:
df = pd.DataFrame({'id':[1,2,3,4],'a':['on','on','off','off'], 'b':['on','off','on','off']})
a b id
0 on on 1
1 on off 2
2 off on 3
3 off off 4
what I would like to achieve is a column result with results from the 'on' and 'off' selection of the columns. Expected output is:
a b id result
0 on on 1 [a,b]
1 on off 2 [a]
2 off on 3 [b]
3 off off 4 []
so basically I have to select the 'on' values in columns (except id) and then keep the resulting column names into lists. My first attemp was using pivot_table:
d = pd.pivot_table(df, index='id', columns=?, values=?)
but I am stuck on how to put the selection into the values and the new column into the columns args.
For me works create nested lists and then select first value of lists by str[0]:
df['res'] = df[['a','b']].eq('on').apply(lambda x: [x.index.values[x]], axis=1).str[0]
print (df)
a b id res
0 on on 1 [a, b]
1 on off 2 [a]
2 off on 3 [b]
3 off off 4 []
Or create tuple first and then cast to lists:
df['res'] = df[['a','b']].eq('on')
.apply(lambda x: tuple(x.index.values[x]), axis=1).apply(list)
print (df)
a b id res
0 on on 1 [a, b]
1 on off 2 [a]
2 off on 3 [b]
3 off off 4 []
Instead of pivot table you can also use
df['result'] = df.iloc[:,0:2].eq('on').apply(lambda x: tuple(df.columns[0:2][x]), axis=1)
Output :
a b id result
0 on on 1 (a, b)
1 on off 2 (a,)
2 off on 3 (b,)
3 off off 4 ()
or you can using eq and mul
df['res']=(df[['a','b']].eq('on').mul(['a','b'])).values.tolist()
Out[824]:
a b id res
0 on on 1 [a, b]
1 on off 2 [a, ]
2 off on 3 [, b]
3 off off 4 [, ]
Try this:
import pandas as pd
df = pd.DataFrame({'id':[1,2,3,4],'a':['on','on','off','off'], 'b':['on','off','on','off']})
stringList = []
for i in range(0,df.shape[0]):
if df['a'][i] == 'on' and df['b'][i] == 'on':
stringList.append('[a,b]')
elif df['a'][i] == 'on' and df['b'][i] == 'off':
stringList.append('[a]')
elif df['a'][i] == 'off' and df['b'][i] == 'on':
stringList.append('[b]')
else:
stringList.append('[]')
df['result'] = stringList
print df

np.where multiple return values

Using pandas and numpy I am trying to process a column in a dataframe, and want to create a new column with values relating to it. So if in column x the value 1 is present, in the new column it would be a, for value 2 it would be b etc
I can do this for single conditions, i.e
df['new_col'] = np.where(df['col_1'] == 1, a, n/a)
And I can find example of multiple conditions i.e if x = 3 or x = 4 the value should a, but not to do something like if x = 3 the value should be a and if x = 4 the value be c.
I tried simply running two lines of code such as :
df['new_col'] = np.where(df['col_1'] == 1, a, n/a)
df['new_col'] = np.where(df['col_1'] == 2, b, n/a)
But obviously the second line overwrites. Am I missing something crucial?
I think you can use loc:
df.loc[(df['col_1'] == 1, 'new_col')] = a
df.loc[(df['col_1'] == 2, 'new_col')] = b
Or:
df['new_col'] = np.where(df['col_1'] == 1, a, np.where(df['col_1'] == 2, b, np.nan))
Or numpy.select:
df['new_col'] = np.select([df['col_1'] == 1, df['col_1'] == 2],[a, b], default=np.nan)
Or use Series.map, if no match get NaN by default:
d = { 0 : 'a', 1 : 'b'}
df['new_col'] = df['col_1'].map(d)
I think numpy choose() is the best option for you.
import numpy as np
choices = 'abcde'
N = 10
np.random.seed(0)
data = np.random.randint(1, len(choices) + 1, size=N)
print(data)
print(np.choose(data - 1, choices))
Output:
[5 1 4 4 4 2 4 3 5 1]
['e' 'a' 'd' 'd' 'd' 'b' 'd' 'c' 'e' 'a']
you could define a dict with your desired transformations.
Then loop through the a DataFrame column and fill it.
There may a more elegant ways, but this will work:
# create a dummy DataFrame
df = pd.DataFrame( np.random.randint(2, size=(6,4)), columns=['col_1', 'col_2', 'col_3', 'col_4'], index=range(6) )
# create a dict with your desired substitutions:
swap_dict = { 0 : 'a',
1 : 'b',
999 : 'zzz', }
# introduce new column and fill with swapped information:
for i in df.index:
df.loc[i, 'new_col'] = swap_dict[ df.loc[i, 'col_1'] ]
print df
returns something like:
col_1 col_2 col_3 col_4 new_col
0 1 1 1 1 b
1 1 1 1 1 b
2 0 1 1 0 a
3 0 1 0 0 a
4 0 0 1 1 a
5 0 0 1 0 a
Use the pandas Series.map instead of where.
import pandas as pd
df = pd.DataFrame({'col_1' : [1,2,4,2]})
print(df)
def ab_ify(v):
if v == 1:
return 'a'
elif v == 2:
return 'b'
else:
return None
df['new_col'] = df['col_1'].map(ab_ify)
print(df)
# output:
#
# col_1
# 0 1
# 1 2
# 2 4
# 3 2
# col_1 new_col
# 0 1 a
# 1 2 b
# 2 4 None
# 3 2 b

Categories

Resources