Remove multivalued columns - python

I have a dataframe。
A B
0 2 3
1 2 4
2 3 5
If the value of a column has more than 2 different values, I will remove.
expect the output:
A
0 2
1 2
2 3

You can use .nunique() and .loc, passing a boolean
df = pd.DataFrame({'A': {0: 2, 1: 2, 2: 3}, 'B': {0: 3, 1: 4, 2: 5}})
df.loc[:, (df.nunique() <= 2)]
A
0 2
1 2
2 3
An alternative approach (credit to this answer):
criteria = df.nunique() <= 2
df[criteria.index[criteria]]

Use for loop and value_count to get the result:-
df = pd.DataFrame(data= {'A':[2,2,3], 'B':[3,4,5]})
for var in df.columns:
result = df[var].value_counts()
if len(result)>2:
df.drop(var, axis=1,inplace=True)
df
Output
A
0 2
1 2
2 3

Related

combine multiple column into one in pandas [duplicate]

This question already has answers here:
How do I melt a pandas dataframe?
(3 answers)
Closed 12 days ago.
I have table like below
Column 1 Column 2 Column 3 ...
0 a 1 2
1 b 1 3
2 c 2 1
and I want to convert it to be like below
Column 1 Column 2
0 a 1
1 a 2
2 b 1
3 b 3
4 c 2
5 c 1
...
I want to take each value from Column 2 (and so on) and pair it to value in column 1. I have no idea how to do it in pandas or even where to start.
You can use pd.melt to do this:
>>> df = pd.DataFrame({'A': {0: 'a', 1: 'b', 2: 'c'},
... 'B': {0: 1, 1: 3, 2: 5},
... 'C': {0: 2, 1: 4, 2: 6}})
>>> df
A B C
0 a 1 2
1 b 3 4
2 c 5 6
>>> pd.melt(df, id_vars=['A'], value_vars=['B', 'C'])
A variable value
0 a B 1
1 b B 3
2 c B 5
3 a C 2
4 b C 4
5 c C 6
Here's my approach, hope it helps:
import pandas as pd
df=pd.DataFrame({'col1':['a','b','c'],'col2':[1,1,2],'col3':[2,3,1]})
new_df=pd.DataFrame(columns=['col1','col2'])
for index,row in df.iterrows():
for element in row.values[1:]:
new_df.loc[len(new_df)]=[row[0],element]
print(new_df)
Output:
col1 col2
0 a 1
1 a 2
2 b 1
3 b 3
4 c 2
5 c 1

Taking rows from dataframe until a condition is met

I have a dataframe with two columns:
A B
0 1 3
1 2 2
2 3 2
3 9 3
4 1 1
...
For a given index i, I want the rows from row i to the row j in which df.at[j,A]-df.at[i,B]>5. I don't want any rows after row j.
For example, let i=1, the output should be:
[out]
A B
2 2
3 2
9 3
Is there a simple way of do this without using loops?
df = pd.DataFrame({'A': [10, 1, 2, 3, 9], 'B': [1, 3, 2, 2, 3]})
i = 2
base = df.at[i, 'B']
df = df.iloc[i:]
j = df[df['A'] - df.at[i, 'B'] > 5]
if not j.empty:
print(df.iloc[:j.index[0]])
else:
print('Condition not found')
Prints:
A B
2 2 2
3 3 2
4 9 3
You could try as follows:
import pandas as pd
data = {'A': {0: 10, 1: 2, 2: 3, 3: 9}, 'B': {0: 3, 1: 2, 2: 2, 3: 3}}
df = pd.DataFrame(data)
i=1
s = df.loc[i:,'A']-df.loc[i,'B']>5
trues = s[s==True]
if not trues.empty:
subset = df.iloc[i:trues.idxmax()+1]
else:
subset = pd.DataFrame()
print(subset)
A B
1 2 2
2 3 2
3 9 3

How do you filter duplicate columns in a dataframe based on a value in another column

I would like to filter duplicate rows in a DataFrame according to columns "NID", "Lact" and code when the column "Code" = 10.
The following data provides example data
data_list = {'NID': {1: '1', 2: '1', 3: '1', 4: '1', 5: '2', 6: '2', 7: '1'},
'Lact': {1: 1, 2: 1, 3: 1, 4: 2, 5: 2, 6: 2, 7: 1},
'Code': {1: 10, 2: 0, 3: 10, 4: 0, 5: 0, 6: 10, 7: 0}}
The DataFrame appears below
NID Lact Code
1 1 1 10
2 1 1 0
3 1 1 10
4 1 2 0
5 2 2 0
6 2 2 10
7 1 1 0
If I run the following filter to identify duplicates it identifies duplicate rows based on "NID", "Lact", and "Code"
df[(df.duplicated(['NID', 'Lact', 'Code'], keep=False))]
The output is provided below
NID Lact Code
1 1 1 10
2 1 1 0
3 1 1 10
7 1 1 0
I would like to make this filter conditional on Code = 10 as I would like to delete the first instance of duplicate rows when code = 10 but not when code is not equal to 10
Is there a way to add a condition for Code == 10 to this filter?
IIUC, you want to keep all rows if Code is not equal to 10 but drop the first of duplicates otherwise, right? Then you could add that into the boolean mask:
cols = ['NID', 'Lact', 'Code']
out = df[~df.duplicated(cols, keep=False) | df.duplicated(cols) | df['Code'].ne(10)]
Output:
NID Lact Code
2 1 1 0
3 1 1 10
4 1 2 0
5 2 2 0
6 2 2 10
7 1 1 0
If need remove first duplicated row if condition Code == 10 chain it with DataFrame.duplicated with default keep='first' parameter and if need also filter all duplicates chain m2 with & for bitwise AND:
m1 = df['Code'].eq(10)
m2 = df.duplicated(['NID', 'Lact', 'Code'], keep=False)
m3 = df.duplicated(['NID', 'Lact', 'Code'])
df = df[(~m1 | m3) & m2]
print (df)
NID Lact Code
2 1 1 0
3 1 1 10
7 1 1 0

How to sum up pandas columns where the number is determined by the values of another column? [duplicate]

I have a data frame as:
a b c d......
1 1
3 3 3 5
4 1 1 4 6
1 0
I want to select number of columns based on value given in column "a". In this case for first row it would only select column b.
How can I achieve something like:
df.iloc[:,column b:number of columns corresponding to value in column a]
My expected output would be:
a b c d e
1 1 0 0 1 # 'e' contains value in column b because colmn a = 1
3 3 3 5 335 # 'e' contains values of column b,c,d because colm a
4 1 1 4 1 # = 3
1 0 NAN
Define a little function for this:
def select(df, r):
return df.iloc[r, 1:1 + df.iat[r, 0]]
The function uses iat to query the a column for that row, and iloc to select columns from the same row.
Call it as such:
select(df, 0)
b 1.0
Name: 0, dtype: float64
And,
select(df, 1)
b 3.0
c 3.0
d 5.0
Name: 1, dtype: float64
Based on your edit, consider this -
df
a b c d e
0 1 1 0 0 0
1 3 3 3 5 0
2 4 1 1 4 6
3 1 0 0 0 0
Use where/mask (with numpy broadcasting) + agg here -
df['e'] = df.iloc[:, 1:]\
.astype(str)\
.where(np.arange(df.shape[1] - 1) < df.a[:, None], '')\
.agg(''.join, axis=1)
df
a b c d e
0 1 1 0 0 1
1 3 3 3 5 335
2 4 1 1 4 1146
3 1 0 0 0 0
If nothing matches, then those entries in e will have an empty string. Just use replace -
df['e'] = df['e'].replace('', np.nan)
A numpy slicing approach
a = v[:, 0]
b = v[:, 1:]
n, m = b.shape
b = b.ravel()
b = np.where(b == 0, '', b.astype(str))
r = np.arange(n) * m
f = lambda t: b[t[0]:t[1]]
df.assign(g=list(map(''.join, map(f, zip(r, r + a)))))
a b c d e g
0 1 1 0 0 0 1
1 3 3 3 5 0 335
2 4 1 1 4 6 1146
3 1 0 0 0 0
Edit: one line solution with slicing.
df["f"] = df.astype(str).apply(lambda r: "".join(r[1:int(r["a"])+1]), axis=1)
# df["f"] = df["f"].astype(int) if you need `f` to be integer
df
a b c d e f
0 1 1 X X X 1
1 3 3 3 5 X 335
2 4 1 1 4 6 1146
3 1 0 X X X 0
Dataset used:
df = pd.DataFrame({'a': {0: 1, 1: 3, 2: 4, 3: 1},
'b': {0: 1, 1: 3, 2: 1, 3: 0},
'c': {0: 'X', 1: '3', 2: '1', 3: 'X'},
'd': {0: 'X', 1: '5', 2: '4', 3: 'X'},
'e': {0: 'X', 1: 'X', 2: '6', 3: 'X'}})
Suggestion for improvement would be appreciated!

how to convert header row into new columns in python pandas?

I am having following dataframe:
A,B,C
1,2,3
I have to convert above dataframe like following format:
cols,vals
A,1
B,2
c,3
How to create column names as a new column in pandas?
You can transpose by T:
import pandas as pd
df = pd.DataFrame({'A': {0: 1}, 'C': {0: 3}, 'B': {0: 2}})
print (df)
A B C
0 1 2 3
print (df.T)
0
A 1
B 2
C 3
df1 = df.T.reset_index()
df1.columns = ['cols','vals']
print (df1)
cols vals
0 A 1
1 B 2
2 C 3
If DataFrame has more rows, you can use:
import pandas as pd
df = pd.DataFrame({'A': {0: 1, 1: 9, 2: 1},
'C': {0: 3, 1: 6, 2: 7},
'B': {0: 2, 1: 4, 2: 8}})
print (df)
A B C
0 1 2 3
1 9 4 6
2 1 8 7
df.index = 'vals' + df.index.astype(str)
print (df.T)
vals0 vals1 vals2
A 1 9 1
B 2 4 8
C 3 6 7
df1 = df.T.reset_index().rename(columns={'index':'cols'})
print (df1)
cols vals0 vals1 vals2
0 A 1 9 1
1 B 2 4 8
2 C 3 6 7

Categories

Resources