I have a Pandas dataframe df in the following format:
ColumnA. ColumnB. IsMatching
0 sadasdsad. asdsadsad True
1 asdsadsadas. asdsadasd. False
2 asdsadasd. asdsadsad. False
3 dfsdfsdfi ijijiiijj. False
4 sdasdsads. asdsadsad True
5 dfsdfsdfi ijijiiijj. False
6 jijijijij. ijijijiji. False
7 assdssads. asd222sad True
I would like to create a new dataframe, say new_df, which contains n randomly sampled rows with IsMatching == False from the original df between 2 True instances. For example, randomly select n rows between indices 0 and 4, similarly randomly select n rows between indices 4 and 7 etc.
A sample desired output for new_df would be (with sampling 2 rows randomly between the True instances in df). Note that it is possible that there is less than 2 rows between True instances - and in that case I would like the new_df to have whatever rows are there.
1 asdsadsadas. asdsadasd. False
3 dfsdfsdfi ijijiiijj. False
5 dfsdfsdfi ijijiiijj. False
6 jijijijij. ijijijiji. False
I searched about the df.sample() method in Pandas, but it doesn't seem to have a provision to sample between 2 rows. Any help and suggestion would be appreciated.
This should work:
from io import StringIO
data = StringIO("""
ColumnA. ColumnB. IsMatching
0 sadasdsad. asdsadsad True
1 asdsadsadas. asdsadasd. False
2 asdsadasd. asdsadsad. False
3 dfsdfsdfi ijijiiijj. False
4 sdasdsads. asdsadsad True
5 dfsdfsdfi ijijiiijj. False
6 jijijijij. ijijijiji. False
7 assdssads. asd222sad True
""")
import pandas as pd
df = pd.read_csv(data, sep='\s+')
df['between_rows_group'] = df['IsMatching'].cumsum()
# take 1 sample
df.query('IsMatching==False').groupby('between_rows_group').sample(1)
# take samples with replacement
df.query('IsMatching==False').groupby('between_rows_group').sample(5, replace=True)
# take as many samples as are possible
df.query('IsMatching==False').groupby('between_rows_group').apply(lambda x: x.sample(min(5, len(x)))).reset_index(drop=True)
Related
I have the following dataframe as below.
0 1 2 3 4 5 6 7
True False False False False False False False
[1 rows * 8 columns]
As you can see, there is one True value which is the first column.
Therefore, I want to get the 0 index which is True element in the dataframe.
In other case, there is True in the 4th column index, then I would like to get the 4 as 4th column has the True value for below dataframe.
0 1 2 3 4 5 6 7
False False False False True False False False
[1 rows * 8 columns]
I tried to google it but failed to get what I want.
And for assumption, there is no designated column name in the case.
Look forward to your help.
Thanks.
IIUC, you are looking for idxmax:
>>> df
0 1 2 3 4 5 6 7
0 True False False False False False False False
>>> df.idxmax(axis=1)
0 0
dtype: object
>>> df
0 1 2 3 4 5 6 7
0 False False False False True False False False
>>> df.idxmax(axis=1)
0 4
dtype: object
Caveat: if all values are False, Pandas returns the first index because index 0 is the lowest index of the highest value:
>>> df
0 1 2 3 4 5 6 7
0 False False False False False False False False
>>> df.idxmax(axis=1)
0 0
dtype: object
Workaround: replace False by np.nan:
>>> df.replace(False, np.nan).idxmax(axis=1)
0 NaN
dtype: float64
if you want every field that is true:
cols_true = []
for idx, row in df.iterrows():
for i in cols:
if row[i]:
cols_true.append(i)
print(cols_true)
Use boolean indexing:
df.columns[df.iloc[0]]
output:
Index(['0'], dtype='object')
Or numpy.where
np.where(df)[1]
You may want to index the dataframe's index by a column itself (0 in this case), as follows:
df.index[df[0]]
You'll get:
Int64Index([0], dtype='int64')
df.loc[:, df.any()].columns[0]
# 4
If you have several True values you can also get them all with columns
Generalization
Imagine we have the following dataframe (several True values in positions 4, 6 and 7):
0 1 2 3 4 5 6 7
0 False False False False True False True True
With the formula above :
df.loc[:, df.any()].columns
# Int64Index([4, 6, 7], dtype='int64')
df1.apply(lambda ss:ss.loc[ss].index.min(),axis=1).squeeze()
out:
0
or
df1.loc[:,df1.iloc[0]].columns.min()
I am trying to run this code.
import pandas as pd
df = pd.DataFrame({'A':['1','2'],
'B':['1','2'],
'C':['1','2']})
print(df.duplicated())
It is giving me the output.
0 False
1 False
dtype: bool
I want to know why it is showing index 1 as False and not True.
I'm expecting output this.
0 False
1 True
dtype: bool
I'm using Python 3.11.1 and Pandas 1.4.4
duplicated is working on full rows (or a subset of the columns if the parameter is used).
Here you don't have any duplicate:
A B C
0 1 1 1 # this row is unique
1 2 2 2 # this one is also unique
I believe you might want duplication column-wise?
df.T.duplicated()
Output:
A False
B True
C True
dtype: bool
You are not getting the expected output because you don't have duplicates, to begin with. I added the duplicate rows to the end of your dataframe and this is closer to what you are looking for:
import pandas as pd
df = pd.DataFrame({'A':['1','2'],
'B':['1','2'],
'C':['1','2']})
df = pd.concat([df]*2)
df
A B C
0 1 1 1
1 2 2 2
0 1 1 1
1 2 2 2
df.duplicated(keep='first')
Output:
0 False
1 False
0 True
1 True
dtype: bool
And the if you want to keep duplicates the other way around:
df.duplicated(keep='last')
0 True
1 True
0 False
1 False
dtype: bool
I have a dataFrame with 2 columns a A and B. I have to separate out subset of dataFrames using pandas to delete all the duplicate values.
For Example
My dataFrame looks like this
**A B**
1 1
2 3
4 4
8 8
5 6
4 7
Then the output should be
**A B**
1 1 <--- both values Highlighted
2 3
4 4 <--- both values Highlighted
8 8 <--- both values Highlighted
5 6
4 7 <--- value in column A highlighted
How do I do that?
Thanks in advance.
You can use this:
def color_dupes(x):
c1='background-color:red'
c2=''
cond=x.stack().duplicated(keep=False).unstack()
df1 = pd.DataFrame(np.where(cond,c1,c2),columns=x.columns,index=x.index)
return df1
df.style.apply(color_dupes,axis=None)
# if df has many columns: df.style.apply(color_dupes,axis=None,subset=['A','B'])
Example working code:
Explanation:
First we stack the dataframe so as to bring all the columns into a series and find duplicated with keep=False to mark all duplicates as true:
df.stack().duplicated(keep=False)
0 A True
B True
1 A False
B False
2 A True
B True
3 A True
B True
4 A False
B False
5 A True
B False
dtype: bool
After this we unstack() the dataframe which gives a boolean dataframe with the same dataframe structure:
df.stack().duplicated(keep=False).unstack()
A B
0 True True
1 False False
2 True True
3 True True
4 False False
5 True False
Once we have this we assign the background color to values if True else no color using np.where
I am trying to clean a dataset and basically get rid of all the features which have a certain amount of empty values, in more than 100 empty values inclusive, with pandas/python. I am using the following command
train.isnull().sum()>=100
which gets me:
Id False
Feature 1 False
Feature 2 False
Feature 3 True
Feature 4 False
Feature 5 True
I would like to return a new dataframe without the features 3 and 4.
Thank you.
in your case, just run:
train[train.columns[train.isnull().sum()<100]]
Full example:
import pandas as pd
df = pd.DataFrame([[1,None,2],[3,4,None],[7,8,9]], columns = ['A','B','C'])
You'll get:
A B C
0 1 NaN 2.0
1 3 4.0 NaN
2 7 8.0 9.0
then running:
df.isnull().sum()
will result in null count:
A 0
B 1
C 1
then just select the wanted columns:
df.columns[df.isnull().sum()<100]
and filter your data frame:
df[ df.columns[df.isnull().sum()<100]]
How do I drop a row if any of the values in the row equal zero?
I would normally use df.dropna() for NaN values but not sure how to do it with "0" values.
i think the easiest way is looking at rows where all values are not equal to 0:
df[(df != 0).all(1)]
You could make a boolean frame and then use any:
>>> df = pd.DataFrame([[1,0,2],[1,2,3],[0,1,2],[4,5,6]])
>>> df
0 1 2
0 1 0 2
1 1 2 3
2 0 1 2
3 4 5 6
>>> df == 0
0 1 2
0 False True False
1 False False False
2 True False False
3 False False False
>>> df = df[~(df == 0).any(axis=1)]
>>> df
0 1 2
1 1 2 3
3 4 5 6
Although it is late, someone else might find it helpful.
I had similar issue. But the following worked best for me.
df =pd.read_csv(r'your file')
df =df[df['your column name'] !=0]
reference:
Drop rows with all zeros in pandas data frame
see #ikbel benabdessamad
Assume a simple DataFrame as below:
df=pd.DataFrame([1,2,0,3,4,0,9])
Pick non-zero values which turns all zero values into nan and remove nan-values
df=df[df!=0].dropna()
df
Output:
0
0 1.0
1 2.0
3 3.0
4 4.0
6 9.0