Delete all rows from a dataframe containing question marks (?) - python

I have a Pandas DataFrame where some of the values are missing (denoted by ?). Is there any easy way of deleting all rows where at least one column has the value ??
Normally, I would do boolean indexing but I have many columns. One way is as follows:
for index, row in df.iterrows():
for col in df.columns:
if '?' in row[col]:
#delete row
But this seems unPythonic...
Any ideas?

Or just replace it as NaN and using dropna
df.replace({'?':np.nan}).dropna()
Out[126]:
col1 col2 col3 col4
row4 24 12 52 17

Option 1a
boolean indexing and any
df
col1 col2 col3 col4
row1 65 24 47 ?
row2 33 48 ? 89
row3 ? 34 67 ?
row4 24 12 52 17
(df.astype(str) == '?').any(1)
row1 True
row2 True
row3 True
row4 False
dtype: bool
df = df[~(df.astype(str) == '?').any(1)]
df
col1 col2 col3 col4
row4 24 12 52 17
Here, the astype(str) check is to prevent a TypeError: Could not compare ['?'] with block values from being raised if you have a mixture of string and numeric columns in your dataframe.
Option 1b
Direct comparison with values
(df.values == '?').any(1)
array([ True, True, True, False], dtype=bool)
df = df[~(df.values == '?').any(1)]
df
col1 col2 col3 col4
row4 24 12 52 17
Option 2
df.replace and df.notnull
df.replace('?', np.nan).notnull().all(1)
row1 False
row2 False
row3 False
row4 True
dtype: bool
df = df[df.replace('?', np.nan).notnull().all(1)]
col1 col2 col3 col4
row4 24 12 52 17
Which avoids the astype(str) call. Alternatively, you might do as Wen suggested and just drop them:
df.replace('?', np.nan).dropna()

You can use boolean indexing with all for check if values not contain ?
if mixed types - numeric with ints:
df = pd.DataFrame({'B':[4,5,'?',5,5,4],
'C':[7,'?',9,4,2,3],
'D':[1,3,5,7,'?',0],
'E':[5,3,'?',9,2,4]})
print (df)
B C D E
0 4 7 1 5
1 5 ? 3 3
2 ? 9 5 ?
3 5 4 7 9
4 5 2 ? 2
5 4 3 0 4
df = df[(df.astype(str) != '?').all(axis=1)].astype(int)
print (df)
B C D E
0 4 7 1 5
3 5 4 7 9
5 4 3 0 4
Or compare with numpy array created by values:
df = df[(df.values != '?').all(axis=1)]
print (df)
B C D E
0 4 7 1 5
3 5 4 7 9
5 4 3 0 4
if all values are strings solution can be simplify:
df = pd.DataFrame({'B':[4,5,'?',5,5,4],
'C':[7,'?',9,4,2,3],
'D':[1,3,5,7,'?',0],
'E':[5,3,'?',9,2,4]}).astype(str)
df = df[(df != '?').all(axis=1)].astype(int)
print (df)
B C D E
0 4 7 1 5
3 5 4 7 9
5 4 3 0 4

Related

How to insert a row of value in a table with python (not Null Values?

I have a table with multiply lines as follows:
table1
col1
col2
col2
row1
1
2
3
row2
3
4
6
row3
4
5
7
row4
5
4
6
row5
6
2
3
row6
7
4
6
I want to change it like this:
table1
col1
col2
col2
row1
1
2
3
row2
3
4
6
table1
col1
col2
col2
row1
4
5
7
row3
5
4
6
table3
col1
col2
col2
row4
6
2
3
row5
7
4
6
namely,just insert a row(title) to separate it,because they belong to different subtables.
I have try use insert function to inert a value of title
d.DataFrame(np.insert(df.values, 0, values=["col1", "col1", "col3"], axis=0))
but in a specific column of DataFrame, the type of all values must be the same.
I also use xlwings (insert function)and openpyxl (insert_rows function) to insert one row
but it seems that they can only insert with a blank value, not specific values.
After finish constructing this table, I will use it to set some styles.
In excel I just need to copy and paste, is there a flexible way?
or inserting maybe not a good way, and just to split and combine tables(with subtitle and keep format)?
addition:
[data link][1]
[1]: https://cowtransfer.com/s/a160ccec698a48,you need input code 454008
you can try:
s=pd.DataFrame([df.columns]*int(len(df)/2),columns=df.columns)
s.index=pd.Series(s.index+1).cumsum()
df=pd.concat([df,s]).sort_index().iloc[:-1].reset_index(drop=True)
output of df:
table1 col1 col2 col2
0 row1 1 2 3
1 row2 3 4 6
2 table1 col1 col2 col2
3 row3 4 5 7
4 row4 5 4 6
5 table1 col1 col2 col2
6 row5 6 2 3
7 row6 7 4 6
Update:
you can try:
s=pd.DataFrame([df.columns]*int(len(df)/16),columns=df.columns)
s.index=[x+1 for x in range(16,16*(len(s)+1),16)]
df=pd.concat([df,s]).sort_index().reset_index(drop=True)
#If needed to remove last row then use:
#df=df.iloc[:-1]
Sample dataframe used by me for testing:
import numpy as np
df=pd.DataFrame()
df['table1']=[f'row {x}' for x in range(65)]
df['col1']=np.random.randint(1,10,(65,1))
df['col2']=np.random.randint(1,10,(65,1))
df['col3']=np.random.randint(1,10,(65,1))
df.columns=[*df.columns[:-1],'col2']

In Pandas, how do I compare values in a dataframe with others in its row and column at the same time?

So I have a dataframe and I would like to be able to compare each value with other values in its row and column at the same time. For example, I have something like this
Col1 Col2 Col3 NumCol
Row1 1 4 7 16
Row2 2 5 8 13
Row3 3 6 9 30
NumRow 28 14 10
For each value that isn't in the NumRow or NumCol, I would like to compare the NumCol and NumRow values in the same column/row as it.
I would like it to return the value of the first instance where NumCol is larger than NumRow in each row.
So the result would be this:
Row1 4
Row2 8
Row3 3
I have no clue on how to even begin this, but is there a way to do this elegantly without using for loops to loop through the whole dataframe to find these values?
First we flatten the dataframe (here df is your original dataframe):
df2 = (df.fillna('NumRow')
.set_index('NumCol')
.transpose()
.set_index('NumRow')
.stack()
.reset_index(name='value')
)
df2
output
NumRow NumCol value
0 28 16.0 1
1 28 13.0 2
2 28 30.0 3
3 14 16.0 4
4 14 13.0 5
5 14 30.0 6
6 10 16.0 7
7 10 13.0 8
8 10 30.0 9
now for each row of the new dataframe df2 we have the corresponding number from NumRow, corresponding number from NumCol, and the number from within the 'body' of the original dataframe df
Next we apply the condition, group by NumCol and within each group find the first row where the condition is satisfied. We report corresponding value:
df3 = (df2.assign(cond = df2['NumCol']>df2['NumRow'])
.groupby('NumCol')
.apply(lambda d:d[d['cond']].iloc[0])['value']
)
df3.index = df3.index.map(dict(zip(df['NumCol'],df.index)))
df3.sort_index()
Output
NumCol
Row1 4
Row2 8
Row3 3
Name: value, dtype: int64

Split the data frame based on consecutive row values differences

I have a data frame like this,
df
col1 col2 col3
1 2 3
2 5 6
7 8 9
10 11 12
11 12 13
13 14 15
14 15 16
Now I want to create multiple data frames from above when the col1 difference of two consecutive rows are more than 1.
So the result data frames will look like,
df1
col1 col2 col3
1 2 3
2 5 6
df2
col1 col2 col3
7 8 9
df3
col1 col2 col3
10 11 12
11 12 13
df4
col1 col2 col3
13 14 15
14 15 16
I can do this using for loop and storing the indices but this will increase execution time, looking for some pandas shortcuts or pythonic way to do this most efficiently.
You could define a custom grouper by taking the diff, checking when it is greater than 1, and take the cumsum of the boolean series. Then group by the result and build a dictionary from the groupby object:
d = dict(tuple(df.groupby(df.col1.diff().gt(1).cumsum())))
print(d[0])
col1 col2 col3
0 1 2 3
1 2 5 6
print(d[1])
col1 col2 col3
2 7 8 9
A more detailed break-down:
df.assign(difference=(diff:=df.col1.diff()),
condition=(gt1:=diff.gt(1)),
grouper=gt1.cumsum())
col1 col2 col3 difference condition grouper
0 1 2 3 NaN False 0
1 2 5 6 1.0 False 0
2 7 8 9 5.0 True 1
3 10 11 12 3.0 True 2
4 11 12 13 1.0 False 2
5 13 14 15 2.0 True 3
6 14 15 16 1.0 False 3
You can also peel off the target column and work with it as a series, rather than the above answer. That keeps everything smaller. It runs faster on the example, but I don't know how they'll scale up, depending how many times you're splitting.
row_bool = df['col1'].diff()>1
split_inds, = np.where(row_bool)
split_inds = np.insert(arr=split_inds, obj=[0,len(split_inds)], values=[0,len(df)])
df_tup = ()
for n in range(0,len(split_inds)-1):
tempdf = df.iloc[split_inds[n]:split_inds[n+1],:]
df_tup.append(tempdf)
(Just throwing it in a tuple of dataframes afterward, but the dictionary approach might be better?)

Subset rows in df depending on conditions

Hello I have a df such as :
I wondered how I can subset row where :
COL1 contains a string "ok"
COL2 > 4
COL3 < 4
here is an exemple
COL1 COL2 COL3
AB_ok_7 5 2
AB_ok_4 2 5
AB_uy_2 5 2
AB_ok_2 2 2
U_ok_7 12 3
I should display only :
COL1 COL2 COL3
AB_ok_7 5 2
U_ok_7 12 3
Like this:
In [2288]: df[df['COL1'].str.contains('ok') & df['COL2'].gt(4) & df['COL3'].lt(4)]
Out[2288]:
COL1 COL2 COL3
0 AB_ok_7 5 2
4 U_ok_7 12 3
You can use boolean indexing and chaining all the conditions.
m = df['COL1'].str.contains('ok')
m1 = df['COL2'].gt(4)
m2 = df['COL3'].lt(4)
df[m & m1 & m2]
COL1 COL2 COL3
0 AB_ok_7 5 2
4 U_ok_7 12 3

position or move pandas column to a specific column index

I have a DF mydataframe and it has multiple columns (over 75 columns) with default numeric index:
Col1 Col2 Col3 ... Coln
I need to arrange/change position to as follows:
Col1 Col3 Col2 ... Coln
I can get the index of Col2 using:
mydataframe.columns.get_loc("Col2")
but I don't seem to be able to figure out how to swap, without manually listing all columns and then manually rearrange in a list.
Try:
new_cols = [Col1, Col3, Col2] + df.columns[3:]
df = df[new_cols]
How to proceed:
store the names of columns in a list;
swap the names in that list;
apply the new order on the dataframe.
code:
l = list(df)
i1, i2 = l.index('Col2'), l.index('Col3')
l[i2], l[i1] = l[i1], l[i2]
df = df[l]
I'm imagining you want what #sentence is assuming. You want to swap the positions of 2 columns regardless of where they are.
This is a creative approach:
Create a dictionary that defines which columns get switched with what.
Define a function that takes a column name and returns an ordering.
Use that function as a key for sorting.
d = {'Col3': 'Col2', 'Col2': 'Col3'}
k = lambda x: df.columns.get_loc(d.get(x, x))
df[sorted(df, key=k)]
Col0 Col1 Col3 Col2 Col4
0 0 1 3 2 4
1 5 6 8 7 9
2 10 11 13 12 14
3 15 16 18 17 19
4 20 21 23 22 24
Setup
df = pd.DataFrame(
np.arange(25).reshape(5, 5)
).add_prefix('Col')
Using np.r_ to create array of column index:
Given sample as follows:
df:
col1 col2 col3 col4 col5 col6 col7 col8 col9 col10
0 0 1 2 3 4 5 6 7 8 9
1 10 11 12 13 14 15 16 17 18 19
i, j = df.columns.slice_locs('col2', 'col10')
df[df.columns[np.r_[:i, i+1, i, i+2:j]]]
Out[142]:
col1 col3 col2 col4 col5 col6 col7 col8 col9 col10
0 0 2 1 3 4 5 6 7 8 9
1 10 12 11 13 14 15 16 17 18 19

Categories

Resources