Dropping rows in python pandas - python

I have the following DataFrame:
2010-01-03 2010-01-04 2010-01-05 2010-01-06 2010-01-07
1560 0.002624 0.004992 -0.011085 -0.007508 -0.007508
14 0.000000 -0.000978 -0.016960 -0.016960 -0.009106
2920 0.000000 0.018150 0.018150 0.002648 0.025379
1502 0.000000 0.018150 0.011648 0.005963 0.005963
78 0.000000 0.018150 0.014873 0.014873 0.007564
I have list of indices corresponding to rows that I want to drop from my DataFrame. For simplicity, assume my list is idx_to_drop = [1560,1502] which correspond to the 1st row and 4th row in the daraframe above.
I tried to run df2 = df.drop(df.index[idx_to_drop]), but that expects row numbers rather than the .ix() index value. I have many more rows and many more columns, and getting row numbers by using the where() function takes a while.
How can I drop rows whose .ix() match?

I would tackle this by breaking the problem into two pieces. Mask what you are looking for, then sub-select the inverse.
Short answer:
df[~df.index.isin([1560, 1502])]
Explanation with runnable example, using isin:
import pandas as pd
df = pd.DataFrame({'index': [1, 2, 3, 1500, 1501],
'vals': [1, 2, 3, 4, 5]}).set_index('index')
bad_rows = [1500, 1501]
mask = df.index.isin(bad_rows)
print mask
[False False False True True]
df[mask]
vals
index
1500 4
1501 5
print ~mask
[ True True True False False]
df[~mask]
vals
index
1 1
2 2
3 3
You can see that we've identified the two bad rows, then we want to choose all the rows that aren't the bad ones. Our mask if for the bad rows, and all other rows would be anything that is not the mask (~mask)

Related

Find greater row between two dataframes of same shape

I have two dataframes of the same shape and am trying to find all the rows in df A where every value is greater than the corresponding row in df B.
Mini-example:
df_A = pd.DataFrame({'one':[20,7,2],'two':[11,9,1]})
df_B = pd.DataFrame({'one':[1,8,12],'two':[10,5,3]})
I'd like to return only row 0.
one two
0 20 11
I realise that df_A > df_B gets me most of the way, but I just can't figure out how to return only those rows where everything is True.
(I tried merging the two, but that didn't seem to make it simpler.)
IIUIC, you can use all
In [633]: m = (df_A > df_B).all(1)
In [634]: m
Out[634]:
0 True
1 False
2 False
dtype: bool
In [635]: df_A[m]
Out[635]:
one two
0 20 11
In [636]: df_B[m]
Out[636]:
one two
0 1 10
In [637]: pd.concat([df_A[m], df_B[m]])
Out[637]:
one two
0 20 11
0 1 10
Or, if you just need row indices.
In [642]: m.index[m]
Out[642]: Int64Index([0], dtype='int64')
df_A.loc[(df_A > df_B).all(axis=1)]
import pandas as pd
df_A = pd.DataFrame({"one": [20, 7, 2], "two": [11, 9, 1]})
df_B = pd.DataFrame({"one": [1, 8, 12], "two": [10, 5, 3]})
row_indices = (df_A > df_B).apply(min, axis=1)
print(df_A[row_indices])
print()
print(df_B[row_indices])
Output is:
one two
0 20 11
one two
0 1 10
Explanation:
df_A > df_B compares element wise, this is the result:
one two
0 True True
1 False True
2 False False
Pythons max interprets True > False, so applying min row wise (this is why I used axis=1) only computes True if both values in a row are True:
0 True
1 False
2 False
This is now a boolean index to extract rows from df_A resp. df_B.
It can be done in one line code if you are interested.
df_A[(df_A > df_B)].dropna(axis=0, how='any')
Here df_A[(df_A > df_B)] gives the output after matching true false either the value or na.
one two
0 20.0 11.0
1 NaN 9.0
2 NaN NaN
Then we can drop the na values along the the 0 axis if there is at least anynot a number value.

Efficient pairwise comparison of rows in pandas DataFrame

I am currently working with a smallish dataset (about 9 million rows). Unfortunately, most of the entries are strings, and even with coercion to categories, the frame sits at a few GB in memory.
What I would like to do is compare each row with other rows and do a straight comparison of contents. For example, given
A B C D
0 cat blue old Saturday
1 dog red old Saturday
I would like to compute
d_A d_B d_C d_D
0, 0 True True True True
0, 1 False False True True
1, 0 False False True True
1, 1 True True True True
Obviously, combinatorial explosion will preclude a comparison of every record with every other record. So we can instead use blocking, by applying groupby, say on column A.
My question is, is there a a way to do this in either pandas or dask, that is faster than the following sequence:
Group by index
Outer join each group to itself to produce pairs
dataframe.apply comparison function on each row of pairs
For reference, assume I have access to a good number of cores (hundreds), and about 200G of memory.
The solution turned out to be using numpy in place of step 3). While we cannot create an outer join of every row, we can group by values in column A and create smaller groups to outer join.
The trick is then to use numpy.equal.outer(df1, df2).ravel() When dataframes are passed as inputs to a numpy function in this way, the result is a much faster (at least 30x) vectorized result. For example:
>>> df = pd.DataFrame
A B C D
0 cat blue old Saturday
1 dog red old Saturday
>>> result = pd.DataFrame(columns=["A", "B", "C", "D"],
index=pd.MultiIndex.from_product([df.index, df.index]))
>>> result["A"] = np.equal.outer(df["A"], df["A"]).ravel()
>>> result
A B C D
0, 0 True NaN NaN NaN
0, 1 False NaN NaN NaN
1, 0 False NaN NaN NaN
1, 1 True NaN NaN NaN
You can repeat for each column, or just automate the process with columnwise apply on result.
You might consider phrasing your problem as a join operation
You might consider using categoricals to reduce memory use

Pandas.read_excel sometimes incorrectly reads Boolean values as 1's/0's

I need to read a very large Excel file into a DataFrame. The file has string, integer, float, and Boolean data, as well as missing data and totally empty rows. It may also be worth noting that some of the cell values are derived from cell formulas and/or VBA - although theoretically that shouldn't affect anything.
As the title says, pandas sometimes reads Boolean values as float or int 1's and 0's, instead of True and False. It appears to have something to do with the amount of empty rows and type of other data. For simplicity's sake, I'm just linking a 2-sheet Excel file where the issue is replicated.
Boolean_1.xlsx
Here's the code:
import pandas as pd
df1 = pd.read_excel('Boolean_1.xlsx','Sheet1')
df2 = pd.read_excel('Boolean_1.xlsx','Sheet2')
print(df1, '\n' *2, df2)
Here's the print. Mainly note row ZBA, which has the same values in both sheets, but different values in the DataFrames:
Name stuff Unnamed: 1 Unnamed: 2 Unnamed: 3
0 AFD a dsf ads
1 DFA 1 2 3
2 DFD 123.3 41.1 13.7
3 IIOP why why why
4 NaN NaN NaN NaN
5 ZBA False False True
Name adslfa Unnamed: 1 Unnamed: 2 Unnamed: 3
0 asdf 6.0 3.0 6.0
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN
5 ZBA 0.0 0.0 1.0
I was also able to get integer 1's and 0's output in the large file I'm actually working on (yay), but wasn't able to easily replicate it.
What could be causing this inconsistency, and is there a way to force pandas to read Booleans as they should be read?
Pandas type-casting is applied by column / series. In general, Pandas doesn't work well with mixed types, or object dtype. You should expect internalised logic to determine the most efficient dtype for a series. In this case, Pandas has chosen float dtype as applicable for a series containing float and bool values. In my opinion, this is efficient and neat.
However, as you noted, this won't work when you have a transposed input dataset. Let's set up an example from scratch:
import pandas as pd, numpy as np
df = pd.DataFrame({'A': [True, False, True, True],
'B': [np.nan, np.nan, np.nan, False],
'C': [True, 'hello', np.nan, True]})
df = df.astype({'A': bool, 'B': float, 'C': object})
print(df)
A B C
0 True NaN True
1 False NaN hello
2 True NaN NaN
3 True 0.0 True
Option 1: change "row dtype"
You can, without transposing your data, change the dtype for objects in a row. This will force series B to have object dtype, i.e. a series storing pointers to arbitrary types:
df.iloc[3] = df.iloc[3].astype(bool)
print(df)
A B C
0 True NaN True
1 False NaN hello
2 True NaN NaN
3 True False True
print(df.dtypes)
A bool
B object
C object
dtype: object
Option 2: transpose and cast to Boolean
In my opinion, this is the better option, as a data type is being attached to a specific category / series of input data.
df = df.T # transpose dataframe
df[3] = df[3].astype(bool) # convert series to Boolean
print(df)
0 1 2 3
A True False True True
B NaN NaN NaN False
C True hello NaN True
print(df.dtypes)
0 object
1 object
2 object
3 bool
dtype: object
Read_excel will determine the dtype for each column based on the first row in the column with a value. If the first row of that column is empty, Read_excel will continue to the next row until a value is found.
In Sheet1, your first row with values in column B, C, and D contains strings. Therefore, all subsequent rows will be treated as strings for these columns. In this case, FALSE = False
In Sheet2, your first row with values in column B, C, and D contains integers. Therefore, all subsequent rows will be treated as integers for these columns. In this case, FALSE = 0.

How to select all elements greater than a given values in a dataframe

I have a csv that is read by my python code and a dataframe is created using pandas.
CSV file is in following format
1 1.0
2 99.0
3 20.0
7 63
My code calculates the percentile and wants to find all rows that have the value in 2nd column greater than 60.
df = pd.read_csv(io.BytesIO(body), error_bad_lines=False, header=None, encoding='latin1', sep=',')
percentile = df.iloc[:, 1:2].quantile(0.99) # Selecting 2nd column and calculating percentile
criteria = df[df.iloc[:, 1:2] >= 60.0]
While my percentile code works fine, criteria to find all rows that have column 2's value greater than 60 returns
NaN NaN
NaN NaN
NaN NaN
NaN NaN
Can you please help me find the error.
Just correct the condition inside criteria. Being the second column "1" you should write df.iloc[:,1].
Example:
import pandas as pd
import numpy as np
b =np.array([[1,2,3,7], [1,99,20,63] ])
df = pd.DataFrame(b.T) #just creating the dataframe
criteria = df[ df.iloc[:,1]>= 60 ]
print(criteria)
Why?
It seems like the cause resides inside the definition type of the condition. Let's inspect
Case 1:
type( df.iloc[:,1]>= 60 )
Returns pandas.core.series.Series,so it gives
df[ df.iloc[:,1]>= 60 ]
#out:
0 1
1 2 99
3 7 63
Case2:
type( df.iloc[:,1:2]>= 60 )
Returns a pandas.core.frame.DataFrame, and gives
df[ df.iloc[:,1:2]>= 60 ]
#out:
0 1
0 NaN NaN
1 NaN 99.0
2 NaN NaN
3 NaN 63.0
Therefore I think it changes the way the index is processed.
Always keep in mind that 3 is a scalar, and 3:4 is a array.
For more info is always good to take a look at the official doc Pandas indexing
Your indexing a bit off, since you only have two columns [0, 1] and you are interested in selecting just the one with index 1. As #applesoup mentioned the following is just enough:
criteria = df[df.iloc[:, 1] >= 60.0]
However, I would consider naming columns and just referencing based on name. This will allow you to avoid any mistakes in case your df structure changes, e.g.:
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3, 7], 'b': [1.0, 99.0, 20.0, 63.]})
criteria = df[df['b'] >= 60.0]
People here seem to be more interested in coming up with alternative solutions instead of digging into his code in order to find out what's really wrong. I will adopt a diametrically opposed strategy!
The problem with your code is that you are indexing your DataFrame df by another DataFrame. Why? Because you use slices instead of integer indexing.
df.iloc[:, 1:2] >= 60.0 # Return a DataFrame with one boolean column
df.iloc[:, 1] >= 60.0 # Return a Series
df.iloc[:, [1]] >= 60.0 # Return a DataFrame with one boolean column
So correct your code by using :
criteria = df[df.iloc[:, 1] >= 60.0] # Dont slice !

Randomly insert NA's values in a pandas dataframe - with no rows completely missing

How can I randomly make some values missing in a panda dataframe, as in Randomly insert NA's values in a pandas dataframe but make sure no row is set completely with missing values?
Edit: Sorry for not stating this explicitly again (it was in the question I referenced though): I need to be able to specify how much percentage, for example 10%, of the cells is supposed to be NaN (or rather, as close to 10% as can be achieved with the existing data frame's size), as opposed to, say, clearing cells independently with a marginal per-cell probability of 10%.
You can use DataFrame.mask and for numpy boolean mask is used answer of this my question:
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9]})
print (df)
A B C
0 1 4 7
1 2 5 8
2 3 6 9
np.random.seed(100)
mask = np.random.choice([True, False], size=df.shape)
print (mask)
[[ True True False]
[False False False]
[ True True True]] -> problematic values - all True
mask[mask.all(1),-1] = 0
print (mask)
[[ True True False]
[False False False]
[ True True False]]
print (df.mask(mask))
A B C
0 NaN NaN 7
1 2.0 5.0 8
2 NaN NaN 9
Here is an answer based on Randomly insert NA's values in a pandas dataframe:
replaced = collections.defaultdict(set)
ix = [(row, col) for row in range(df.shape[0]) for col in range(df.shape[1])]
random.shuffle(ix)
to_replace = int(round(.1*len(ix)))
for row, col in ix:
if len(replaced[row]) < df.shape[1] - 1:
df.iloc[row, col] = np.nan
to_replace -= 1
replaced[row].add(col)
if to_replace == 0:
break
The shuffle operation will cause random order to the indexes and the if clause will avoid replacing the entire row.
How about applying a function that will replace random columns' values. To avoid replacing the entire row it is possible to draw a number between 0 and n-1 of values to replace.
import random
def add_random_na(row):
vals = row.values
for _ in range(random.randint(0,len(vals)-2)):
i = random.randint(0,len(vals)-1)
vals[i] = np.nan
return vals
df = df.apply(add_random_na,axis=1)

Categories

Resources