Get rows from DataFrame based on array of indices - python

I have an array with numbers which corresponds to the row numbers that need to be selected from a DataFrame.
For example, arr = np.array([0,0,1,1]) and the DataFrame is seen below. arr is the row number and not the index.
Index A B C D
3 10 0 0 0
4 5 2 0 0
Using arr I would like to produce a DataFrame that looks like this
Index A B C D
3 10 0 0 0
3 10 0 0 0
4 5 2 0 0
4 5 2 0 0

You can use iloc with integer indexing:
df.iloc[[0,0,1,1], :] # or df.iloc[arr, :]
# A B C D
#Index
#3 10 0 0 0
#3 10 0 0 0
#4 5 2 0 0
#4 5 2 0 0

Related

ApplyMap function on Multiple columns pandas

I have this dataframe
dd = pd.DataFrame({'a':[1,5,3],'b':[3,2,3],'c':[2,4,5]})
a b c
0 1 3 2
1 5 2 4
2 3 3 5
I just want to replace numbers of column a and b which are smaller than column c numbers. I want to this operation row wise
I did this
dd.applymap(lambda x: 0 if x < x['c'] else x )
I get error
TypeError: 'int' object is not subscriptable
I understood x is a int but how to get value of column c for that row
I want this output
a b c
0 0 3 2
1 5 0 4
2 0 0 5
Use DataFrame.mask with DataFrame.lt:
df = dd.mask(dd.lt(dd['c'], axis=0), 0)
print (df)
a b c
0 0 3 2
1 5 0 4
2 0 0 5
Or you can set values by compare broadcasting by column c:
dd[dd < dd['c'].to_numpy()[:, None]] = 0
print (dd)
a b c
0 0 3 2
1 5 0 4
2 0 0 5

How to merge one numpy array onto multiple dataframes

I have a bunch of data frames. They all have the same columns but different amounts of rows. They look like this:
df_1
0
0 1
1 0
2 0
3 1
4 1
5 0
df_2
0
0 1
1 0
2 0
3 1
df_3
0
0 1
1 0
2 0
3 1
4 1
I have them all stored in a list.
Then, I have a numpy array where each item maps to a row in each individual df. The numpy array looks like this:
[3 1 1 2 4 0 6 7 2 1 3 2 5 5 5]
If I were to pd.concat my list of dataframes, then I could merge the np array onto the concatenated df. However, I want to preserve the individual df structure, so it should look like this:
0 1
0 1 3
1 0 1
2 0 1
3 1 2
4 1 4
5 0 0
0 1
0 1 6
1 0 7
2 0 2
3 1 1
0 1
0 1 3
1 0 2
2 0 5
3 1 5
4 1 5
Considering the given dataframes & array as,
df1 = pd.DataFrame([1,0,0,1,1,0])
df2 = pd.DataFrame([1,0,0,1])
df3 = pd.DataFrame([1,0,0,1,1])
arr = np.array([3, 1, 1, 2, 4, 0, 6, 7, 2, 1, 3, 2, 5, 5, 5])
You can use numpy.split to split an array into multiple sub-arrays according to the given dataframes. Then you can append those arrays as columns to their respective dataframes.
Use:
dfs = [df1, df2, df3]
def get_indices(dfs):
"""
Returns the split indices inside the array.
"""
indices = [0]
for df in dfs:
indices.append(len(df) + indices[-1])
return indices[1:-1]
# split the given arr into multiple sections.
sections = np.split(arr, get_indices(dfs))
for df, s in zip(dfs, sections):
df[1] = s # append the section of array to dataframe
print(df)
This results:
# df1
0 1
0 1 3
1 0 1
2 0 1
3 1 2
4 1 4
5 0 0
#df2
0 1
0 1 6
1 0 7
2 0 2
3 1 1
# df3
0 1
0 1 3
1 0 2
2 0 5
3 1 5
4 1 5

How to efficiently remove leading rows containing only 0 as value?

I have a pandas dataframe and the first rows have only zeros as value.
I would like to remove those rows.
So, denoting df my dataframe and ['a', 'b', 'c'] its columns. I tried the following code.
df[(df[['a', 'b', 'c']] != 0).all(axis=1)]
But it will also turn the following dataframe :
a b c
0 0 0
0 0 0
1 0 0
0 0 0
2 3 5
4 5 6
0 0 0
1 1 1
Into this one :
a b c
1 0 0
2 3 5
4 5 6
1 1 1
That's not what I want. I just want to focus on leading rows. So, I would like to have :
a b c
1 0 0
0 0 0
2 3 5
4 5 6
0 0 0
1 1 1
It would be great to have a simple and efficient solution using pandas functions. Thanks
General solution working if all 0 rows in data - first use cummsum for cumualtive sum and then test any Trues per rows:
df1 = df[(df[['a', 'b', 'c']] != 0).cumsum().any(1)]
print (df1)
a b c
2 1 0 0
3 0 0 0
4 2 3 5
5 4 5 6
6 0 0 0
7 1 1 1
Solution if at least one non 0 row in data - get first value of non 0 rows with Series.idxmax:
df1 = df.iloc[(df[['a', 'b', 'c']] != 0).any(axis=1).idxmax():]
print (df1)
a b c
2 1 0 0
3 0 0 0
4 2 3 5
5 4 5 6
6 0 0 0
7 1 1 1
Here is an example that finds the first row that is not all zeros and then selects all from that point on. Should solve the problem you are describing:
ix_first_valid = df[(df != 0).any(axis=1)].index[0]
df[ix_first_valid:]

Python pandas cumsum with reset everytime there is a 0

I have a matrix with 0s and 1s, and want to do a cumsum on each column that resets to 0 whenever a zero is observed. For example, if we have the following:
df = pd.DataFrame([[0,1],[1,1],[0,1],[1,0],[1,1],[0,1]],columns = ['a','b'])
print(df)
a b
0 0 1
1 1 1
2 0 1
3 1 0
4 1 1
5 0 1
The result I desire is:
print(df)
a b
0 0 1
1 1 2
2 0 3
3 1 0
4 2 1
5 0 2
However, when I try df.cumsum() * df, I am able to correctly identify the 0 elements, but the counter does not reset:
print(df.cumsum() * df)
a b
0 0 1
1 1 2
2 0 3
3 2 0
4 3 4
5 0 5
You can use:
a = df != 0
df1 = a.cumsum()-a.cumsum().where(~a).ffill().fillna(0).astype(int)
print (df1)
a b
0 0 1
1 1 2
2 0 3
3 1 0
4 2 1
5 0 2
Try this
df = pd.DataFrame([[0,1],[1,1],[0,1],[1,0],[1,1],[0,1]],columns = ['a','b'])
df['groupId1']=df.a.eq(0).cumsum()
df['groupId2']=df.b.eq(0).cumsum()
New=pd.DataFrame()
New['a']=df.groupby('groupId1').a.transform('cumsum')
New['b']=df.groupby('groupId2').b.transform('cumsum')
New
Out[1184]:
a b
0 0 1
1 1 2
2 0 3
3 1 0
4 2 1
5 0 2
You may also try the following naive but reliable approach.
Per every column - create groups to count within. Group starts once sequential value difference by row appears and lasts while value is being constant: (x != x.shift()).cumsum().
Example:
a b
0 1 1
1 2 1
2 3 1
3 4 2
4 4 3
5 5 3
Calculate cummulative sums within groups per columns using pd.DataFrame's apply and groupby methods and you get cumsum with the zero reset in one line:
import pandas as pd
df = pd.DataFrame([[0,1],[1,1],[0,1],[1,0],[1,1],[0,1]], columns = ['a','b'])
cs = df.apply(lambda x: x.groupby((x != x.shift()).cumsum()).cumsum())
print(cs)
a b
0 0 1
1 1 2
2 0 3
3 1 0
4 2 1
5 0 2
A slightly hacky way would be to identify the indices of the zeros and set the corresponding values to the negative of those indices before doing the cumsum:
import pandas as pd
df = pd.DataFrame([[0,1],[1,1],[0,1],[1,0],[1,1],[0,1]],columns = ['a','b'])
z = np.where(df['b']==0)
df['b'][z[0]] = -z[0]
df['b'] = np.cumsum(df['b'])
df
a b
0 0 1
1 1 2
2 0 3
3 1 0
4 1 1
5 0 2

Convert pandas dataframe to series

Is there a way to convert pandas dataframe to series with multiindex? The dataframe's columns could be multi-indexed too.
Below works, but only for multiindex with labels.
In [163]: d
Out[163]:
a 0 1
b 0 1 0 1
a 0 0 0 0
b 1 2 3 4
c 2 4 6 8
In [164]: d.stack(d.columns.names)
Out[164]:
a b
a 0 0 0
1 0
1 0 0
1 0
b 0 0 1
1 2
1 0 3
1 4
c 0 0 2
1 4
1 0 6
1 8
dtype: int64
I think you can use nlevels for find length of levels in MultiIndex, then create range with stack:
print (d.columns.nlevels)
2
#for python 3 add `list`
print (list(range(d.columns.nlevels)))
[0, 1]
print (d.stack(list(range(d.columns.nlevels))))
a b
a 0 0 0
1 0
1 0 0
1 0
b 0 0 1
1 2
1 0 3
1 4
c 0 0 2
1 4
1 0 6
1 8
dtype: int64

Categories

Resources