Subset df by last valid item

Subset df by last valid item - python

I can return the index of the last valid item but I'm hoping to subset a df using the same method. For instance, the code below returns the last time 2 appears in the df. But I want to return the df using this index.
import pandas as pd
df = pd.DataFrame({
'Number' : [2,3,2,4,2,1],
'Code' : ['x','a','b','c','f','y'],
})
df_last = df[df['Number'] == 2].last_valid_index()
print(df_last)
4
Intended Output:
Number Code
0 2 x
1 3 a
2 2 b
3 4 c
4 2 f

You can use loc, but solution working only if at least one value 2 in column:
df = df.loc[:df[df['Number'] == 2].last_valid_index()]
print (df)
Number Code
0 2 x
1 3 a
2 2 b
3 4 c
4 2 f
General solution should be:
df = df[(df['Number'] == 2)[::-1].cumsum().ne(0)[::-1]]
print (df)
Number Code
0 2 x
1 3 a
2 2 b
3 4 c
4 2 f

Related

sort headers by specific cols - pandas

I'm trying to sort col headers by last 3 columns only. Using below, sort_index works on the whole data frame but not when I select the last 3 cols only.
Note: I can't hard-code the sorting because I don't know the columns headers beforehand.
import pandas as pd
df = pd.DataFrame({
'Z' : [1,1,1,1,1],
'B' : ['A','A','A','A','A'],
'C' : ['B','A','A','A','A'],
'A' : [5,6,6,5,5],
})
# sorts all cols
df = df.sort_index(axis = 1)
# aim to sort by last 3 cols
#df.iloc[:,1:3] = df.iloc[:,1:3].sort_index(axis=1)
Intended Out:
Z A B C
0 1 A B 5
1 1 A A 6
2 1 A A 6
3 1 A A 5
4 1 A A 5

Try with reindex
out = df.reindex(columns=df.columns[[0]].tolist()+sorted(df.columns[1:].tolist()))
Out[66]:
Z A B C
0 1 5 A B
1 1 6 A A
2 1 6 A A
3 1 5 A A
4 1 5 A A
Method two insert
newdf = df.iloc[:,1:].sort_index(axis=1)
newdf.insert(loc=0, column='Z', value=df.Z)
newdf
Out[74]:
Z A B C
0 1 5 A B
1 1 6 A A
2 1 6 A A
3 1 5 A A
4 1 5 A A

Returning dataframe of multiple rows/columns per one row of input

I am using apply to leverage one dataframe to manipulate a second dataframe and return results. Here is a simplified example that I realize could be more easily answered with "in" logic, but for now let's keep the use of .apply() as a constraint:
import pandas as pd
df1 = pd.DataFrame({'Name':['A','B'],'Value':range(1,3)})
df2 = pd.DataFrame({'Name':['A']*3+['B']*4+['C'],'Value':range(1,9)})
def filter_df(x, df):
return df[df['Name']==x['Name']]
df1.apply(filter_df, axis=1, args=(df2, ))
Which is returning:
0 Name Value
0 A 1
1 A 2
2 ...
1 Name Value
3 B 4
4 B 5
5 ...
dtype: object
What I would like to see instead is one formated DataFrame with Name and Value headers. All advice appreciated!
Name Value
0 A 1
1 A 2
2 A 3
3 B 4
4 B 5
5 B 6
6 B 7

In my opinion, this cannot be done solely based on apply, you need pandas.concat:
result = pd.concat(df1.apply(filter_df, axis=1, args=(df2,)).to_list())
print(result)
Output
Name Value
0 A 1
1 A 2
2 A 3
3 B 4
4 B 5
5 B 6
6 B 7

How can I operate with the output of a DataFrame?

I have a DataFrame object and I'm grouping by some keys and counting the results. The problem is that I want to replace one of the index of the DataFrame columns for a relation between the counts.
df.groupby(['A','B', 'C'])['C'].count().apply(f).reset_index()
I'm looking for an f that replaces the column C by the value of #timesC==1 / #timesC==0 for each value of A and B.

Is this what you want?
import pandas as pd
import numpy as np
df = pd.DataFrame(
{'A':[1,2,3,1,2,3],
'B':[2,0,1,2,0,1],
'C':[1,1,0,1,1,1]
})
print(df)
def f(x):
if np.count_nonzero(x==0)==0:
return np.nan
else:
return np.count_nonzero(x==1)/np.count_nonzero(x==0)
result = df.groupby(['A','B'])['C'].apply(f).reset_index()
print(result)
Result:
#df
A B C
0 1 2 1
1 2 0 1
2 3 1 0
3 1 2 1
4 2 0 1
5 3 1 1
#result
A B C
0 1 2 NaN
1 2 0 NaN
2 3 1 1.0

Duplicate row of low occurrence in pandas dataframe

In the following dataset what's the best way to duplicate row with groupby(['Type']) count < 3 to 3. df is the input, and df1 is my desired outcome. You see row 3 from df was duplicated by 2 times at the end. This is only an example deck. the real data has approximately 20mil lines and 400K unique Types, thus a method that does this efficiently is desired.
>>> df
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
>>> df1
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
7 b 1
8 b 1
Thought about using something like the following but do not know the best way to write the func.
df.groupby('Type').apply(func)
Thank you in advance.

Use value_counts with map and repeat:
counts = df.Type.value_counts()
repeat_map = 3 - counts[counts < 3]
df['repeat_num'] = df.Type.map(repeat_map).fillna(0,downcast='infer')
df = df.append(df.set_index('Type')['Val'].repeat(df['repeat_num']).reset_index(),
sort=False, ignore_index=True)[['Type','Val']]
print(df)
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
7 b 1
8 b 1
Note : sort=False for append is present in pandas>=0.23.0, remove if using lower version.
EDIT : If data contains multiple val columns then make all columns columns as index expcept one column and repeat and then reset_index as:
df = df.append(df.set_index(['Type','Val_1','Val_2'])['Val'].repeat(df['repeat_num']).reset_index(),
sort=False, ignore_index=True)

Select rows of pandas dataframe from list, in order of list

The question was originally asked here as a comment but could not get a proper answer as the question was marked as a duplicate.
For a given pandas.DataFrame, let us say
df = DataFrame({'A' : [5,6,3,4], 'B' : [1,2,3, 5]})
df
A B
0 5 1
1 6 2
2 3 3
3 4 5
How can we select rows from a list, based on values in a column ('A' for instance)
For instance
# from
list_of_values = [3,4,6]
# we would like, as a result
# A B
# 2 3 3
# 3 4 5
# 1 6 2
Using isin as mentioned here is not satisfactory as it does not keep order from the input list of 'A' values.
How can the abovementioned goal be achieved?

One way to overcome this is to make the 'A' column an index and use loc on the newly generated pandas.DataFrame. Eventually, the subsampled dataframe's index can be reset.
Here is how:
ret = df.set_index('A').loc[list_of_values].reset_index(inplace=False)
# ret is
# A B
# 0 3 3
# 1 4 5
# 2 6 2
Note that the drawback of this method is that the original indexing has been lost in the process.
More on pandas indexing: What is the point of indexing in pandas?

Use merge with helper DataFrame created by list and with column name of matched column:
df = pd.DataFrame({'A' : [5,6,3,4], 'B' : [1,2,3,5]})
list_of_values = [3,6,4]
df1 = pd.DataFrame({'A':list_of_values}).merge(df)
print (df1)
A B
0 3 3
1 6 2
2 4 5
For more general solution:
df = pd.DataFrame({'A' : [5,6,5,3,4,4,6,5], 'B':range(8)})
print (df)
A B
0 5 0
1 6 1
2 5 2
3 3 3
4 4 4
5 4 5
6 6 6
7 5 7
list_of_values = [6,4,3,7,7,4]
#create df from list
list_df = pd.DataFrame({'A':list_of_values})
print (list_df)
A
0 6
1 4
2 3
3 7
4 7
5 4
#column for original index values
df1 = df.reset_index()
#helper column for count duplicates values
df1['g'] = df1.groupby('A').cumcount()
list_df['g'] = list_df.groupby('A').cumcount()
#merge together, create index from column and remove g column
df = list_df.merge(df1).set_index('index').rename_axis(None).drop('g', axis=1)
print (df)
A B
1 6 1
4 4 4
3 3 3
5 4 5

1] Generic approach for list_of_values.
In [936]: dff = df[df.A.isin(list_of_values)]
In [937]: dff.reindex(dff.A.map({x: i for i, x in enumerate(list_of_values)}).sort_values().index)
Out[937]:
A B
2 3 3
3 4 5
1 6 2
2] If list_of_values is sorted. You can use
In [926]: df[df.A.isin(list_of_values)].sort_values(by='A')
Out[926]:
A B
2 3 3
3 4 5
1 6 2

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Subset df by last valid item - python

Related

sort headers by specific cols - pandas

Returning dataframe of multiple rows/columns per one row of input

How can I operate with the output of a DataFrame?

Duplicate row of low occurrence in pandas dataframe

Select rows of pandas dataframe from list, in order of list

Categories

Resources