Description
Long story short, I need a way to sort a DataFrame by a specific column, given a specific function which is analagous to usage of "key" parameter in python built-in sorted() function. Yet there's no such "key" parameter in pd.DataFrame.sort_value() function.
The approach used for now
I have to create a new column to store the "scores" of a specific row, and delete it in the end. The problem of this approach is that the necessity to generate a column name which does not exists in the DataFrame, and it could be more troublesome when it comes to sorting by multiple columns.
I wonder if there's a more suitable way for such purpose, in which there's no need to come up with a new column name, just like using a sorted() function and specifying parameter "key" in it.
Update: I changed my implementation by using a new object instead of generating a new string beyond those in the columns to avoid collision, as shown in the code below.
Code
Here goes the example code. In this sample the DataFrame is needed to be sort according to the length of the data in row "snippet". Please don't make additional assumptions on the type of the objects in each rows of the specific column. The only thing given is the column itself and a function object/lambda expression (in this example: len) that takes each object in the column as input and produce a value, which is used for comparison.
def sort_table_by_key(self, ascending=True, key=len):
"""
Sort the table inplace.
"""
# column_tmp = "".join(self._table.columns)
column_tmp = object() # Create a new object to avoid column name collision.
# Calculate the scores of the objects.
self._table[column_tmp] = self._table["snippet"].apply(key)
self._table.sort_values(by=column_tmp, ascending=ascending, inplace=True)
del self._table[column_tmp]
Now this is not implemented, check github issue 3942.
I think you need argsort and then select by iloc:
df = pd.DataFrame({
'A': ['assdsd','sda','affd','asddsd','ffb','sdb','db','cf','d'],
'B': list(range(9))
})
print (df)
A B
0 assdsd 0
1 sda 1
2 affd 2
3 asddsd 3
4 ffb 4
5 sdb 5
6 db 6
7 cf 7
8 d 8
def sort_table_by_length(column, ascending=True):
if ascending:
return df.iloc[df[column].str.len().argsort()]
else:
return df.iloc[df[column].str.len().argsort()[::-1]]
print (sort_table_by_length('A'))
A B
8 d 8
6 db 6
7 cf 7
1 sda 1
4 ffb 4
5 sdb 5
2 affd 2
0 assdsd 0
3 asddsd 3
print (sort_table_by_length('A', False))
A B
3 asddsd 3
0 assdsd 0
2 affd 2
5 sdb 5
4 ffb 4
1 sda 1
7 cf 7
6 db 6
8 d 8
How it working:
First get lengths to new Series:
print (df['A'].str.len())
0 6
1 3
2 4
3 6
4 3
5 3
6 2
7 2
8 1
Name: A, dtype: int64
Then get indices by sorted values by argmax, for descending ordering is used this solution:
print (df['A'].str.len().argsort())
0 8
1 6
2 7
3 1
4 4
5 5
6 2
7 0
8 3
Name: A, dtype: int64
Last change ordering by iloc:
print (df.iloc[df['A'].str.len().argsort()])
A B
8 d 8
6 db 6
7 cf 7
1 sda 1
4 ffb 4
5 sdb 5
2 affd 2
0 assdsd 0
3 asddsd 3
Related
I just started learning pandas and I was trying to figure out the easiest possible solution for the problem mentioned below.
Suppose, I've a dataframe like this ->
A B
6 7
8 9
5 6
7 8
Here, I'm selecting the minimum value cell from column 'A' as the starting point and updating the sequence in the new column 'C'. After sequencing dataframe must look like this ->
A B C
5 6 0
6 7 1
7 8 2
8 9 3
Is there any easy way to pick a cell from from column 'A' and match it with the matching cell in column 'B' and update the sequence respectively in column 'C'?
Some extra conditions ->
If 5 is present in column 'B' then I need to add another row like this -
A B C
0 5 0
5 6 1
......
Try sort_values:
df.sort_values('A').assign(C=np.arange(len(df)))
Output:
A B C
2 5 6 0
0 6 7 1
3 7 8 2
1 8 9 3
I'm not sure what you mean with the extra conditions though.
I want to find the values corresponding to a column such that no values in another column takes value greater than 3.
For example, in the following dataframe
df = pd.DataFrame({'a':[1,2,3,1,2,3,1,2,3], 'b':[4,5,6,4,5,6,4,5,6], 'c':[4,3,5,4,3,5,4,3,3]})
I want the values of the column 'a' for which all the values of 'c' which are greater than 3.
I think groupby is the correct way to do it. My below code comes closer to it.
df.groupby('a')['c'].max()>3
a
1 True
2 False
3 True
4 False
Name: c, dtype: bool
The above code gives me a boolean frame. How can I get the values of 'a' such that it is true.
I want my output to be [1,3]
Is there a better and efficient way to get this on a very large data frame (with more than 30 million rows).
From your code I see that you actually want to output:
group keys for each group (df grouped by a),
where no value in c column (within the current group) is greater than 3.
In order to get some non-empty result, let's change the source DataFrame to:
a b c
0 1 4 4
1 2 5 1
2 3 6 5
3 1 4 4
4 2 5 2
5 3 6 5
6 1 4 4
7 2 5 2
8 3 6 3
For readability, let's group df by a and print each group.
The code to do it:
for key, grp in df.groupby('a'):
print(f'\nGroup: {key}\n{grp}')
gives result:
Group: 1
a b c
0 1 4 4
3 1 4 4
6 1 4 4
Group: 2
a b c
1 2 5 1
4 2 5 2
7 2 5 2
Group: 3
a b c
2 3 6 5
5 3 6 5
8 3 6 3
And now take a look at each group.
Only group 2 meets the condition that each element in c column
is less than 3.
So actually you need a groupby and filter, passing only groups
meeting the above condition:
To get full rows from the "good" groups, you can run:
df.groupby('a').filter(lambda grp: grp.c.lt(3).all())
getting:
a b c
1 2 5 1
4 2 5 2
7 2 5 2
But you want only values from a column, without repetitions.
So extend the above code to:
df.groupby('a').filter(lambda grp: grp.c.lt(3).all()).a.unique().tolist()
getting:
[2]
Note that your code: df.groupby('a')['c'].max() > 3 is wrong,
as it marks with True groups for which max is greater than 3
(instead of ">" there should be "<").
So an alternative solution is:
res = df.groupby('a')['c'].max()<3
res[res].index.tolist()
giving the same result.
Yet another solution can be based on a list comprehension:
[ key for key, grp in df.groupby('a') if grp.c.lt(3).all() ]
Details:
for key, grp in df.groupby('a') - creates groups,
if grp.c.lt(3).all() - filters groups,
key (at the start) - adds particular group key to the result.
import pandas as pd
#Create DataFrame
df = pd.DataFrame({'a':[1,2,3,1,2,3,1,2,3], 'b':[4,5,6,4,5,6,4,5,6], 'c':[4,3,5,4,3,5,4,3,3]})
#Write a function to find values greater than 3 if found return.
def grt(x):
for i in x:
if i>3:
return(i)
#Groupby column a and call function grt
p = {'c':grt}
grp = df.groupby(['a']).agg(p)
print(grp)
I am using apply to leverage one dataframe to manipulate a second dataframe and return results. Here is a simplified example that I realize could be more easily answered with "in" logic, but for now let's keep the use of .apply() as a constraint:
import pandas as pd
df1 = pd.DataFrame({'Name':['A','B'],'Value':range(1,3)})
df2 = pd.DataFrame({'Name':['A']*3+['B']*4+['C'],'Value':range(1,9)})
def filter_df(x, df):
return df[df['Name']==x['Name']]
df1.apply(filter_df, axis=1, args=(df2, ))
Which is returning:
0 Name Value
0 A 1
1 A 2
2 ...
1 Name Value
3 B 4
4 B 5
5 ...
dtype: object
What I would like to see instead is one formated DataFrame with Name and Value headers. All advice appreciated!
Name Value
0 A 1
1 A 2
2 A 3
3 B 4
4 B 5
5 B 6
6 B 7
In my opinion, this cannot be done solely based on apply, you need pandas.concat:
result = pd.concat(df1.apply(filter_df, axis=1, args=(df2,)).to_list())
print(result)
Output
Name Value
0 A 1
1 A 2
2 A 3
3 B 4
4 B 5
5 B 6
6 B 7
I am creating a dataframe like this.
np.random.seed(2)
df=pd.DataFrame(np.random.randint(1,6,(6,6)))
out[]
0 1 1 4 3 4 1
1 3 2 4 3 5 5
2 5 4 5 3 4 4
3 3 2 3 5 4 1
4 5 4 2 3 1 5
5 5 3 5 3 2 1
spliting the dataframe into 3,3 matrix like below, it will have 16 matrix.
dfs=[]
for col in range(df.shape[1]-2):
for row in range(df.shape[0]-2):
dfs.append(df.iloc[row:row+3,col:col+3])
lets print,
dfs[0]
1 1 4
3 2 4
5 4 5
dfs[1]
3 2 4
5 4 5
3 2 3
.
.
.
dfs[15]
5 4 1
3 1 5
3 2 1
writing a function to change the values from each matrix in locations [1,0] and [1,2] to zero,
so that my output will looks like,
dfs[0]
1 1 4
0 2 0
5 4 5
def process(x):
new=[]
for d in x:
d.iloc[1,0]=0
d.iloc[1,2]=0
new.append(d)
print(d)
return new
dfs=process(dfs.copy())
my expected output, is
dfs[0]
1 1 4
0 2 0
5 4 5
but what my function returns is,
dfs[0]
1 1 4
0 0 0
0 0 0
dfs[1]
0 0 0
0 0 0
0 0 0
It producres more zeros in all matrix. I don't know why it is working unexpectedly or what I am doing wrong with my function process please help. Thanks.
Long story short, you are a victim of chained indexing, which can lead to bad things happening.
When you slice the original DataFrame, you get overlapping views.
Modifying one changes the others too, since the second row of one chunk is the first row of another, and the third row of the first chunk is the first row of yet another, and so on...which is why you see non-zero values only at the "edges", since those are unique to a single chunk.
You can make copies of each slice, like this:
def process(x):
new = []
for d in x:
d = d.copy() # each one is now a copy
d.iloc[1, 0]=0
d.iloc[1, 2]=0
new.append(d)
return new
Lastly, note that dfs = process(dfs) is actually fine; you don't need to make a copy of the enclosing list.
Change your code and process function call to get your required output. Also, I used copy in for loop to make subset of dataframe which is independent to change in future, in your case it makes changes to original df which are reflected with all zeros in other dfs list:
for col in range(df.shape[1]-2):
for row in range(df.shape[0]-2):
dfs.append(df.iloc[row:row+3,col:col+3].copy())
dfs=process(dfs)
I have a pandas column that contains a lot of string that appear less than 5 times, I do not to remove these values however I do want to replace them with a placeholder string called "pruned". What is the best way to do this?
df= pd.DataFrame(['a','a','b','c'],columns=["x"])
# get value counts and set pruned I want something that does as follows
df[df[count<2]] = "pruned"
I suspect there is a more efficient way to do this, but simple way to do it is to build a dict of counts and then prune if those values are below the count threshold. Consider the example df:
df= pd.DataFrame([12,11,4,15,6,12,4,7],columns=['foo'])
foo
0 12
1 11
2 4
3 15
4 6
5 12
6 4
7 7
# make a dict with counts
count_dict = {d:(df['foo']==d).sum() for d in df.foo.unique()}
# assign that dict to a column
df['bar'] = [count_dict[d] for d in df.foo]
# loc in the 'pruned' tag
df.loc[df.bar < 2, 'foo']='pruned'
Returns as desired:
foo bar
0 12 2
1 pruned 1
2 4 2
3 pruned 1
4 pruned 1
5 12 2
6 4 2
7 pruned 1
(and of course you would change 2 to 5 and dump that bar column if you want).
UPDATE
Per request for an in-place version, here is a one-liner that can do it without assigning another column or creating that dict directly (and thanks #TrigonaMinima for the values_count() tip):
df= pd.DataFrame([12,11,4,15,6,12,4,7],columns=['foo'])
print(df)
df.foo = df.foo.apply(lambda row: 'pruned' if (df.foo.value_counts() < 2)[row] else row)
print(df)
which returns again as desired:
foo
0 12
1 11
2 4
3 15
4 6
5 12
6 4
7 7
foo
0 12
1 pruned
2 4
3 pruned
4 pruned
5 12
6 4
7 pruned
This is the solution I ended up using based on the answer above.
import pandas as pd
df= pd.DataFrame([12,11,4,15,6,12,4,7],columns=['foo'])
# make a dict with counts
count_dict = dict(df.foo.value_counts())
# assign that dict to a column
df['temp_count'] = [count_dict[d] for d in df.foo]
# loc in the 'pruned' tag
df.loc[df.temp_count < 2, 'foo']='pruned'
df = df.drop(["temp_count"], axis=1)