Getting lowest valued duplicated columns only - python

I have a dataframe with 2 columns: value and product. There will be duplicated products, but with different values. What I want to do is to get all products, but remove any duplication. The condition to remove duplication will be to get the row with the lowest value and drop the rest. For example, I want something like this:
Before:
product value
A 25
B 45
C 15
C 14
C 13
B 22
After
product value
A 25
B 22
C 13
How can I make it so that only the lowest valued duplicated columns get added in the new dataframe?

df.sort_values('value').groupby('product').first()
# value
#product
#A 25
#B 22
#C 13

You can sort_values and then drop_duplicates:
res = df.sort_values('values').drop_duplicates('product')

While going through the requirement i see , even you don't need to use drop.duplicate and sort_values as we are looking for the least minimum value of each product column in the dataFrame. So, there are couple ways doing it as follows...
I believe one of the shorted way will looking at the unique index by using pandas.DataFrame.idxmin.
>>> df
product value
0 A 25
1 B 45
2 C 15
3 C 14
4 C 13
5 B 22
>>> df.loc[df.groupby('product')['value'].idxmin()]
product value
0 A 25
5 B 22
4 C 13
OR
In this case another shortest and elegant way around using Compute min of group values using groupby.min() :
>>> df
product value
0 A 25
1 B 45
2 C 15
3 C 14
4 C 13
5 B 22
>>> df.groupby('product').min()
value
product
A 25
B 22
C 13

Related

Sort CSV file by count on column values

May I know how I can sort a csv file by a certain column, not by the value in the column, but the count of rows having the largest number of same values should appear first (or last).
Is it possible to do this using csv package or pandas. If I can see both that will be great.
I hope I have described the problem in an understandable manner
With pandas you can combine using key parameter of sort_values() and a lambda function that effectively calculates the frequency.
import numpy as np
df = pd.DataFrame({"col":np.random.choice(list("abcd"),20,p=(.46,.46,.04,.04))})
df.sort_values("col", key=lambda s: s.groupby(s).transform("size"))
output
col
0
c
2
d
1
a
16
a
5
a
15
a
8
a
13
a
11
a
17
b
14
b
12
b
9
b
18
b
7
b
6
b
4
b
3
b
10
b
19
b

how to find standard deviation of pandas dataframe column containg list in every row?

i have an pandas dataframe
dd1=
A B C D E F Result
10 18 13 11 9 25 []
6 32 27 3 18 28 [6,32]
4 6 3 29 2 23 [29,35,87]
now i want to find std of result column by adding C column value with Result column 1st value then 2nd value of result column with C column value and so on.. and i want add that result of std and store in another column.
i want to pass the value to std function like this
for 1st row :- it will pass because it is empty.
for 2nd row :- std([6,27])=14.84,std([32,27])=3.53
after finding std just add that value and store in output column like (14.84 + 3.53)=18.37
for 3rd row :- std([29,3])=18.38,std([35,3])=22.62,std([87,3])=59.39
output like this:- dd1=
A B C D E F Result output
10 18 13 11 9 25 [] []
6 32 27 3 18 28 [6,32] 18.37
4 6 3 29 2 23 [29,35,87] 100.39
Try using lambda and apply:
l = lambda x: sum([np.std([x['C'], i], ddof=1) for i in x['Result']])
dd1['output'] = dd1.apply(l, 1)

Pandas: group by two columns, sum up the first value in the first column group

In Python, I have a pandas data frame df.
ID Ref Dist
A 0 10
A 0 10
A 1 20
A 1 20
A 2 30
A 2 30
A 3 5
A 3 5
B 0 8
B 0 8
B 1 40
B 1 40
B 2 7
B 2 7
I want to group by ID and Ref, and take the first row of the Dist column in each group.
ID Ref Dist
A 0 10
A 1 20
A 2 30
A 3 5
B 0 8
B 1 40
B 2 7
And I want to sum up the Dist column in each ID group.
ID Sum
A 65
B 55
I tried this to do the first step, but this gives me just an index of the row and Dist, so I cannot move on to the second step.
df.groupby(['ID', 'Ref'])['Dist'].head(1)
It'd be wonderful if somebody helps me for this.
Thank you!
I believe this is what you're looking for.
The first step you need to use first since you want the first in the groupby. Once you've done that, use reset_index() so you can use a groupby afterwards and sum it up using ID.
df.groupby(['ID','Ref'])['Dist'].first()\
.reset_index().groupby(['ID'])['Dist'].sum()
ID
A 65
B 55
Just drop_duplicates before the groupby. The default behavior is to keep the first duplicate row, which is what you want.
df.drop_duplicates(['ID', 'Ref']).groupby('ID').Dist.sum()
#A 65
#B 55
#Name: Dist, dtype: int64

Using apply to quickly fill in a column of a pandas dataframe using a specific function?

I have the following pandas dataframe as an input:
df= pd.DataFrame({"C": [2,4,7,17,39], "D": [0,0,0,0,0]})
Output:
C D
0 2 0
1 4 0
2 7 0
3 17 0
4 39 0
I want to apply a function to column D such that it takes the current C value and subtracts the previous C value and adds this to the previous D value. The first element in D is necessarily 0.
Ex. For the fourth row, the column D value will be 39 - 17 + 15 = 37
The desired output would be as shown below:
C D
0 2 0
1 4 2
2 7 5
3 17 15
4 39 37
I can get the desired result using a for loop that goes through every row and performs the calculation. My actual dataframe is several thousand lines and the calculation pulls on several columns. I was wondering if there was a more efficient, simpler routine that I could employ either using apply or shift or something similar but not a for loop?
You can do a cumsum on the difference (current - previous) of column C:
df['D'] = df['C'].diff().fillna(0).cumsum()
df
# C D
#0 2 0.0
#1 4 2.0
#2 7 5.0
#3 17 15.0
#4 39 37.0

Pandas - Create new DataFrame column using dot product of elements in each row

I'm trying to take an existing DataFrame and append a new column.
Let's say I have this DataFrame (just some random numbers):
a b c d e
0 2.847674 0.890958 -1.785646 -0.648289 1.178657
1 -0.865278 0.696976 1.522485 -0.248514 1.004034
2 -2.229555 -0.037372 -1.380972 -0.880361 -0.532428
3 -0.057895 -2.193053 -0.691445 -0.588935 -0.883624
And I want to create a new column 'f' that multiplies each row by a 'costs' vector, for instance [1,0,0,0,0]. So, for row zero, the output in column f should be 2.847674.
Here's the function I currently use:
def addEstimate (df, costs):
row_iterator = df.iterrows()
for i, row in row_iterator:
df.ix[i, 'f'] = np.dot(costs, df.ix[i])
I'm doing this with a 15-element vector, over ~20k rows, and I'm finding that this is super-duper slow (half an hour). I suspect that using iterrows and ix is inefficient, but I'm not sure how to correct this.
Is there a way that I can apply this to the entire DataFrame at once, rather than looping through rows? Or do you have other suggestions to speed this up?
You can create the new column with df['f'] = df.dot(costs).
dot is already a DataFrame method: applying it to the DataFrame as a whole will be much quicker than looping over the DataFrame and applying np.dot to individual rows.
For example:
>>> df # an example DataFrame
a b c d e
0 0 1 2 3 4
1 12 13 14 15 16
2 24 25 26 27 28
3 36 37 38 39 40
>>> costs = [1, 0, 0, 0, 2]
>>> df['f'] = df.dot(costs)
>>> df
a b c d e f
0 0 1 2 3 4 8
1 12 13 14 15 16 44
2 24 25 26 27 28 80
3 36 37 38 39 40 116
Pandas has a dot function as well. Does
df['dotproduct'] = df.dot(costs)
do what you are looking for?

Categories

Resources