Handling duplicate rows in python - python

I have a date frame df, let's say with 5 columns : a, b, c, d, e.
a b c d e
1 6 x 8 3
2 3 y 2 3
3 5 d 1 1
3 4 g 3 4
5 3 z 3 1
This is what I want to do, for all the rows with same value of column a, I want to drop duplicates, but value of column b should be summed across those rows, and for rest of the columns, I want to keep the first value.
Final Data frame will be :
a b c d e
1 6 x 8 3
2 3 y 2 3
3 9 d 1 1
5 3 z 3 1
How to do this?

I'd assign to column 'b' the result of grouping on 'a' and summing, you can then drop the duplicates:
In [171]:
df['b'] = df.groupby('a')['b'].transform('sum')
df
Out[171]:
a b c d e
0 1 6 x 8 3
1 2 3 y 2 3
2 3 9 d 1 1
3 3 9 g 3 4
4 5 3 z 3 1
In [172]:
df.drop_duplicates('a')
Out[172]:
a b c d e
0 1 6 x 8 3
1 2 3 y 2 3
2 3 9 d 1 1
4 5 3 z 3 1

Related

I want to groupby and drop groups if the shape is 3 and non of the values from a column contains zero

I want to groupby and drop groups if it satisfies two conditions (the shape is 3 and column A doesn't contain zeros).
My df
ID value
A 3
A 2
A 0
B 1
B 1
C 3
C 3
C 4
D 0
D 5
D 5
E 6
E 7
E 7
F 3
F 2
my desired df would be
ID value
A 3
A 2
A 0
B 1
B 1
D 0
D 5
D 5
F 3
F 2
You can use boolean indexing with groupby operations:
g = df['value'].eq(0).groupby(df['ID'])
# group contains a 0
m1 = g.transform('any')
# group doesn't have size 3
m2 = g.transform('size').ne(3)
# keep if any of the condition above is met
# this is equivalent to dropping if contains 0 AND size 3
out = df[m1|m2]
Output:
ID value
0 A 3
1 A 2
2 A 0
3 B 1
4 B 1
8 D 0
9 D 5
10 D 5
14 F 3
15 F 2

How to create a new column that increments within a subgroup of a group in Python?

I have a problem where I need to group the data by two groups, and attach a column that sort of counts the subgroup.
Example dataframe looks like this:
colA colB
1 a
1 a
1 c
1 c
1 f
1 z
1 z
1 z
2 a
2 b
2 b
2 b
3 c
3 d
3 k
3 k
3 m
3 m
3 m
Expected output after attaching the new column is as follows:
colA colB colC
1 a 1
1 a 1
1 c 2
1 c 2
1 f 3
1 z 4
1 z 4
1 z 4
2 a 1
2 b 2
2 b 2
2 b 2
3 c 1
3 d 2
3 k 3
3 k 3
3 m 4
3 m 4
3 m 4
I tried the following but I cannot get this trivial looking problem solved:
Solution 1 I tried that does not give what I am looking for:
df['ONES']=1
df['colC']=df.groupby(['colA','colB'])['ONES'].cumcount()+1
df.drop(columns='ONES', inplace=True)
I also played with transform, and cumsum functions, and apply, but I cannot seem to solve this. Any help is appreciated.
Edit: minor error on dataframes.
Edit 2: For simplicity purposes, I showed similar values for column B, but the problem is within a larger group (indicated by colA), colB may be different and therefore, it needs to be grouped by both at the same time.
Edit 3: Updated dataframes to emphasize what I meant by my second edit. Hope this makes it more clear and reproduceable.
You could use groupby + ngroup:
df['colC'] = df.groupby('colA').apply(lambda x: x.groupby('colB').ngroup()+1).droplevel(0)
Output:
colA colB colC
0 1 a 1
1 1 a 1
2 1 c 2
3 1 c 2
4 1 f 3
5 1 z 4
6 1 z 4
7 1 z 4
8 2 a 1
9 2 b 2
10 2 b 2
11 2 b 2
12 3 c 1
13 3 d 2
14 3 k 3
15 3 k 3
16 3 m 4
17 3 m 4
18 3 m 4
Categorically, factorize
df['colC'] =df['colB'].astype('category').cat.codes+1
colA colB colC
0 1 a 1
1 1 a 1
2 1 b 2
3 1 b 2
4 1 c 3
5 1 d 4
6 1 d 4
7 1 d 4
8 2 a 1
9 2 b 2
10 2 b 2
11 2 b 2
12 3 a 1
13 3 b 2
14 3 c 3
15 3 c 3
16 3 d 4
17 3 d 4
18 3 d 4

Determine reverse order of data given X/Y coordinates

Imagine an electrical connector. It has pins. Each pin has a corresponding X/Y location in space. I am trying to figure out how to mirror, or 'flip' each pin on the connector given their X/Y coordinate. note: I am using pandas version 23.4 We can assume that x,y, and pin are not unique but connector is. Connectors can be any size, so two rows of 5, 3 rows of 6, etc.
x y pin connector
1 1 A 1
2 1 B 1
3 1 C 1
1 2 D 1
2 2 E 1
3 2 F 1
1 1 A 2
2 1 B 2
3 1 C 2
1 2 D 2
2 2 E 2
3 2 F 2
The dataframe column, 'flip', is the solution I am trying to get to. Notice the pins that would be in the same row are now in reverse order.
x y pin flip connector
1 1 A C 1
2 1 B B 1
3 1 C A 1
1 2 D F 1
2 2 E E 1
3 2 F D 1
1 1 A C 2
2 1 B B 2
3 1 C A 2
1 2 D F 2
2 2 E E 2
3 2 F D 2
IIUC try using [::-1] a reversing element and groupby with transform:
df['flip'] = df.groupby(['connector','y'])['pin'].transform(lambda x: x[::-1])
Output:
x y pin connector flip
0 1 1 A 1 C
1 2 1 B 1 B
2 3 1 C 1 A
3 1 2 D 1 F
4 2 2 E 1 E
5 3 2 F 1 D
6 1 1 A 2 C
7 2 1 B 2 B
8 3 1 C 2 A
9 1 2 D 2 F
10 2 2 E 2 E
11 3 2 F 2 D
import io
import pandas as pd
data = """
x y pin connector
1 1 A 1
2 1 B 1
3 1 C 1
1 2 D 1
2 2 E 1
3 2 F 1
1 1 A 2
2 1 B 2
3 1 C 2
1 2 D 2
2 2 E 2
3 2 F 2
"""
#strip blank lines at the beginning and end
data = data.strip()
#make it quack like a file
data_file = io.StringIO(data)
#read data from a "wsv" file (whitespace separated values)
df = pd.read_csv(data_file, sep='\s+')
Make the new column:
flipped = []
for name, group in df.groupby(['connector','y']):
flipped.extend(group.loc[::-1,'pin'])
df = df.assign(flip=flipped)
df
Final DataFrame:
x y pin connector flip
0 1 1 A 1 C
1 2 1 B 1 B
2 3 1 C 1 A
3 1 2 D 1 F
4 2 2 E 1 E
5 3 2 F 1 D
6 1 1 A 2 C
7 2 1 B 2 B
8 3 1 C 2 A
9 1 2 D 2 F
10 2 2 E 2 E
11 3 2 F 2 D
You can create a map between the original coordinates and the coordinates of the 'flipped' component. Then you can select the flipped values.
import numpy as np
midpoint = 2
coordinates_of_flipped = pd.MultiIndex.from_arrays([df['x'].map(lambda x: x - midpoint * np.sign(x - midpoint )), df['y'], df['connector']])
df['flipped'] = df.set_index(['x', 'y', 'connector']).loc[coordinates_of_flipped].reset_index()['pin']
which gives
Out[30]:
x y pin connector flipped
0 1 1 A 1 C
1 2 1 B 1 B
2 3 1 C 1 A
3 1 2 D 1 F
4 2 2 E 1 E
5 3 2 F 1 D
6 1 1 A 2 C
7 2 1 B 2 B
8 3 1 C 2 A
9 1 2 D 2 F
10 2 2 E 2 E
11 3 2 F 2 D

Pandas reverse column values groupwise

I want to reverse a column values in my dataframe, but only on a individual "groupby" level. Below you can find a minimal demonstration example, where I want to "flip" values that belong the same letter A,B or C:
df = pd.DataFrame({"group":["A","A","A","B","B","B","B","C","C"],
"value": [1,3,2,4,4,2,3,2,5]})
group value
0 A 1
1 A 3
2 A 2
3 B 4
4 B 4
5 B 2
6 B 3
7 C 2
8 C 5
My desired output looks like this: (column is added instead of replaced only for the brevity purposes)
group value value_desired
0 A 1 2
1 A 3 3
2 A 2 1
3 B 4 3
4 B 4 2
5 B 2 4
6 B 3 4
7 C 2 5
8 C 5 2
As always, when I don't see a proper vector-style approach, I end messing with loops just for the sake of final output, but my current code hurts me very much:
for i in list(set(df["group"].values.tolist())):
reversed_group = df.loc[df["group"]==i,"value"].values.tolist()[::-1]
df.loc[df["group"]==i,"value_desired"] = reversed_group
Pandas gurus, please show me the way :)
You can use transform
In [900]: df.groupby('group')['value'].transform(lambda x: x[::-1])
Out[900]:
0 2
1 3
2 1
3 3
4 2
5 4
6 4
7 5
8 2
Name: value, dtype: int64
Details
In [901]: df['value_desired'] = df.groupby('group')['value'].transform(lambda x: x[::-1])
In [902]: df
Out[902]:
group value value_desired
0 A 1 2
1 A 3 3
2 A 2 1
3 B 4 3
4 B 4 2
5 B 2 4
6 B 3 4
7 C 2 5
8 C 5 2

Add multiple columns to DataFrame and set them equal to an existing column

I want to add multiple columns to a pandas DataFrame and set them equal to an existing column. Is there a simple way of doing this? In R I would do:
df <- data.frame(a=1:5)
df[c('b','c')] <- df$a
df
a b c
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
In pandas this results in KeyError: "['b' 'c'] not in index":
df = pd.DataFrame({'a': np.arange(1,6)})
df[['b','c']] = df.a
you can use .assign() method:
In [31]: df.assign(b=df['a'], c=df['a'])
Out[31]:
a b c
0 1 1 1
1 2 2 2
2 3 3 3
3 4 4 4
4 5 5 5
or a little bit more creative approach:
In [41]: cols = list('bcdefg')
In [42]: df.assign(**{col:df['a'] for col in cols})
Out[42]:
a b c d e f g
0 1 1 1 1 1 1 1
1 2 2 2 2 2 2 2
2 3 3 3 3 3 3 3
3 4 4 4 4 4 4 4
4 5 5 5 5 5 5 5
another solution:
In [60]: pd.DataFrame(np.repeat(df.values, len(cols)+1, axis=1), columns=['a']+cols)
Out[60]:
a b c d e f g
0 1 1 1 1 1 1 1
1 2 2 2 2 2 2 2
2 3 3 3 3 3 3 3
3 4 4 4 4 4 4 4
4 5 5 5 5 5 5 5
NOTE: as #Cpt_Jauchefuerst mentioned in the comment DataFrame.assign(z=1, a=1) will add columns in alphabetical order - i.e. first a will be added to existing columns and then z.
A pd.concat approach
df = pd.DataFrame(dict(a=range5))
pd.concat([df.a] * 5, axis=1, keys=list('abcde'))
a b c d e
0 0 0 0 0 0
1 1 1 1 1 1
2 2 2 2 2 2
3 3 3 3 3 3
4 4 4 4 4 4
Turns out you can use a loop to do this:
for i in ['b','c']: df[i] = df.a
You can set them individually if you're only dealing with a few columns:
df['b'] = df['a']
df['c'] = df['a']
or you can use a loop as you discovered.

Categories

Resources