Pandas: merging rows with condition - python

I have a pandas DataFrame with the next columns - "A", "B", "C", "D". I want to merge the rows of the DataFrame that has the following condition -
if my DataFrame is called df:
(df.at[i,"A"] == df.at[j, "B"]) and (df.at[j,"A"] == df.at[i,"B"])
For example -
df = pd.DataFrame([[1,2,10,0.55],[3,4,5,0.3],[2,1,2,0.7]], columns=["A","B","C","D"])
Which gives -
In [93]: df
Out[93]:
A B C D
0 1 2 10 0.55
1 3 4 5 0.30
2 2 1 2 0.70
In the example above rows 0 and 2 has the condition. I know for sure that there can be at most 2 rows that correspond to this condition. For the rows that has this condition I would like to sum the "C" values, Average the "D" and remove the redundant row. In the example above I would like to get -
In [95]: result
Out[95]:
A B C D
0 1 2 12 0.625
1 3 4 5 0.300
Or
In [95]: result
Out[95]:
A B C D
0 2 1 12 0.625
1 3 4 5 0.300
I tried the following code that was very slow:
def remove_dups(path_to_df: str):
df = pd.read_csv(path_to_df)
for i in range(len(df)):
a = df.at[i, "A"]
b = df.at[i, "B"]
same_row = df[(df["A"] == b) & (df["B"] == a)]
if same_row.empty:
continue
c = df.at[i, "C"]
d = df.at[i, "D"]
df.drop(i, inplace=True)
new_ind = same_row.index[0]
df.at[new_ind, "C"] += c
df.at[new_ind, "D"] = (df.at[new_ind, "D"] + distance) / 2
return df
Is there a way to accomplish this using only built-in Pandas functions?

Use numpy.sort first and then GroupBy.agg:
df[['A','B']] = np.sort(df[['A','B']], axis=1)
df = df.groupby(['A','B'], as_index=False).agg({'C':'sum', 'D':'mean'})
print (df)
A B C D
0 1 2 12 0.625
1 3 4 5 0.300
If original values cannot be changed:
arr = np.sort(df[['A','B']], axis=1)
df = (df.groupby([arr[:, 0],arr[:, 1]])
.agg({'C':'sum', 'D':'mean'})
.rename_axis(('A','B'))
.reset_index())
print (df)
A B C D
0 1 2 12 0.625
1 3 4 5 0.300

Related

How do you generate a rolling count the number of rows that are duplicated in Pandas? [duplicate]

I come from a sql background and I use the following data processing step frequently:
Partition the table of data by one or more fields
For each partition, add a rownumber to each of its rows that ranks the row by one or more other fields, where the analyst specifies ascending or descending
EX:
df = pd.DataFrame({'key1' : ['a','a','a','b','a'],
'data1' : [1,2,2,3,3],
'data2' : [1,10,2,3,30]})
df
data1 data2 key1
0 1 1 a
1 2 10 a
2 2 2 a
3 3 3 b
4 3 30 a
I'm looking for how to do the PANDAS equivalent to this sql window function:
RN = ROW_NUMBER() OVER (PARTITION BY Key1 ORDER BY Data1 ASC, Data2 DESC)
data1 data2 key1 RN
0 1 1 a 1
1 2 10 a 2
2 2 2 a 3
3 3 3 b 1
4 3 30 a 4
I've tried the following which I've gotten to work where there are no 'partitions':
def row_number(frame,orderby_columns, orderby_direction,name):
frame.sort_index(by = orderby_columns, ascending = orderby_direction, inplace = True)
frame[name] = list(xrange(len(frame.index)))
I tried to extend this idea to work with partitions (groups in pandas) but the following didn't work:
df1 = df.groupby('key1').apply(lambda t: t.sort_index(by=['data1', 'data2'], ascending=[True, False], inplace = True)).reset_index()
def nf(x):
x['rn'] = list(xrange(len(x.index)))
df1['rn1'] = df1.groupby('key1').apply(nf)
But I just got a lot of NaNs when I do this.
Ideally, there'd be a succinct way to replicate the window function capability of sql (i've figured out the window based aggregates...that's a one liner in pandas)...can someone share with me the most idiomatic way to number rows like this in PANDAS?
you can also use sort_values(), groupby() and finally cumcount() + 1:
df['RN'] = df.sort_values(['data1','data2'], ascending=[True,False]) \
.groupby(['key1']) \
.cumcount() + 1
print(df)
yields:
data1 data2 key1 RN
0 1 1 a 1
1 2 10 a 2
2 2 2 a 3
3 3 3 b 1
4 3 30 a 4
PS tested with pandas 0.18
Use groupby.rank function.
Here the working example.
df = pd.DataFrame({'C1':['a', 'a', 'a', 'b', 'b'], 'C2': [1, 2, 3, 4, 5]})
df
C1 C2
a 1
a 2
a 3
b 4
b 5
df["RANK"] = df.groupby("C1")["C2"].rank(method="first", ascending=True)
df
C1 C2 RANK
a 1 1
a 2 2
a 3 3
b 4 1
b 5 2
You can do this by using groupby twice along with the rank method:
In [11]: g = df.groupby('key1')
Use the min method argument to give values which share the same data1 the same RN:
In [12]: g['data1'].rank(method='min')
Out[12]:
0 1
1 2
2 2
3 1
4 4
dtype: float64
In [13]: df['RN'] = g['data1'].rank(method='min')
And then groupby these results and add the rank with respect to data2:
In [14]: g1 = df.groupby(['key1', 'RN'])
In [15]: g1['data2'].rank(ascending=False) - 1
Out[15]:
0 0
1 0
2 1
3 0
4 0
dtype: float64
In [16]: df['RN'] += g1['data2'].rank(ascending=False) - 1
In [17]: df
Out[17]:
data1 data2 key1 RN
0 1 1 a 1
1 2 10 a 2
2 2 2 a 3
3 3 3 b 1
4 3 30 a 4
It feels like there ought to be a native way to do this (there may well be!...).
You can use transform and Rank together Here is an example
df = pd.DataFrame({'C1' : ['a','a','a','b','b'],
'C2' : [1,2,3,4,5]})
df['Rank'] = df.groupby(by=['C1'])['C2'].transform(lambda x: x.rank())
df
Have a look at Pandas Rank method for more information
pandas.lib.fast_zip() can create a tuple array from a list of array. You can use this function to create a tuple series, and then rank it:
values = {'key1' : ['a','a','a','b','a','b'],
'data1' : [1,2,2,3,3,3],
'data2' : [1,10,2,3,30,20]}
df = pd.DataFrame(values, index=list("abcdef"))
def rank_multi_columns(df, cols, **kw):
data = []
for col in cols:
if col.startswith("-"):
flag = -1
col = col[1:]
else:
flag = 1
data.append(flag*df[col])
values = pd.lib.fast_zip(data)
s = pd.Series(values, index=df.index)
return s.rank(**kw)
rank = df.groupby("key1").apply(lambda df:rank_multi_columns(df, ["data1", "-data2"]))
print rank
the result:
a 1
b 2
c 3
d 2
e 4
f 1
dtype: float64

Map a Pandas Series with duplicate keys to a DataFrame

Env: Python 3.9.6, Pandas 1.3.5
I have a DataFrame and a Series like below
df = pd.DataFrame({"C1" : ["A", "B", "C", "D"]})
sr = pd.Series(data = [1, 2, 3, 4, 5],
index = ["A", "A", "B", "C", "D"])
"""
[DataFrame]
C1
0 A
1 B
2 C
3 D
[Series]
A 1
A 2
B 3
C 4
D 5
"""
What I tried,
df["C2"] = df["C1"].map(sr)
But InvalidIndexError occurred because the series has duplicate keys ("A").
pandas.errors.InvalidIndexError: Reindexing only valid with uniquely valued Index objects
Is there any method to make DF like below?
C1 C2
0 A 1
1 A 2
2 B 3
3 C 4
4 D 5
or
C1 C2
0 A 1
1 B 3
2 C 4
3 D 5
4 A 2
Row indices do not matter.
The question was heavily edited and now has a very different meaning.
You want a simple merge:
df.merge(sr.rename('C2'),
left_on='C1', right_index=True)
Output:
C1 C2
0 A 1
0 A 2
1 B 3
2 C 4
3 D 5
old answer
First, I don't reproduce your issue (tested with 3M rows on pandas 1.3.5).
Then why do you use slicing and not map? This would have the advantage of systematically outputting the correct number of rows (NaN if the key is absent):
Example:
sr = pd.Series({10:"A", 13:"B", 16:"C", 18:"D"})
df = pd.DataFrame({"C1":np.random.randint(10, 20, size=3000000)})
df['C2'] = df['C1'].map(sr)
print(df.head())
output:
C1 C2
0 10 A
1 18 D
2 10 A
3 13 B
4 15 NaN

How do I replace pandas rows with values of another dataframe for all instances of the value in the first df?

I have two dataframes:
df1=
A B C
a 1 3
b 2 3
c 2 2
a 1 4
df2=
A B C
a 1 3.5
Now I need to replace all occurrences of a in df1 (2 in this case) with a in df2, leaving b and c unchanged. The final dataframe should be:
df_final=
A B C
b 2 3
c 2 2
a 1 3.5
Do you mean:
df_final = pd.concat((df1[df1['A'].ne('a')], df2))
Or if you have several values like a:
list_special = ['a']
df_final = pd.concat((df1[~df1['A'].isin(list_special)], df2))
If df2 just has the average of duplicated values, you can do df1.groupby(["A", "B"]).mean().reset_index()
Otherwise, you can do something like this:
In [27]: df = df1.groupby(["A", "B"]).first().merge(df2, how="left", on=["A", "
...: B"])
...: df["C"] = df["C_y"].fillna(df["C_x"])
...: df = df[["A", "B", "C"]]
...: df
Out[27]:
A B C
0 a 1 3.5
1 b 2 3.0
2 c 2 2.0

Pandas DataFrame - Normalisation [duplicate]

I have:
df = pd.DataFrame({'col1': ['asdf', 'xy', 'q'], 'col2': [1, 2, 3]})
col1 col2
0 asdf 1
1 xy 2
2 q 3
I'd like to take the "combinatoric product" of each letter from the strings in col1, with each elementwise int in col2. I.e.:
col1 col2
0 a 1
1 s 1
2 d 1
3 f 1
4 x 2
5 y 2
6 q 3
Current method:
from itertools import product
pieces = []
for _, s in df.iterrows():
letters = list(s.col1)
prods = list(product(letters, [s.col2]))
pieces.append(pd.DataFrame(prods))
pd.concat(pieces)
Any more efficient workarounds?
Using list + str.join and np.repeat -
pd.DataFrame(
{
'col1' : list(''.join(df.col1)),
'col2' : df.col2.values.repeat(df.col1.str.len(), axis=0)
})
col1 col2
0 a 1
1 s 1
2 d 1
3 f 1
4 x 2
5 y 2
6 q 3
A generalised solution for any number of columns is easily achievable, without much change to the solution -
i = list(''.join(df.col1))
j = df.drop('col1', 1).values.repeat(df.col1.str.len(), axis=0)
df = pd.DataFrame(j, columns=df.columns.difference(['col1']))
df.insert(0, 'col1', i)
df
col1 col2
0 a 1
1 s 1
2 d 1
3 f 1
4 x 2
5 y 2
6 q 3
Performance
df = pd.concat([df] * 100000, ignore_index=True)
# MaxU's solution
%%timeit
df.col1.str.extractall(r'(.)') \
.reset_index(level=1, drop=True) \
.join(df['col2']) \
.reset_index(drop=True)
1 loop, best of 3: 1.98 s per loop
# piRSquared's solution
%%timeit
pd.DataFrame(
[[x] + b for a, *b in df.values for x in a],
columns=df.columns
)
1 loop, best of 3: 1.68 s per loop
# Wen's solution
%%timeit
v = df.col1.apply(list)
pd.DataFrame({'col1':np.concatenate(v.values),'col2':df.col2.repeat(v.apply(len))})
1 loop, best of 3: 835 ms per loop
# Alexander's solution
%%timeit
pd.DataFrame([(letter, i)
for letters, i in zip(df['col1'], df['col2'])
for letter in letters],
columns=df.columns)
1 loop, best of 3: 316 ms per loop
%%timeit
pd.DataFrame(
{
'col1' : list(''.join(df.col1)),
'col2' : df.col2.values.repeat(df.col1.str.len(), axis=0)
})
10 loops, best of 3: 124 ms per loop
I tried timing Vaishali's, but it took too long on this dataset.
pd.DataFrame([(letter, i)
for letters, i in zip(df['col1'], df['col2'])
for letter in letters],
columns=df.columns)
Trick from the list :-)
df.col1=df.col1.apply(list)
df
Out[489]:
col1 col2
0 [a, s, d, f] 1
1 [x, y] 2
2 [q] 3
pd.DataFrame({'col1':np.concatenate(df.col1.values),'col2':df.col2.repeat(df.col1.apply(len))})
Out[490]:
col1 col2
0 a 1
0 s 1
0 d 1
0 f 1
1 x 2
1 y 2
2 q 3
In [86]: df.col1.str.extractall(r'(.)') \
.reset_index(level=1, drop=True) \
.join(df['col2']) \
.reset_index(drop=True)
Out[86]:
0 col2
0 a 1
1 s 1
2 d 1
3 f 1
4 x 2
5 y 2
6 q 3
One more:)
df.set_index('col2').col1.apply(lambda x: pd.Series(list(x))).stack()\
.reset_index(1,drop = True).reset_index(name = 'col1')
col2 col1
0 1 a
1 1 s
2 1 d
3 1 f
4 2 x
5 2 y
6 3 q
General solution with a list comprehension and clever unpacking:
pd.DataFrame(
[[x] + b for a, *b in df.values for x in a],
columns=df.columns
)
col1 col2
0 a 1
1 s 1
2 d 1
3 f 1
4 x 2
5 y 2
6 q 3
Using Explode (pandas>=0.25)
df = pd.DataFrame({'col1': ['asdf', 'xy', 'q'], 'col2': [1, 2, 3]})
df.col1=df.col1.apply(list)
df = df.explode('col1')
Result:
col1 col2
0 a 1
0 s 1
0 d 1
0 f 1
1 x 2
1 y 2
2 q 3
You can also try to itertools.chain and itertools.repeat functions to achieve similar results.
An example would be
import pandas as pd
from itertools import chain, repeat
d = {'col1': ['asdf', 'xy', 'q'], 'col2': [1, 2, 3]}
expanded_d = {
"col1": list(chain(*[list(item) for item in d["col1"]])),
"col2": list(chain(*[list(repeat(d["col2"][idx], len(list(d["col1"][idx])))) for idx in range(len(d["col1"])) ]))
}
result = pd.DataFrame(data=expanded_d)
col1 col2
0 a 1
1 s 1
2 d 1
3 f 1
4 x 2
5 y 2
6 q 3
Hope it helps.

create binary columns in a dataframe from condition on its value

I have a dataframe that looks like this one:
df = pd.DataFrame(np.nan, index=[0,1,2,3], columns=['A','B','C'])
df.iloc[0,0] = 'a'
df.iloc[1,0] = 'b'
df.iloc[1,1] = 'c'
df.iloc[2,0] = 'b'
df.iloc[3,0] = 'c'
df.iloc[3,1] = 'b'
df.iloc[3,2] = 'd'
df
out : A B C
0 a NaN NaN
1 b c NaN
2 b NaN NaN
3 c b d
And I would like to add new columns to it which names are the values inside the dataframe (here 'a','b','c',and 'd'). Those columns are binary, and reflect if the values 'a','b','c',and 'd' are in the row.
In one picture, the output I'd like is:
A B C a b c d
0 a NaN NaN 1 0 0 0
1 b c NaN 0 1 1 0
2 b NaN NaN 0 1 0 0
3 c b d 0 1 1 1
To do this I first create the columns filled with zeros:
cols = pd.Series(df.values.ravel()).value_counts().index
for col in cols:
df[col] = 0
(It doesn't create the columns in the right order, but that doesn't matter)
Then I...use a loop over the rows and columns...
for row in df.index:
for col in cols:
if col in df.loc[row].values:
df.ix[row,col] = 1
You'll get why I'm looking for another way to do it, even if my dataframe is relatively small (76k rows), it still takes around 8 minutes, which is far too long.
Any idea?
You're looking for get_dummies. Here I choose to use the .str version:
df.fillna('', inplace=True)
(df.A + '|' + df.B + '|' + df.C).str.get_dummies()
Output:
a b c d
0 1 0 0 0
1 0 1 1 0
2 0 1 0 0
3 0 1 1 1

Categories

Resources