Duplicate rows in dataframe based on column value [duplicate]

Duplicate rows in dataframe based on column value [duplicate] - python

Let's say I have a data frame called df
x count
d 2
e 3
f 2
Count would be the counter column and the # times I want it to repeat.
How would I expand it to make it
x count
d 2
d 2
e 3
e 3
e 3
f 2
f 2
I've already tried
numpy.repeat(df,df.iloc['count']) and it errors out

You can use np.repeat()
import pandas as pd
import numpy as np
# your data
# ========================
df
x count
0 d 2
1 e 3
2 f 2
# processing
# ==================================
np.repeat(df.values, df['count'].values, axis=0)
array([['d', 2],
['d', 2],
['e', 3],
['e', 3],
['e', 3],
['f', 2],
['f', 2]], dtype=object)
pd.DataFrame(np.repeat(df.values, df['count'].values, axis=0), columns=['x', 'count'])
x count
0 d 2
1 d 2
2 e 3
3 e 3
4 e 3
5 f 2
6 f 2

You could use .loc with repeat like
In [295]: df.loc[df.index.repeat(df['count'])].reset_index(drop=True)
Out[295]:
x count
0 d 2
1 d 2
2 e 3
3 e 3
4 e 3
5 f 2
6 f 2
Or, using pd.Series.repeat you can
In [278]: df.set_index('x')['count'].repeat(df['count']).reset_index()
Out[278]:
x count
0 d 2
1 d 2
2 e 3
3 e 3
4 e 3
5 f 2
6 f 2

Related

Pandas row replication python

So my dataframe has multiple columns, one of them is named "multiple" which contains boolean, only 1s and 0s. Now, I want to replicate all the rows 4 times only for all the df.loc[df.multiple==1]. How can I do that? (I don't want to replicate indexes)
example input:
df=
index strings multiple
0 A 0
1 B 1
2 C 1
3 D 0
4 E 1
Expected output:
index strings multiple
0 A 0
1 B 1
2 B 1
3 B 1
4 B 1
5 B 1
6 C 1
7 C 1
8 C 1
9 C 1
10 C 1
11 D 0
12 E 1
13 E 1
14 E 1
15 E 1
16 E 1

Here is another alternative, based on #Vinzent answer.
It is using the same approach to construct the repeats, but doesn't require to reconstruct the full dataframe. It is instead based on indexing. This solution is ~30% faster on the provided dataset and larger datasets.
df.loc[np.repeat(df.multiple, df.multiple.values*4+1).index].reset_index(drop=True)

This is what numpy.repeat is for:
import pandas as pd
import numpy as np
df = pd.DataFrame([['A', 0],
['B', 1],
['C', 1],
['D', 0],
['E', 1]],
columns=['strings', 'multiple'])
df = pd.DataFrame(np.repeat(df.values, df['multiple']*4+1, axis=0), columns=df.columns)
print(df)
# strings multiple
# 0 A 0
# 1 B 1
# 2 B 1
# 3 B 1
# 4 B 1
# 5 B 1
# 6 C 1
# 7 C 1
# 8 C 1
# 9 C 1
# 10 C 1
# 11 D 0
# 12 E 1
# 13 E 1
# 14 E 1
# 15 E 1
# 16 E 1

You can do it with pandas:
(df.groupby('multiple')
.apply(lambda x: pd.concat([x]*4) if x.name else x)
.droplevel(level=0)
.sort_index()
.reset_index(drop=True)
)

Filling a column with its header value

How would I be able to create a new column D and fill it with it's respective header value (i.e. not set as just D, but any value that is passed as a column header)
import pandas as pd
df = pd.DataFrame({'B': [1, 2, 3], 'C': [4, 5, 6]})
Output:
index B C D
0 1 4 D
1 2 5 D
2 3 6 D

One way is the following (if you know at hand what is the header):
df['D'] = 'D'
>>> df
B C D
0 1 4 D
1 2 5 D
2 3 6 D
Or if your 'D' column is initially empty, e.g.
>>> df
B C D
0 1 4
1 2 5
2 3 6
then the following works too:
header = list(df.columns)[-1]
df[header] = header
>>> df
B C D
0 1 4 D
1 2 5 D
2 3 6 D

How can I add a column to a pandas DataFrame that uniquely identifies grouped data? [duplicate]

Given the following data frame:
import pandas as pd
import numpy as np
df=pd.DataFrame({'A':['A','A','A','B','B','B'],
'B':['a','a','b','a','a','a'],
})
df
A B
0 A a
1 A a
2 A b
3 B a
4 B a
5 B a
I'd like to create column 'C', which numbers the rows within each group in columns A and B like this:
A B C
0 A a 1
1 A a 2
2 A b 1
3 B a 1
4 B a 2
5 B a 3
I've tried this so far:
df['C']=df.groupby(['A','B'])['B'].transform('rank')
...but it doesn't work!

Use groupby/cumcount:
In [25]: df['C'] = df.groupby(['A','B']).cumcount()+1; df
Out[25]:
A B C
0 A a 1
1 A a 2
2 A b 1
3 B a 1
4 B a 2
5 B a 3

Use groupby.rank function.
Here the working example.
df = pd.DataFrame({'C1':['a', 'a', 'a', 'b', 'b'], 'C2': [1, 2, 3, 4, 5]})
df
C1 C2
a 1
a 2
a 3
b 4
b 5
df["RANK"] = df.groupby("C1")["C2"].rank(method="first", ascending=True)
df
C1 C2 RANK
a 1 1
a 2 2
a 3 3
b 4 1
b 5 2

Cannot re-add column to pandas multi-index dataframe after deletion

It seems odd that after deleting a column, I cannot add it back with the same name. So I create a simple dataframe with multi labeled columns and add a new column with level0 name only, and then I delete it.
>>> import pandas as pd
>>> df = pd.DataFrame([[1,2,3],[4,5,6]])
>>> df.columns=[['a','b','c'],['e','f','g']]
>>> print(df)
a b c
e f g
0 1 2 3
1 4 5 6
>>> df['d'] = df.c+2
>>> print(df)
a b c d
e f g
0 1 2 3 5
1 4 5 6 8
>>> del df['d']
>>> print(df)
a b c
e f g
0 1 2 3
1 4 5 6
Now I try to add it again, and it seems like it has no effect and no error or warning is shown.
>>> df['d'] = df.c+2
>>> print(df)
a b c
e f g
0 1 2 3
1 4 5 6
Is this expected behaviour? Should I report a bugreport to pandas project? There is no such issue if I add 'd' columns with both levels specified, like this
df['d', 'x'] = df.c+2
Thanks,
PS: Python is 2.7.14 and pandas 0.20.1

There is problem your MultiIndex level are not removed after calling del:
del df['d']
print(df)
a b c
e f g
0 1 2 3
1 4 5 6
Check columns:
print (df.columns)
MultiIndex(levels=[['a', 'b', 'c', 'd'], ['e', 'f', 'g', '']],
labels=[[0, 1, 2], [0, 1, 2]])
Solution for remove is MultiIndex.remove_unused_levels:
df.columns = df.columns.remove_unused_levels()
print (df.columns)
MultiIndex(levels=[['a', 'b', 'c'], ['e', 'f', 'g']],
labels=[[0, 1, 2], [0, 1, 2]])
df['d'] = df.c+2
print (df)
a b c d
e f g
0 1 2 3 5
1 4 5 6 8
Another solution is reaasign to MultiIndex, need tuple for select MultiIndex column:
df[('d', '')] = df.c+2
print (df)
a b c d
e f g
0 1 2 3 5
1 4 5 6 8

insert a list as row in a dataframe at a specific position

I have a list l=['a', 'b' ,'c']
and a dataframe with columns d,e,f and values are all numbers
How can I insert list l in my dataframe just below the columns.

Setup
df = pd.DataFrame(np.ones((2, 3), dtype=int), columns=list('def'))
l = list('abc')
df
d e f
0 1 1 1
1 1 1 1
Option 1
I'd accomplish this task by adding a level to the columns object
df.columns = pd.MultiIndex.from_tuples(list(zip(df.columns, l)))
df
d e f
a b c
0 1 1 1
1 1 1 1
Option 2
Use a dictionary comprehension passed to the dataframe constructor
pd.DataFrame({(i, j): df[i] for i, j in zip(df, l)})
d e f
a b c
0 1 1 1
1 1 1 1
But if you insist on putting it in the dataframe proper... (keep in mind, this turns the dataframe into dtype object and we lose significant computational efficiencies.)
Alternative 1
pd.DataFrame([l], columns=df.columns).append(df, ignore_index=True)
d e f
0 a b c
1 1 1 1
2 1 1 1
Alternative 2
pd.DataFrame([l] + df.values.tolist(), columns=df.columns)
d e f
0 a b c
1 1 1 1
2 1 1 1

Use pd.concat
In [1112]: df
Out[1112]:
d e f
0 0.517243 0.731847 0.259034
1 0.318821 0.551298 0.773115
2 0.194192 0.707525 0.804102
3 0.945842 0.614033 0.757389
In [1113]: pd.concat([pd.DataFrame([l], columns=df.columns), df], ignore_index=True)
Out[1113]:
d e f
0 a b c
1 0.517243 0.731847 0.259034
2 0.318821 0.551298 0.773115
3 0.194192 0.707525 0.804102
4 0.945842 0.614033 0.757389

Are you looking for append i.e
df = pd.DataFrame([[1,2,3]],columns=list('def'))
I = ['a','b','c']
ndf = df.append(pd.Series(I,index=df.columns.tolist()),ignore_index=True)
Output:
d e f
0 1 2 3
1 a b c

If you want add list to columns for MultiIndex:
df.columns = [df.columns, l]
print (df)
d e f
a b c
0 4 7 1
1 5 8 3
2 4 9 5
3 5 4 7
4 5 2 1
5 4 3 0
print (df.columns)
MultiIndex(levels=[['d', 'e', 'f'], ['a', 'b', 'c']],
labels=[[0, 1, 2], [0, 1, 2]])
If you want add list to specific position pos:
pos = 0
df1 = pd.DataFrame([l], columns=df.columns)
print (df1)
d e f
0 a b c
df = pd.concat([df.iloc[:pos], df1, df.iloc[pos:]], ignore_index=True)
print (df)
d e f
0 a b c
1 4 7 1
2 5 8 3
3 4 9 5
4 5 4 7
5 5 2 1
6 4 3 0
But if append this list to numeric dataframe, get mixed types - numeric with strings, so some pandas functions should failed.
Setup:
df = pd.DataFrame({'d':[4,5,4,5,5,4],
'e':[7,8,9,4,2,3],
'f':[1,3,5,7,1,0]})
print (df)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Duplicate rows in dataframe based on column value [duplicate] - python

Let's say I have a data frame called df x count d 2 e 3 f 2 Count would be the counter column and the # times I want it to repeat. How would I expand it to make it x count d 2 d 2 e 3 e 3 e 3 f 2 f 2 I've already tried numpy.repeat(df,df.iloc['count']) and it errors out

Related

Pandas row replication python

Filling a column with its header value

How can I add a column to a pandas DataFrame that uniquely identifies grouped data? [duplicate]

Cannot re-add column to pandas multi-index dataframe after deletion

insert a list as row in a dataframe at a specific position

Categories

Resources