How to embed a Series into specific rows in a Dataframe?

How to embed a Series into specific rows in a Dataframe? - python

For instance, now I have a data frame df initially:
df = pd.DataFrame()
df['A'] = pd.Series([1, 1, 2, 2, 1, 3, 3])
df['B'] = pd.Series(['a', 'b', 'c', 'd', 'e', 'f', 'g'])
df
# A B
1 a
1 b
2 c
2 d
1 e
3 f
3 g
Now I'd like to replace the rows which column A equals to 1 with a list [0, 1, 2]. So, here is my expectation after embedding:
df
# A B
1 0
1 1
2 c
2 d
1 2
3 f
3 g
How to achieve this goal?

df.loc[df['A']==1, 'B'] = [0, 1, 2]
print(df)
Prints:
A B
0 1 0
1 1 1
2 2 c
3 2 d
4 1 2
5 3 f
6 3 g

Related

Join two columns of sequentially values

I have dataframe, where 'A' 1 - client, B - admin
I need to merge messages in row with 1 sequentially and merge lines 2 - admin response sequentially across the dataframe.
df1 = pd.DataFrame({'A' : ['a', 'b', 'c', 'd', 'e', 'f', 'h', 'j', 'de', 'be'],
'B' : [1, 1, 2, 1, 1, 1, 2, 2, 1, 2]})
df1
A B
A B
0 a 1
1 b 1
2 c 2
3 d 1
4 e 1
5 f 1
6 h 2
7 j 2
8 de 1
9 be 2
I need to get in the end this dataframe:
df2 = pd.DataFrame({'A' : ['a, b', 'd, e, f', 'de'],
'B' : ['c', 'h, j', 'be' ]})
Out:
A B
0 a,b c
1 d,e,f h,j
2 de be
I do not know how to do this

Create groups by consecutive values in B - trick compare shifted values with cumulative sum and aggregate first and join. Create helper column for posible pivoting in next step by DataFrame.pivot:
Solution working if exist pairs 1,2 in sequentially order with duplicates.
df = (df1.groupby(df1['B'].ne(df1['B'].shift()).cumsum())
.agg(B = ('B','first'), A= ('A', ','.join))
.assign(C = lambda x: x['B'].eq(1).cumsum()))
print (df)
B A C
B
1 1 a,b 1
2 2 c 1
3 1 d,e,f 2
4 2 h,j 2
5 1 de 3
6 2 be 3
df = (df.pivot('C','B','A')
.rename(columns={1:'A',2:'B'})
.reset_index(drop=True).rename_axis(None, axis=1))
print (df)
A B
0 a,b c
1 d,e,f h,j
2 de be

Cannot re-add column to pandas multi-index dataframe after deletion

It seems odd that after deleting a column, I cannot add it back with the same name. So I create a simple dataframe with multi labeled columns and add a new column with level0 name only, and then I delete it.
>>> import pandas as pd
>>> df = pd.DataFrame([[1,2,3],[4,5,6]])
>>> df.columns=[['a','b','c'],['e','f','g']]
>>> print(df)
a b c
e f g
0 1 2 3
1 4 5 6
>>> df['d'] = df.c+2
>>> print(df)
a b c d
e f g
0 1 2 3 5
1 4 5 6 8
>>> del df['d']
>>> print(df)
a b c
e f g
0 1 2 3
1 4 5 6
Now I try to add it again, and it seems like it has no effect and no error or warning is shown.
>>> df['d'] = df.c+2
>>> print(df)
a b c
e f g
0 1 2 3
1 4 5 6
Is this expected behaviour? Should I report a bugreport to pandas project? There is no such issue if I add 'd' columns with both levels specified, like this
df['d', 'x'] = df.c+2
Thanks,
PS: Python is 2.7.14 and pandas 0.20.1

There is problem your MultiIndex level are not removed after calling del:
del df['d']
print(df)
a b c
e f g
0 1 2 3
1 4 5 6
Check columns:
print (df.columns)
MultiIndex(levels=[['a', 'b', 'c', 'd'], ['e', 'f', 'g', '']],
labels=[[0, 1, 2], [0, 1, 2]])
Solution for remove is MultiIndex.remove_unused_levels:
df.columns = df.columns.remove_unused_levels()
print (df.columns)
MultiIndex(levels=[['a', 'b', 'c'], ['e', 'f', 'g']],
labels=[[0, 1, 2], [0, 1, 2]])
df['d'] = df.c+2
print (df)
a b c d
e f g
0 1 2 3 5
1 4 5 6 8
Another solution is reaasign to MultiIndex, need tuple for select MultiIndex column:
df[('d', '')] = df.c+2
print (df)
a b c d
e f g
0 1 2 3 5
1 4 5 6 8

in Pandas, how to create a variable that is n for the nth observation within a group?

consider this
df = pd.DataFrame({'B': ['a', 'a', 'b', 'b'], 'C': [1, 2, 6,2]})
df
Out[128]:
B C
0 a 1
1 a 2
2 b 6
3 b 2
I want to create a variable that simply corresponds to the ordering of observations after sorting by 'C' within each groupby('B') group.
df.sort_values(['B','C'])
Out[129]:
B C order
0 a 1 1
1 a 2 2
3 b 2 1
2 b 6 2
How can I do that? I am thinking about creating a column that is one, and using cumsum but that seems too clunky...

I think you can use range with len(df):
import pandas as pd
df = pd.DataFrame({'A': [1, 2, 3],
'B': ['a', 'a', 'b'],
'C': [5, 3, 2]})
print df
A B C
0 1 a 5
1 2 a 3
2 3 b 2
df.sort_values(by='C', inplace=True)
#or without inplace
#df = df.sort_values(by='C')
print df
A B C
2 3 b 2
1 2 a 3
0 1 a 5
df['order'] = range(1,len(df)+1)
print df
A B C order
2 3 b 2 1
1 2 a 3 2
0 1 a 5 3
EDIT by comment:
I think you can use groupby with cumcount:
import pandas as pd
df = pd.DataFrame({'B': ['a', 'a', 'b', 'b'], 'C': [1, 2, 6,2]})
df.sort_values(['B','C'], inplace=True)
#or without inplace
#df = df.sort_values(['B','C'])
print df
B C
0 a 1
1 a 2
3 b 2
2 b 6
df['order'] = df.groupby('B', sort=False).cumcount() + 1
print df
B C order
0 a 1 1
1 a 2 2
3 b 2 1
2 b 6 2

Nothing wrong with Jezrael's answer but there's a simpler (though less general) method in this particular example. Just add groupby to JohnGalt's suggestion of using rank.
>>> df['order'] = df.groupby('B')['C'].rank()
B C order
0 a 1 1.0
1 a 2 2.0
2 b 6 2.0
3 b 2 1.0
In this case, you don't really need the ['C'] but it makes the ranking a little more explicit and if you had other unrelated columns in the dataframe then you would need it.
But if you are ranking by more than 1 column, you should use Jezrael's method.

How to save a file to a csv file after groupby().size()

Data after groupby(['Id', 'event']).size():
1 A 3
B 1
C 6
2 A 3
B 1
data.to_csv('data.csv', index=False) does not yield the format I need in data.csv, which has the column of count.
What I need in data.csv is
1 A 3
1 B 1
1 C 6
2 A 3
2 B 1
Any idea?

df.to_csv(index=False) omits the index from the output. Use index=True, and header=False (to omit the column names):
import pandas as pd
df = pd.DataFrame({0: [1, 1, 1, 2, 2], 1: ['A', 'B', 'C', 'A', 'B'], 2: [3, 1, 6, 3, 1]})
df = df.set_index([0,1])
print(df.to_csv('data.csv', index=True, header=False, sep='\t'))
yields
1 A 3
1 B 1
1 C 6
2 A 3
2 B 1

Repeat data frame, with varying column value

I have the following data frame and need to repeat the values for a set of values. That is, given
test3 = pd.DataFrame(data={'x':[1, 2, 3, 4, pd.np.nan], 'y':['a', 'a', 'a', 'b', 'b']})
test3
x y
0 1 a
1 2 a
2 3 a
3 4 b
4 NaN b
I need to do something like this, but more performant:
test3['group'] = np.NaN
groups = ['a', 'b']
dfs = []
for group in groups:
temp = test3.copy()
temp['group'] = group
dfs.append(temp)
pd.concat(dfs)
That is, the expected output is:
x y group
0 1 a a
1 2 a a
2 3 a a
3 4 b a
4 NaN b a
0 1 a b
1 2 a b
2 3 a b
3 4 b b
4 NaN b b

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to embed a Series into specific rows in a Dataframe? - python

df.loc[df['A']==1, 'B'] = [0, 1, 2] print(df) Prints: A B 0 1 0 1 1 1 2 2 c 3 2 d 4 1 2 5 3 f 6 3 g

Related

Join two columns of sequentially values

Cannot re-add column to pandas multi-index dataframe after deletion

in Pandas, how to create a variable that is n for the nth observation within a group?

How to save a file to a csv file after groupby().size()

Repeat data frame, with varying column value

Categories

Resources