I want to separate values in "alpha" column like this
Start:
alpha
beta
gamma
A
1
0
A
1
1
B
1
0
B
1
1
B
1
0
C
1
1
End:
alpha
beta
gamma
A
1
0
A
1
1
X
X
X
B
1
0
B
1
1
B
1
0
X
X
X
C
1
1
Thanks for help <3
You can try
out = (df.groupby('alpha')
.apply(lambda g: pd.concat([g, pd.DataFrame([['X', 'X', 'X']], columns=df.columns)]))
.reset_index(drop=True)[:-1])
print(out)
alpha beta gamma
0 A 1 0
1 A 1 1
2 X X X
3 B 1 0
4 B 1 1
5 B 1 0
6 X X X
7 C 1 1
Assuming a range index as in the example, you can use:
# get indices in between 2 groups
idx = df['alpha'].ne(df['alpha'].shift(-1).ffill())
df2 = pd.concat([df, df[idx].assign(**{c: 'X' for c in df})]).sort_index(kind='stable')
Or without groupby and sort_index:
idx = df['alpha'].ne(df['alpha'].shift(-1).ffill())
df2 = df.loc[df.index.repeat(idx+1)]
df2.loc[df2.index.duplicated()] = 'X'
output:
alpha beta gamma
0 A 1 0
1 A 1 1
1 X X X
2 B 1 0
3 B 1 1
4 B 1 0
4 X X X
5 C 1 1
NB. add reset_index(drop=True) to get a new index
You can do:
dfx = pd.DataFrame({'alpha':['X'],'beta':['X'],'gamma':['X']})
df = df.groupby('alpha',as_index=False).apply(lambda x:x.append(dfx)).reset_index(drop=True)
Output:
alpha beta gamma
0 A 1 0
1 A 1 1
2 X X X
3 B 1 0
4 B 1 1
5 B 1 0
6 X X X
7 C 1 1
8 X X X
To avoid adding a [X, X, X] at the end you can check the index first like:
df.groupby('alpha',as_index=False).apply(
lambda x:x.append(dfx)
if x.index[-1] != df.index[-1] else x).reset_index(drop=True)
Related
I have two dfs.
df1 = pd.DataFrame(["bazzar","dogsss","zxvfzx","anythi"], columns = [0], index = [0,1,2,3])
df2 = pd.DataFrame(["baar","maar","cats","$%&*"], columns = [0], index = [0,1,2,3])
df1 = df1[0].apply(lambda x: pd.Series(list(x)))
df2 = df2[0].apply(lambda x: pd.Series(list(x)))
which look like
df1
0 1 2 3 4 5
0 b a z z a r
1 d o g s s s
2 z x v f z x
3 a n y t h i
df2
0 1 2 3
0 b a a r
1 m a a r
2 c a t s
3 $ % & *
I want to compare their first rows and make them identical by inserting new columns containing the character z to df2, so that df2 becomes
0 1 2 3 4 5
0 b a z z a r
1 m a z z a r
2 c a z z t s
3 $ % z z & *
An additional example:
df3 = pd.DataFrame(["aazzbbzcc","bbbbbbbbb","ccccccccc","ddddddddd"], columns = [0], index = [0,1,2,3])
df4 = pd.DataFrame(["aabbcc","111111","222222","333333"], columns = [0], index = [0,1,2,3])
df3 = df3[0].apply(lambda x: pd.Series(list(x)))
df4 = df4[0].apply(lambda x: pd.Series(list(x)))
df3
0 1 2 3 4 5 6 7 8
0 a a z z b b z c c
1 b b b b b b b b b
2 c c c c c c c c c
3 d d d d d d d d d
df4
0 1 2 3 4 5
0 a a b b c c
1 1 1 1 1 1 1
2 2 2 2 2 2 2
3 3 3 3 3 3 3
You can see, an important relationship between the first rows of the two dataframes: they will eventually become the same when character z are added to the later dataframe (i.e. df2 and df4), so that the expected output for this example is:
0 1 2 3 4 5 6 7 8
0 a a z z b b z c c
1 1 1 z z 1 1 z 1 1
2 2 2 z z 2 2 z 2 2
3 3 3 z z 3 3 z 3 3
Any idea how to do that?
Because in first rows are duplicated values are create MultiIndex with first rows and GroupBy.cumcount for both DataFrames:
a = df1.iloc[[0]].T
df1.columns = [a[0], a.groupby(a[0]).cumcount()]
b = df2.iloc[[0]].T
df2.columns = [b[0], b.groupby(b[0]).cumcount()]
print (df1)
0 b a z a r
0 0 0 1 1 0
0 b a z z a r
1 d o g s s s
2 z x v f z x
3 a n y t h i
print (df2)
0 b a r
0 0 1 0
0 b a a r
1 m a a r
2 c a t s
3 $ % & *
And then is used DataFrame.reindex with replace missing values by first row of df1:
df = df2.reindex(df1.columns, axis=1).fillna(df1.iloc[0])
print (df)
0 b a z a r
0 0 0 1 1 0
0 b a z z a r
1 m a z z a r
2 c a z z t s
3 $ % z z & *
Last set range to columns:
df.columns = range(len(df.columns))
print (df)
0 1 2 3 4 5
0 b a z z a r
1 m a z z a r
2 c a z z t s
3 $ % z z & *
Check where to add:
list(difflib.ndiff(df2[0][0], df1[0][0]))
[' b', ' a', '+ z', '+ z', ' a', ' r']
Add manually
df2[0].str.replace('(.){2}', '\\1zz', regex = True).str.split('(?<=\\S)(?=\\S)', expand = True)
Out[1557]:
0 1 2 3 4 5
0 a z z r z z
1 a z z r z z
2 a z z s z z
3 % z z * z z
I have the following dataframe:
import pandas as pd
import numpy as np
ds = pd.DataFrame({'z':np.random.binomial(n=1,p=0.5,size=10),
'x':np.random.binomial(n=1,p=0.5,size=10),
'u':np.random.binomial(n=1,p=0.5,size=10),
'y':np.random.binomial(n=1,p=0.5,size=10)})
ds
z x u y
0 0 1 0 0
1 0 1 1 1
2 1 1 1 1
3 0 0 1 1
4 0 0 1 1
5 0 0 0 0
6 1 0 1 1
7 0 1 1 1
8 1 1 0 0
9 0 1 1 1
How do I select rows that have the values (0,1) for variable names specified in a list?
This is what I have thus far:
zs = ['z','x']
tf = ds[ds[zs].values == (0,1)]
tf
Now that prints:
z x u y
0 0 1 0 0
0 0 1 0 0
1 0 1 1 1
1 0 1 1 1
2 1 1 1 1
3 0 0 1 1
4 0 0 1 1
5 0 0 0 0
7 0 1 1 1
7 0 1 1 1
8 1 1 0 0
9 0 1 1 1
9 0 1 1 1
Which shows duplicates and also has incorrect row (row #2 - 1,1,1,1). Any thoughts or ideas? Of course I am assuming there is a pythonic way of doing this without nested loops and brute-forcing it.
You can use broadcasted numpy comparison:
df[(df[['z','x']].values == [0, 1]).all(1)]
z x u y
0 0 1 0 0
1 0 1 1 1
7 0 1 1 1
9 0 1 1 1
You can also use np.logical_and.reduce:
cols = ['z', 'x']
vals = [0, 1]
df[np.logical_and.reduce([df[c] == v for c, v in zip(cols, vals)])]
z x u y
0 0 1 0 0
1 0 1 1 1
7 0 1 1 1
9 0 1 1 1
Lastly, assuming your column names are compatible, dynamically generate query expression strings for use with query:
querystr = ' and '.join([f'{c} == {v!r}' for c, v in zip(cols, vals)])
df.query(querystr)
z x u y
0 0 1 0 0
1 0 1 1 1
7 0 1 1 1
9 0 1 1 1
Where {v!r} is the same as {repr(v)}.
You can do:
cols = ['u','x']
bools = ds[cols].apply(lambda x: all(x == (0,1)), axis=1)
ds[bools]
u x y z
0 0 1 1 1
7 0 1 0 1
8 0 1 1 0
Using eq , and very similar to cold's numpy method
df[df[zs].eq(pd.Series([0,1],index=zs),1).all(1)]
z x u y
0 0 1 0 0
1 0 1 1 1
7 0 1 1 1
9 0 1 1 1
A simpler way is to use boolean indexing:
f = ds['z'] == 0
g = ds['x'] == 1
ds[f & g]
I want to change shape of a dataframe from (x,y) to (1,x,y) or (x,1,y) or (x,y,1). I know in numpy I can do something like arr[np.newaxis,...], I wonder how can I achieve the same for a dataframe?
The pandas.panel object is deprecated. We use pandas.MultiIndex to handle higher dimensional data.
Consider the data frame df
df = pd.DataFrame(1, list('abc'), list('xyz'))
df
x y z
a 1 1 1
b 1 1 1
c 1 1 1
Add Level
The following are various ways to add a level and dimensionality.
axis=0, level=0
pd.concat([df], keys=['A'])
x y z
A a 1 1 1
b 1 1 1
c 1 1 1
df.set_index(pd.MultiIndex.from_product([['B'], df.index]))
x y z
B a 1 1 1
b 1 1 1
c 1 1 1
axis=0, level=1
pd.concat([df], keys=['A']).swaplevel(0, 1)
x y z
a A 1 1 1
b A 1 1 1
c A 1 1 1
df.set_index(pd.MultiIndex.from_product([df.index, ['B']]))
x y z
a B 1 1 1
b B 1 1 1
c B 1 1 1
axis=1, level=0
pd.concat([df], axis=1, keys=['A'])
A
x y z
a 1 1 1
b 1 1 1
c 1 1 1
df.set_axis(pd.MultiIndex.from_product([['B'], df.columns]), axis=1, inplace=False)
B
x y z
a 1 1 1
b 1 1 1
c 1 1 1
axis=1, level=1
pd.concat([df], axis=1, keys=['A']).swaplevel(0, 1, 1)
x y z
A A A
a 1 1 1
b 1 1 1
c 1 1 1
df.set_axis(pd.MultiIndex.from_product([df.columns, ['B']]), axis=1, inplace=False)
x y z
B B B
a 1 1 1
b 1 1 1
c 1 1 1
I have a dataframe (df_temp) which is like the following:
ID1 ID2
0 A X
1 A X
2 A Y
3 A Y
4 A Z
5 B L
6 B L
What I need is to add a column which shows the cummulative number of unique values of ID2 for each ID1, so something like
ID1 ID2 CumUniqueIDs
0 A X 1
1 A X 1
2 A Y 2
3 A Y 2
4 A Z 3
5 B L 1
6 B L 1
I've tried:
dfl_temp['CumUniqueIDs'] = dfl_temp.groupby(by=[ID1])[ID2].nunique().cumsum()+1
But this simply fills CumUniqueIDs with NaN.
Not sure what I'm doing wrong here! Any help much appreciated!
you can use groupby() + transform() + factorize():
In [12]: df['CumUniqueIDs'] = df.groupby('ID1')['ID2'].transform(lambda x: pd.factorize(x)[0]+1)
In [13]: df
Out[13]:
ID1 ID2 CumUniqueIDs
0 A X 1
1 A X 1
2 A Y 2
3 A Y 2
4 A Z 3
5 B L 1
6 B L 1
By using category
df.groupby(['ID1']).ID2.apply(lambda x : x.astype('category').cat.codes.add(1))
Out[551]:
0 1
1 1
2 2
3 2
4 3
5 1
6 1
Name: ID2, dtype: int8
After assign it back
df['CumUniqueIDs']=df.groupby(['ID1']).ID2.apply(lambda x : x.astype('category').cat.codes.add(1))
df
Out[553]:
ID1 ID2 CumUniqueIDs
0 A X 1
1 A X 1
2 A Y 2
3 A Y 2
4 A Z 3
5 B L 1
6 B L 1
I would like to transform the below pandas dataframe:
dd = pd.DataFrame({ "zz":[1,3], "y": ["a","b"], "x": [[1,2],[1]]})
x y z
0 [1, 2] a 1
1 [1] b 3
into :
x y z
0 1 a 1
1 1 b 3
2 2 a 1
As you can see, the first row is elaborated in columns X into its individual elements while repeating the other columns y, z. Can I do this without using a for loop?
Use:
#get lengths of lists
l = dd['x'].str.len()
df = dd.loc[dd.index.repeat(l)].assign(x=np.concatenate(dd['x'])).reset_index(drop=True)
print (df)
x y zz
0 1 a 1
1 2 a 1
2 1 b 3
But if order is important:
df1 = pd.DataFrame(dd['x'].values.tolist())
.stack()
.sort_index(level=[1,0])
.reset_index(name='x')
print (df1)
level_0 level_1 x
0 0 0 1.0
1 1 0 1.0
2 0 1 2.0
df = df1.join(dd.drop('x',1), on='level_0').drop(['level_0','level_1'], 1)
print (df)
x y zz
0 1.0 a 1
1 1.0 b 3
2 2.0 a 1
Using join and stack you can
In [655]: dd.drop('x', 1).join(
dd.apply(lambda x: pd.Series(x.x), axis=1)
.stack().reset_index(level=1, drop=True).to_frame('x'))
Out[655]:
y z x
0 a 1 1.0
0 a 1 2.0
1 b 3 1.0
Details
In [656]: dd.apply(lambda x: pd.Series(x.x), axis=1).stack().reset_index(level=1,drop=True)
Out[656]:
0 1.0
0 2.0
1 1.0
dtype: float64
In [657]: dd
Out[657]:
x y z
0 [1, 2] a 1
1 [1] b 3
new_dd = pd.DataFrame(dd.apply(lambda x: pd.Series(x['x']),axis=1).stack().reset_index(level=1, drop=True))
new_dd.columns = ['x']
new_dd.merge(dd[['y','zz']], left_index=True, right_index=True)