Reverse a Cross Tabulation or Frequency Table - python

Suppose I have a frequency table df defined as:
dat = [[0, 2, 1], [1, 0, 3], [4, 1, 1]]
idx = pd.Index([*'abc'], name='One')
col = pd.Index([*'xyz'], name='Two')
df = pd.DataFrame(dat, idx, col)
df
Two x y z
One
a 0 2 1
b 1 0 3
c 4 1 1
How do I "invert" this to get a dataframe pre_df
One Two
0 a y
1 a y
2 a z
3 b x
4 b z
5 b z
6 b z
7 c x
8 c x
9 c x
10 c x
11 c y
12 c z
Such that pd.crosstab(pre_df.One, pre_df.Two) would get me back to df
Two x y z
One
a 0 2 1
b 1 0 3
c 4 1 1

Try stack and repeat:
s = df.stack()
s.index.repeat(s).to_frame().reset_index(drop=True)
Output:
One Two
0 a y
1 a y
2 a z
3 b x
4 b z
5 b z
6 b z
7 c x
8 c x
9 c x
10 c x
11 c y
12 c z

Related

separating values ​between rows with pandas

I want to separate values in "alpha" column like this
Start:
alpha
beta
gamma
A
1
0
A
1
1
B
1
0
B
1
1
B
1
0
C
1
1
End:
alpha
beta
gamma
A
1
0
A
1
1
X
X
X
B
1
0
B
1
1
B
1
0
X
X
X
C
1
1
Thanks for help <3
You can try
out = (df.groupby('alpha')
.apply(lambda g: pd.concat([g, pd.DataFrame([['X', 'X', 'X']], columns=df.columns)]))
.reset_index(drop=True)[:-1])
print(out)
alpha beta gamma
0 A 1 0
1 A 1 1
2 X X X
3 B 1 0
4 B 1 1
5 B 1 0
6 X X X
7 C 1 1
Assuming a range index as in the example, you can use:
# get indices in between 2 groups
idx = df['alpha'].ne(df['alpha'].shift(-1).ffill())
df2 = pd.concat([df, df[idx].assign(**{c: 'X' for c in df})]).sort_index(kind='stable')
Or without groupby and sort_index:
idx = df['alpha'].ne(df['alpha'].shift(-1).ffill())
df2 = df.loc[df.index.repeat(idx+1)]
df2.loc[df2.index.duplicated()] = 'X'
output:
alpha beta gamma
0 A 1 0
1 A 1 1
1 X X X
2 B 1 0
3 B 1 1
4 B 1 0
4 X X X
5 C 1 1
NB. add reset_index(drop=True) to get a new index
You can do:
dfx = pd.DataFrame({'alpha':['X'],'beta':['X'],'gamma':['X']})
df = df.groupby('alpha',as_index=False).apply(lambda x:x.append(dfx)).reset_index(drop=True)
Output:
alpha beta gamma
0 A 1 0
1 A 1 1
2 X X X
3 B 1 0
4 B 1 1
5 B 1 0
6 X X X
7 C 1 1
8 X X X
To avoid adding a [X, X, X] at the end you can check the index first like:
df.groupby('alpha',as_index=False).apply(
lambda x:x.append(dfx)
if x.index[-1] != df.index[-1] else x).reset_index(drop=True)

pandas compare first rows and make identical

I have two dfs.
df1 = pd.DataFrame(["bazzar","dogsss","zxvfzx","anythi"], columns = [0], index = [0,1,2,3])
df2 = pd.DataFrame(["baar","maar","cats","$%&*"], columns = [0], index = [0,1,2,3])
df1 = df1[0].apply(lambda x: pd.Series(list(x)))
df2 = df2[0].apply(lambda x: pd.Series(list(x)))
which look like
df1
0 1 2 3 4 5
0 b a z z a r
1 d o g s s s
2 z x v f z x
3 a n y t h i
df2
0 1 2 3
0 b a a r
1 m a a r
2 c a t s
3 $ % & *
I want to compare their first rows and make them identical by inserting new columns containing the character z to df2, so that df2 becomes
0 1 2 3 4 5
0 b a z z a r
1 m a z z a r
2 c a z z t s
3 $ % z z & *
An additional example:
df3 = pd.DataFrame(["aazzbbzcc","bbbbbbbbb","ccccccccc","ddddddddd"], columns = [0], index = [0,1,2,3])
df4 = pd.DataFrame(["aabbcc","111111","222222","333333"], columns = [0], index = [0,1,2,3])
df3 = df3[0].apply(lambda x: pd.Series(list(x)))
df4 = df4[0].apply(lambda x: pd.Series(list(x)))
df3
0 1 2 3 4 5 6 7 8
0 a a z z b b z c c
1 b b b b b b b b b
2 c c c c c c c c c
3 d d d d d d d d d
df4
0 1 2 3 4 5
0 a a b b c c
1 1 1 1 1 1 1
2 2 2 2 2 2 2
3 3 3 3 3 3 3
You can see, an important relationship between the first rows of the two dataframes: they will eventually become the same when character z are added to the later dataframe (i.e. df2 and df4), so that the expected output for this example is:
0 1 2 3 4 5 6 7 8
0 a a z z b b z c c
1 1 1 z z 1 1 z 1 1
2 2 2 z z 2 2 z 2 2
3 3 3 z z 3 3 z 3 3
Any idea how to do that?
Because in first rows are duplicated values are create MultiIndex with first rows and GroupBy.cumcount for both DataFrames:
a = df1.iloc[[0]].T
df1.columns = [a[0], a.groupby(a[0]).cumcount()]
b = df2.iloc[[0]].T
df2.columns = [b[0], b.groupby(b[0]).cumcount()]
print (df1)
0 b a z a r
0 0 0 1 1 0
0 b a z z a r
1 d o g s s s
2 z x v f z x
3 a n y t h i
print (df2)
0 b a r
0 0 1 0
0 b a a r
1 m a a r
2 c a t s
3 $ % & *
And then is used DataFrame.reindex with replace missing values by first row of df1:
df = df2.reindex(df1.columns, axis=1).fillna(df1.iloc[0])
print (df)
0 b a z a r
0 0 0 1 1 0
0 b a z z a r
1 m a z z a r
2 c a z z t s
3 $ % z z & *
Last set range to columns:
df.columns = range(len(df.columns))
print (df)
0 1 2 3 4 5
0 b a z z a r
1 m a z z a r
2 c a z z t s
3 $ % z z & *
Check where to add:
list(difflib.ndiff(df2[0][0], df1[0][0]))
[' b', ' a', '+ z', '+ z', ' a', ' r']
Add manually
df2[0].str.replace('(.){2}', '\\1zz', regex = True).str.split('(?<=\\S)(?=\\S)', expand = True)
Out[1557]:
0 1 2 3 4 5
0 a z z r z z
1 a z z r z z
2 a z z s z z
3 % z z * z z

Index match with python

I have two dfs
df1
Len Bar
x a
y a
z a
x b
y b
z b
x c
y c
z c
df2
Len/Bar a b c
x 4 2 8
y 2 7 7
z 6 3 9
Need output to be
Len Bar Amount
x a 4
y a 2
z a 6
x b 2
y b 7
z b 3
x c 8
y c 7
z c 9
In excel I use index match formula =INDEX($B$2:$D$4,MATCH(A19,$A$2:$A$4,0),MATCH(B19,$B$1:$D$1,0))
But is there any way to do the same using map or merge
I think you need first reshape df2 and then merge with left join with df1:
df2 =df2.set_index('Len/Bar').unstack().rename_axis(('Bar','Len')).reset_index(name='Amount')
df2 = df1.merge(df2, how='left', on=['Len', 'Bar'])
print (df2)
Len Bar Amount
0 x a 4
1 y a 2
2 z a 6
3 x b 2
4 y b 7
5 z b 3
6 x c 8
7 y c 7
8 z c 9
Another solution:
df2 = df2.set_index('Len/Bar').stack().rename_axis(('Bar','Len')).rename('Amount')
df2 = df1.join(df2, on=['Len', 'Bar'])
print (df2)
Len Bar Amount
0 x a 4
1 y a 2
2 z a 6
3 x b 2
4 y b 7
5 z b 3
6 x c 8
7 y c 7
8 z c 9
EDIT:
If you dont know if need merge/join it depends if need filter reshaped df2 by df1 or not.
See difference:
#removed some rows
print (df1)
Len Bar
0 x a
1 y a
2 z a
3 x b
4 y b
print (df2)
Bar Len Amount
0 a x 4
1 a y 2
2 a z 6
3 b x 2
4 b y 7
5 b z 3
6 c x 8
7 c y 7
8 c z 9
And after merge rows are filtered by columns Len and Bar from df1:
print (df3)
Len Bar Amount
0 x a 4
1 y a 2
2 z a 6
3 x b 2
4 y b 7
Incidentally, you do not seem to need df1 at all:
df3 = df2.set_index('Len/Bar').stack().reset_index()
df3.columns = "Len", "Bar", "Amount"
# Len Bar Amount
#0 x a 4
#1 x b 2
#2 x c 8
#3 y a 2
#4 y b 7
#5 y c 7
#6 z a 6
#7 z b 3
#8 z c 9
Unless you want to borrow the column names from it:
df3.columns = df1.columns + ("Amount",)

Pandas self-join on non-unique values

I have the following table:
ind_ID pair_ID orig_data
0 A 1 W
1 B 1 X
2 C 2 Y
3 D 2 Z
4 A 3 W
5 C 3 X
6 B 4 Y
7 D 4 Z
Each row has an individual_ID, and a pair_ID that it shares with exactly one other row. I want to do a self join, so that every row has its original data, and the data of the row it shares a pair_ID with:
ind_ID pair_ID orig_data partner_data
0 A 1 W X
1 B 1 X W
2 C 2 Y Z
3 D 2 Z Y
4 A 3 W X
5 C 3 X W
6 B 4 Y Z
7 D 4 Z Y
I have tried:
df.join(df, on='pair_ID')
But obviously since pair_ID values are not unique I get:
ind_ID pair_ID orig_data partner_data
0 A 1 W NaN
1 B 1 X NaN
2 C 2 Y NaN
3 D 2 Z NaN
4 A 3 W NaN
5 C 3 X NaN
6 B 4 Y NaN
7 D 4 Z NaN
I've also thought about creating a new column that concatenates ind_ID+pair_ID which would be unique, but then the join would not know what to match on.
Is it possible to do a self-join on pair_ID where each row is joined with the matching row that is not itself?
In your case (with only two pairs) - you can probably just groupby and transform based on the ID, and just reverse the order of the values in the group, eg:
df.loc[:, 'partner_data'] = df.groupby('pair_ID').orig_data.transform(lambda L: L[::-1])
Which gives you:
ind_ID pair_ID orig_data partner_ID
0 A 1 W X
1 B 1 X W
2 C 2 Y Z
3 D 2 Z Y
4 A 3 W X
5 C 3 X W
6 B 4 Y Z
7 D 4 Z Y

Get first version of a line with duplicate values versus one column

Hello I'm looking for a way to get from this dataframe df::
df = pd.DataFrame(dict(X=list('abbcccddef'),
Y=list('ABCDEFGHIJ'),
Z=list('1234123412')))
df
# X Y Z
# 0 a A 1
# 1 b B 2
# 2 b C 3
# 3 c D 4
# 4 c E 1
# 5 c F 2
# 6 d G 3
# 7 d H 4
# 8 e I 1
# 9 f J 2
Only the first lines for each X value, so this one::
# X Y Z
# 0 a A 1
# 1 b B 2
# 3 c D 4
# 6 d G 3
# 8 e I 1
# 9 f J 2
I'm looking for a more elegant way than this::
x_unique = df.X.unique()
x_unique
# array(['a', 'b', 'c', 'd', 'e', 'f'], dtype=object)
res = df[df.X == x_unique[0]].iloc[0]
for u in x_unique[1:]:
res = pd.concat([res, df[df.X==u].iloc[0]], axis=1)
res
# 0 1 3 6 8 9
# X a b c d e f
# Y A B D G I J
# Z 1 2 4 3 1 2
res = res.transpose()
res
# X Y Z
# 0 a A 1
# 1 b B 2
# 3 c D 4
# 6 d G 3
# 8 e I 1
# 9 f J 2
You could use drop_duplicates() method on X
In [60]: df.drop_duplicates('X')
Out[60]:
X Y Z
0 a A 1
1 b B 2
3 c D 4
6 d G 3
8 e I 1
9 f J 2
You can also do:
In [3]: import pandas as pd
In [4]: df = pd.DataFrame(dict(X=list('abbcccddef'),
Y=list('ABCDEFGHIJ'),
Z=list('1234123412')))
In [5]: df.groupby('X').first()
Out[5]:
Y Z
X
a A 1
b B 2
c D 4
d G 3
e I 1
f J 2

Categories

Resources