Index match with python - python

I have two dfs
df1
Len Bar
x a
y a
z a
x b
y b
z b
x c
y c
z c
df2
Len/Bar a b c
x 4 2 8
y 2 7 7
z 6 3 9
Need output to be
Len Bar Amount
x a 4
y a 2
z a 6
x b 2
y b 7
z b 3
x c 8
y c 7
z c 9
In excel I use index match formula =INDEX($B$2:$D$4,MATCH(A19,$A$2:$A$4,0),MATCH(B19,$B$1:$D$1,0))
But is there any way to do the same using map or merge

I think you need first reshape df2 and then merge with left join with df1:
df2 =df2.set_index('Len/Bar').unstack().rename_axis(('Bar','Len')).reset_index(name='Amount')
df2 = df1.merge(df2, how='left', on=['Len', 'Bar'])
print (df2)
Len Bar Amount
0 x a 4
1 y a 2
2 z a 6
3 x b 2
4 y b 7
5 z b 3
6 x c 8
7 y c 7
8 z c 9
Another solution:
df2 = df2.set_index('Len/Bar').stack().rename_axis(('Bar','Len')).rename('Amount')
df2 = df1.join(df2, on=['Len', 'Bar'])
print (df2)
Len Bar Amount
0 x a 4
1 y a 2
2 z a 6
3 x b 2
4 y b 7
5 z b 3
6 x c 8
7 y c 7
8 z c 9
EDIT:
If you dont know if need merge/join it depends if need filter reshaped df2 by df1 or not.
See difference:
#removed some rows
print (df1)
Len Bar
0 x a
1 y a
2 z a
3 x b
4 y b
print (df2)
Bar Len Amount
0 a x 4
1 a y 2
2 a z 6
3 b x 2
4 b y 7
5 b z 3
6 c x 8
7 c y 7
8 c z 9
And after merge rows are filtered by columns Len and Bar from df1:
print (df3)
Len Bar Amount
0 x a 4
1 y a 2
2 z a 6
3 x b 2
4 y b 7

Incidentally, you do not seem to need df1 at all:
df3 = df2.set_index('Len/Bar').stack().reset_index()
df3.columns = "Len", "Bar", "Amount"
# Len Bar Amount
#0 x a 4
#1 x b 2
#2 x c 8
#3 y a 2
#4 y b 7
#5 y c 7
#6 z a 6
#7 z b 3
#8 z c 9
Unless you want to borrow the column names from it:
df3.columns = df1.columns + ("Amount",)

Related

pandas compare first rows and make identical

I have two dfs.
df1 = pd.DataFrame(["bazzar","dogsss","zxvfzx","anythi"], columns = [0], index = [0,1,2,3])
df2 = pd.DataFrame(["baar","maar","cats","$%&*"], columns = [0], index = [0,1,2,3])
df1 = df1[0].apply(lambda x: pd.Series(list(x)))
df2 = df2[0].apply(lambda x: pd.Series(list(x)))
which look like
df1
0 1 2 3 4 5
0 b a z z a r
1 d o g s s s
2 z x v f z x
3 a n y t h i
df2
0 1 2 3
0 b a a r
1 m a a r
2 c a t s
3 $ % & *
I want to compare their first rows and make them identical by inserting new columns containing the character z to df2, so that df2 becomes
0 1 2 3 4 5
0 b a z z a r
1 m a z z a r
2 c a z z t s
3 $ % z z & *
An additional example:
df3 = pd.DataFrame(["aazzbbzcc","bbbbbbbbb","ccccccccc","ddddddddd"], columns = [0], index = [0,1,2,3])
df4 = pd.DataFrame(["aabbcc","111111","222222","333333"], columns = [0], index = [0,1,2,3])
df3 = df3[0].apply(lambda x: pd.Series(list(x)))
df4 = df4[0].apply(lambda x: pd.Series(list(x)))
df3
0 1 2 3 4 5 6 7 8
0 a a z z b b z c c
1 b b b b b b b b b
2 c c c c c c c c c
3 d d d d d d d d d
df4
0 1 2 3 4 5
0 a a b b c c
1 1 1 1 1 1 1
2 2 2 2 2 2 2
3 3 3 3 3 3 3
You can see, an important relationship between the first rows of the two dataframes: they will eventually become the same when character z are added to the later dataframe (i.e. df2 and df4), so that the expected output for this example is:
0 1 2 3 4 5 6 7 8
0 a a z z b b z c c
1 1 1 z z 1 1 z 1 1
2 2 2 z z 2 2 z 2 2
3 3 3 z z 3 3 z 3 3
Any idea how to do that?
Because in first rows are duplicated values are create MultiIndex with first rows and GroupBy.cumcount for both DataFrames:
a = df1.iloc[[0]].T
df1.columns = [a[0], a.groupby(a[0]).cumcount()]
b = df2.iloc[[0]].T
df2.columns = [b[0], b.groupby(b[0]).cumcount()]
print (df1)
0 b a z a r
0 0 0 1 1 0
0 b a z z a r
1 d o g s s s
2 z x v f z x
3 a n y t h i
print (df2)
0 b a r
0 0 1 0
0 b a a r
1 m a a r
2 c a t s
3 $ % & *
And then is used DataFrame.reindex with replace missing values by first row of df1:
df = df2.reindex(df1.columns, axis=1).fillna(df1.iloc[0])
print (df)
0 b a z a r
0 0 0 1 1 0
0 b a z z a r
1 m a z z a r
2 c a z z t s
3 $ % z z & *
Last set range to columns:
df.columns = range(len(df.columns))
print (df)
0 1 2 3 4 5
0 b a z z a r
1 m a z z a r
2 c a z z t s
3 $ % z z & *
Check where to add:
list(difflib.ndiff(df2[0][0], df1[0][0]))
[' b', ' a', '+ z', '+ z', ' a', ' r']
Add manually
df2[0].str.replace('(.){2}', '\\1zz', regex = True).str.split('(?<=\\S)(?=\\S)', expand = True)
Out[1557]:
0 1 2 3 4 5
0 a z z r z z
1 a z z r z z
2 a z z s z z
3 % z z * z z

How to compute nested group proportions using python pandas without losing original row count

Given the following data:
df = pd.DataFrame({
'where': ['a','a','a','a','a','a'] + ['b','b','b','b','b','b'],
'what': ['x','y','z','x','y','z'] + ['x','y','z','x','y','z'],
'val' : [1,3,2,5,4,3] + [5,6,3,4,5,3]
})
Which looks as:
where what val
0 a x 1
1 a y 3
2 a z 2
3 a x 5
4 a y 4
5 a z 3
6 b x 5
7 b y 6
8 b z 3
9 b x 4
10 b y 5
11 b z 3
I would like to compute the proportion of what in where, and create a new
column that represented this.
The column will have duplicates, If I consider what = x in the above, and
add that column in then the data would be as follows
where what val what_where_prop
0 a x 1 6/18
1 a y 3
2 a z 2
3 a x 5 6/18
4 a y 4
5 a z 3
6 b x 5 9/26
7 b y 6
8 b z 3
9 b x 4 9/26
10 b y 5
11 b z 3
Here 6/18 is computed by finding the total x (6 = 1 + 5) in a over the total of val in a. The same process is taken for 9/26
The full solution will be filled similarly for y and z in the final column.
IIUC,
df['what_where_group'] = (df.groupby(['where', 'what'], as_index=False)['val']
.transform('sum')
.div(df.groupby('where')['val']
.transform('sum'),
axis=0))
df
Output:
where what val what_where_prop what_where_group
0 a x 1 6 0.333333
1 a y 3 7 0.388889
2 a z 2 5 0.277778
3 a x 5 6 0.333333
4 a y 4 7 0.388889
5 a z 3 5 0.277778
6 b x 5 9 0.346154
7 b y 6 11 0.423077
8 b z 3 6 0.230769
9 b x 4 9 0.346154
10 b y 5 11 0.423077
11 b z 3 6 0.230769
Details:
First groupby two levels using what and where, by using index=False, I am not setting the index as the groups, and transform sum. Next, groupby only where and transform sum. Lastly, divide, using div, the first groupby by the second groupby using the direction as rows with axis=0.
Another way:
g = df.set_index(['where', 'what'])['val']
num = g.sum(level=[0,1])
denom = g.sum(level=0)
ww_group = num.div(denom, level=0).rename('what_where_group')
df.merge(ww_group, left_on=['where','what'], right_index=True)
Output:
where what val what_where_prop what_where_group
0 a x 1 6 0.333333
3 a x 5 6 0.333333
1 a y 3 7 0.388889
4 a y 4 7 0.388889
2 a z 2 5 0.277778
5 a z 3 5 0.277778
6 b x 5 9 0.346154
9 b x 4 9 0.346154
7 b y 6 11 0.423077
10 b y 5 11 0.423077
8 b z 3 6 0.230769
11 b z 3 6 0.230769
Details:
Basically the same as before just using steps. And, merge results to apply division to each line.

Reverse a Cross Tabulation or Frequency Table

Suppose I have a frequency table df defined as:
dat = [[0, 2, 1], [1, 0, 3], [4, 1, 1]]
idx = pd.Index([*'abc'], name='One')
col = pd.Index([*'xyz'], name='Two')
df = pd.DataFrame(dat, idx, col)
df
Two x y z
One
a 0 2 1
b 1 0 3
c 4 1 1
How do I "invert" this to get a dataframe pre_df
One Two
0 a y
1 a y
2 a z
3 b x
4 b z
5 b z
6 b z
7 c x
8 c x
9 c x
10 c x
11 c y
12 c z
Such that pd.crosstab(pre_df.One, pre_df.Two) would get me back to df
Two x y z
One
a 0 2 1
b 1 0 3
c 4 1 1
Try stack and repeat:
s = df.stack()
s.index.repeat(s).to_frame().reset_index(drop=True)
Output:
One Two
0 a y
1 a y
2 a z
3 b x
4 b z
5 b z
6 b z
7 c x
8 c x
9 c x
10 c x
11 c y
12 c z

Python: how to reshape a Pandas dataframe and keeping the information?

I have a dataframe counting the geographical information of points.
df:
A B ax ay bx by
0 x y 5 7 3 2
1 z w 2 0 7 4
2 k x 5 7 2 0
3 v y 2 3 3 2
I would like to create a dataframe with the geographical info of the unique points
df1:
ID x y
0 x 5 7
1 y 3 2
2 z 2 0
3 w 7 4
4 k 5 7
5 v 2 3
First flatten values in columns by numpy.ravel, create DataFrame by contructor and last add drop_duplicates, thanks #zipa:
a = df[['A','B']].values.ravel()
b = df[['ax','bx']].values.ravel()
c = df[['ay','by']].values.ravel()
df = pd.DataFrame({'ID':a, 'x':b, 'y':c}).drop_duplicates('ID').reset_index(drop=True)
print (df)
ID x y
0 x 5 7
1 y 3 2
2 z 2 0
3 w 7 4
4 k 5 7
5 v 2 3

Pandas self-join on non-unique values

I have the following table:
ind_ID pair_ID orig_data
0 A 1 W
1 B 1 X
2 C 2 Y
3 D 2 Z
4 A 3 W
5 C 3 X
6 B 4 Y
7 D 4 Z
Each row has an individual_ID, and a pair_ID that it shares with exactly one other row. I want to do a self join, so that every row has its original data, and the data of the row it shares a pair_ID with:
ind_ID pair_ID orig_data partner_data
0 A 1 W X
1 B 1 X W
2 C 2 Y Z
3 D 2 Z Y
4 A 3 W X
5 C 3 X W
6 B 4 Y Z
7 D 4 Z Y
I have tried:
df.join(df, on='pair_ID')
But obviously since pair_ID values are not unique I get:
ind_ID pair_ID orig_data partner_data
0 A 1 W NaN
1 B 1 X NaN
2 C 2 Y NaN
3 D 2 Z NaN
4 A 3 W NaN
5 C 3 X NaN
6 B 4 Y NaN
7 D 4 Z NaN
I've also thought about creating a new column that concatenates ind_ID+pair_ID which would be unique, but then the join would not know what to match on.
Is it possible to do a self-join on pair_ID where each row is joined with the matching row that is not itself?
In your case (with only two pairs) - you can probably just groupby and transform based on the ID, and just reverse the order of the values in the group, eg:
df.loc[:, 'partner_data'] = df.groupby('pair_ID').orig_data.transform(lambda L: L[::-1])
Which gives you:
ind_ID pair_ID orig_data partner_ID
0 A 1 W X
1 B 1 X W
2 C 2 Y Z
3 D 2 Z Y
4 A 3 W X
5 C 3 X W
6 B 4 Y Z
7 D 4 Z Y

Categories

Resources