pandas.DataFrame - how to reindex by group? - python

Can new index be applied to DF, respectively to grouping made with groupby? Precisely - is there an elegant way to do that, and can original DF be changed through groupby groups at all?
UPD:
My data looks like this:
A B C
0 a x 0.903343
1 a z 0.982050
2 g x 0.274823
3 g y 0.334491
4 c z 0.756728
5 f z 0.697841
6 d z 0.505845
7 b z 0.768199
8 b y 0.743012
9 e x 0.697212
I grouping by columns 'A' and 'B', and I want that every unique pair of values of that columns would have same index value in original DF. Also - original DF can be big, and Im trying to figure how to make such reindex without inefficient forming whole new DF.
Currently Im using this solution:
df = pd.DataFrame({'A': [random.choice(ascii_lowercase[:5]) for _ in xrange(10)],
'B': [random.choice(['x', 'y']) for _ in xrange(10)],
'C': [random.random() for _ in xrange(10)]})
df['id'] = None
new_df = pd.DataFrame()
for i, (n, g) in enumerate(df.groupby(['A', 'B'])):
g['id'] = i
new_df = new_df.append(g)
new_df.set_index('id', inplace=True)

You can do this quickly with some internal function in pandas:
Create test DataFrame first:
import pandas as pd
import random
random.seed(1)
df = pd.DataFrame({'A': [random.choice(ascii_lowercase[:5]) for _ in xrange(10)],
'B': [random.choice(['x', 'y']) for _ in xrange(10)],
'C': [random.random() for _ in xrange(10)]})
If you want the new id the same order as column A & B:
m = pd.MultiIndex.from_arrays((df.A, df.B))
df.index = pd.factorize(pd.lib.fast_zip(m.labels), sort=True)[0]
print df
The output is:
A B C
1 a y 0.025446
7 e x 0.541412
6 d y 0.939149
2 b x 0.381204
3 c x 0.216599
4 c y 0.422117
5 d x 0.029041
6 d y 0.221692
1 a y 0.437888
0 a x 0.495812
If you don't care the order of new id:
m = pd.MultiIndex.from_arrays((df.A, df.B))
la, lb = m.labels
df.index = pd.factorize(la*len(lb)+lb)[0]
print df
The output is:
A B C
0 a y 0.025446
1 e x 0.541412
2 d y 0.939149
3 b x 0.381204
4 c x 0.216599
5 c y 0.422117
6 d x 0.029041
2 d y 0.221692
0 a y 0.437888
7 a x 0.495812

Related

How to get re-arrange the pandas dataframe based on frequency?

I have a dataframe like this:
import numpy as np
import pandas as pd
from collections import Counter
df = pd.DataFrame({'c0': ['app','e','i','owl','u'],'c1': ['p','app','i','g',''],'c2': ['g','p','app','owl','']})
df
c0 c1 c2
0 app p g
1 e app p
2 i i app
3 owl g owl
4 u
I would like to align the rows based on frequency of items.
Required dataframe with quantities:
c0 c1 c2
0 app app app
1 i i
2 owl owl
3 e p p
4 u g g
My attempt
all_cols = df.values.flatten()
all_cols = [i for i in all_cols if i]
freq = Counter(all_cols)
freq
I can get you this far:
import pandas as pd
df = pd.DataFrame({'c0': list('aeiou'),'c1': ['p','a','i','g',''],'c2': ['g','p','a','o','']})
allLetters = set(x for x in df.to_numpy().flatten() if x)
binaryIncidence = []
for letter in allLetters:
binaryIncidence.append(tuple(int(letter in df[col].tolist()) for col in df.columns))
x = list(zip(allLetters, binaryIncidence))
x.sort(key=lambda y:(y[1], -ord(y[0])), reverse=True)
x = [[y[0] if b else '' for b in y[1]] for y in x]
df_results = pd.DataFrame(x, columns=df.columns)
print(df_results)
... with this output:
c0 c1 c2
0 a a a
1 i i
2 o o
3 e
4 u
5 g g
6 p p
However, in the sample output from your question, you show 'e' getting paired up with 'p', 'p', and also 'u' getting paired up with 'g', 'g'. It's not clear to me how this selection would be made.
UPDATE: generalize to strings of arbitrary length
This will work not just with strings of length <=1 but of arbitrary length:
import pandas as pd
df = pd.DataFrame({'c0': ['app','e','i','owl','u'],'c1': ['p','app','i','g',''],'c2': ['g','p','app','owl','']})
allStrings = set(x for x in df.to_numpy().flatten() if x)
binaryIncidence = []
for s in allStrings:
binaryIncidence.append(tuple(int(s in df[col].tolist()) for col in df.columns))
x = list(zip(allStrings, binaryIncidence))
x.sort(key=lambda y:(tuple(-b for b in y[1]), y[0]))
x = [[y[0] if b else '' for b in y[1]] for y in x]
df_results = pd.DataFrame(x, columns=df.columns)
print(df_results)
Output:
c0 c1 c2
0 app app app
1 i i
2 owl owl
3 e
4 u
5 g g
6 p p

Replace contents of cell with another cell if condition on a separate cell is met

I have to following data frame
A = [1,2,5,4,3,1]
B = ["yes","No","hello","yes","no", 'why']
C = [1,0,1,1,0,0]
D = ['y','n','y','y','n','n']
test_df = pd.DataFrame({'A': A, 'B': B, 'C': C, 'D':D})
we can see 4 columns A,B,C,D the intended outcome is to replace the contents of B with the contents of D, if a condition on C is met, for this example the condition is of C = 1
the intended output is
A = [1,2,5,4,3,1]
B = ["y","No","y","y","no", 'why']
C = [1,0,1,1,0,0]
D = ['y','n','y','y','n','n']
output_df = pd.DataFrame({'A': A, 'B': B, 'C': C, 'D':D})
output_df.drop('D', axis = 1)
What is the best way to apply this logic to a data frame?
There are many ways to solve, here is another one:
test_df['B'] = test_df['B'].mask(test_df['C'] == 1, test_df['D'])
Output:
A B C D
0 1 y 1 y
1 2 No 0 n
2 5 y 1 y
3 4 y 1 y
4 3 no 0 n
5 1 why 0 n
This can be done with np.where:
test_df['B'] = np.where(test_df['C']==1, test_df['D'], test_df['B'])
Output:
A B C D
0 1 y 1 y
1 2 No 0 n
2 5 y 1 y
3 4 y 1 y
4 3 no 0 n
5 1 why 0 n
The desired output is achieved using .loc with column 'C' as the mask.
test_df.loc[test_df['C']==1,'B'] = test_df.loc[test_df['C']==1,'D']
UPDATE: Just found out a similar answer is posted by #QuangHoang. This answer is slightly different in that it does not require numpy
I don't know if inverse is the right word here, but I noticed recently that mask and where are "inverses" of each other. If you pass a ~ to the condition of a .where statement, then you get the same result as mask:
A = [1,2,5,4,3,1]
B = ["yes","No","hello","yes","no", 'why']
C = [1,0,1,1,0,0]
D = ['y','n','y','y','n','n']
test_df = pd.DataFrame({'A': A, 'B': B, 'C': C, 'D':D})
test_df['B'] = test_df['B'].where(~(test_df['C'] == 1), test_df['D'])
# test_df['B'] = test_df['B'].mask(test_df['C'] == 1, test_df['D']) - Scott Boston's answer
test_df
Out[1]:
A B C D
0 1 y 1 y
1 2 No 0 n
2 5 y 1 y
3 4 y 1 y
4 3 no 0 n
5 1 why 0 n
You can also use df.where:
test_df['B'] = test_df['D'].where(test_df.C.eq(1), test_df.B)
Output:
In [875]: test_df
Out[875]:
A B C D
0 1 y 1 y
1 2 No 0 n
2 5 y 1 y
3 4 y 1 y
4 3 no 0 n
5 1 why 0 n

Append a DataFrame as a new column

I have the following piece of code:
rev = input(rev: )
def ds():
data = pd.read_excel(r'H:\sysfile.xls', skiprows=2)
dataset1 = pd.DataFrame(data, columns=['Col_1', 'Col_2', 'Col_3', 'Col_4'])
dataset2 = pd.DataFrame(data, columns=['Col_5'])
dataset2['Col_5'] = dataset2['Col_5'].fillna(rev)
ds()
Col_5 is an existing Column in the xls File. I want to give every cell in the column the input from rev
If I print() the dataset1 i get the content of the existing DataFrame (read from the xls-File):
A B C
0 x y z
1 x y z
2 x y z
Now I want to write the userinput from rev=input() into DataFrame dataset2 and append it to dataset 1.
INPUT:
>>>rev: h
Should become this (dataset2):
D
0 h
1 h
2 h
and appand to dataset1:
A B C D
0 x y z h
1 x y z h
2 x y z h
I need help!
If you managed already to fill the dataset2 with the user input, what you need is to join both datasets:
dataset1 = pd.DataFrame({'A': ['X', 'X', 'X', 'X'], 'B': ['Y', 'Y', 'Y', 'Y'], 'C': ['z', 'z', 'z', 'z']})
Out[5]:
A B C
0 X Y z
1 X Y z
2 X Y z
3 X Y z
dataset2 = pd.DataFrame({'D': ['h', 'h', 'h', 'h']})
Out[8]:
D
0 h
1 h
2 h
3 h
At this point you need to join them:
result = pd.concat([df1, df2], axis=1, sort=False)
Out[10]:
A B C D
0 X Y z h
1 X Y z h
2 X Y z h
3 X Y z h
You can read it in here for further details:
https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html
You can get Transpose of your data frame and add that column as a row.
After adding that row you can use Transpose again to transform your data frame to its normal form

How to repeat certain rows of a dataframe?

I have a dataframe like this
import pandas as pd
df1 = pd.DataFrame({
'key': list('AAABBC'),
'prop1': list('xyzuuy'),
'prop2': list('mnbnbb')
})
key prop1 prop2
0 A x m
1 A y n
2 A z b
3 B u n
4 B u b
5 C y b
and a dictionary like this (user input):
d = {
'A': 2,
'B': 1,
'C': 3,
}
The keys of d refer to entries in column key in df1, the values indicate how often the rows of df1 that belong to the respective keys should be present: 1 means that nothing has to be done, 2 means all lines should be copied once, 3 they should be copied twice.
For the example above, the expected output looks as follows:
key prop1 prop2
0 A x m
1 A y n
2 A z b
3 B u n
4 B u b
5 C y b
6 A x m # <-- copied, copy 1
7 A y n # <-- copied, copy 1
8 A z b # <-- copied, copy 1
9 C y b # <-- copied, copy 1
10 C y b # <-- copied, copy 2
So, the rows that belong to A have been copied once and added to df1, nothing had to be done about the rows the belong to B and the rows that belong to C have been copied twice and were also added to df1.
I currently implement this as follows:
dfs_to_add = []
for el, val in d.items():
if val > 1:
_temp_df = pd.concat(
[df1[df1['key'] == el]] * (val-1)
)
dfs_to_add.append(_temp_df)
df_to_add = pd.concat(dfs_to_add)
df_final = pd.concat([df1, df_to_add]).reset_index(drop=True)
which gives me the desired output.
The code is rather ugly; does anyone see a more straightforward option to get to the same output?
The order is important, so in case of A, I would need
0 A x m
1 A y n
2 A z b
0 A x m
1 A y n
2 A z b
and not
0 A x m
0 A x m
1 A y n
1 A y n
2 A z b
2 A z b
We can sue concat + groupby
df=pd.concat([ pd.concat([y]*d.get(x)) for x , y in df1.groupby('key')])
key prop1 prop2
0 A x m
1 A y n
2 A z b
0 A x m
1 A y n
2 A z b
3 B u n
4 B u b
5 C y b
5 C y b
5 C y b
One way using Index.repeat with loc[] and series.map:
m = df1.set_index('key',append=True)
out = m.loc[m.index.repeat(df1['key'].map(d))].reset_index('key')
print(out)
key prop1 prop2
0 A x m
0 A x m
1 A y n
1 A y n
2 A z b
2 A z b
3 B u n
4 B u b
5 C y b
5 C y b
5 C y b
You can try repeat:
df1.loc[df1.index.repeat(df1['key'].map(d))]
Output:
key prop1 prop2
0 A x m
0 A x m
1 A y n
1 A y n
2 A z b
2 A z b
3 B u n
4 B u b
5 C y b
5 C y b
5 C y b
If order is not important, use another solutions.
If order is important get indices of repeated values, repeat by loc and add to original:
idx = [x for k, v in d.items() for x in df1.index[df1['key'] == k].repeat(v-1)]
df = df1.append(df1.loc[idx], ignore_index=True)
print (df)
key prop1 prop2
0 A x m
1 A y n
2 A z b
3 B u n
4 B u b
5 C y b
6 A x m
7 A y n
8 A z b
9 C y b
10 C y b
Using DataFrame.merge and np.repeat:
df = df1.merge(
pd.Series(np.repeat(list(d.keys()), list(d.values())), name='key'), on='key')
Result:
# print(df)
key prop1 prop2
0 A x m
1 A x m
2 A y n
3 A y n
4 A z b
5 A z b
6 B u n
7 B u b
8 C y b
9 C y b
10 C y b

Change the values of column after having used groupby on another column (pandas dataframe)

I have two data frames, one with the coordinates of places
coord = pd.DataFrame()
coord['Index'] = ['A','B','C']
coord['x'] = np.random.random(coord.shape[0])
coord['y'] = np.random.random(coord.shape[0])
coord
Index x y
0 A 0.888025 0.376416
1 B 0.052976 0.396243
2 C 0.564862 0.30138
and one with several values measured in the places
df = pd.DataFrame()
df['Index'] = ['A','A','B','B','B','C','C','C','C']
df['Value'] = np.random.random(df.shape[0])
df
Index Value
0 A 0.930298
1 A 0.144550
2 B 0.393952
3 B 0.680941
4 B 0.657807
5 C 0.704954
6 C 0.733328
7 C 0.099785
8 C 0.871678
I want to find an efficient way of assigning the coordinates to the df data frame. For the moment I have tried
df['x'] = np.zeros(df.shape[0])
df['y'] = np.zeros(df.shape[0])
for i in df.Index.unique():
df.loc[df.Index == i, 'x'] = coord.loc[coord.Index == i,'x'].values
df.loc[df.Index == i, 'y'] = coord.loc[coord.Index == i,'y'].values
which works and yields
Index Value x y
0 A 0.220323 0.983739 0.121289
1 A 0.115075 0.983739 0.121289
2 B 0.432688 0.809586 0.639811
3 B 0.106178 0.809586 0.639811
4 B 0.259465 0.809586 0.639811
5 C 0.804018 0.827192 0.156095
6 C 0.552053 0.827192 0.156095
7 C 0.412345 0.827192 0.156095
8 C 0.235106 0.827192 0.156095
but this is quite sloppy, and highly inefficient. I tried to use the groupby operation like this
df['x'] =np.zeros(df.shape[0])
df['y'] =np.zeros(df.shape[0])
gb = df.groupby('Index')
for k in gb.groups.keys():
gb.get_group(k)['x'] = coord.loc[coord.Index == i ,'x']
gb.get_group(k)['y'] = coord.loc[coord.Index == i ,'y']
but I get this error here
/anaconda/lib/python2.7/site-packages/ipykernel_launcher.py:5: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
I understand the problem, but I dont know how to overcome it.
Any suggestions ?
merge is what you're looking for.
df
Index Value
0 A 0.930298
1 A 0.144550
2 B 0.393952
3 B 0.680941
4 B 0.657807
5 C 0.704954
6 C 0.733328
7 C 0.099785
8 C 0.871678
coord
Index x y
0 A 0.888025 0.376416
1 B 0.052976 0.396243
2 C 0.564862 0.301380
df.merge(coord, on='Index')
Index Value x y
0 A 0.930298 0.888025 0.376416
1 A 0.144550 0.888025 0.376416
2 B 0.393952 0.052976 0.396243
3 B 0.680941 0.052976 0.396243
4 B 0.657807 0.052976 0.396243
5 C 0.704954 0.564862 0.301380
6 C 0.733328 0.564862 0.301380
7 C 0.099785 0.564862 0.301380
8 C 0.871678 0.564862 0.301380

Categories

Resources