I have a dataframe like this:
mainid pidx pidy score
1 a b 2
1 a c 5
1 c a 7
1 c b 2
1 a e 8
2 x y 1
2 y z 3
2 z y 5
2 x w 12
2 x v 1
2 y x 6
I want to groupby on column 'pidx' and then sort score in descending order in each group i.e for each pidx
and then select head(2) i.e top 2 from each group.
The result I am looking for is like this:
mainid pidx pidy score
1 a e 8
1 a c 5
1 c a 7
1 c b 2
2 x w 12
2 x y 1
2 y x 6
2 y z 3
2 z y 5
What I tried was:
df.sort(['pidx','score'],ascending = False).groupby('pidx').head(2)
and this seems to work, but I dont know if it's the right approach if working on a huge dataset. What other best method can I use to get such result?
There are 2 solutions:
1.sort_values and aggregate head:
df1 = df.sort_values('score',ascending = False).groupby('pidx').head(2)
print (df1)
mainid pidx pidy score
8 2 x w 12
4 1 a e 8
2 1 c a 7
10 2 y x 6
1 1 a c 5
7 2 z y 5
6 2 y z 3
3 1 c b 2
5 2 x y 1
2.set_index and aggregate nlargest:
df = df.set_index(['mainid','pidy']).groupby('pidx')['score'].nlargest(2).reset_index()
print (df)
pidx mainid pidy score
0 a 1 e 8
1 a 1 c 5
2 c 1 a 7
3 c 1 b 2
4 x 2 w 12
5 x 2 y 1
6 y 2 x 6
7 y 2 z 3
8 z 2 y 5
Timings:
np.random.seed(123)
N = 1000000
L1 = list('abcdefghijklmnopqrstu')
L2 = list('efghijklmnopqrstuvwxyz')
df = pd.DataFrame({'mainid':np.random.randint(1000, size=N),
'pidx': np.random.randint(10000, size=N),
'pidy': np.random.choice(L2, N),
'score':np.random.randint(1000, size=N)})
#print (df)
def epat(df):
grouped = df.groupby('pidx')
new_df = pd.DataFrame([], columns = df.columns)
for key, values in grouped:
new_df = pd.concat([new_df, grouped.get_group(key).sort_values('score', ascending=True)[:2]], 0)
return (new_df)
print (epat(df))
In [133]: %timeit (df.sort_values('score',ascending = False).groupby('pidx').head(2))
1 loop, best of 3: 309 ms per loop
In [134]: %timeit (df.set_index(['mainid','pidy']).groupby('pidx')['score'].nlargest(2).reset_index())
1 loop, best of 3: 7.11 s per loop
In [147]: %timeit (epat(df))
1 loop, best of 3: 22 s per loop
a simple solution would be:
grouped = DF.groupby('pidx')
new_df = pd.DataFrame([], columns = DF.columns)
for key, values in grouped:
new_df = pd.concat([new_df, grouped.get_group(key).sort_values('score', ascending=True)[:2]], 0)
hope it helps!
Another method is to rank scores in each group and filter the rows where the scores are ranked top 2 in each group.
df1 = df[df.groupby('pidx')['score'].rank(method='first', ascending=False) <= 2]
Related
I have a dataframe like this:
mainid pidx pidy score
1 a b 2
1 a c 5
1 c a 7
1 c b 2
1 a e 8
2 x y 1
2 y z 3
2 z y 5
2 x w 12
2 x v 1
2 y x 6
I want to groupby on column 'pidx' and then sort score in descending order in each group i.e for each pidx
and then select head(2) i.e top 2 from each group.
The result I am looking for is like this:
mainid pidx pidy score
1 a e 8
1 a c 5
1 c a 7
1 c b 2
2 x w 12
2 x y 1
2 y x 6
2 y z 3
2 z y 5
What I tried was:
df.sort(['pidx','score'],ascending = False).groupby('pidx').head(2)
and this seems to work, but I dont know if it's the right approach if working on a huge dataset. What other best method can I use to get such result?
There are 2 solutions:
1.sort_values and aggregate head:
df1 = df.sort_values('score',ascending = False).groupby('pidx').head(2)
print (df1)
mainid pidx pidy score
8 2 x w 12
4 1 a e 8
2 1 c a 7
10 2 y x 6
1 1 a c 5
7 2 z y 5
6 2 y z 3
3 1 c b 2
5 2 x y 1
2.set_index and aggregate nlargest:
df = df.set_index(['mainid','pidy']).groupby('pidx')['score'].nlargest(2).reset_index()
print (df)
pidx mainid pidy score
0 a 1 e 8
1 a 1 c 5
2 c 1 a 7
3 c 1 b 2
4 x 2 w 12
5 x 2 y 1
6 y 2 x 6
7 y 2 z 3
8 z 2 y 5
Timings:
np.random.seed(123)
N = 1000000
L1 = list('abcdefghijklmnopqrstu')
L2 = list('efghijklmnopqrstuvwxyz')
df = pd.DataFrame({'mainid':np.random.randint(1000, size=N),
'pidx': np.random.randint(10000, size=N),
'pidy': np.random.choice(L2, N),
'score':np.random.randint(1000, size=N)})
#print (df)
def epat(df):
grouped = df.groupby('pidx')
new_df = pd.DataFrame([], columns = df.columns)
for key, values in grouped:
new_df = pd.concat([new_df, grouped.get_group(key).sort_values('score', ascending=True)[:2]], 0)
return (new_df)
print (epat(df))
In [133]: %timeit (df.sort_values('score',ascending = False).groupby('pidx').head(2))
1 loop, best of 3: 309 ms per loop
In [134]: %timeit (df.set_index(['mainid','pidy']).groupby('pidx')['score'].nlargest(2).reset_index())
1 loop, best of 3: 7.11 s per loop
In [147]: %timeit (epat(df))
1 loop, best of 3: 22 s per loop
a simple solution would be:
grouped = DF.groupby('pidx')
new_df = pd.DataFrame([], columns = DF.columns)
for key, values in grouped:
new_df = pd.concat([new_df, grouped.get_group(key).sort_values('score', ascending=True)[:2]], 0)
hope it helps!
Another method is to rank scores in each group and filter the rows where the scores are ranked top 2 in each group.
df1 = df[df.groupby('pidx')['score'].rank(method='first', ascending=False) <= 2]
I have the following data frame table. The table has the columns Id, columns, rows, 1, 2, 3, 4, 5, 6, 7, 8, and 9.
Id columns rows 1 2 3 4 5 6 7 8 9
1 3 3 A B C D E F G H Z
2 3 2 I J K
By considering Id, the number of rows, and columns I would like to restructure the table as follows.
Id columns rows col_1 col_2 col_3
1 3 3 A B C
1 3 3 D E F
1 3 3 G H Z
2 3 2 I J K
2 3 2 - - -
Can anyone help to do this in Python Pandas?
Here's a solution using MultiIndex and .itterrows():
df
Id columns rows 1 2 3 4 5 6 7 8 9
0 1 3 3 A B C D E F G H Z
1 2 3 2 I J K None None None None None None
You can set n to any length, in your case 3:
n = 3
df = df.set_index(['Id', 'columns', 'rows'])
new_index = []
new_rows = []
for index, row in df.iterrows():
max_rows = index[-1] * (len(index)-1) # read amount of rows
for i in range(0, len(row), n):
if i > max_rows: # max rows reached, stop appending
continue
new_index.append(index)
new_rows.append(row.values[i:i+n])
pd.DataFrame(new_rows, index=pd.MultiIndex.from_tuples(new_index))
0 1 2
1 3 3 A B C
3 D E F
3 G H Z
2 3 2 I J K
2 None None None
And if you are keen on getting your old index and headers back:
new_headers = ['Id', 'columns', 'rows'] + list(range(1, n+1))
df2.reset_index().set_axis(new_headers, axis=1)
Id columns rows 1 2 3
0 1 3 3 A B C
1 1 3 3 D E F
2 1 3 3 G H Z
3 2 3 2 I J K
4 2 3 2 None None None
Using melt and str.split with floor division against your index to create groups of 3.
s = pd.melt(df,id_vars=['Id','columns','rows'])
s1 = (
s.sort_values(["Id", "variable"])
.assign(idx=s.index // 3)
.fillna("-")
.groupby(["idx", "Id"])
.agg(
columns=("columns", "first"), rows=("rows", "first"), value=("value", ",".join)
)
)
s2 = s1["value"].str.split(",", expand=True).rename(
columns=dict(zip(s1["value"].str.split(",", expand=True).columns,
[f'col_{i+1}' for i in range(s1["value"].str.split(',').apply(len).max())]
))
)
df1 = pd.concat([s1.drop('value',axis=1),s2],axis=1)
print(df1)
columns rows col_1 col_2 col_3
idx Id
0 1 3 3 A B C
1 1 3 3 D E F
2 1 3 3 G H Z
3 2 3 2 I J K
4 2 3 2 - - -
5 2 3 2 - - -
I modify unutbu solution for create array for each row by expected length of new rows, columns, then create Dataframe in list comprehension and join together by concat:
def f(x):
c, r = x.name[1], x.name[2]
#print (c, r)
arr = np.empty(c * r, dtype='O')
vals = x.iloc[:len(arr)]
arr[:len(vals)] = vals
idx = pd.MultiIndex.from_tuples([x.name] * r, names=df.columns[:3])
cols = [f'col_{c+1}' for c in range(c)]
return pd.DataFrame(arr.reshape((r, c)), index=idx, columns=cols).fillna('-')
df1 = (pd.concat([x for x in df.set_index(['Id', 'columns', 'rows'])
.apply(f, axis=1)])
.reset_index())
print (df1)
Id columns rows col_1 col_2 col_3
0 1 3 3 A B C
1 1 3 3 D E F
2 1 3 3 G H Z
3 2 3 2 I J K
4 2 3 2 - - -
Consider the following hdfstore and dataframes df and df2
import pandas as pd
store = pd.HDFStore('test.h5')
midx = pd.MultiIndex.from_product([range(2), list('XYZ')], names=list('AB'))
df = pd.DataFrame(dict(C=range(6)), midx)
df
C
A B
0 X 0
Y 1
Z 2
1 X 3
Y 4
Z 5
midx2 = pd.MultiIndex.from_product([range(2), list('VWX')], names=list('AB'))
df2 = pd.DataFrame(dict(C=range(6)), midx2)
df2
C
A B
0 V 0
W 1
X 2
1 V 3
W 4
X 5
I want to first write df to the store.
store.append('df', df)
store.get('df')
C
A B
0 X 0
Y 1
Z 2
1 X 3
Y 4
Z 5
At a later point in time I will have another dataframe that I want to update the store with. I want to overwrite the rows with the same index values as are in my new dataframe while keeping the old ones.
When I do
store.append('df', df2)
store.get('df')
C
A B
0 X 0
Y 1
Z 2
1 X 3
Y 4
Z 5
0 V 0
W 1
X 2
1 V 3
W 4
X 5
This isn't at all what I want. Notice that (0, 'X') and (1, 'X') are repeated. I can manipulate the combined dataframe and overwrite, but I expect to be working with a lot data where this wouldn't be feasible.
How do I update the store to get?
C
A B
0 V 0
W 1
X 2
Y 1
Z 2
1 V 3
W 4
X 5
Y 4
Z 5
You'll see that For each level of 'A', 'Y' and 'Z' are the same, 'V' and 'W' are new, and 'X' is updated.
What is the correct way to do this?
Idea: remove matching rows (with matching index values) from the HDF first and then append df2 to HDFStore.
Problem: I couldn't find a way to use where="index in df2.index" for multi-index indexes.
Solution: first convert multiindexes to normal ones:
df.index = df.index.get_level_values(0).astype(str) + '_' + df.index.get_level_values(1).astype(str)
df2.index = df2.index.get_level_values(0).astype(str) + '_' + df2.index.get_level_values(1).astype(str)
this yields:
In [348]: df
Out[348]:
C
0_X 0
0_Y 1
0_Z 2
1_X 3
1_Y 4
1_Z 5
In [349]: df2
Out[349]:
C
0_V 0
0_W 1
0_X 2
1_V 3
1_W 4
1_X 5
make sure that you use format='t' and data_columns=True (this will index save index and index all columns in the HDF5 file, allowing us to use them in the where clause) when you create/append HDF5 files:
store = pd.HDFStore('d:/temp/test1.h5')
store.append('df', df, format='t', data_columns=True)
store.close()
now we can first remove those rows from the HDFStore with matching indexes:
store = pd.HDFStore('d:/temp/test1.h5')
In [345]: store.remove('df', where="index in df2.index")
Out[345]: 2
and append df2:
In [346]: store.append('df', df2, format='t', data_columns=True, append=True)
Result:
In [347]: store.get('df')
Out[347]:
C
0_Y 1
0_Z 2
1_Y 4
1_Z 5
0_V 0
0_W 1
0_X 2
1_V 3
1_W 4
1_X 5
I have a large DataFrame of observations. i.e.
value 1,value 2
a,1
a,1
a,2
b,3
a,3
I now have an external DataFrame of values
_ ,a,b
1 ,10,20
2 ,30,40
3 ,50,60
What will be an efficient way to add to the first DataFrame the values from the indexed table? i.e.:
value 1,value 2, new value
a,1,10
a,1,10
a,2,30
b,3,60
a,3,50
An alternative solution using .lookup(). It's just one line, vectorized solution. suitable for large dataset.
import pandas as pd
import numpy as np
# generate some artificial data
# ================================
np.random.seed(0)
df1 = pd.DataFrame(dict(value1=np.random.choice('a b'.split(), 10), value2=np.random.randint(1, 10, 10)))
df2 = pd.DataFrame(dict(a=np.random.randn(10), b=np.random.randn(10)), columns=['a', 'b'], index=np.arange(1, 11))
df1
Out[178]:
value1 value2
0 a 6
1 b 3
2 b 5
3 a 8
4 b 7
5 b 9
6 b 9
7 b 2
8 b 7
9 b 8
df2
Out[179]:
a b
1 2.5452 0.0334
2 1.0808 0.6806
3 0.4843 -1.5635
4 0.5791 -0.5667
5 -0.1816 -0.2421
6 1.4102 1.5144
7 -0.3745 -0.3331
8 0.2752 0.0474
9 -0.9608 1.4627
10 0.3769 1.5350
# processing: one liner lookup function
# =======================================================
# df1.value2 is the index and df1.value1 is the column
df1['new_values'] = df2.lookup(df1.value2, df1.value1)
Out[181]:
value1 value2 new_values
0 a 6 1.4102
1 b 3 -1.5635
2 b 5 -0.2421
3 a 8 0.2752
4 b 7 -0.3331
5 b 9 1.4627
6 b 9 1.4627
7 b 2 0.6806
8 b 7 -0.3331
9 b 8 0.0474
Assuming your first and second dfs are df and df1 respectively, you can merge on the matching columns and then mask the 'a' and 'b' conditions:
In [9]:
df = df.merge(df1, left_on=['value 2'], right_on=['_'])
a_mask = (df['value 2'] == df['_']) & (df['value 1'] == 'a')
b_mask = (df['value 2'] == df['_']) & (df['value 1'] == 'b')
df.loc[a_mask, 'new value'] = df['a'].where(a_mask)
df.loc[b_mask, 'new value'] = df['b'].where(b_mask)
df
Out[9]:
value 1 value 2 _ a b new value
0 a 1 1 10 20 10
1 a 1 1 10 20 10
2 a 2 2 30 40 30
3 b 3 3 50 60 60
4 a 3 3 50 60 50
You can then drop the additional columns:
In [11]:
df = df.drop(['_','a','b'], axis=1)
df
Out[11]:
value 1 value 2 new value
0 a 1 10
1 a 1 10
2 a 2 30
3 b 3 60
4 a 3 50
Another way is to define a func to perform the lookup:
In [15]:
def func(x):
row = df1[(df1['_'] == x['value 2'])]
return row[x['value 1']].values[0]
df['new value'] = df.apply(lambda x: func(x), axis = 1)
df
Out[15]:
value 1 value 2 new value
0 a 1 10
1 a 1 10
2 a 2 30
3 b 3 60
4 a 3 50
EDIT
Using #Jianxun Li's lookup works but you have to offset the index as your index is 0 based:
In [20]:
df['new value'] = df1.lookup(df['value 2'] - 1, df['value 1'])
df
Out[20]:
value 1 value 2 new value
0 a 1 10
1 a 1 10
2 a 2 30
3 b 3 60
4 a 3 50
I have 2 csv files. Each contains a data set with multiple columns and an ASSET_ID column. I used pandas to read each csv file in as a df1 and df2. My problem has been trying to define a function to iterate over the ASSET_ID value in df1 and compare each value against all the ASSET_ID values in df2. From there I want to return all the corresponding rows from df1's ASSET_ID's that matched df2. Any help would be appreciated I've been working on this for hours with little to show for it. dtypes are float or int.
My configuration = windows xp, python 2.7 anaconda distribution
Create a boolean mask of the values will index the rows where the 2 df's match, no need to iterate and much faster.
Example:
# define a list of values
a = list('abcdef')
b = range(6)
df = pd.DataFrame({'X':pd.Series(a),'Y': pd.Series(b)})
# c has x values for 'a' and 'd' so these should not match
c = list('xbcxef')
df1 = pd.DataFrame({'X':pd.Series(c),'Y': pd.Series(b)})
print(df)
print(df1)
X Y
0 a 0
1 b 1
2 c 2
3 d 3
4 e 4
5 f 5
[6 rows x 2 columns]
X Y
0 x 0
1 b 1
2 c 2
3 x 3
4 e 4
5 f 5
[6 rows x 2 columns]
In [4]:
# now index your df using boolean condition on the values
df[df.X == df1.X]
Out[4]:
X Y
1 b 1
2 c 2
4 e 4
5 f 5
[4 rows x 2 columns]
EDIT:
So if you have different length series then that won't work, in which case you can use isin:
So create 2 dataframes of different lengths:
a = list('abcdef')
b = range(6)
d = range(10)
df = pd.DataFrame({'X':pd.Series(a),'Y': pd.Series(b)})
c = list('xbcxefxghi')
df1 = pd.DataFrame({'X':pd.Series(c),'Y': pd.Series(d)})
print(df)
print(df1)
X Y
0 a 0
1 b 1
2 c 2
3 d 3
4 e 4
5 f 5
[6 rows x 2 columns]
X Y
0 x 0
1 b 1
2 c 2
3 x 3
4 e 4
5 f 5
6 x 6
7 g 7
8 h 8
9 i 9
[10 rows x 2 columns]
Now use isin to select rows from df1 where the id's exist in df:
In [7]:
df1[df1.X.isin(df.X)]
Out[7]:
X Y
1 b 1
2 c 2
4 e 4
5 f 5
[4 rows x 2 columns]