Say I have a dataframe with columns A, B, C, and data.
I would like to:
Convert it to a multi-index dataframe with indices A, B and C
Sort the rows by the the indices A and B of this dataframe.
Within each A B pair of the index, sort the rows (i.e. the C index) by the value on the column data.
Get the top 20 rows within each such A B pair, according to the previous sorting on data.
This shouldn't be hard, but I have tried all sorts of approaches, and none of them give me what I want. The following, for example, is close, but it gives me only values for the first group of A B indices.
temp = mdf.set_index(['A', 'B','C']).sort_index()
# Sorting by value and retrieving the top 20 entries:
func = lambda x: x.sort('data', ascending=False).head(20)
temp = temp.groupby(level=['A','B'],as_index=False).apply(func)
# Drop the dummy index (?) introduced in the line above
temp = temp.reset_index(level=0)['data']
Update:
def create_random_multi_index():
df = pd.DataFrame({'A' : [np.random.random_integers(10) for x in xrange(500)],
'B' : [np.random.random_integers(10) for x in xrange(500)],
'C' : [np.random.random_integers(10) for x in xrange(500)],
'data' : randn(500) })
return df
E.g. of what I am looking for (showing top 3 elements, note how the data is sorted within each A-B pair) :
data
A B C
1 1 10 2.057864
5 1.234252
7 0.235246
2 7 1.309126
6 0.450208
8 0.397360
2 2 2 1.609126
1 0.250208
4 0.597360
...
Not sure I 100% understand what you want, but I think this will do it. When you reset it stays in the same order. The key is the sortlevel(), it sorts lexiographically the levels (and the remaining levels on ties). In 0.14 (coming soon) their is an option sort_remaining which you can play with I think.
In [48]: np.random.seed(1234)
In [49]: df = pd.DataFrame({'A' : [np.random.random_integers(10) for x in xrange(500)],
....: 'B' : [np.random.random_integers(10) for x in xrange(500)],
....: 'C' : [np.random.random_integers(10) for x in xrange(500)],
....: 'data' : randn(500) })
First set the index, then sort it and reset.
Then groupby A,B and pull out the first 20 biggest elements.
df.set_index(['A','B','C']).sortlevel().reset_index().groupby(
['A','B']).apply(lambda x: x.sort(columns='data',ascending=False).head(20)).set_index(['A','B','C'])
Out[8]:
data
A B C
1 1 1 0.959688
2 0.918230
2 0.731919
10 0.212463
1 0.103644
1 -0.035266
2 8 1.459579
8 1.277935
5 -0.075886
2 -0.684101
3 -0.928110
3 5 0.675987
4 0.065301
5 -0.800067
7 -1.349503
4 4 1.167308
8 1.148327
9 0.417590
6 -1.274146
10 -2.656304
5 2 -0.962994
1 -0.982679
6 2 1.410920
6 1.352527
10 0.510330
4 0.033275
1 -0.679686
10 -0.896797
1 -2.858669
7 8 -0.219342
8 -0.591054
2 -0.773227
1 -0.781850
3 -1.259089
10 -1.387992
10 -1.891734
8 7 1.578855
2 -0.498898
9 3 0.644277
8 0.572177
2 0.058431
9 -0.146912
4 -0.334690
10 9 0.795346
8 -0.137661
10 -1.335385
2 1 9 1.309405
3 0.328546
5 0.198422
1 -0.561974
3 -0.578069
2 5 0.645426
1 -0.138808
5 -0.400199
5 -0.513738
10 -0.667343
9 -1.983470
3 3 1.210882
6 0.894201
3 0.743652
...
[500 rows x 1 columns]
Try this
df.sort('data', ascending=False).set_index('C').groupby(['A', 'B']).data.head(3)
Its not the most readable syntax, but will get the job done
A B C
1 1 9 1.380526
1 0.903524
7 -0.112363
2 2 0.284057
5 0.131392
1 0.111512
Related
Assume I have the following pandas data frame:
my_class value
0 1 1
1 1 2
2 1 3
3 2 4
4 2 5
5 2 6
6 2 7
7 2 8
8 2 9
9 3 10
10 3 11
11 3 12
I want to identify the indices of "my_class" where the class changes and remove n rows after and before this index. The output of this example (with n=2) should look like:
my_class value
0 1 1
5 2 6
6 2 7
11 3 12
My approach:
# where class changes happen
s = df['my_class'].ne(df['my_class'].shift(-1).fillna(df['my_class']))
# mask with `bfill` and `ffill`
df[~(s.where(s).bfill(limit=1).ffill(limit=2).eq(1))]
Output:
my_class value
0 1 1
5 2 6
6 2 7
11 3 12
One of possible solutions is to:
Make use of the fact that the index contains consecutive integers.
Find index values where class changes.
For each such index generate a sequence of indices from n-2
to n+1 and concatenate them.
Retrieve rows with indices not in this list.
The code to do it is:
ind = df[df['my_class'].diff().fillna(0, downcast='infer') == 1].index
df[~df.index.isin([item for sublist in
[ range(i-2, i+2) for i in ind ] for item in sublist])]
my_class = np.array([1] * 3 + [2] * 6 + [3] * 3)
cols = np.c_[my_class, np.arange(len(my_class)) + 1]
df = pd.DataFrame(cols, columns=['my_class', 'value'])
df['diff'] = df['my_class'].diff().fillna(0)
idx2drop = []
for i in df[df['diff'] == 1].index:
idx2drop += range(i - 2, i + 2)
print(df.drop(idx_drop)[['my_class', 'value']])
Output:
my_class value
0 1 1
5 2 6
6 2 7
11 3 12
I am new to python and the last time I coded was in the mid-80's so I appreciate your patient help.
It seems .rolling(window) requires the window to be a fixed integer. I need a rolling window where the window or lookback period is dynamic and given by another column.
In the table below, I seek the Lookbacksum which is the rolling sum of Data as specified by the Lookback column.
d={'Data':[1,1,1,2,3,2,3,2,1,2],
'Lookback':[0,1,2,2,1,3,3,2,3,1],
'LookbackSum':[1,2,3,4,5,8,10,7,8,3]}
df=pd.DataFrame(data=d)
eg:
Data Lookback LookbackSum
0 1 0 1
1 1 1 2
2 1 2 3
3 2 2 4
4 3 1 5
5 2 3 8
6 3 3 10
7 2 2 7
8 1 3 8
9 2 1 3
You can create a custom function for use with df.apply, eg:
def lookback_window(row, values, lookback, method='sum', *args, **kwargs):
loc = values.index.get_loc(row.name)
lb = lookback.loc[row.name]
return getattr(values.iloc[loc - lb: loc + 1], method)(*args, **kwargs)
Then use it as:
df['new_col'] = df.apply(lookback_window, values=df['Data'], lookback=df['Lookback'], axis=1)
There may be some corner cases but as long as your indices align and are unique - it should fulfil what you're trying to do.
here is one with a list comprehension which stores the index and value of the column df['Lookback'] and the gets the slice by reversing the values and slicing according to the column value:
df['LookbackSum'] = [sum(df.loc[:e,'Data'][::-1].to_numpy()[:i+1])
for e,i in enumerate(df['Lookback'])]
print(df)
Data Lookback LookbackSum
0 1 0 1
1 1 1 2
2 1 2 3
3 2 2 4
4 3 1 5
5 2 3 8
6 3 3 10
7 2 2 7
8 1 3 8
9 2 1 3
An exercise in pain, if you want to try an almost fully vectorized approach. Sidenote: I don't think it's worth it here. At all.
Inspired by Divakar's answer here
Given:
import numpy as np
import pandas as pd
d={'Data':[1,1,1,2,3,2,3,2,1,2],
'Lookback':[0,1,2,2,1,3,3,2,3,1],
'LookbackSum':[1,2,3,4,5,8,10,7,8,3]}
df=pd.DataFrame(data=d)
Using the function from Divakar's answer, but slightly modified
from skimage.util.shape import view_as_windows as viewW
def strided_indexing_roll(a, r, fill_value=np.nan):
# Concatenate with sliced to cover all rolls
p = np.full((a.shape[0],a.shape[1]-1),fill_value)
a_ext = np.concatenate((p,a,p),axis=1)
# Get sliding windows; use advanced-indexing to select appropriate ones
n = a.shape[1]
return viewW(a_ext,(1,n))[np.arange(len(r)), -r + (n-1),0]
Now, we just need to prepare a 2d array for the data and independently shift the rows according to our desired lookback values.
arr = df['Data'].to_numpy().reshape(1, -1).repeat(len(df), axis=0)
shifter = np.arange(len(df) - 1, -1, -1) #+ d['Lookback'] - 1
temp = strided_indexing_roll(arr, shifter, fill_value=0)
out = strided_indexing_roll(temp, (len(df) - 1 - df['Lookback'])*-1, 0).sum(-1)
Output:
array([ 1, 2, 3, 4, 5, 8, 10, 7, 8, 3], dtype=int64)
We can then just assign it back to the dataframe as needed and check.
df['out'] = out
#output:
Data Lookback LookbackSum out
0 1 0 1 1
1 1 1 2 2
2 1 2 3 3
3 2 2 4 4
4 3 1 5 5
5 2 3 8 8
6 3 3 10 10
7 2 2 7 7
8 1 3 8 8
9 2 1 3 3
I have a dataframe with a lot of tweets and i want to remove the duplicates. The tweets are stored in fh1.df['Tweets']. i counts the amount of non-duplicates. j the amount of duplicates. In the else statement I remove the lines of the duplicates. And in the if I make a new list "tweetChecklist" where I put all the good tweets in.
Ok, if I do i + j , i become the amount of original tweets. So that's good. But in the else, I don't know why, he removes to much rows because the shape of my dataframe is much smaller after the for loop (1/10).
How does the " fh1.df = fh1.df[fh1.df.Tweets != current_tweet]
" line remove to much rows??
tweetChecklist = []
for current_tweet in fh1.df['Tweets']:
if current_tweet not in tweetChecklist:
i = i + 1
tweetChecklist.append(current_tweet)
else:
j = j + 1
fh1.df = fh1.df[fh1.df.Tweets != current_tweet]
fh1.df['Tweets'] = pd.Series(tweetChecklist)
NOTE
Graipher's solution tells you how to generate a unique dataframe. My answer tells you why your current operation removes too many rows (per your question).
END NOTE
When you enter the "else" statement to remove the duplicated tweet you are removing ALL of the rows that have the specified tweet. Let's demonstrate:
import numpy as np
import pandas as pd
df = pd.DataFrame(data=np.random.randint(0, 10, (10, 5)), columns=list('ABCDE'))
What does this make:
Out[118]:
A B C D E
0 2 7 0 5 4
1 2 8 8 3 7
2 9 7 4 6 2
3 9 7 7 9 2
4 6 5 7 6 8
5 8 8 7 6 7
6 6 1 4 5 3
7 1 4 7 8 7
8 3 2 5 8 5
9 5 8 9 2 4
In your method (assume you want to remove duplicates from "A" instead of "Tweets") you would end up with (i.e. only have rows that were not unique).
Out[118]:
A B C D E
5 8 8 7 6 7
7 1 4 7 8 7
8 3 2 5 8 5
9 5 8 9 2 4
If you just want to make this unique, implement Graipher's suggestion. If you want to count how many duplicates you have you can do this:
total = df.shape[0]
duplicates = total - df.A.unique().size
In pandas there is usually always a better way than iterating over the dataframe with a for loop.
In this case, what you really want is to group equal tweets together and just retain the first one. This can be achieved with pandas.DataFrame.groupby:
import random
import string
import pandas as pd
# some random one character tweets, so there are many duplicates
df = pd.DataFrame({"Tweets": random.choices(string.ascii_lowercase, k=100),
"Data": [random.random() for _ in range(100)]})
df.groupby("Tweets", as_index=False).first()
# Tweets Data
# 0 a 0.327766
# 1 b 0.677697
# 2 c 0.517186
# 3 d 0.925312
# 4 e 0.748902
# 5 f 0.353826
# 6 g 0.991566
# 7 h 0.761849
# 8 i 0.488769
# 9 j 0.501704
# 10 k 0.737816
# 11 l 0.428117
# 12 m 0.650945
# 13 n 0.530866
# 14 o 0.337835
# 15 p 0.567097
# 16 q 0.130282
# 17 r 0.619664
# 18 s 0.365220
# 19 t 0.005407
# 20 u 0.905659
# 21 v 0.495603
# 22 w 0.511894
# 23 x 0.094989
# 24 y 0.089003
# 25 z 0.511532
Even better, there is even a function explicitly for that, pandas.drop_duplicates, which is about twice as fast:
df.drop_duplicates(subset="Tweets", keep="first")
Here is my data:
import numpy as np
import pandas as pd
z = pd.DataFrame({'a':[1,1,1,2,2,3,3],'b':[3,4,5,6,7,8,9], 'c':[10,11,12,13,14,15,16]})
z
a b c
0 1 3 10
1 1 4 11
2 1 5 12
3 2 6 13
4 2 7 14
5 3 8 15
6 3 9 16
Question:
How can I do calculation on different element of each subgroup? For example, for each group, I want to extract any element in column 'c' which its corresponding element in column 'b' is between 4 and 9, and sum them all.
Here is the code I wrote: (It runs but I cannot get the correct result)
gbz = z.groupby('a')
# For displaying the groups:
gbz.apply(lambda x: print(x))
list = []
def f(x):
list_new = []
for row in range(0,len(x)):
if (x.iloc[row,0] > 4 and x.iloc[row,0] < 9):
list_new.append(x.iloc[row,1])
list.append(sum(list_new))
results = gbz.apply(f)
The output result should be something like this:
a c
0 1 12
1 2 27
2 3 15
It might just be easiest to change the order of operations, and filter against your criteria first - it does not change after the groupby.
z.query('4 < b < 9').groupby('a', as_index=False).c.sum()
which yields
a c
0 1 12
1 2 27
2 3 15
Use
In [2379]: z[z.b.between(4, 9, inclusive=False)].groupby('a', as_index=False).c.sum()
Out[2379]:
a c
0 1 12
1 2 27
2 3 15
Or
In [2384]: z[(4 < z.b) & (z.b < 9)].groupby('a', as_index=False).c.sum()
Out[2384]:
a c
0 1 12
1 2 27
2 3 15
You could also groupby first.
z = z.groupby('a').apply(lambda x: x.loc[x['b']\
.between(4, 9, inclusive=False), 'c'].sum()).reset_index(name='c')
z
a c
0 1 12
1 2 27
2 3 15
Or you can use
z.groupby('a').apply(lambda x : sum(x.loc[(x['b']>4)&(x['b']<9),'c']))\
.reset_index(name='c')
Out[775]:
a c
0 1 12
1 2 27
2 3 15
I have got the following pandas data frame
Y X id WP_NER
0 35.973496 -2.734554 1 WP_01
1 35.592138 -2.903913 2 WP_02
2 35.329853 -3.391070 3 WP_03
3 35.392608 -3.928513 4 WP_04
4 35.579265 -3.942995 5 WP_05
5 35.519728 -3.408771 6 WP_06
6 35.759485 -3.078903 7 WP_07
I´d like to round Y and X columns using pandas.
How can I do that ?
You can now, use round on dataframe
Option 1
In [661]: df.round({'Y': 2, 'X': 2})
Out[661]:
Y X id WP_NER
0 35.97 -2.73 1 WP_01
1 35.59 -2.90 2 WP_02
2 35.33 -3.39 3 WP_03
3 35.39 -3.93 4 WP_04
4 35.58 -3.94 5 WP_05
5 35.52 -3.41 6 WP_06
6 35.76 -3.08 7 WP_07
Option 2
In [662]: cols = ['Y', 'X']
In [663]: df[cols] = df[cols].round(2)
In [664]: df
Out[664]:
Y X id WP_NER
0 35.97 -2.73 1 WP_01
1 35.59 -2.90 2 WP_02
2 35.33 -3.39 3 WP_03
3 35.39 -3.93 4 WP_04
4 35.58 -3.94 5 WP_05
5 35.52 -3.41 6 WP_06
6 35.76 -3.08 7 WP_07
You can apply round:
In [142]:
df[['Y','X']].apply(pd.Series.round)
Out[142]:
Y X
0 36 -3
1 36 -3
2 35 -3
3 35 -4
4 36 -4
5 36 -3
6 36 -3
If you want to apply to a specific number of places:
In [143]:
df[['Y','X']].apply(lambda x: pd.Series.round(x, 3))
Out[143]:
Y X
0 35.973 -2.735
1 35.592 -2.904
2 35.330 -3.391
3 35.393 -3.929
4 35.579 -3.943
5 35.520 -3.409
6 35.759 -3.079
EDIT
You assign the above to the columns you want to modify like the following:
In [144]:
df[['Y','X']] = df[['Y','X']].apply(lambda x: pd.Series.round(x, 3))
df
Out[144]:
Y X id WP_NER
0 35.973 -2.735 1 WP_01
1 35.592 -2.904 2 WP_02
2 35.330 -3.391 3 WP_03
3 35.393 -3.929 4 WP_04
4 35.579 -3.943 5 WP_05
5 35.520 -3.409 6 WP_06
6 35.759 -3.079 7 WP_07
Round is so smart that it works just on float columns, so the simplest solution is just:
df = df.round(2)
you can do the below:
df['column_name'] = df['column_name'].apply(lambda x: round(x,2) if isinstance(x, float) else x)
that check as well if the value of the cell is a float number. if is not float return the same value. that comes from the fact that a cell value can be a string or a NAN.
You can also - first check to see which columns are of type float - then round those columns:
for col in df.select_dtypes(include=['float']).columns:
df[col] = df[col].apply(lambda x: x if(math.isnan(x)) else round(x,1))
This also manages potential errors if trying to round nanvalues by implementing if(math.isnan(x))