Pandas DataFrame group by value and get column & row indexes - python

I have a pandas DataFrame like following.
df = pandas.DataFrame(np.random.randn(5,5),columns=['1','2','3','4','5'])
1 2 3 4 5
0 0.877455 -1.215212 -0.453038 -1.825135 0.440646
1 1.640132 -0.031353 1.159319 -0.615796 0.763137
2 0.132355 -0.762932 -0.909496 -1.012265 -0.695623
3 -0.257547 -0.844019 0.143689 -2.079521 0.796985
4 2.536062 -0.730392 1.830385 0.694539 -0.654924
I need to get row and column indexes for following three groups. (In my original dataset there are no negative values)
value is greater than 2.0
value is between 1.0 - 2.0
value is less than 1.0
For e.g for "value is greater than 2.0" it should return [1,4]. I have tried using this which gives a boolean result.
df.values > 2

You can use np.where on the boolean result to extract the indices:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(5,5),columns=['1','2','3','4','5'])
condition = df.values > 2
print np.column_stack(np.where(condition))
For a df like this,
1 2 3 4 5
0 0.057347 0.722251 0.263292 -0.168865 -0.111831
1 -0.765375 1.040659 0.272883 -0.834273 -0.126997
2 -0.023589 0.046002 1.206445 0.381532 -1.219399
3 2.290187 2.362249 -0.748805 -1.217048 -0.973749
4 0.100084 0.671120 -0.211070 0.903264 -0.312815
Output:
[[3 0]
[3 1]]
Or get a list of row-column index pairs if necessary:
print map(list, np.column_stack(np.where(condition)))
Output:
[[3,0], [3,1]]

Related

How to find the number of unique values in comma separated strings stored in an pandas data frame column?

x
Unique_in_x
5,5,6,7,8,6,8
4
5,9,8,0
4
5,9,8,0
4
3,2
2
5,5,6,7,8,6,8
4
Unique_in_x is my expected column.Sometime x column might be string also.
You can use a list comprehension with a set
df['Unique_in_x'] = [len(set(x.split(','))) for x in df['x']]
Or using a split and nunique:
df['Unique_in_x'] = df['x'].str.split(',', expand=True).nunique(1)
Output:
x Unique_in_x
0 5,5,6,7,8,6,8 4
1 5,9,8,0 4
2 5,9,8,0 4
3 3,2 2
4 5,5,6,7,8,6,8 4
You can find the unique value of the list with np.unique() and then just use the length
import pandas as pd
import numpy as np
df['Unique_in_x'] = df['X'].apply(lambda x : len(np.unique(x.split(','))))

Collapsing identical adjacent rows in a Pandas Series

Basically if a column of my pandas dataframe looks like this:
[1 1 1 2 2 2 3 3 3 1 1]
I'd like it to be turned into the following:
[1 2 3 1]
You can write a simple function that loops through the elements of your series only storing the first element in a run.
As far as I know, there is no tool built in to pandas to do this. But it is not a lot of code to do it yourself.
import pandas
example_series = pandas.Series([1, 1, 1, 2, 2, 3])
def collapse(series):
last = ""
seen = []
for element in series:
if element != last:
last = element
seen.append(element)
return seen
collapse(example_series)
In the code above, you will iterate through each element of a series and check if it is the same as the last element seen. If it is not, store it. If it is, ignore the value.
If you need to handle the return value as a series you can change the last line of the function to:
return pandas.Series(seen)
You could write a function that does the following:
x = pandas.Series([1 1 1 2 2 2 3 3 3 1 1])
y = x-x.shift(1)
y[0] = 1
result = x[y!=0]
You can use DataFrame's diff and indexing:
>>> df = pd.DataFrame([1,1,2,2,2,2,3,3,3,3,1])
>>> df[df[0].diff()!=0]
0
0 1
2 2
6 3
10 1
>>> df[df[0].diff()!=0].values.ravel() # If you need an array
array([1, 2, 3, 1])
Same works for Series:
>>> df = pd.Series([1,1,2,2,2,2,3,3,3,3,1])
>>> df[df.diff()!=0].values
array([1, 2, 3, 1])
You can use shift to create a boolean mask to compare the row against the previous row:
In [67]:
s = pd.Series([1,1,2,2,2,2,3,3,3,3,4,4,5])
s[s!=s.shift()]
Out[67]:
0 1
2 2
6 3
10 4
12 5
dtype: int64

simplifying routine in python with numpy array or pandas

The initial problem is the following: I have an initial matrix with let say 10 lines and 12 rows. For all lines, I want to sum two rows together. At the end I must have 10 lines but with only 6 rows. Currently, I am doing the following for loop in python (using initial which is a pandas DataFrame)
for i in range(0,12,2):
coarse[i]=initial.iloc[:,i:i+1].sum(axis=1)
In fact, I am quite sure that something more efficient is possible. I am thinking something like list comprehension but for a DataFrame or a numpy array. Does anybody have an idea ?
Moreover I would want to know if it is better to manipulate large numpy arrays or pandas DataFrame.
Let's create a small sample dataframe to illustrate the solution:
np.random.seed(0)
df = pd.DataFrame(np.random.rand(6, 3))
>>> df
0 1 2
0 0.548814 0.715189 0.602763
1 0.544883 0.423655 0.645894
2 0.437587 0.891773 0.963663
3 0.383442 0.791725 0.528895
4 0.568045 0.925597 0.071036
5 0.087129 0.020218 0.832620
You can use slice notation to select every other row starting from the first row (::2) and starting from the second row (1::2). iloc is for integer indexing. You need to select the values at these locations, and add them together. The result is a numpy array that you could then convert back into a DataFrame if required.
>>> df.iloc[::2].values + df.iloc[1::2].values
array([[ 1.09369669, 1.13884417, 1.24865749],
[ 0.82102873, 1.68349804, 1.49255768],
[ 0.65517386, 0.94581504, 0.9036559 ]])
You use values to remove the indexing. This is what happens otherwise:
>>> df.iloc[::2] + df.iloc[1::2].values
0 1 2
0 1.093697 1.138844 1.248657
2 0.821029 1.683498 1.492558
4 0.655174 0.945815 0.903656
>>> df.iloc[::2].values + df.iloc[1::2]
0 1 2
1 1.093697 1.138844 1.248657
3 0.821029 1.683498 1.492558
5 0.655174 0.945815 0.903656
For a more general solution:
df = pd.DataFrame(np.random.rand(9, 3))
n = 3 # Number of consecutive rows to group.
df['group'] = [idx // n for idx in range(len(df.index))]
df.groupby('group').sum()
0 1 2
group
0 1.531284 2.030617 2.212320
1 1.038615 1.737540 1.432551
2 1.695590 1.971413 1.902501

Python Pandas: Get row by median value

I'm trying to get the row of the median value for a column.
I'm using data.median() to get the median value for 'column'.
id 30444.5
someProperty 3.0
numberOfItems 0.0
column 70.0
And data.median()['column'] is subsequently:
data.median()['performance']
>>> 70.0
How can get the row or index of the median value?
Is there anything similar to idxmax / idxmin?
I tried filtering but it's not reliable in cases multiple rows have the same value.
Thanks!
You can use rank and idxmin and apply it to each column:
import numpy as np
import pandas as pd
def get_median_index(d):
ranks = d.rank(pct=True)
close_to_median = abs(ranks - 0.5)
return close_to_median.idxmin()
df = pd.DataFrame(np.random.randn(13, 4))
df
0 1 2 3
0 0.919681 -0.934712 1.636177 -1.241359
1 -1.198866 1.168437 1.044017 -2.487849
2 1.159440 -1.764668 -0.470982 1.173863
3 -0.055529 0.406662 0.272882 -0.318382
4 -0.632588 0.451147 -0.181522 -0.145296
5 1.180336 -0.768991 0.708926 -1.023846
6 -0.059708 0.605231 1.102273 1.201167
7 0.017064 -0.091870 0.256800 -0.219130
8 -0.333725 -0.170327 -1.725664 -0.295963
9 0.802023 0.163209 1.853383 -0.122511
10 0.650980 -0.386218 -0.170424 1.569529
11 0.678288 -0.006816 0.388679 -0.117963
12 1.640222 1.608097 1.779814 1.028625
df.apply(get_median_index, 0)
0 7
1 7
2 3
3 4
May be just : data[data.performance==data.median()['performance']].

How do I convert a row from a pandas DataFrame from a Series back to a DataFrame?

I am iterating through the rows of a pandas DataFrame, expanding each one out into N rows with additional info on each one (for simplicity I've made it a random number here):
from pandas import DataFrame
import pandas as pd
from numpy import random, arange
N=3
x = DataFrame.from_dict({'farm' : ['A','B','A','B'],
'fruit':['apple','apple','pear','pear']})
out = DataFrame()
for i,row in x.iterrows():
rows = pd.concat([row]*N).reset_index(drop=True) # requires row to be a DataFrame
out = out.append(rows.join(DataFrame({'iter': arange(N), 'value': random.uniform(size=N)})))
In this loop, row is a Series object, so the call to pd.concat doesn't work. How do I convert it to a DataFrame? (Eg. the difference between x.ix[0:0] and x.ix[0])
Thanks!
Given what you commented, I would try
def giveMeSomeRows(group):
return random.uniform(low=group.low, high=group.high, size=N)
results = x.groupby(['farm', 'fruit']).apply(giveMeSomeRows)
This should give you a separate result dataframe. I have assumed that every farm-fruit combination is unique... there might be other ways, if we'd know more about your data.
Update
Running code example
def giveMeSomeRows(group):
return random.uniform(low=group.low, high=group.high, size=N)
N = 3
df = pd.DataFrame(arange(0,8).reshape(4,2), columns=['low', 'high'])
df['farm'] = 'a'
df['fruit'] = arange(0,4)
results = df.groupby(['farm', 'fruit']).apply(giveMeSomeRows)
df
low high farm fruit
0 0 1 a 0
1 2 3 a 1
2 4 5 a 2
3 6 7 a 3
results
farm fruit
a 0 [0.176124290969, 0.459726835079, 0.999564934689]
1 [2.42920143009, 2.37484506501, 2.41474002256]
2 [4.78918572452, 4.25916442343, 4.77440617104]
3 [6.53831891152, 6.23242754976, 6.75141668088]
If instead you want a dataframe, you can update the function to
def giveMeSomeRows(group):
return pandas.DataFrame(random.uniform(low=group.low, high=group.high, size=N))
results
0
farm fruit
a 0 0 0.281088
1 0.020348
2 0.986269
1 0 2.642676
1 2.194996
2 2.650600
2 0 4.545718
1 4.486054
2 4.027336
3 0 6.550892
1 6.363941
2 6.702316

Categories

Resources