Get integer row index of MultiIndex Series - python

I have a pandas Series with a MultiIndex, and I want to get the integer row numbers that belong to one level of the MultiIndex.
For example, if I have sample data s
s = pandas.Series([10, 23, 2, 19],
index=pandas.MultiIndex.from_product([['a', 'b'], ['c', 'd']]))
which looks like this:
a c 10
d 23
b c 2
d 19
I want to get the row numbers that correspond to the level b. So here, I'd get [2, 3] as the output, because the last two rows are under b. Also, I really only need the first row that belongs under b.
I wanted to get the numbers so that I can compare across Series. Say I have five Series objects with a b level. These are time-series data, and b corresponds to a condition that was present during some of the observations (and c is a sub-condition, etc). I want to see which Series had the conditions present at the same time.
Edit: To clarify, I don't need to compare the values themselves, just the indices. For example, in R if I had this dataframe:
d = data.frame(col_1 = c('a','a','b','b'), col_2 = c('c','d','c','d'), col_3 = runif(4))
Then the command which(d$col_1 == 'b') would produce the results I want.

If the index that you want to index by is the outermost one you can use loc
df.loc['b']
To get the first row I find the head method the easiest
df.loc['b'].head(1)
The idiomatic way to do the second part of your question is as follows. Say your series are named series1, series2 and series3.
big_series = pd.concat([series1, series2, series3])
big_series.loc['b']

Related

Dataframe- Remove similar rows related based on two columns

I have following dataset:
this dataset print correlation of two columns at left
if you look at the row number 3 and 42, you will find they are same. only column position is different. that does not affect correlation. I want to remove column 42. But this dataset has many these row of similar values. I need a general algorithm to remove these similar value and have only unique.
As the correlation_value seems to be the same, the operation should be commutative, so whatever the value, you just have to focus on two first columns. Sort the tuple and remove duplicates
# You can probably replace 'sorted' by 'set'
key = df[['source_column', 'destination_column']] \
.apply(lambda x: tuple(sorted(x)), axis='columns')
out = df.loc[~key.duplicated()]
>>> out
source_column destination_column correlation_Value
0 A B 1
2 C E 2
3 D F 4
You could try a self join. Without a code example, it's hard to answer, but something like this maybe:
df.merge(df, left_on="source_column", right_on="destination_column")
You can follow that up with a call to drop_duplicates.

How to iterate and calculate over a pandas multi-index dataframe

I have a pandas multi-index dataframe:
>>> df
0 1
first second
A one 0.991026 0.734800
two 0.582370 0.720825
B one 0.795826 -1.155040
two 0.013736 -0.591926
C one -0.538078 0.291372
two 1.605806 1.103283
D one -0.617655 -1.438617
two 1.495949 -0.936198
I'm trying to find an efficient way to divide each number in column 0 by the maximum number in column I that shares the same group under index "first", and make this into a third column. Is there a simply efficient method for doing something like this that doesn't require multiple for loops?
Use Series.div with max for maximal values per first level:
print (df[1].max(level=0))
first
A 0.734800
B -0.591926
C 1.103283
D -0.936198
Name: 1, dtype: float64
df['new'] = df[0].div(df[1].max(level=0))
print (df)
0 1 new
first second
A one 0.991026 0.734800 1.348702
two 0.582370 0.720825 0.792556
B one 0.795826 -1.155040 -1.344469
two 0.013736 -0.591926 -0.023206
C one -0.538078 0.291372 -0.487706
two 1.605806 1.103283 1.455480
D one -0.617655 -1.438617 0.659748
two 1.495949 -0.936198 -1.597898

Is there a way to allocates sorted values in a dataframe to groups based on alternating elements

I have a Pandas DataFrame like:
COURSE BIB# COURSE 1 COURSE 2 STRAIGHT-GLIDING MEAN PRESTASJON
1 2 20.220 22.535 19.91 21.3775 1.073707
0 1 21.235 23.345 20.69 22.2900 1.077332
This is from a pilot and the DataFrame may be much longer when we perform the real experiment. Now that I have calculated the performance for each BIB#, I want to allocate them into two different groups based on their performance. I have therefore written the following code:
df1 = df1.sort_values(by='PRESTASJON', ascending=True)
This sorts values in the DataFrame. Now I want to assign even rows to one group and odd rows to another. How can I do this?
I have no idea what I am looking for. I have looked up in the documentation for the random module in Python but that is not exactly what I am looking for. I have seen some questions/posts pointing to a scikit-learn stratification function but I don't know if that is a good choice. Alternatively, is there a way to create a loop that accomplishes this? I appreciate your help.
Here a figure to illustrate what I want to accomplish
How about this:
threshold = 0.5
df1['group'] = df1['PRESTASJON'] > threshold
Or if you want values for your groups:
df['group'] = np.where(df['PRESTASJON'] > threshold, 'A', 'B')
Here, 'A' will be assigned to column 'group' if precision meets our threshold, otherwise 'B'.
UPDATE: Per OP's update on the post, if you want to group them alternatively into two groups:
#sort your dataframe based on precision column
df1 = df1.sort_values(by='PRESTASJON')
#create new column with default value 'A' and assign even rows (alternative rows) to 'B'
df1['group'] = 'A'
df1.iloc[1::2,-1] = 'B'
Are you splitting the dataframe alternatingly? If so, you can do:
df1 = df1.sort_values(by='PRESTASJON', ascending=True)
for i,d in df1.groupby(np.arange(len(df1)) %2):
print(f'group {i}')
print(d)
Another way without groupby:
df1 = df1.sort_values(by='PRESTASJON', ascending=True)
mask = np.arange(len(df1)) %2
group1 = df1.loc[mask==0]
group2 = df1.loc[mask==1]

How to get the mean of a subset of rows after using groupby?

I want to get the average of a particular subset of rows in one particular column in my dataframe.
I can use
df['C'].iloc[2:9].mean()
to get the mean of just the particular rows I want from my original Dataframe but my problem is that I want to perform this operation after using the groupby operation.
I am building on
df.groupby(["A", "B"])['C'].mean()
whereby there are 11 values returned in 'C' once I group by columns A and B and I get the average of those 11 values. I actually only want to get the average of the 3rd through 9th values though so ideally what I would want to do is
df.groupby(["A", "B"])['C'].iloc[2:9].mean()
This would return those 11 values from column C for every group of A,B and then would find the mean of the 3rd through 9th values but I know I can't do this. The error suggests using the apply method but I can't seem to figure it out.
Any help would be appreciated.
You can use agg function after the groupby and then subset within each group and take the mean:
df = pd.DataFrame({'A': ['a']*22, 'B': ['b1']*11 + ['b2']*11, 'C': list(range(11))*2})
# A dummy data frame to demonstrate
df.groupby(['A', 'B'])['C'].agg(lambda g: g.iloc[2:9].mean())
# A B
# a b1 5
# b2 5
# Name: C, dtype: int64
Try this variant:
for key, grp in df.groupby(["A", "B"]):
print grp['C'].iloc[2:9].mean()

Spliting multi-index indices around a given level

Say I have a MultiIndex DataFrame like the following:
X Y
A B
bar one 0.717822 -0.421127
three -0.763407 -0.306909
flux six -1.504799 0.977983
three -0.202268 1.971939
foo five 1.715336 -0.157881
one 0.942614 -1.529973
two -1.918896 -0.989882
two 0.434202 1.438424
I would like to create a new column new, so that, within each value of A, for half of the B entries, the column new is H, while for the other half, new is L.
I am looking for an answer that makes no assumptions about the location of the levels in the index (i.e. the solution should refer to levels by names).
In the example above, one possible such assignment would look like the following:
X Y new
A B
bar one 0.717822 -0.421127 H
three -0.763407 -0.306909 L
flux six -1.504799 0.977983 H
three -0.202268 1.971939 L
foo five 1.715336 -0.157881 H
one 0.942614 -1.529973 H
two -1.918896 -0.989882 L
two 0.434202 1.438424 L
How can I do this in Pandas?
I first created a series with a relative cumulative count within each group (grouped on level A), and then assigned "H"/"L" to the values below/above 0.5:
In [118]: s = df.groupby(level='A').cumcount() / df.groupby(level='A').size()
In [119]: df['new'] = 'H'
In [120]: df.loc[s>=0.5, 'new'] = 'L'
Update: the division does not seem to work with pandas 0.13.1 (but does with master/0.14). Instead you can use the div method and explicitely specify the level:
s = df.groupby(level='A').cumcount().div(df.groupby(level='A').size(), level='A')

Categories

Resources