Say I have a MultiIndex DataFrame like the following:
X Y
A B
bar one 0.717822 -0.421127
three -0.763407 -0.306909
flux six -1.504799 0.977983
three -0.202268 1.971939
foo five 1.715336 -0.157881
one 0.942614 -1.529973
two -1.918896 -0.989882
two 0.434202 1.438424
I would like to create a new column new, so that, within each value of A, for half of the B entries, the column new is H, while for the other half, new is L.
I am looking for an answer that makes no assumptions about the location of the levels in the index (i.e. the solution should refer to levels by names).
In the example above, one possible such assignment would look like the following:
X Y new
A B
bar one 0.717822 -0.421127 H
three -0.763407 -0.306909 L
flux six -1.504799 0.977983 H
three -0.202268 1.971939 L
foo five 1.715336 -0.157881 H
one 0.942614 -1.529973 H
two -1.918896 -0.989882 L
two 0.434202 1.438424 L
How can I do this in Pandas?
I first created a series with a relative cumulative count within each group (grouped on level A), and then assigned "H"/"L" to the values below/above 0.5:
In [118]: s = df.groupby(level='A').cumcount() / df.groupby(level='A').size()
In [119]: df['new'] = 'H'
In [120]: df.loc[s>=0.5, 'new'] = 'L'
Update: the division does not seem to work with pandas 0.13.1 (but does with master/0.14). Instead you can use the div method and explicitely specify the level:
s = df.groupby(level='A').cumcount().div(df.groupby(level='A').size(), level='A')
Related
I have following dataset:
this dataset print correlation of two columns at left
if you look at the row number 3 and 42, you will find they are same. only column position is different. that does not affect correlation. I want to remove column 42. But this dataset has many these row of similar values. I need a general algorithm to remove these similar value and have only unique.
As the correlation_value seems to be the same, the operation should be commutative, so whatever the value, you just have to focus on two first columns. Sort the tuple and remove duplicates
# You can probably replace 'sorted' by 'set'
key = df[['source_column', 'destination_column']] \
.apply(lambda x: tuple(sorted(x)), axis='columns')
out = df.loc[~key.duplicated()]
>>> out
source_column destination_column correlation_Value
0 A B 1
2 C E 2
3 D F 4
You could try a self join. Without a code example, it's hard to answer, but something like this maybe:
df.merge(df, left_on="source_column", right_on="destination_column")
You can follow that up with a call to drop_duplicates.
I'd need a little suggestion on a procedure using pandas, I have a 2-columns dataset that looks like this:
A 0.4533
B 0.2323
A 1.2343
A 1.2353
B 4.3521
C 3.2113
C 2.1233
.. ...
where first column contains strings and the second one floats. Looking at each subgroup given by the 'keys' 'A','B','C', I would like to filter out the values in each subgroup that differ more than 20% from the minimum of each subgroup.
What I've tried to do has been:
out = df[df.groupby(stringcol)[floatcol].apply(lambda x: x <= x.min() * 1.2)]
But I'm not sure if, doing like this, the global minimum over the entire dataset is considered, or the minimum with respect to each subgroup, that is what I want.
Many thanks,
James
I have a pandas multi-index dataframe:
>>> df
0 1
first second
A one 0.991026 0.734800
two 0.582370 0.720825
B one 0.795826 -1.155040
two 0.013736 -0.591926
C one -0.538078 0.291372
two 1.605806 1.103283
D one -0.617655 -1.438617
two 1.495949 -0.936198
I'm trying to find an efficient way to divide each number in column 0 by the maximum number in column I that shares the same group under index "first", and make this into a third column. Is there a simply efficient method for doing something like this that doesn't require multiple for loops?
Use Series.div with max for maximal values per first level:
print (df[1].max(level=0))
first
A 0.734800
B -0.591926
C 1.103283
D -0.936198
Name: 1, dtype: float64
df['new'] = df[0].div(df[1].max(level=0))
print (df)
0 1 new
first second
A one 0.991026 0.734800 1.348702
two 0.582370 0.720825 0.792556
B one 0.795826 -1.155040 -1.344469
two 0.013736 -0.591926 -0.023206
C one -0.538078 0.291372 -0.487706
two 1.605806 1.103283 1.455480
D one -0.617655 -1.438617 0.659748
two 1.495949 -0.936198 -1.597898
I am trying to find out the 3 nearest neighbours of a row within a set of 10 rows(each 10 rows is a class), and then average out those 3 neighbours.
I need to do this over an array of 400 rows, where each consecutive 10 rows belong to one class.
I think I have managed to capture the 3 nearest neighbours for each row within 'indices' below.
In the output below, 'indices' is a 10x3 matrix.
I'm just not sure how to go about referencing those particular 3 rows in the original xclass that the 3 elements of each row of 'indices' refers to, and then add them (the challenge) and then divide by 3 to get the average (i assume this division is straight-forward).
Updated this para after the responses below:
Basically, X has dimensions 400x4096
Indices could be for example [[1,3,5],[2,4,8].....]
What I need to do is average out rows 1,3 and 5 of X and obtain a resultant row of shape 1x4096.
Similarly average out rows 2,4,8 of X and obtain a new row for this set and so on for each row in indices.
So basically each element in a particular row of indices refers to a specific row in X.
'''
for counter in range(0,399,10):
#print(counter)
xclass=X[counter:counter+9]
yclass=y[counter:counter+9]
#print(xclass)
nbrs = NearestNeighbors(n_neighbors=3, algorithm='brute').fit(xclass)
distances, indices = nbrs.kneighbors(xclass)
#print(indices)
'''
appreciate any insight.
You can index lists in python using a word as such...
a = ['aaa', 'bbb', 'ccc']
b = a[a.index('aaa')]
print(b)
output: aaa
and also like...
a = ['aaa', 'bbb', 'ccc']
word = 'aaa'
b = a[a.index(word)]
print(b)
output: aaa
so can do something like...
a = ['aaa', 'bbb', 'ccc']
word = 'aaa'
b = a[a.index(word+1)]
print(b)
output: bbb
I assume you are using numpy (or something similar). In general, you can take any indexing array and use that to capture the particular entries of interest in another array. For example,
import numpy as np
#Accessing an array by an indexing array.
X = np.arange(30).reshape(6,5) #6, 5 long vectors
I = [[0,1,2],[3,4,5]] #I wish to collect vectors 0,1,2 together and vectors 3,4,5 together
C = X[I,:] #Will be a 2 by 3 by 5. 2 collections of 3 5-long vectors.
print(C)
#Computations for averaging those collected arrays.
#Note that C is shape (2,3,5), we wish to average the 3s together, hence we need to
#average along the middle axis (axis=1).
A = np.average(C,axis=1)
print(A)
More detail about X[I,:]. In general, what we did here was specify all the x-coordinates and all the y-coordinates to capture in our array X. Since, we wanted the full vectors in X we didn't care about the y-coordinates and wanted to capture all of them, hence :. Likewise, we wanted to pull the x-coordinates 3 at a time, so we specified to pull [0,1,2] and then to pull [3,4,5]. You could change those to any indexes you wish.
I have a pandas Series with a MultiIndex, and I want to get the integer row numbers that belong to one level of the MultiIndex.
For example, if I have sample data s
s = pandas.Series([10, 23, 2, 19],
index=pandas.MultiIndex.from_product([['a', 'b'], ['c', 'd']]))
which looks like this:
a c 10
d 23
b c 2
d 19
I want to get the row numbers that correspond to the level b. So here, I'd get [2, 3] as the output, because the last two rows are under b. Also, I really only need the first row that belongs under b.
I wanted to get the numbers so that I can compare across Series. Say I have five Series objects with a b level. These are time-series data, and b corresponds to a condition that was present during some of the observations (and c is a sub-condition, etc). I want to see which Series had the conditions present at the same time.
Edit: To clarify, I don't need to compare the values themselves, just the indices. For example, in R if I had this dataframe:
d = data.frame(col_1 = c('a','a','b','b'), col_2 = c('c','d','c','d'), col_3 = runif(4))
Then the command which(d$col_1 == 'b') would produce the results I want.
If the index that you want to index by is the outermost one you can use loc
df.loc['b']
To get the first row I find the head method the easiest
df.loc['b'].head(1)
The idiomatic way to do the second part of your question is as follows. Say your series are named series1, series2 and series3.
big_series = pd.concat([series1, series2, series3])
big_series.loc['b']