With a dataframe, called mrna as follows:
id Cell_1 Cell_2 Cell_3
CDH3 8.006 5.183 10.144
ERBB2 9.799 12.355 8.571
...
How can I select the ERBB2 row as a pandas series (if I don't know its index)?
I tried:
mrna.iloc['ERBB2'] but that only takes a integer, and doesn't map to string
I also tried:
mrna_t = mrna.transpose()
mrna_t['ERBB2']
but I get KeyError: 'ERBB2'
Pass a boolean condition to generate a boolean mask, this mask is used against the index and will return just the rows where the condition is met:
In [116]:
df[df['id']=='ERBB2']
Out[116]:
id Cell_1 Cell_2 Cell_3
1 ERBB2 9.799 12.355 8.571
Output from boolean condition:
In [117]:
df['id']=='ERBB2'
Out[117]:
0 False
1 True
Name: id, dtype: bool
As for your error: mrna_t['ERBB2'] will attempt to look for a column with that name which doesn't exist hence the KeyError
If it was your index then you could just do:
df.loc['ERBB2']
To select index values matching the passed label, it's worth checking the docs, including the section on index selection by position and label
Just figured it out. I just reset the index labels then transposed. This allowed me to index by "ERBB2".
mrna.set_index('id').T
Related
Is there a possibility to get the first value from a filtered dataframe without having to copy and reindexing the whole dataframe?
Lets say I have a dataframe df:
index
statement
name
1
True
123
2
True
456
3
True
789
4
False
147
5
False
258
6
True
369
and I want to get the name of the first row with statement that is False.
I would do:
filtered_df = df[df.statement == False]
filtered_df = reset_index(drop=True)
name = filtered_df.loc[0, "name"]
but is there an easier/faster solution to this?
If it is for the name of the first statement that is False, then use
df[~df.statement].name.iloc[0]
The ~ inverts the selection ("negates"), so only the rows where df.statement equals False are selected.
Then the name column of that selection is selected, then the first item by position (not index) is selected using .iloc[0].
The best approach to make your code neat and more readable is to use pandas method chaining.
df.query('statement == False').name.iloc[0]
Generally the .query() method improves the code readability while performing filtering operations.
Using Python Pandas I am trying to find the Country & Place with the maximum value.
This returns the maximum value:
data.groupby(['Country','Place'])['Value'].max()
But how do I get the corresponding Country and Place name?
Assuming df has a unique index, this gives the row with the maximum value:
In [34]: df.loc[df['Value'].idxmax()]
Out[34]:
Country US
Place Kansas
Value 894
Name: 7
Note that idxmax returns index labels. So if the DataFrame has duplicates in the index, the label may not uniquely identify the row, so df.loc may return more than one row.
Therefore, if df does not have a unique index, you must make the index unique before proceeding as above. Depending on the DataFrame, sometimes you can use stack or set_index to make the index unique. Or, you can simply reset the index (so the rows become renumbered, starting at 0):
df = df.reset_index()
df[df['Value']==df['Value'].max()]
This will return the entire row with max value
I think the easiest way to return a row with the maximum value is by getting its index. argmax() can be used to return the index of the row with the largest value.
index = df.Value.argmax()
Now the index could be used to get the features for that particular row:
df.iloc[df.Value.argmax(), 0:2]
The country and place is the index of the series, if you don't need the index, you can set as_index=False:
df.groupby(['country','place'], as_index=False)['value'].max()
Edit:
It seems that you want the place with max value for every country, following code will do what you want:
df.groupby("country").apply(lambda df:df.irow(df.value.argmax()))
Use the index attribute of DataFrame. Note that I don't type all the rows in the example.
In [14]: df = data.groupby(['Country','Place'])['Value'].max()
In [15]: df.index
Out[15]:
MultiIndex
[Spain Manchester, UK London , US Mchigan , NewYork ]
In [16]: df.index[0]
Out[16]: ('Spain', 'Manchester')
In [17]: df.index[1]
Out[17]: ('UK', 'London')
You can also get the value by that index:
In [21]: for index in df.index:
print index, df[index]
....:
('Spain', 'Manchester') 512
('UK', 'London') 778
('US', 'Mchigan') 854
('US', 'NewYork') 562
Edit
Sorry for misunderstanding what you want, try followings:
In [52]: s=data.max()
In [53]: print '%s, %s, %s' % (s['Country'], s['Place'], s['Value'])
US, NewYork, 854
In order to print the Country and Place with maximum value, use the following line of code.
print(df[['Country', 'Place']][df.Value == df.Value.max()])
You can use:
print(df[df['Value']==df['Value'].max()])
Using DataFrame.nlargest.
The dedicated method for this is nlargest which uses algorithm.SelectNFrame on the background, which is a performant way of doing: sort_values().head(n)
x y a b
0 1 2 a x
1 2 4 b x
2 3 6 c y
3 4 1 a z
4 5 2 b z
5 6 3 c z
df.nlargest(1, 'y')
x y a b
2 3 6 c y
import pandas
df is the data frame you create.
Use the command:
df1=df[['Country','Place']][df.Value == df['Value'].max()]
This will display the country and place whose value is maximum.
My solution for finding maximum values in columns:
df.ix[df.idxmax()]
, also minimum:
df.ix[df.idxmin()]
I'd recommend using nlargest for better performance and shorter code. import pandas
df[col_name].value_counts().nlargest(n=1)
I encountered a similar error while trying to import data using pandas, The first column on my dataset had spaces before the start of the words. I removed the spaces and it worked like a charm!!
I have a datraframe in python like that:
st se st_min st_max se_min se_max
42 922444 923190 922434 922454 923180 923200
24 922445 923190 922435 922455 923180 923200
43 928718 929456 928708 928728 929446 929466
37 928718 929459 928708 928728 929449 929469
As we can see, I have a range in the first 2 columns and a variation of 10 positions of the initial range.
I know that function drop_duplicates can remove duplicate rows based on the exact match of values.
But, if I want to remove rows based on a range of values, for example, both indexes 42 and 24 are in the same range (if I considerer a range of 10) and indexes 43 and 37 are in the same case.
How I can do this?
Ps: I can't remove based only in one column (e.g. st or se), I need to remove redundancy based on both columns (st and se), using the range of columns min and max as filters...
I assume, you want to combine all ranges. So that all ranges that overlap are reduced to one row. I think you need to do that recursively, because there could be multiple ranges, that form one big range, not just two. You could do it like this (just replace df by the variable you use to store your dataframe):
# create a dummy key column to produce a cartesian product
df['fake_key']=0
right_df= pd.DataFrame(df, copy=True)
right_df.rename({col: col + '_r' for col in right_df if col!='fake_key'}, axis='columns', inplace=True)
# this variable indicates that we need to perform the loop once more
change=True
# diff and new_diff are used to see, if the loop iteration changed something
# it's monotically increasing btw.
new_diff= (right_df['se_r'] - right_df['st_r']).sum()
while change:
diff= new_diff
joined_df= df.merge(right_df, on='fake_key')
invalid_indexer= joined_df['se']<joined_df['st_r']
joined_df.drop(joined_df[invalid_indexer].index, axis='index', inplace=True)
right_df= joined_df.groupby('st').aggregate({col: 'max' if '_min' not in col else 'min' for col in joined_df})
# update the ..._min / ..._max fields in the combined range
for col in ['st_min', 'se_min', 'st_max', 'se_max']:
col_r= col + '_r'
col1, col2= (col, col_r) if 'min' in col else (col_r, col)
right_df[col_r]= right_df[col1].where(right_df[col1]<=right_df[col2], right_df[col2])
right_df.drop(['se', 'st_r', 'st_min', 'se_min', 'st_max', 'se_max'], axis='columns', inplace=True)
right_df.rename({'st': 'st_r'}, axis='columns', inplace=True)
right_df['fake_key']=0
# now check if we need to iterate once more
new_diff= (right_df['se_r'] - right_df['st_r']).sum()
change= diff <= new_diff
# now all ranges which overlap have the same value for se_r
# so we just need to aggregate on se_r to remove them
result= right_df.groupby('se_r').aggregate({col: 'min' if '_max' not in col else 'max' for col in right_df})
result.rename({col: col[:-2] if col.endswith('_r') else col for col in result}, axis='columns', inplace=True)
result.drop('fake_key', axis='columns', inplace=True)
If you execute this on your data, you get:
st se st_min st_max se_min se_max
se_r
923190 922444 923190 922434 922455 923180 923200
929459 928718 929459 922434 928728 923180 929469
Note, if your data set is larger than a few thousand records, you might need to change the join logic above which produces a cartesian product. So in the first iteration, you get a joined_df of the size n^2, where n is the number of records in your input dataframe. Then later in each iteration the joined_df will get smaller due to the aggregation.
I just ignored that, because I don't know, how large your dataset is. Avoiding this would make the code a bit more complex. But if you need it, you could just create an auxillary dataframe which allows you to "bin" the se values on both dataframes and use the binned value as the fake_key. It's not quite regular binning, you would have to create a dataframe that contains for each fake_key all values of the in the range (0...fake_key). So e.g. if you define your fake key to be fake_key=se//1000, your dataframe would contain
fake_key fake_key_join
922 922
922 921
922 920
... ...
922 0
If you replace the merge in the loop above by code, that merges such a dataframe on fake_key with right_df and the result on fake_key_join with df you can use the rest of the code and get the same result as above but without having to produce a full cartesian product.
Note that e.g. st values for keys 42 and 24 are different, so you can
not use just st values.
If e.g. your range can be defined as st / 100 (rounded down to integer),
you can create a column with this value:
df['rng'] = df.st.floordiv(100)
Then use drop_duplicates with subset set to just this column and
drop rng column:
df.drop_duplicates(subset='rng').drop(columns=['rng'])
Or maybe st value for keys 24 should be the same as above (for key
42) and the same for se in the second pair of rows?
In this case you could use:
df.drop_duplicates(subset=['st', 'se'])
without any auxiliary column.
What's the code to get the index of a value in a pandas series data structure?.
animals=pd.Series(['bear','dog','mammoth','python'],
index=['canada','germany','iran','brazil'])
What's the code to extract the index of "mammoth"?
You can just use boolean indexing:
In [8]: animals == 'mammoth'
Out[8]:
canada False
germany False
iran True
brazil False
dtype: bool
In [9]: animals[animals == 'mammoth'].index
Out[9]: Index(['iran'], dtype='object')
Note, indexes aren't necessarily unique for pandas data structures.
You have two options:
1) If you make sure that value is unique, or just want to get the first one, use the find function.
find(animals, 'mammoth') # retrieves index of first occurrence of value
2) If you would like to get all indices matching that value, as per #juanpa.arrivillaga 's post.
animals[animals == 'mammoth'].index # retrieves indices of all matching values
You can also index find any number occurrence of the value by treating the the above statement as a list:
animals[animas == 'mammoth'].index[1] #retrieves index of second occurrence of value.
I want to add a column to a Dataframe that will contain a number derived from the number of NaN values in the row, specifically: one less than the number of non-NaN values in the row.
I tried:
for index, row in df.iterrows():
count = row.value_counts()
val = sum(count) - 1
df['Num Hits'] = val
Which returns an error:
-c:4: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
and puts the first val value into every cell of the new column. I've tried reading about .loc and indexing in the Pandas documentation and failed to make sense of it. I gather that .loc wants a row_index and a column_index but I don't know if these are pre-defined in every dataframe and I just have to specify them somehow or if I need to "set" an index on the dataframe somehow before telling the loop where to place the new value, val.
You can totally do it in a vectorized way without using a loop, which is likely to be faster than the loop version:
In [89]:
print df
0 1 2 3
0 0.835396 0.330275 0.786579 0.493567
1 0.751678 0.299354 0.050638 0.483490
2 0.559348 0.106477 0.807911 0.883195
3 0.250296 0.281871 0.439523 0.117846
4 0.480055 0.269579 0.282295 0.170642
In [90]:
#number of valid numbers - 1
df.apply(lambda x: np.isfinite(x).sum()-1, axis=1)
Out[90]:
0 3
1 3
2 3
3 3
4 3
dtype: int64
#DSM brought up an good point that the above solution is still not fully vectorized. A vectorized form can be simply (~df.isnull()).sum(axis=1)-1.
You can use the index variable that you define as part of the for loop as the row_index that .loc is looking for:
for index, row in df.iterrows():
count = row.value_counts()
val = sum(count) - 1
df.loc[index, 'Num Hits'] = val