Let's say I have an UNORDERED Dataframe :
df = pandas.DataFrame({'A': [6, 2, 3, 5]})
I have an input :
input = 3
I want to find the rank of my input in the list. Here :
expected_rank_in_df(input) = 2
# Because 2 < 3 < 5 < 6
Assumption : The input is always included in the dataframe. So for example, I will not find the position of "4" in this df.
The first idea was to use like here : Pandas rank by column value:
df.rank()
But it seems overkill to me as I don't need to rank the whole column. Maybe it's not ?
If you know for sure that the input is in the column, the rank will be equal to
df[df > input].count()
Does that make sense? If you intend on calling this multiple times, it may be worth it to just sort the column. But this is probably faster if you only care about a few inputs.
You can get first position of matched value by numpy.where with boolean mask for first True:
a = 3
print (np.where(np.sort(df['A']) == a)[0][0] + 1)
2
If default RangeIndex:
a = 3
print (df['A'].sort_values().eq(3).idxmax())
2
Another idea is count True values by sum:
print (df['A'].gt(3).sum())
2
Related
I have a excel file, has file name and a value, for example:
file. count
001.txt 1
002.txt 2
003.txt 2
004.txt 3
005.txt 1
006.txt 2
I'm using the following code to find how many 2s are in the value column, but somehow the result is 0
df = pd.read_excel('report.xlsx')
df.columns = ['file', 'count']
count = df['count'].tolist().count('2')
print(count)
>>0
Did I do something wrong in the code?
Firstly, check the column 'count' if it's 'number'(such as 'int64','int32','float64',etc) or 'object'.
df['count'] # check the result
If the data type is 'number', then you can use the code you worte, but just correct like this:
df['count'].to_list().count(2)
I guess when you .to_list, the elements are all numbers. And count('2') means to count the string element — '2' how many times it appears, while there was no elements like '2'. That's why you got zero result.
Here is a simple example.
lis = [0, 1, 2, 2, 3]
lis.count('2') # it returns 0
lis.count(2) # it returns 2
My data consist of Latitude in object type :
0 4.620881605
1 4.620124518
2 4.619367709
3 4.618609512
4 4.61784758
Then, I split after the decimal point using this code:
marker['Latitude'].str.split('.')
Resulting in :
0 [4, 620881605]
1 [4, 620124518]
2 [4, 619367709]
3 [4, 618609512]
4 [4, 61784758]
which is good but not quite there yet. I want to access the second element of the list for every row and the end result I am expecting is this :
0 620881605
1 620124518
2 619367709
3 618609512
4 61784758
I was looking for an answer to the same question, it seems there is nothing built-in. The best option I can find is operator.itemgetter(), which is implemented in native code and should perform fine with Series.apply():
from operator import itemgetter
series = pd.Series(["%s|%s" % (-x, x) for x in range(100)])
pairs = series.str.split('|')
# Fetch all the negative numbers
negatives = pairs.apply(itemgetter(0)).astype(int)
# Fetch all the positive numbers
positives = pairs.apply(itemgetter(1)).astype(int)
Note Series.str.split() also accepts an expand=True argument, which returns a new DataFrame containing columns 0..n rather than a series of lists. This probably should be the default behaviour, it's much easier to work with:
series = pd.Series(["%s|%s" % (-x, x) for x in range(100)])
pairs = series.str.split('|', expand=True)
# Fetch all the negative numbers
negatives = pairs[0]
# Fetch all the positive numbers
positives = pairs[1]
You can use pd.DataFrame.iterrows() to iterate by row and then select the proper index for your list.
import pandas as pd
x = pd.DataFrame({'a':[[1,2],[3,4],[5,6]]})
for index, row in x.iterrows():
print(row['a'][1])
2
4
6
marker['Latitude'].apply(lambda x : x.strip(',').split('.')[1])
I am trying to return a specific item from a Pandas DataFrame via conditional selection (and do not want to have to reference the index to do so).
Here is an example:
I have the following dataframe:
Code Colour Fruit
0 1 red apple
1 2 orange orange
2 3 yellow banana
3 4 green pear
4 5 blue blueberry
I enter the following code to search for the code for blueberries:
df[df['Fruit'] == 'blueberry']['Code']
This returns:
4 5
Name: Code, dtype: int64
which is of type:
pandas.core.series.Series
but what I actually want to return is the number 5 of type:
numpy.int64
which I can do if I enter the following code:
df[df['Fruit'] == 'blueberry']['Code'][4]
i.e. referencing the index to give the number 5, but I do not want to have to reference the index!
Is there another syntax that I can deploy here to achieve the same thing?
Thank you!...
Update:
One further idea is this code:
df[df['Fruit'] == 'blueberry']['Code'][df[df['Fruit']=='blueberry'].index[0]]
However, this does not seem particularly elegant (and it references the index). Is there a more concise and precise method that does not need to reference the index or is this strictly necessary?
Thanks!...
Let's try this:
df.loc[df['Fruit'] == 'blueberry','Code'].values[0]
Output:
5
First, use .loc to access the values in your dataframe using the boolean indexing for row selection and index label for column selection. The convert that returned series to an array of values and since there is only one value in that array you can use index '[0]' get the scalar value from that single element array.
Referencing index is a requirement (unless you use next()^), since a pd.Series is not guaranteed to have one value.
You can use pd.Series.values to extract the values as an array. This also works if you have multiple matches:
res = df.loc[df['Fruit'] == 'blueberry', 'Code'].values
# array([5], dtype=int64)
df2 = pd.concat([df]*5)
res = df2.loc[df2['Fruit'] == 'blueberry', 'Code'].values
# array([5, 5, 5, 5, 5], dtype=int64)
To get a list from the numpy array, you can use .tolist():
res = df.loc[df['Fruit'] == 'blueberry', 'Code'].values.tolist()
Both the array and the list versions can be indexed intuitively, e.g. res[0] for the first item.
^ If you are really opposed to using index, you can use next() to iterate:
next(iter(res))
you can also set your 'Fruit' column as ann index
df_fruit_index = df.set_index('Fruit')
and extract the value from the 'Code' column based on the fruit you choose
df_fruit_index.loc['blueberry','Code']
Easiest solution: convert pandas.core.series.Series to integer!
my_code = int(df[df['Fruit'] == 'blueberry']['Code'])
print(my_code)
Outputs:
5
I have a dataframe having multiple columns. I would like to replace the value in a column called Discriminant. Now this value needs to only be replaced for a few rows, whenever a condition is met in another column called ids. I tried various methods; The most common method seems to be using the .loc method, but for some reason it doesn't work for me.
Here are the variations that I am unsuccessfully trying:
encodedid - variable used for condition checking
indices - variable used for subsetting the dataframe (starts from zero)
Variation 1:
df[df.ids == encodedid].loc[df.ids==encodedid, 'Discriminant'].values[indices] = 'Y'
Variation 2:
df[df['ids'] == encodedid].iloc[indices,:].set_value('questionid','Discriminant', 'Y')
Variation 3:
df.loc[df.ids==encodedid, 'Discriminant'][indices] = 'Y'
Variation 3 particularly has been disappointing in that most posts on SO tend to say it should work but it gives me the following error:
ValueError: [ 0 1 2 3 5 6 7 8 10 11 12 13 14 16 17 18 19 20 21 22 23] not contained in the index
Any pointers will be highly appreciated.
you are slicing too much. try something like this:
indexer = df[df.ids == encodedid].index
df.loc[indexer, 'Discriminant'] = 'Y'
.loc[] needs an index list and a column list. you can set the value of that slice easily using = 'what you need'
looking at your problem you might want to set that for 2 columns at the same time such has:
indexer = df[df.ids == encodedid].index
column_list = ['Discriminant', 'questionid']
df.loc[indexer, column_list] = 'Y'
Maybe something like this. I don't have a dataframe to test it, but...
df['Discriminant'] = np.where(df['ids'] == 'some_condition', 'replace', df['Discriminant'])
I want to count the number of pipe symbol occurrence in a column of a data frame and it equals 5, then I need to append another pipe(|) symbol to the existing value.
df2['smartexpenseid']
0 878497|253919815?HOTEL?141791520780|||305117||
1 362593||||35068||
2 |231931871509?CARRT?231940968972||||177849|
3 955304|248973233?HOTEL?154687992630||||93191|
4 27984||||5883|3242|
5 3579321|253872763?HOTEL?128891721799|92832814|||
6 127299|248541768?HOTEL?270593355555|||||
7 |231931871509?CARRT?231940968972||||177849|
8 831665||||80658||
9 |247132692?HOTEL?141790728905||||6249|
For ex: for row number 5, the (|) count is 5, so it should add another (|) to the existing value and for other rows, since count is 6, we just leave it as it is. Can somebody help me with this ?
I tried these
if df2['smartexpenseid'].str.count('\|')==5:
df2['smartexpenseid'].append('\|')
This is throwing me error saying "The truth value of a Series is ambiguous"
and also
a = df2['smartexpenseid'].str.count('\|')
if 5 in a:
a.index(5)
So you have the vectorized str methods down. Now you need to conditionally append an extra '|' character. See Pandas section on masking for more info.
m = df2['smartexpenseid'].str.count('\|') == 5
df2.loc[m, 'smartexpenseid'] = df2['smartexpenseid'][m].values + '|'