How to get the second largest value in Pandas Python [duplicate] - python

This question already has answers here:
Get first and second highest values in pandas columns
(7 answers)
Closed 4 years ago.
This is my code:
maxData = all_data.groupby(['Id'])[features].agg('max')
all_data = pd.merge(all_data, maxData.reset_index(), suffixes=["", "_max"], how='left', on=['Id'])
Now Instead of getting the max value, How can I fetch the second max value in the above code (groupBy Id)

Try using nlargest
maxData = all_data.groupby(['Id'])[features].apply(lambda x:x.nlargest(2)[1]).reset_index(drop=True)

You can use the nth method just after sorting the values;
maxData = all_data.sort_values("features", ascending=False).groupby(['Id']).nth(1)
Please ignore apply method as it decreases performance of code.

Related

PYTHON sort a column conditionally by put special char on the top [duplicate]

This question already has answers here:
Custom sorting in pandas dataframe
(5 answers)
Closed 11 months ago.
I am doing my dataset. I need to sort one of my dataset columns from the smallest to the largest like:
however, when I use :
count20 = count20.sort_values(by = ['Month Year', 'Age'])
I got:
Can anyone help me with this?
Thank you very much!
define a function like this:
def fn(x):
output = []
for item, value in x.iteritems():
output.append(''.join(e for e in value if e.isalnum()))
return output
and pass this function as key while sorting values.
count20 = count20.sort_values(by = ['Month Year', 'Age'], key= fn)

Equivalent R and Python with a DataFrame [duplicate]

This question already has answers here:
How do I select rows from a DataFrame based on column values?
(16 answers)
Closed 2 years ago.
I'm stuck with an equivalence of code between R and Python.
Code in R
library(datasets)
data <- airquality
data2 <- data[data$Ozone < 63,]
I download the file of airquality and use pd.read_csv() function for obtain the .csv file into Python. But I don't know how obtain this equivalent line data[data$Ozone < 63,].
data2 = data.loc[data["Ozone"] < 63,:]
This should do the trick.
data["Ozone"] < 63 returns an index where the condition is verified
data.loc[index, :] returns a copy of the dataframe data, for all columns : on the given index

create a new column which is a value_counts of another column in python [duplicate]

This question already has answers here:
pandas add column to groupby dataframe
(3 answers)
Closed 2 years ago.
I have a pandas datafram df that contains a column say x, and I would like to create another column out of x which is a value_count of each item in x.
Here is my approach
x_counts= []
for item in df['x']:
item_count = len(df[df['x']==item])
x_counts.append(item_count)
df['x_count'] = x_counts
This works but this is far inefficient. I am looking for a more efficient way to handle this. Your approach and recommendations are highly appreciated
It sounds like you are looking for groupby function that you are trying to get the count of items in x
There are many other function driven methods but they may differ in different versions.
I suppose that you are looking to join the same elements and find their sum
df.loc[:,'x_count']=1 # This will make a new column of x_count to each row with value 1 in it
aggregate_functions={"x_count":"sum"}
df=df.groupby(["x"],as_index=False,sort=False).aggregate(aggregate_functions) # as_index and sort functions will allow you to choose x separately otherwise it would conside the x column as index column
Hope it heps.

Performance of Pandas string contains for column [duplicate]

This question already has answers here:
Pandas filtering for multiple substrings in series
(3 answers)
Closed 4 years ago.
I have a DataFrame of 83k rows and a column "Text" of text that i have to search for ~200 masks. Is there a way to pass a column to .str.contains()?
I'm able to do it like this:
start = time.time()
[a["Text"].str.contains(m).sum() for m in \
b["mask"].values]
print time.time() - start
But it's taking 34.013s. Is there any faster way?
Edit:
b["mask"] looks like:
'PR347856|P5478'
'BS7623|B5763'
and i want the count of occurances for each mask, so i can't join them.
Edit:
a["text"] contains strings of the size of ~ 3 sentences
Maybe you can vectorize the containment operation.
text_contains = a['Text'].str.contains
b['mask'].map(lambda m: text_contains(m).sum())

Partial Indexing Error in Python Series [duplicate]

This question already has answers here:
key error and MultiIndex lexsort depth
(1 answer)
What exactly is the lexsort_depth of a multi-index Dataframe?
(1 answer)
Closed 5 years ago.
I have created a Hierarchical indexed Series and I wanted to partially index some values of the Series. But When I changed the alphabetic order of the Series. The partially indexing is not working. Can anybody explain why is this happening?
with Some better and logical explanation.
sr = Series(np.arange(11),index=[['a','b','b','c','d','d','e','e','f','f','f'],[1,2,1,3,1,2,1,2,1,2,3]])
print (sr['a':'c'])
This gives the resultant output but when I change the alphabetic order of the indexes, the partial indexing gives an error.
hs = Series(np.arange(10),index=[['a','a','b','b','c','c','d','e','e','a'],[1,0,2,1,0,1,1,3,2,3]])
print(hs['a':'c'])
pandas.errors.UnsortedIndexError: 'Key length (1) was greater than MultiIndex lexsort depth (0)'

Categories

Resources