This question already has answers here:
pyspark: grouby and then get max value of each group
(2 answers)
Closed 1 year ago.
I have such a DF. I am working on tables in the form of rdd
I would like to get a table with the maximum order value for a given country and the customer number for this customer.
I have no idea how to construct the map function. Unless there is a better way?
With PySpark:
df.groupBy('customernumber', 'city').max('sum_of_orders')
With Pandas:
df.groupby(['customernumber', 'city'])['sum_of_orders'].max()
Related
This question already has an answer here:
Reversing the order of values in a single column of a Dataframe
(1 answer)
Closed 1 year ago.
I need to reverse a series for correct plotting.
So I wrote the code below:
dataFrame["close"] = dataFrame["close"][::-1]
But it doesn't differ. Why?
You are including the specific column. You should reverse the entire dataframe like this:
dataFrame.reindex(index=dataFrame.index[::-1])
or
dataFrame.iloc[::-1]
This question already has answers here:
Get the row(s) which have the max value in groups using groupby
(15 answers)
Closed 2 years ago.
I have this dataframe where I have more than one column and I want to know additional data to the maximum value of one column
For example, given the following code, show the country where the number is the highest per year per causes What I did was:
var=data.groupby(["Year","Causes"])["number"].max()
But this only shows the max value for each of the years and each of the causes. I would like to know which country is the one associated with the max value from the number.
This code shows the highest number per cause per year, but I need to
show the country associated with the highest number per cause per
year
I tried using idxmax() instead of max() but it did not work
try this:
var=data.groupby(["Year","Causes"])["number"].max()
print(df[var]);
There's probably a more efficient way but if I've understood what you'd like to achieve and your data structure then this works:
var=data.groupby(["Year","Causes"])["number"].max()
var = pd.DataFrame(var)
new = var.merge(data, how='inner').drop_duplicates()
new
This question already has answers here:
Pandas get topmost n records within each group
(6 answers)
Closed 3 years ago.
Given a pandas dataframe with company purchases across various months in a year, how do I find the "N" highest each month?
Currently have:
df.groupby(df['Transaction Date'].dt.strftime('%B'))['Amount'].max()
Which is returning the highest value for each month but would like to see the highest four values.
Am I getting close here or is there a more efficient approach? Thanks in advance
With sort_values then tail
yourdf=df.sort_values('Amount').groupby(df['Transaction Date'].dt.strftime('%B'))['Amount'].tail(4)
This question already has answers here:
How to get the number of times a piece if word is inside a particular column in pandas?
(2 answers)
Closed 3 years ago.
I have a dataframe with ~150k columns:
Dataframe: Information about Salaries and Employees
I need to count specific values in the Job Title column of the dataframe, but it has to be a count of the values that include 'chief' somewhere within the job title.
I tried bringing up all the unique job titles up with value_counts, but there are too many still for me to count.
print("%s employees have 'chief' in their job title." % salaries['JobTitle'].value_counts())
How can I create the specific condition I need to count the values correctly?
salaries['JobTitle'].str.contains('chief').sum()
This question already has answers here:
Find the list values not in pandas dataframe data
(3 answers)
Closed 5 years ago.
I have a data frame and I need to get the rows which are "no in" the given list
I know in order to get the rows from the list we can use isin.(list), so my question is whether there is a contrary "notin" function?
You can use the ~ in front of condition to negate it.
~df['Col1'].isin(list)
df['Col1'].isin(list) will return True/False, then just flip the boolean to get True where Col1 is not in the list.