Grouping Data with pandas [duplicate] - python

This question already has answers here:
Can pandas groupby aggregate into a list, rather than sum, mean, etc?
(7 answers)
Closed 5 years ago.
I'm using pandas for a thesis assignment and got stuck on the following
MY data is as below where I have multiple entries for Full Names with one authID in the second column.
Full_Name author_ID
SVANTE ARRHENIUS 5C5007F5
SVANTE ARRHENIUS 76E05190
I'm trying to update the data so I have one row per author with all corresponding authorIDs in the second column as such:
Full_Name author_ID
SVANTE ARRHENIUS [5C5007F5,76E05190]
Sorry if this is a very basic question. I've been stuck on it for a while and can't figure it out :(

Let's say you have a Data Frame object created as:
DF_obj=DataFrame([['Ravi',1234],['Ragh',12345],['Ravi',14567]])
DF_obj.columns=['Full_Name','Author_ID']
group_by=DF_obj.groupby('Full_Name')['Author_ID'].apply(list)
group_by
Out[]
Full_Name
Ragh [12345]
Ravi [1234, 14567]
Name: Author_ID, dtype: object

Related

Max value from several row [duplicate]

This question already has answers here:
pyspark: grouby and then get max value of each group
(2 answers)
Closed 1 year ago.
I have such a DF. I am working on tables in the form of rdd
I would like to get a table with the maximum order value for a given country and the customer number for this customer.
I have no idea how to construct the map function. Unless there is a better way?
With PySpark:
df.groupBy('customernumber', 'city').max('sum_of_orders')
With Pandas:
df.groupby(['customernumber', 'city'])['sum_of_orders'].max()

How to find an index with maximum number of rows in Pandas? [duplicate]

This question already has answers here:
The first three max value in a column in python
(1 answer)
Count and Sort with Pandas
(5 answers)
Closed 3 years ago.
I am doing an online course which has a problem like " Find the name of the state with maximum number of counties". The problem dataframe is the image below
Problem Dataframe
Now, I have given the dataframe two new index (hierarchical indexing) and after that the dataframe takes a new look like the image below
Modified Dataframe
I have used this code to get the modified dataframe:
def answer_five():
new_df = census_df[census_df['SUMLEV'] == 50]
new_df = new_df.set_index(['STNAME', 'CTYNAME'])
return new_df
answer_five()
What I want to do now is to find the name of the state with most number of counties i.e to find the index with maximum number of rows. How Can I do that?
I know that using something like groupby() method this can be done but I'm not familiar with this method yet and so don't want to use it. Can anyone help? I have searched for this but failed. Sorry if the problem is rudimentary. Thanks in advance.

Filter dataframe by first order per customer [duplicate]

This question already has answers here:
Groupby first two earliest dates, then average time between first two dates - pandas
(3 answers)
Closed 3 years ago.
I would like some help to solve the following problem using Pandas in Python.
I have a dataframe about the customers' transactions - in random order, which contains the following columns, along with datatypes:
user_id object;
transaction_date datetime64[ns];
account_creation_date datetime64[ns];
transaction_id object;
I need to find a dataframe that contains all the first (chronological) transactions for every customer. The final dataframe should contain the same columns as the original one.
So far I have tried to use some "group by", together with aggregate functions, but I cannot seem to get the first transaction in chronological order, instead of the first in order of appeareance.
Any thoughts?
This will get you the earliest observation per customer:
df_first = df.sort_values('transaction_date').groupby('user_id').head(1)

groupby and extract data [duplicate]

This question already has answers here:
Pandas groupby: How to get a union of strings
(8 answers)
Closed 3 years ago.
new in pandas and I was able to create a dataframe from a csv file. I was also able to sort it out.
What I am struggling now is the following: I give an image as an example from a pandas data frame.
First column is the index,
Second column is a group number
Third column is what happened.
I want based on the second column to take out the third column on the same unique data frame.
I highlight few examples: For the number 9 return back the sequence
[60,61,70,51]
For the number 6 get back the sequence
[65,55,56]
For the number 8 get back the single element 8.
How groupby can be used to do this extraction?
Thanks a lot
Regards
Alex
Starting from the answers on this question we can extract following code to receive the desired result.
dataframe = pd.DataFrame({'index':[0,1,2,3,4], 'groupNumber':[9,9,9,9,9], 'value':[12,13,14,15,16]})
grouped = dataframe.groupby('groupNumber')['value'].apply(list)

Count values in dataframe column that include specific parameters [duplicate]

This question already has answers here:
How to get the number of times a piece if word is inside a particular column in pandas?
(2 answers)
Closed 3 years ago.
I have a dataframe with ~150k columns:
Dataframe: Information about Salaries and Employees
I need to count specific values in the Job Title column of the dataframe, but it has to be a count of the values that include 'chief' somewhere within the job title.
I tried bringing up all the unique job titles up with value_counts, but there are too many still for me to count.
print("%s employees have 'chief' in their job title." % salaries['JobTitle'].value_counts())
How can I create the specific condition I need to count the values correctly?
salaries['JobTitle'].str.contains('chief').sum()

Categories

Resources