Pandas GroupBy query - python

I have a dataframe in pandas which looks like the following:
Snapshot of my pandas dataframe
Now I want the data frame to be transformed like below wherein attribute 'category' get concatenated separated by a delimiter for each customerid based on sorted date value(%m/%d/%Y). The order with earlier date has its category listed first for the corresponding customer id.
Desired/Transformed data frame

First convert column by to_datetime, then sort_values and last groupby with join:
df['Date'] = pd.to_datetime(df['Date'], format='%m/%d/%Y')
df = (df.sort_values(['customerid','Age','Date'])
.groupby(['customerid','Age'])['category']
.agg(', '.join)
.reset_index())
print (df)
customerid Age category
0 1 10 Electronics, Clothing
1 2 25 Grocery, Clothing

Related

Assign counts from .count() to a dataframe + column names - pandas python

Hoping someone can help me here - i believe i am close to the solution.
I have a dataframe, of which i have am using .count() in order to return a series of all column names of my dataframe, and each of their respective non-NAN value counts.
Example dataframe:
feature_1
feature_2
1
1
2
NaN
3
2
4
NaN
5
3
Example result for .count() here would output a series that looks like:
feature_1 5
feature_2 3
I am now trying to get this data into a dataframe, with the column names "Feature" and "Count". To have the expected output look like this:
Feature
Count
feature_1
5
feature_2
3
I am using .to_frame() to push the series to a dataframe in order to add column names. Full code:
df = data.count()
df = df.to_frame()
df.columns = ['Feature', 'Count']
However receiving this error message - "ValueError: Length mismatch: Expected axis has 1 elements, new values have 2 elements", as if though it is not recognising the actual column names (Feature) as a column with values.
How can i get it to recognise both Feature and Count columns to be able to add column names to them?
Add Series.reset_index instead Series.to_frame for 2 columns DataFrame - first column from index, second from values of Series:
df = data.count().reset_index()
df.columns = ['Feature', 'Count']
print (df)
Feature Count
0 feature_1 5
1 feature_2 3
Another solution with name parameter and Series.rename_axis or with DataFrame.set_axis:
df = data.count().rename_axis('Feature').reset_index(name='Count')
#alternative
df = data.count().reset_index().set_axis(['Feature', 'Count'], axis=1)
print (df)
Feature Count
0 feature_1 5
1 feature_2 3
This happens because your new dataframe has only one column (the column name is taken as series index, then translated into dataframe index with the func to_frame()). In order to assign a 2 elements list to df.columns you have to reset the index first:
df = data.count()
df = df.to_frame().reset_index()
df.columns = ['Feature', 'Count']

pandas increment row based on how many times a date is in a dataframe

i have this list for example dates = ["2020-2-1", "2020-2-3", "2020-5-8"] now i want to make a dataframe which contains only the month and year then the count of how many times it appeared, the output should be like
Date
Count
2020-2
2
2020-5
1
Shorted code:
df['month_year'] = df['dates'].dt.to_period('M')
df1 = df.groupby('month_year')['dates'].count().reset_index(name="count")
print(df1)
month_year count
0 2020-02 2
1 2020-05 1
import pandas as pd
dates = ["2020-2-1", "2020-2-3", "2020-5-8"]
df = pd.DataFrame({'Date':dates})
df['Date'] = df['Date'].str.slice(0,6)
df['Count'] = 1
df = df.groupby('Date').sum().reset_index()
Note: you might want to use the format "2020-02-01" with padded zeros so that the first 7 characters are always the year and month
This will give you a "month" and "year" column with the count of the year/month
If you want you could just combine the month/year columns together, but this will give you the results you expect if not exactly cleaned up.
df = pd.DataFrame({'Column1' : ["2020-2-1", "2020-2-3", "2020-5-8"]})
df['Month'] = pd.to_datetime(df['Column1']).dt.month
df['Year'] = pd.to_datetime(df['Column1']).dt.year
df.groupby(['Month', 'Year']).agg('count').reset_index()

combining dataframes and adding values on common date index

I have many dataframes with one column (same name in all) whose indexes are date ranges - I want to merge/combine these dataframes into one, summing the values where any dates are common. below is a simplified example
range1 = pd.date_range('2021-10-01','2021-11-01')
range2 = pd.date_range('2021-11-01','2021-12-01')
df1 = pd.DataFrame(np.random.rand(len(range1),1), columns=['value'], index=range1)
df2 = pd.DataFrame(np.random.rand(len(range2),1), columns=['value'], index=range2)
here '2021-11-01' appears in both df1 and df2 with different values
I would like to obtain a single dataframe of 62 rows (32+31-1) where the 2021-11-01 date contains the sum of its values in df1 and df2
We can use pd.concate() on the two dataframes, then df.reset_index() to get a new regular-integer index, rename the date column, and then use df.groupby().sum().
df = pd.concat([df1,df2]) # this gives 63 rows by 1 column, where the column is the values and the dates are the index
df = df.reset_index() # moves the dates to a column, now called 'index', and makes a new integer index
df = df.rename(columns={'index':'Date'}) #renames the column
df.groupby('Date').sum()

How to map one dataframe to another dataframe for cross-sectional panel data?

I have df1 and df2, where df1 is a balanced panel of 20 stocks with daily datetime data. Due to missing days (weekends, holidays) I am assigning each day available to an integer of how many days I have (1-252). df2 is a 2 column matrix which maps each day to the integer.
df2
date integer
2020-06-26, 1
2020-06-29, 2
2020-06-30, 3
2020-07-01, 4
2020-07-02, 5
...
2021-06-25, 252
I would like to map these dates to every asset I have in df1 for each date, therefore returning a single column of (0-252) repeated for each asset.
So far I have tried this:
df3 = (df1.merge(df2, left_on='date', right_on='integer'))
which returns an empty dataframe - I dont think I'm fully understanding the logic here
Assuming both df1 and df2 having the same column label as date hence,
df3 = df1.merge(df2)

dataframe operations - column attributes to new columns in a new subset dataframe with conditions

I have the dataframe df1 with the columns type, Date and amount.
My goal is to create a Dataframe df2 with a subset of dates from df1, in which each type has a column with the amounts of the type as values for the respective date.
Input Dataframe:
df1 =
,type,Date,amount
0,42,2017-02-01,4
1,42,2017-02-02,5
2,42,2017-02-03,7
3,42,2017-02-04,2
4,48,2017-02-01,6
5,48,2017-02-02,8
6,48,2017-02-03,3
7,48,2017-02-04,6
8,46,2017-02-01,3
9,46,2017-02-02,8
10,46,2017-02-03,3
11,46,2017-02-04,4
Desired Output, if the subset of Dates are 2017-02-02 and 2017-02-04:
df2 =
,Date,42,48,46
0,2017-02-02,5,8,8
1,2017-02-04,2,6,4
I tried it like this:
types = list(df1["type"].unique())
dates = ["2017-02-02","2017-02-04"]
df2 = pd.DataFrame()
df2["Date"]=dates
for t in types:
df2[t] = df1[(df1["type"]==t)&(df1[df1["type"]==t][["Date"]]==df2["Date"])][["amount"]]
but with this solution I get a lot of NaNs, it seems my comparison condition is wrong.
This is the Ouput I get:
,Date,42,48,46
0,2017-02-02,,,
1,2017-02-04,,,
You can use .pivot_table() and then filter data:
df2 = df1.pivot_table(
index="Date", columns="type", values="amount", aggfunc="sum"
)
dates = ["2017-02-02", "2017-02-04"]
print(df2.loc[dates].reset_index())
Prints:
type Date 42 46 48
0 2017-02-02 5 8 8
1 2017-02-04 2 4 6

Categories

Resources