Pandas: convert to time series with frequency counts + maintaining index - python

I currently have pandas df that looks like:
Company Date Title
Apple 1/2/2020 Sr. Exec
Google 2/2/2020 Manager
Google 2/2/2020 Analyst
How do I get it to maintain the index while counting the frequency of 'title' per date (as shown below)
Company 1/2/2020 2/2/2020
Apple 1 0
Google 0 2
I've tried using group_by() on the date but it doesn't break up the dates to show on the top row and I need to export the resulting df to csv so group by didn't work.

It looks like what you want is a pivot table
pivot = df.pivot_table(
index="Company",
columns="Date",
values="Title",
aggfunc=len,
fill_value=0
).reset_index()
A quick explanation of what is happening here:
Rows will be made for each unique value in the 'Company' column
Values from the 'Date' column will become column headers
We want to count how frequently a title occurs at a given date in a given company, so we set the 'Title' to be the value and pass the aggfunc (aggregation function) of len to tell pandas to count the values
Since there could be an instance where Google doesn't have any analysts on the 20th of Febuary 2020 we supply a fill_value of 0, preventing empty (Null) values
Finally, we reset the index so that the 'Company' value is just a column not the index of the dataframe.
You will end up having a new index, but this is inevitable since you are no longer going to have rows with duplicated values in the 'Company' column.
The pivot_table method is extremely powerful. Look here for the full documentation

Like this:
pd.pivot_table(df, index='Company', columns='Date', values='Title', aggfunc='count').reset_index().rename_axis(None, axis=1).fillna(0)
Output:
Company 1/2/2020 2/2/2020
0 Apple 1.0 0.0
1 Google 0.0 2.0

Related

Pandas - How to drop rows based on a unique column value where another column value is a minimum and handling nulls?

I have a pandas dataframe with something like the following:
index
order_id
cost
123a
123
5
123b
123
None
123c
123
3
124a
124
None
124b
124
None
For each unique value of order_id, I'd like to drop any row that isn't the lowest cost. For any order_id that only contains nulls for the cost, any row for an order_id can be retained.
I've been struggling with this for a while now.
ol3 = ol3.loc[ol3.groupby('Order_ID').cost.idxmin()]
This code doesn't play nice with the order_id's that have only nulls. So, I tried to figure out how to drop the null's I don't want with
ol4 = ol3.loc[ol3['cost'].isna()].drop_duplicates(subset=['Order_ID', 'cost'], keep='first')
This gives me the list of null order_id's I want to retain. Not sure where to go from here. I'm pretty sure I'm looking at this the wrong way. Any help would be appreciated!
You can use transform to get the indexes with min cost per order_id. We additionally need isna check for the special order_ids that have only NaNs:
order_mins = df.groupby('order_id').cost.transform('min')
df[(df.cost == order_mins) | (order_mins.isna())]
You can (temporarily) fill the NA/None with np.inf before getting the idxmin:
ol3.loc[ol3['cost'].fillna(np.inf).groupby(ol3['order_id']).idxmin()]
You will have exactly one row per order_id
output:
index order_id cost
2 123c 123 3.0
3 124a 124 NaN

Pandas: groupby first column and apply that to rest of df (store values in list)

I know there are a lot of similar questions already, but none seem to apply to my problem (or I don't know how to apply them)
So I have a Pandas DataFrame with duplicated entries in the first column, but different values in the rest of the columns.
Like so:
location year GDP ...
0 AUT 1998 14...
1 AUT 2018 98...
2 AUS 1998 24...
3 AUS 2018 83...
...
I would like to get only unique entries in 'location' but keep all the columns and their values in a list:
location year GDP ...
0 AUT (1998, 2018) (14..., 98...)
1 AUS (1998, 2018) (24..., 83...)
...
I tried:
grouped = df_gdp_pop_years.groupby("Location")['Year'].apply(list).reset_index(name = "Year")
and I tried to do the same with a lambda function, but I always end up with the same problem: I only keep the years.
How can I get the results I want?
You can try something like
df_gdp_pop_years.groupby("Location").agg({"Year": list, "GDP": list})
If your other columns may be changing or there may be more added, you can accomplish this with a generalized .agg() on those columns:
df_gdp_pop_years.groupby('location').agg(lambda x: x.tolist())
I found this by searching for 'opposite of pandas explode' which led me to a different SO question: How to implode(reverse of pandas explode) based on a column

Pandas - pivoting multiple columns into fewer columns with some level of detail kept

Say I have the following code that generates a dataframe:
df = pd.DataFrame({"customer_code": ['1234','3411','9303'],
"main_purchases": [3,10,5],
"main_revenue": [103.5,401.5,99.0],
"secondary_purchases": [1,2,4],
"secondary_revenue": [43.1,77.5,104.6]
})
df.head()
There's the customer_code column that's the unique ID for each client.
And then there are 2 columns to indicate the purchases that took place and revenue generated from main branches by those clients.
And another 2 columns to indicate the purchases/revenue from secondary branches by those clients.
I want to get the data into a format like this, where a pivot is done where there's a new column to differentiate between main vs secondary, but the revenue numbers and purchase columns are not mixed up:
The obvious solution is just to split this into 2 dataframes, and then simply do a concatenate, but I'm wondering whether there's a built-in way to do this in a line or two - this strikes me as the kind of thing someone might have thought to bake in a solution for.
With a little column renaming to get the "revenue" and "purchases" in the column names first using a regular expression and str.replace we can use pd.wide_to_long to convert these now stubnames from columns to rows:
# Reorder column names so stubnames are first
df.columns = [df.columns[0],
*df.columns[1:].str.replace(r'(.*)_(.*)', r'\2_\1', regex=True)]
# Convert wide_to_long
df = (
pd.wide_to_long(
df,
i='customer_code',
stubnames=['purchases', 'revenue'],
j='type',
sep='_',
suffix='.*'
)
.sort_index() # Optional sort to match expected output
.reset_index() # retrieve customer_code from the index
)
df:
customer_code
type
purchases
revenue
0
1234
main
3
103.5
1
1234
secondary
1
43.1
2
3411
main
10
401.5
3
3411
secondary
2
77.5
4
9303
main
5
99
5
9303
secondary
4
104.6
What does reordering the column headers do?
df.columns = [df.columns[0],
*df.columns[1:].str.replace(r'(.*)_(.*)', r'\2_\1', regex=True)]
Produces:
Index(['customer_code', 'purchases_main', 'revenue_main',
'purchases_secondary', 'revenue_secondary'],
dtype='object')
The "type" column is now the suffix of the column header which allows wide_to_long to process the table as expected.
You can abstract the reshaping process with pivot_longer from pyjanitor; they are just a bunch of wrapper functions in Pandas:
#pip install pyjanitor
import pandas as pd
import janitor
df.pivot_longer(index = 'customer_code',
names_to=('type', '.value'),
names_sep='_',
sort_by_appearance=True)
customer_code type purchases revenue
0 1234 main 3 103.5
1 1234 secondary 1 43.1
2 3411 main 10 401.5
3 3411 secondary 2 77.5
4 9303 main 5 99.0
5 9303 secondary 4 104.6
The .value in names_to signifies to the function that you want that part of the column to remain as a header; the other part goes under the type column. The split is determined in this case by names_sep (there is a names_pattern option, that allows regular expression split); if you do not care about the order of appearance, you can set sort_by_appearance as False.
You can also use melt() and concat() function to solve this problem.
import pandas as pd
df1 = df.melt(
id_vars='customer_code',
value_vars=['main_purchases', 'secondary_purchases'],
var_name='type',
value_name='purchases',
ignore_index=True)
df2 = df.melt(
id_vars='customer_code',
value_vars=['main_revenue', 'secondary_revenue'],
var_name='type',
value_name='revenue',
ignore_index=True)
Then we use concat() with the parameter axis=1 to join side by side and use sort_values(by='customer_code') to sort data by customer.
result= pd.concat([df1,df2['revenue']],
axis=1,
ignore_index=False).sort_values(by='customer_code')
Using replace() with regex to align type names:
result.type.replace(r'_.*$','', regex=True, inplace=True)
The above code will output the below dataframe:
customer_code
type
purchases
revenue
0
1234
main
3
103.5
3
1234
secondary
1
43.1
1
3411
main
10
401.5
4
3411
secondary
2
77.5
2
9303
main
5
99
5
9303
secondary
4
104.6

Python pandas column filtering substring

I have a dataframe in python3 using pandas which has a column containing a string with a date.
This is the subset of the column
ColA
"2021-04-03"
"2021-04-08"
"2020-04-12"
"2020-04-08"
"2020-04-12"
I would like to remove the rows that have the same month and day twice and keep the one with the newest year.
This would be what I would expect as a result from this subset
ColA
"2021-04-03"
"2021-04-08"
"2020-04-12"
The last two rows where removed because 2020-04-12 and 2020-04-08 already had the dates in 2021.
I thought of doing this with an apply and lambda but my real dataframe has hundreds of rows and tens of columns so it would not be efficient. Is there a more efficient way of doing this?
There are a couple of ways you can do this. One of them would be to extract the year, sort it by year, and drop rows with duplicate month day pair.
# separate year and month-day pairs
df['year'] = df['ColA'].apply(lambda x: x[:4])
df['mo-day'] = df['ColA'].apply(lambda x: x[5:])
df.sort_values('year', inplace=True)
print(df)
This is what it would look like after separation and sorting:
ColA year mo-day
2 2020-04-12 2020 04-12
3 2020-04-08 2020 04-08
4 2020-04-12 2020 04-12
0 2021-04-03 2021 04-03
1 2021-04-08 2021 04-08
Afterwards, we can simply drop the duplicates and remove the additional columns:
# drop duplicate month-day pairs
df.drop_duplicates('mo-day', keep='first', inplace=True)
# get rid of the two columns
df.drop(['year','mo-day'], axis=1, inplace=True)
# since we dropped duplicate, reset the index
df.reset_index(drop=True, inplace=True)
print(df)
Final result:
ColA
0 2020-04-12
1 2020-04-08
2 2021-04-03
This would be much faster than if you were to convert the entire column to datetime and extract dates, as you're working with the string as is.
I'm not sure you can get away from using an 'apply' to extract the relevant part of the date for grouping, but this is much easier if you first convert that column to a pandas datetime type:
df = pd.DataFrame({'colA':
["2021-04-03",
"2021-04-08",
"2020-04-12",
"2020-04-08",
"2020-04-12"]})
df['colA'] = df.colA.apply(pd.to_datetime)
Then you can group by the (day, month) and keep the highest value like so:
df.groupby(df.colA.apply(lambda x: (x.day, x.month))).max()

How to handle records in dataframe with same ID but some different values in columns in python

I am working on a dataframe using pandas with bank (loan) details for customers. There is a problem because some unique loan id have been recorded 2 times with different values for some of the features respectively. I am attaching a screenshot to be more specific.
Now you see for instance this unique Loan ID has been recorded 2 times. I want to drop the second one with nan values but I can't do it manually because there are 4900 similar cases. any idea?
The problem is not the NaN value, the problem is the double records. I want to drop rows with nan values only for double records not for the entire dataframe
Thanks in advance
Count rows where there are > 1, and then only drop nans where there are > 1 rows.
df['flag'] = df.groupby(['Loan ID', 'Credit ID'])['Loan ID'].transform('count')
df = df.loc[df['flag'] > 1].dropna(subset=['Credit Score', 'Annual Income']).drop('flag', axis=1)
Instead of dropping nan rows, just take the rows where credit score or annual income is not nan:
df = df[df['Credit Score'].notna()]

Categories

Resources