How to group phone number with and without country code

How to group phone number with and without country code - python

I am trying to detect phone number, my country code is +62 but some phone manufacturer or operator use 0 and +62, after query and pivoting I get pivoted data. But, the pivoted data is out of context
Here's the pivoted data
Id +623684682 03684682 +623684684 03684684
1 1 0 1 1
2 1 1 2 1
Here's what I need to group, but I don't want to group manually (+623684682 and 03684682 is same, etc)
Id 03684682 03684684
1 1 2
2 2 3

I think need replace with aggregate sum:
df = df.groupby(lambda x: x.replace('+62','0'), axis=1).sum()
Or replace columns names and sum:
df.columns = df.columns.str.replace('\+62','0')
df = df.sum(level=0, axis=1)
print (df)
03684682 03684684
Id
1 1 2
2 2 3

Related

How to get most recent date based on a given date using python?

Consider the following two dataframes:
Dataframe1 contains a list of users and stop_dates
Dataframe2 contains a history of user transactions and dates
I want to get the last transaction date before the stop date for all users in Dataframe1 (some users in Dataframe1 have multiple stop dates)
I want the output to look like the following:

Please always provide data in a form that makes it easy to use as samples (i.e. as text, not as images - see here).
You could try:
df1["Stop_Date"] = pd.to_datetime(df1["Stop_Date"], format="%m/%d/%y")
df2["Transaction_Date"] = pd.to_datetime(df2["Transaction_Date"], format="%m/%d/%y")
df = (
df1.merge(df2, on="UserID", how="left")
.loc[lambda df: df["Stop_Date"] >= df["Transaction_Date"]]
.groupby(["UserID", "Stop_Date"])["Transaction_Date"].max()
.to_frame().reset_index().drop(columns="Stop_Date")
)
Make datetimes out of the date columns.
Merge df2 on df1 along UserID.
Remove the rows which have a Transaction_Date greater than Stop_Date.
Group the result by UserID and Stop_Date, and fetch the maximum Transaction_Date.
Bring the result in shape.
Result for
df1:
UserID Stop_Date
0 1 2/2/22
1 2 6/9/22
2 3 7/25/22
3 3 9/14/22
df2:
UserID Transaction_Date
0 1 1/2/22
1 1 2/1/22
2 1 2/3/22
3 2 1/24/22
4 2 3/22/22
5 3 6/25/22
6 3 7/20/22
7 3 9/13/22
8 3 9/14/22
9 4 2/2/22
is
UserID Transaction_Date
0 1 2022-02-01
1 2 2022-03-22
2 3 2022-07-20
3 3 2022-09-14
If you don't want to permanently change the dtype to datetime, and also want the result as string, similarly formatted as the input (with padding), then you could try:
df = (
df1
.assign(Stop_Date=pd.to_datetime(df1["Stop_Date"], format="%m/%d/%y"))
.merge(
df2.assign(Transaction_Date=pd.to_datetime(df2["Transaction_Date"], format="%m/%d/%y")),
on="UserID", how="left"
)
.loc[lambda df: df["Stop_Date"] >= df["Transaction_Date"]]
.groupby(["UserID", "Stop_Date"])["Transaction_Date"].max()
.to_frame().reset_index().drop(columns="Stop_Date")
.assign(Transaction_Date=lambda df: df["Transaction_Date"].dt.strftime("%m/%d/%y"))
)
Result:
UserID Transaction_Date
0 1 02/01/22
1 2 03/22/22
2 3 07/20/22
3 3 09/14/22

Here is one way to accomplish (make sure both date columns are already datetime):
df = pd.merge(df1, df2, on="UserID")
df["Last_Before_Stop"] = df["Stop_Date"].apply(lambda x: max(df["Transaction_Date"][df["Transaction_Date"] <= x]))

Pandas - Applying filter in groupby

I am trying to perform a group by function in a Dataframe. I need two aggregations done, to find total count and find the count based on filtering of one column
product, count, type
prod_a,100,1
prod_b,200,2
prod_c,23,3
prod_d,23,1
I am trying to create a pivot of columns, column 1 that has count of product sold and column 2 that has count of products by type 1
sold, type_1
prod_a,1,1
prod_b,1,0
prod_c,1,0
prod_d,1,1
I am able to get count of products sold but I am not sure how to apply filter and get the count of prod_a sold
df("product").agg({'count': [('sold', 'count')]})

If need count by only one condition like type==1 then use GroupBy.agg with named aggregations:
df2 = df.groupby("product").agg(sold = ('count','count'),
type_1= ('type', lambda x: (x == 1).sum()))
print (df2)
sold type_1
product
prod_a 1 1
prod_b 1 0
prod_c 1 0
prod_d 1 1
For improve performance first create column and then aggregate sum:
df2 = (df.assign(type_1 = df['type'].eq(1).astype(int))
.groupby("product").agg(sold = ('count','count'),
type_1 = ('type_1','sum')))
For all combinations use crosstab with DataFrame.join:
df1 = pd.crosstab(df['product'], df['type']).add_prefix('type_')
df2 = df.groupby("product").agg(sold = ('count','count')).join(df1)
print (df2)
sold type_1 type_2 type_3
product
prod_a 1 1 0 0
prod_b 1 0 1 0
prod_c 1 0 0 1
prod_d 1 1 0 0

Enumerate rows by category

Enumerate values rows by category
I have the following dataframe that I'm ordering by category and values:
d = {"cat":["a","b","a","c","c"],"val" :[1,2,3,1,4] }
df = pd.DataFrame(d)
df = df.sort_values(["cat","val"])
Now from that dataframe I want to enumarate the occurrence of each category
so the result is as follows:
df["cat_count"] = [1,2,1,1,2]
Is there a way to automate this?

You can use cumcount like this. Details here cumcount
df['count'] = df.groupby('cat').cumcount()+1
print (df)
Output
cat val count
0 a 1 1
2 a 3 2
1 b 2 1
3 c 1 1
4 c 4 2

Adding a column from the original data frame to a groupby data frame?

I have a data frame df1 with data that looks like this:
Item Store Sales Dept
0 1 1 5 A
1 1 2 3 A
2 1 3 4 A
3 2 1 3 A
4 2 2 3 A
I then want to use group by to see the total sales by item:
df2 = df1.groupby(['Item']).agg({'Item':'first','Sales':'sum'})
Which gives me:
Item Sales
0 1 12
1 2 6
And then I add a column with the rank of the item in terms of number of sales:
df2['Item Rank'] = df2['Sales'].rank(ascending=False,method='min').astype(int)
So that I get:
Item Sales Item Rank
0 1 12 1
1 2 6 2
I now want to add the Dept column to df2, so that I have
Item Sales Item Rank Dept
0 1 12 1 A
1 2 6 2 A
But everything I have tried has failed.
I either get an empty column, when I try to add the column in from the beginning, or a df with the wrong size if I try to concatenate the new df with the column from the original df.

df.groupby(['Item']).agg({'Item':'first','Sales':'sum','Dept': 'first'}).\
assign(Itemrank=df.Sales.rank(ascending=False,method='min').astype(int) )
Out[64]:
Item Dept Sales Itemrank
Item
1 1 A 12 3
2 2 A 6 2

This is unusual but if you can add the Dept column when you're doing the groupby itself:
A simple option is just to hard code the value if you already know what it needs to be:
df2 = df1.groupby(['Item']).agg({'Item':'first',
'Sales':'sum',
'Dept': lambda x: 'A'})
Or you could take it from the dataframe itself:
df2 = df1.groupby(['Item']).agg({'Item':'first',
'Sales':'sum',
'Dept': lambda x: df1['Dept'].iloc[0]})

Column headers like pivot table

I am trying to find out the mix of member grades that visit my stores.
import pandas as pd
df=pd.DataFrame({'MbrID':['M1','M2','M3','M4','M5','M6','M7']
,'Store':['PAR','TPM','AMK','TPM','PAR','PAR','AMK']
,'Grade':['A','A','B','A','C','A','C']})
df=df[['MbrID','Store','Grade']]
print(df)
df.groupby('Store').agg({'Grade':pd.Series.nunique})
Below is the dataframe and also the result of groupby function.
How do I produce the result like Excel Pivot table, such that the categories of Grade (A,B,C) is the column headers? This is assuming that I have quite a wide range of member grades.

I think you can use groupby with size and reshaping by unstack:
df1 = df.groupby(['Store','Grade'])['Grade'].size().unstack(fill_value=0)
print (df1)
Grade A B C
Store
AMK 0 1 1
PAR 2 0 1
TPM 2 0 0
Solution with crosstab:
df2 = pd.crosstab(df.Store, df.Grade)
print (df2)
Grade A B C
Store
AMK 0 1 1
PAR 2 0 1
TPM 2 0 0
and with pivot_table:
df3 = df.pivot_table(index='Store',
columns='Grade',
values='MbrID',
aggfunc=len,
fill_value=0)
print (df3)
Grade A B C
Store
AMK 0 1 1
PAR 2 0 1
TPM 2 0 0

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to group phone number with and without country code - python

I think need replace with aggregate sum: df = df.groupby(lambda x: x.replace('+62','0'), axis=1).sum() Or replace columns names and sum: df.columns = df.columns.str.replace('\+62','0') df = df.sum(level=0, axis=1) print (df) 03684682 03684684 Id 1 1 2 2 2 3

Related

How to get most recent date based on a given date using python?

Pandas - Applying filter in groupby

Enumerate rows by category

Adding a column from the original data frame to a groupby data frame?

Column headers like pivot table

Categories

Resources