how to merge or join pandas pivot table with python - python

I try to merge or join two pandas's pivot table
t tried using merge and it does works but the result wasn't i expecting tobe, its giving me duplicate for either table
df1.fillna('', inplace=True)
df1.reset_index(inplace=True)
df2.fillna('', inplace=True)
df2.reset_index(inplace=True)
df = pd.merge(df1, df2, how='left', on=['KEY1', 'KEY2'])
First Pivot Table:
KEY1 KEY2 KEY3 Column_A0 Column_A1 Column_A2
Row_X0 Row_Y0 Row_Z0 123 123
Row_Y1 Row_Z0 456
Row_Z1 789 789
Row_X1 Row_Y0 Row_Z0 123
Row_Z1 789
Row_Z2 456 789
Second Pivot Table:
KEY1 KEY2 KEY3 Column_B0 Column_B1 Column_B2 Column_B3
Row_X0 Row_Y0 Row_W0 1 234
Row_W1 2 345
Row_W2 3 456
Row_Y1 Row_W0 4 567 1 234
Row_W1 7 890 2 345
Row_W2 8 901 3 456
Row_W3 9 12 4 567
Row_X1 Row_Y0 Row_W0 7 890
Row_W1 8 901
Row_W2 9 12
The result I expect:
KEY1 KEY2 KEY3_X Column_A0 Column_A1 Column_A2 KEY3_Y Column_B0 Column_B1 Column_B2 Column_B3
Row_X0 Row_Y0 Row_Z0 123 123 Row_W0 1 234
Row_W1 2 345
Row_W2 3 456
Row_Y1 Row_Z0 456 Row_W0 4 567 1 234
Row_Z1 789 789 Row_W1 7 890 2 345
Row_W2 8 901 3 456
Row_W3 9 12 4 567
Row_W0 7 890
Row_X1 Row_Y0 Row_Z0 123 Row_W1 8 901
Row_Z1 789 Row_W2 9 12
Row_Z2 456 789
is there any i can do to make this happen? thank you

Concat() by row or column. The pd.concat function allows you to tables using the column and or row

Related

Get max calls by a person Pandas Python

Let's say, I have number A and they call several people B
A B
123 987
123 987
123 124
435 567
435 789
653 876
653 876
999 654
999 654
999 654
999 123
I want to find to whom the person in A has called maximum times and also the number of times.
OUTPUT:
A B Count
123 987 2
435 567 or789 1
653 876 2
999 654 3
How one can think of it is,
A B
123 987 2
124 1
435 567 1
789 1
653 876 2
999 654 3
123 1
Can somebody help me out on how to do this?
Try this
# count the unique values in rows
df.value_counts(['A','B']).sort_index()
A B
123 124 1
987 2
435 567 1
789 1
653 876 2
999 123 1
654 3
dtype: int64
To get the highest values for each unique A:
v = df.value_counts(['A','B'])
# remove duplicated rows
v[~v.reset_index(level=0).duplicated('A').values]
A B
999 654 3
123 987 2
653 876 2
435 567 1
dtype: int64
Use SeriesGroupBy.value_counts which by default sorting values, so get first rows per A by GroupBy.head:
df = df.groupby('A')['B'].value_counts().groupby(level=0).head(1).reset_index(name='Count')
print (df)
A B Count
0 123 987 2
1 435 567 1
2 653 876 2
3 999 654 3
Another idea:
df = df.value_counts(['A','B']).reset_index(name='Count').drop_duplicates('A')
print (df)
A B Count
0 999 654 3
1 123 987 2
2 653 876 2
4 435 567 1

How to join all columns in dataframe? [duplicate]

This question already has answers here:
Pandas: Multiple columns into one column
(4 answers)
How to stack/append all columns into one column in Pandas? [duplicate]
(4 answers)
Closed 10 months ago.
I would like one column to have all the other columns in the data frame combined.
here is what the dataframe looks like
0 1 2
0 123 321 231
1 232 321 231
2 432 432 432
dataframe name = task_ba
I would like it to look like this
0
0 123
1 232
2 432
3 321
4 321
5 432
6 231
7 231
8 432
Easiest and fastest option, use the underlying numpy array:
df2 = pd.DataFrame(df.values.ravel(order='F'))
NB. If you prefer a series, use pd.Series instead
Output:
0
0 123
1 232
2 432
3 321
4 321
5 432
6 231
7 231
8 432
You can use pd.DataFrame.melt() and then drop the variable column:
>>> df
0 1 2
0 123 321 231
1 232 321 231
2 432 432 432
>>> df.melt().drop("variable", axis=1) # Drops the 'variable' column
value
0 123
1 232
2 432
3 321
4 321
5 432
6 231
7 231
8 432
Or if you want 0 as your column name:
>>> df.melt(value_name=0).drop("variable", axis=1)
0
0 123
1 232
2 432
3 321
4 321
5 432
6 231
7 231
8 432
You can learn all this (and more!) in the official documentation.

Selecting Items in dataframe

Using Python 3
I have a dataframe sort of like this:
productCode productType storeCode salesAmount moreInfo
111 1 111 111 info
111 1 112 112 info
456 4 456 456 info
and so on for thousands of rows
I want to select (and have a list with the codes for) the X amount of the best selling unique products for each different store.
How would I accomplish that?
Data:
df = pd.DataFrame({'productCode': [111,111,456,123,125],
'productType' : [1,1,4,3,3],
'storeCode' : [111,112,112,456,456],
'salesAmount' : [111,112,34,456,1235]})
productCode productType storeCode salesAmount
0 111 1 111 111
1 111 1 112 112
2 456 4 112 34
3 123 3 456 456
4 125 3 456 1235
It sounds like you want the best selling product at each storeCode? In which case:
df.sort_values('salesAmount', ascending=False).groupby('storeCode').head(1)
productCode productType storeCode salesAmount
4 125 3 456 1235
1 111 1 112 112
0 111 1 111 111
Instead, if you want the best selling of each productType at each storeCode, then:
df.sort_values('salesAmount', ascending=False).groupby(['storeCode', 'productType']).head(1)
productCode productType storeCode salesAmount
4 125 3 456 1235
1 111 1 112 112
0 111 1 111 111
2 456 4 112 34

Pandas Dataframe iteration and selecting the rows based on condition - Change in Requirements

I have sorted data frame as mentioned below(Input DataFrame) and I need to iterate the rows,select & retrive the rows into output data frame based on below conditions.
• Condition 1: For a given R1,R2,W - if we have two records with TYPE 'A' and 'B'
a) If (amoun1& amount2) of TYPE ‘A’ is > (amoun1& amount2 )of TYPE ‘B’ we need to bring the TYPE 'A' record into the output
b) If (amoun1& amount2) of TYPE ‘B’ is > (amoun1& amount2 )of TYPE ‘A’ we need to bring the TYPE 'B' record into the output
c) If (amoun1& amount2) of TYPE ‘A’ is = (amoun1& amount2 )of TYPE ‘B’ we need to bring the TYPE 'A' record into the output
• Condition 2: For a given R1,R2,W - if we have only record with TYPE 'A', we need to bring the TYPE 'A' record into the output
• Condition 3: For a given R1,R2,W - if we have only record with TYPE 'B', we need to bring the TYPE 'B' record into the output
Input Dataframe
R1 R2 W TYPE amount1 amount2
0 123 12 1 A 111 222
1 123 12 1 B 111 222
2 123 12 2 A 222 222
3 123 12 2 B 333 333
4 123 12 3 A 444 444
5 123 12 3 B 333 333
6 123 34 1 A 111 222
7 123 34 2 A 333 444
8 123 34 2 B 333 444
9 123 34 3 B 444 555
10 123 34 4 A 555 666
11 123 34 4 B 666 777
Output dataframe
R1 R2 W TYPE amount1 amount1
0 123 12 1 A 111 222
3 123 12 2 B 333 333
4 123 12 3 A 444 444
6 123 34 1 A 111 222
7 123 34 2 A 333 444
9 123 34 3 B 444 555
11 123 34 4 B 666 777
Selection based on your criteria's
def my_selection(idf):
# If 'A' and 'B' in 'TYPE' then give me the row with 'A'
if idf['TYPE'].unique().shape[0] == 2:
return idf[idf['TYPE'] == 'A']
else:
return idf
df2 = df.groupby(['R1', 'R2', 'W'], as_index=False).apply(lambda idf: my_selection(idf))
df2.index = df2.index.droplevels(-1)
# R1 R2 W TYPE amount1 amount2
# 0 123 12 1 A 111 222
# 1 123 12 2 A 333 444
# 2 123 12 3 A 555 666
# 3 123 34 1 A 111 222
# 4 123 34 2 A 222 333
# 5 123 34 3 B 444 555
# 6 123 34 4 A 555 666
All you have to do is groupby R1,R2,W and operate on Type column as follows:
data.groupby(['R1','R2','W']).apply(lambda x: 'A' if 'A' in x['Type'].values else 'B').reset_index()
You can merge this output with original DataFrame on the obtained columns from the above output to get corresponding 'amount1', 'amount2' values
This is what I would do:
categories = ['B','A'] #create a list of categories in ascending order of precedence
d={i:e for e,i in enumerate(categories)} #create a dictionary:{'A': 0, 'B': 1}
s=df['TYPE'].map(d) #map to df['TYPE'] and create a helper series
then assign this series to the dataframe and groupby+transform max and check if it is equal to the helper series and return where both value matches:
out = df[s.eq(df.assign(TYPE=s).groupby(['R1','R2','W'])['TYPE'].transform('max'))]
print(out)
R1 R2 W TYPE amount1 amount2
0 123 12 1 A 111 222
2 123 12 2 A 333 444
4 123 12 3 A 555 666
6 123 34 1 A 111 222
7 123 34 2 A 222 333
9 123 34 3 B 444 555
10 123 34 4 A 555 666

How to calculate cumulative groupby counts in Pandas with point in time?

I have a df that contains multiple weekly snapshots of JIRA tickets. I want to calculate the YTD counts of tickets.
the df looks like this:
pointInTime ticketId
2008-01-01 111
2008-01-01 222
2008-01-01 333
2008-01-07 444
2008-01-07 555
2008-01-07 666
2008-01-14 777
2008-01-14 888
2008-01-14 999
So if I df.groupby(['pointInTime'])['ticketId'].count() I can get the count of Ids in every snaphsots. But what I want to achieve is calculate the cumulative sum.
and have a df looks like this:
pointInTime ticketId cumCount
2008-01-01 111 3
2008-01-01 222 3
2008-01-01 333 3
2008-01-07 444 6
2008-01-07 555 6
2008-01-07 666 6
2008-01-14 777 9
2008-01-14 888 9
2008-01-14 999 9
so for 2008-01-07 number of ticket would be count of 2008-01-07 + count of 2008-01-01.
Use GroupBy.count and cumsum, then map the result back to "pointInTime":
df['cumCount'] = (
df['pointInTime'].map(df.groupby('pointInTime')['ticketId'].count().cumsum()))
df
pointInTime ticketId cumCount
0 2008-01-01 111 3
1 2008-01-01 222 3
2 2008-01-01 333 3
3 2008-01-07 444 6
4 2008-01-07 555 6
5 2008-01-07 666 6
6 2008-01-14 777 9
7 2008-01-14 888 9
8 2008-01-14 999 9
I am using value_counts
df.pointInTime.map(df.pointInTime.value_counts().sort_index().cumsum())
Out[207]:
0 3
1 3
2 3
3 6
4 6
5 6
6 9
7 9
8 9
Name: pointInTime, dtype: int64
Or
pd.Series(np.arange(len(df))+1,index=df.index).groupby(df['pointInTime']).transform('last')
Out[216]:
0 3
1 3
2 3
3 6
4 6
5 6
6 9
7 9
8 9
dtype: int32
Here's an approach transforming with the size and multiplying by the result of taking pd.factorize on pointInTime:
df['cumCount'] = (df.groupby('pointInTime').ticketId
.transform('size')
.mul(pd.factorize(df.pointInTime)[0]+1))
pointInTime ticketId cumCount
0 2008-01-01 111 3
1 2008-01-01 222 3
2 2008-01-01 333 3
3 2008-01-07 444 6
4 2008-01-07 555 6
5 2008-01-07 666 6
6 2008-01-14 777 9
7 2008-01-14 888 9
8 2008-01-14 999 9

Categories

Resources