I try to merge or join two pandas's pivot table
t tried using merge and it does works but the result wasn't i expecting tobe, its giving me duplicate for either table
df1.fillna('', inplace=True)
df1.reset_index(inplace=True)
df2.fillna('', inplace=True)
df2.reset_index(inplace=True)
df = pd.merge(df1, df2, how='left', on=['KEY1', 'KEY2'])
First Pivot Table:
KEY1 KEY2 KEY3 Column_A0 Column_A1 Column_A2
Row_X0 Row_Y0 Row_Z0 123 123
Row_Y1 Row_Z0 456
Row_Z1 789 789
Row_X1 Row_Y0 Row_Z0 123
Row_Z1 789
Row_Z2 456 789
Second Pivot Table:
KEY1 KEY2 KEY3 Column_B0 Column_B1 Column_B2 Column_B3
Row_X0 Row_Y0 Row_W0 1 234
Row_W1 2 345
Row_W2 3 456
Row_Y1 Row_W0 4 567 1 234
Row_W1 7 890 2 345
Row_W2 8 901 3 456
Row_W3 9 12 4 567
Row_X1 Row_Y0 Row_W0 7 890
Row_W1 8 901
Row_W2 9 12
The result I expect:
KEY1 KEY2 KEY3_X Column_A0 Column_A1 Column_A2 KEY3_Y Column_B0 Column_B1 Column_B2 Column_B3
Row_X0 Row_Y0 Row_Z0 123 123 Row_W0 1 234
Row_W1 2 345
Row_W2 3 456
Row_Y1 Row_Z0 456 Row_W0 4 567 1 234
Row_Z1 789 789 Row_W1 7 890 2 345
Row_W2 8 901 3 456
Row_W3 9 12 4 567
Row_W0 7 890
Row_X1 Row_Y0 Row_Z0 123 Row_W1 8 901
Row_Z1 789 Row_W2 9 12
Row_Z2 456 789
is there any i can do to make this happen? thank you
Concat() by row or column. The pd.concat function allows you to tables using the column and or row
Related
Let's say, I have number A and they call several people B
A B
123 987
123 987
123 124
435 567
435 789
653 876
653 876
999 654
999 654
999 654
999 123
I want to find to whom the person in A has called maximum times and also the number of times.
OUTPUT:
A B Count
123 987 2
435 567 or789 1
653 876 2
999 654 3
How one can think of it is,
A B
123 987 2
124 1
435 567 1
789 1
653 876 2
999 654 3
123 1
Can somebody help me out on how to do this?
Try this
# count the unique values in rows
df.value_counts(['A','B']).sort_index()
A B
123 124 1
987 2
435 567 1
789 1
653 876 2
999 123 1
654 3
dtype: int64
To get the highest values for each unique A:
v = df.value_counts(['A','B'])
# remove duplicated rows
v[~v.reset_index(level=0).duplicated('A').values]
A B
999 654 3
123 987 2
653 876 2
435 567 1
dtype: int64
Use SeriesGroupBy.value_counts which by default sorting values, so get first rows per A by GroupBy.head:
df = df.groupby('A')['B'].value_counts().groupby(level=0).head(1).reset_index(name='Count')
print (df)
A B Count
0 123 987 2
1 435 567 1
2 653 876 2
3 999 654 3
Another idea:
df = df.value_counts(['A','B']).reset_index(name='Count').drop_duplicates('A')
print (df)
A B Count
0 999 654 3
1 123 987 2
2 653 876 2
4 435 567 1
This question already has answers here:
Pandas: Multiple columns into one column
(4 answers)
How to stack/append all columns into one column in Pandas? [duplicate]
(4 answers)
Closed 10 months ago.
I would like one column to have all the other columns in the data frame combined.
here is what the dataframe looks like
0 1 2
0 123 321 231
1 232 321 231
2 432 432 432
dataframe name = task_ba
I would like it to look like this
0
0 123
1 232
2 432
3 321
4 321
5 432
6 231
7 231
8 432
Easiest and fastest option, use the underlying numpy array:
df2 = pd.DataFrame(df.values.ravel(order='F'))
NB. If you prefer a series, use pd.Series instead
Output:
0
0 123
1 232
2 432
3 321
4 321
5 432
6 231
7 231
8 432
You can use pd.DataFrame.melt() and then drop the variable column:
>>> df
0 1 2
0 123 321 231
1 232 321 231
2 432 432 432
>>> df.melt().drop("variable", axis=1) # Drops the 'variable' column
value
0 123
1 232
2 432
3 321
4 321
5 432
6 231
7 231
8 432
Or if you want 0 as your column name:
>>> df.melt(value_name=0).drop("variable", axis=1)
0
0 123
1 232
2 432
3 321
4 321
5 432
6 231
7 231
8 432
You can learn all this (and more!) in the official documentation.
Using Python 3
I have a dataframe sort of like this:
productCode productType storeCode salesAmount moreInfo
111 1 111 111 info
111 1 112 112 info
456 4 456 456 info
and so on for thousands of rows
I want to select (and have a list with the codes for) the X amount of the best selling unique products for each different store.
How would I accomplish that?
Data:
df = pd.DataFrame({'productCode': [111,111,456,123,125],
'productType' : [1,1,4,3,3],
'storeCode' : [111,112,112,456,456],
'salesAmount' : [111,112,34,456,1235]})
productCode productType storeCode salesAmount
0 111 1 111 111
1 111 1 112 112
2 456 4 112 34
3 123 3 456 456
4 125 3 456 1235
It sounds like you want the best selling product at each storeCode? In which case:
df.sort_values('salesAmount', ascending=False).groupby('storeCode').head(1)
productCode productType storeCode salesAmount
4 125 3 456 1235
1 111 1 112 112
0 111 1 111 111
Instead, if you want the best selling of each productType at each storeCode, then:
df.sort_values('salesAmount', ascending=False).groupby(['storeCode', 'productType']).head(1)
productCode productType storeCode salesAmount
4 125 3 456 1235
1 111 1 112 112
0 111 1 111 111
2 456 4 112 34
I have sorted data frame as mentioned below(Input DataFrame) and I need to iterate the rows,select & retrive the rows into output data frame based on below conditions.
• Condition 1: For a given R1,R2,W - if we have two records with TYPE 'A' and 'B'
a) If (amoun1& amount2) of TYPE ‘A’ is > (amoun1& amount2 )of TYPE ‘B’ we need to bring the TYPE 'A' record into the output
b) If (amoun1& amount2) of TYPE ‘B’ is > (amoun1& amount2 )of TYPE ‘A’ we need to bring the TYPE 'B' record into the output
c) If (amoun1& amount2) of TYPE ‘A’ is = (amoun1& amount2 )of TYPE ‘B’ we need to bring the TYPE 'A' record into the output
• Condition 2: For a given R1,R2,W - if we have only record with TYPE 'A', we need to bring the TYPE 'A' record into the output
• Condition 3: For a given R1,R2,W - if we have only record with TYPE 'B', we need to bring the TYPE 'B' record into the output
Input Dataframe
R1 R2 W TYPE amount1 amount2
0 123 12 1 A 111 222
1 123 12 1 B 111 222
2 123 12 2 A 222 222
3 123 12 2 B 333 333
4 123 12 3 A 444 444
5 123 12 3 B 333 333
6 123 34 1 A 111 222
7 123 34 2 A 333 444
8 123 34 2 B 333 444
9 123 34 3 B 444 555
10 123 34 4 A 555 666
11 123 34 4 B 666 777
Output dataframe
R1 R2 W TYPE amount1 amount1
0 123 12 1 A 111 222
3 123 12 2 B 333 333
4 123 12 3 A 444 444
6 123 34 1 A 111 222
7 123 34 2 A 333 444
9 123 34 3 B 444 555
11 123 34 4 B 666 777
Selection based on your criteria's
def my_selection(idf):
# If 'A' and 'B' in 'TYPE' then give me the row with 'A'
if idf['TYPE'].unique().shape[0] == 2:
return idf[idf['TYPE'] == 'A']
else:
return idf
df2 = df.groupby(['R1', 'R2', 'W'], as_index=False).apply(lambda idf: my_selection(idf))
df2.index = df2.index.droplevels(-1)
# R1 R2 W TYPE amount1 amount2
# 0 123 12 1 A 111 222
# 1 123 12 2 A 333 444
# 2 123 12 3 A 555 666
# 3 123 34 1 A 111 222
# 4 123 34 2 A 222 333
# 5 123 34 3 B 444 555
# 6 123 34 4 A 555 666
All you have to do is groupby R1,R2,W and operate on Type column as follows:
data.groupby(['R1','R2','W']).apply(lambda x: 'A' if 'A' in x['Type'].values else 'B').reset_index()
You can merge this output with original DataFrame on the obtained columns from the above output to get corresponding 'amount1', 'amount2' values
This is what I would do:
categories = ['B','A'] #create a list of categories in ascending order of precedence
d={i:e for e,i in enumerate(categories)} #create a dictionary:{'A': 0, 'B': 1}
s=df['TYPE'].map(d) #map to df['TYPE'] and create a helper series
then assign this series to the dataframe and groupby+transform max and check if it is equal to the helper series and return where both value matches:
out = df[s.eq(df.assign(TYPE=s).groupby(['R1','R2','W'])['TYPE'].transform('max'))]
print(out)
R1 R2 W TYPE amount1 amount2
0 123 12 1 A 111 222
2 123 12 2 A 333 444
4 123 12 3 A 555 666
6 123 34 1 A 111 222
7 123 34 2 A 222 333
9 123 34 3 B 444 555
10 123 34 4 A 555 666
I have a df that contains multiple weekly snapshots of JIRA tickets. I want to calculate the YTD counts of tickets.
the df looks like this:
pointInTime ticketId
2008-01-01 111
2008-01-01 222
2008-01-01 333
2008-01-07 444
2008-01-07 555
2008-01-07 666
2008-01-14 777
2008-01-14 888
2008-01-14 999
So if I df.groupby(['pointInTime'])['ticketId'].count() I can get the count of Ids in every snaphsots. But what I want to achieve is calculate the cumulative sum.
and have a df looks like this:
pointInTime ticketId cumCount
2008-01-01 111 3
2008-01-01 222 3
2008-01-01 333 3
2008-01-07 444 6
2008-01-07 555 6
2008-01-07 666 6
2008-01-14 777 9
2008-01-14 888 9
2008-01-14 999 9
so for 2008-01-07 number of ticket would be count of 2008-01-07 + count of 2008-01-01.
Use GroupBy.count and cumsum, then map the result back to "pointInTime":
df['cumCount'] = (
df['pointInTime'].map(df.groupby('pointInTime')['ticketId'].count().cumsum()))
df
pointInTime ticketId cumCount
0 2008-01-01 111 3
1 2008-01-01 222 3
2 2008-01-01 333 3
3 2008-01-07 444 6
4 2008-01-07 555 6
5 2008-01-07 666 6
6 2008-01-14 777 9
7 2008-01-14 888 9
8 2008-01-14 999 9
I am using value_counts
df.pointInTime.map(df.pointInTime.value_counts().sort_index().cumsum())
Out[207]:
0 3
1 3
2 3
3 6
4 6
5 6
6 9
7 9
8 9
Name: pointInTime, dtype: int64
Or
pd.Series(np.arange(len(df))+1,index=df.index).groupby(df['pointInTime']).transform('last')
Out[216]:
0 3
1 3
2 3
3 6
4 6
5 6
6 9
7 9
8 9
dtype: int32
Here's an approach transforming with the size and multiplying by the result of taking pd.factorize on pointInTime:
df['cumCount'] = (df.groupby('pointInTime').ticketId
.transform('size')
.mul(pd.factorize(df.pointInTime)[0]+1))
pointInTime ticketId cumCount
0 2008-01-01 111 3
1 2008-01-01 222 3
2 2008-01-01 333 3
3 2008-01-07 444 6
4 2008-01-07 555 6
5 2008-01-07 666 6
6 2008-01-14 777 9
7 2008-01-14 888 9
8 2008-01-14 999 9