Select rows based on condition on rows and columns - python

Hello I have a pandas dataframe that I want to clean.Here is an example:
IDBILL
IDBUYER
BILL
DATE
001
768787
45
1897-07-24
002
768787
30
1897-07-24
005
786545
45
1897-08-19
008
657676
89
1989-09-23
009
657676
42
1989-09-23
010
657676
18
1989-09-23
012
657676
51
1990-03-10
016
892354
73
1990-03-10
018
892354
48
1765-02-14
I want to delete the highest bills(and keep the lowest when the bills are made on the same day, by the same IDBUYER, and whose bills IDs follow each other.
To get this:
IDBILL
IDBUYER
BILL
DATE
002
768787
30
1897-07-24
005
786545
45
1897-08-19
010
657676
18
1989-09-23
012
657676
51
1990-03-10
016
892354
73
1990-03-10
018
892354
48
1765-02-14
Thank you in advance

Firstly convert 'DATE' column into datetime dtype by using to_datetime() method:
df['DATE'] = pd.to_datetime(df['DATE'])
Try with groupby() method:
result=df.groupby(['IDBUYER',df['DATE'].dt.day],as_index=False)[['IDBILL','BILL','DATE']].min()
OR
result=df.groupby(['DATE', 'IDBUYER'], sort=False)[['IDBILL','BILL']].min().reset_index()
Output of result:
IDBUYER IDBILL BILL DATE
0 657676 12 51 1990-03-10
1 657676 8 18 1989-09-23
2 768787 1 30 1897-07-24
3 786545 5 45 1897-08-19
4 892354 16 73 1990-03-10
5 892354 18 48 1765-02-14

You could try this to keep only min values of the lowest entry which is a follow-upof the idbill:
df['follow_up'] = df['IDBILL'].ne(df['IDBILL'].shift()+1).cumsum()
m = df.groupby(['IDBUYER', 'follow_up', df['DATE']])['BILL'].idxmin()
df.loc[sorted(m)]
# IDBILL IDBUYER BILL DATE follow_up
# 1 2 768787 30 1897-07-24 1
# 2 5 786545 45 1897-08-19 2
# 5 10 657676 18 1989-09-23 3
# 6 12 657676 51 1990-03-10 4
# 7 16 892354 73 1990-03-10 5
# 8 18 892354 48 1765-02-14 6

Related

Calculate mean from multiple columns

I have 12 columns filled with wages. I want to calculate the mean but my output is 12 different means from each column, but I want one mean which is calculated with the whole dataset as one.
This is how my df looks:
Month 1 Month 2 Month 3 Month 4 ... Month 9 Month 10 Month 11 Month 12
0 1429.97 2816.61 2123.29 2123.29 ... 2816.61 2816.61 1429.97 1776.63
1 3499.53 3326.20 3499.53 2112.89 ... 1939.56 2806.21 2632.88 2459.55
2 2599.95 3119.94 3813.26 3466.60 ... 3466.60 3466.60 2946.61 2946.61
3 2599.95 2946.61 3466.60 2773.28 ... 2253.29 3119.94 1906.63 2773.28
I used this code to calculate the mean:
mean = df.mean()
Do i have to convert these 12 columns into one column or how can i calculate one mean?
Just call the mean again to get the mean of those 12 values:
df.mean().mean()
Use numpy.mean with convert values to 2d array:
mean = np.mean(df.to_numpy())
print (mean)
2914.254166666667
Or use DataFrame.melt:
mean = df.melt()['value'].mean()
print (mean)
2914.254166666666
You can also use stack:
df.stack().mean()
Suppose this dataframe:
>>> df
A B C D E F G H
0 60 1 59 25 8 27 34 43
1 81 48 32 30 60 3 90 22
2 66 15 21 5 23 36 83 46
3 56 42 14 86 41 64 89 56
4 28 53 89 89 52 13 12 39
5 64 7 2 16 91 46 74 35
6 81 81 27 67 26 80 19 35
7 56 8 17 39 63 6 34 26
8 56 25 26 39 37 14 41 27
9 41 56 68 38 57 23 36 8
>>> df.stack().mean()
41.6625

How to create MultiIndex DataFrame from other pandas DataFrames

I have basically two DataFrames from different dates and want to join them into one
let's say this is data from 25 Sep
hour columnA columnB
0 12 24
1 45 87
2 10 58
3 12 13
4 12 20
here is data from 26sep
hour columnA columnB
0 54 89
1 45 3
2 33 97
3 12 13
4 78 47
now I want to join both DataFrames and get MultiIndex DataFrame like this
25sep hour columnA columnB
0 12 24
1 45 87
2 10 58
3 12 13
4 12 20
26sep hour columnA columnB
0 54 89
1 45 3
2 33 97
3 12 13
4 78 47
I read the docs about MultiIndex but am not sure how to apply it to my situation.
Use pandas.concat
https://pandas.pydata.org/docs/reference/api/pandas.concat.html
>>> df = pd.concat([df1.set_index('hour'), df2.set_index('hour')],
keys=["25sep", "26sep"])
>>> df
columnA columnB
hour
25sep 0 12 24
1 45 87
2 10 58
3 12 13
4 12 20
26sep 0 54 89
1 45 3
2 33 97
3 12 13
4 78 47
Let us try
out = pd.concat({ y : x.set_index('hour') for x, y in zip([df1,df2],['25sep','26sep'])})
columnA columnB
hour
25sep 0 12 24
1 45 87
2 10 58
3 12 13
4 12 20
26sep 0 54 89
1 45 3
2 33 97
3 12 13
4 78 47

how to change multiple header into a flat header dataframe

This is my data:
My Very First Column
Sr No Col1 Col2
Sr No sub_col1 sub_col2 sub_col3 sub_col1 sub_col2 sub_col3
1 9 45 3 9 97 9
2 32 95 12 67 78 34
3 3 6 5 85 54 99
4 32 31 75 312 56 98
This is how I want it to be:
Sr No Col1-sub_col1 Col1-sub_col2 Col1-sub_col3 Col2-sub_col1 Col2-sub_col2 Col2-sub_col3
1 9 45 3 9 97 9
2 32 95 12 67 78 34
3 3 6 5 85 54 99
4 32 31 75 312 56 98
The problem is the columns and sub columns always differ every time. Thus, I can't put any constant value.
suppose you have a dataframe named df:-
Try using:-
columns=[]
for x in df.columns.to_list():
columns.append('-'.join([list(x)[0],list(x)[1]]))
Finally:-
df.columns=columns

Sum row values of all columns where column names meet string match condition

I have the following dataset:
ID Length Width Range_CAP Capacity_CAP
0 1 33 25 16 50
1 2 34 22 11 66
2 3 22 12 15 42
3 4 46 45 66 54
4 5 16 6 23 75
5 6 21 42 433 50
I basically want to sum the row values of the columns only where the columns match a string (in this case, all columns with _CAP at the end of their name). And store the sum of the result in a new column.
So that I end up with a dataframe that looks something like this:
ID Length Width Range_CAP Capacity_CAP CAP_SUM
0 1 33 25 16 50 66
1 2 34 22 11 66 77
2 3 22 12 15 42 57
3 4 46 45 66 54 120
4 5 16 6 23 75 98
5 6 21 42 433 50 483
I first tried to use the solution recommended in this question here:
Summing columns in Dataframe that have matching column headers
However, the solution doesn't work for me since they are summing up columns that have the same exact name so a simple groupby can accomplish the result whereas I am trying to sum columns with specific string matches only.
Code to recreate above sample dataset:
data1 = [['1', 33,25,16,50], ['2', 34,22,11,66],
['3', 22,12,15,42],['4', 46,45,66,54],
['5',16,6,23,75], ['6', 21,42,433,50]]
df = pd.DataFrame(data1, columns = ['ID', 'Length','Width','Range_CAP','Capacity_CAP'])
Let us do filter
df['CAP_SUM'] = df.filter(like='CAP').sum(1)
Out[86]:
0 66
1 77
2 57
3 120
4 98
5 483
dtype: int64
If have other CAP in front
df.filter(regex='_CAP$').sum(1)
Out[92]:
0 66
1 77
2 57
3 120
4 98
5 483
dtype: int64
One approach is:
df['CAP_SUM'] = df.loc[:, df.columns.str.endswith('_CAP')].sum(1)
print(df)
Output
ID Length Width Range_CAP Capacity_CAP CAP_SUM
0 1 33 25 16 50 66
1 2 34 22 11 66 77
2 3 22 12 15 42 57
3 4 46 45 66 54 120
4 5 16 6 23 75 98
5 6 21 42 433 50 483
The expression:
df.columns.str.endswith('_CAP')
creates a boolean mask where the values are True if and only if the column name ends with CAP. As an alternative use filter, with the following regex:
df['CAP_SUM'] = df.filter(regex='_CAP$').sum(1)
print(df)
Output (of filter)
ID Length Width Range_CAP Capacity_CAP CAP_SUM
0 1 33 25 16 50 66
1 2 34 22 11 66 77
2 3 22 12 15 42 57
3 4 46 45 66 54 120
4 5 16 6 23 75 98
5 6 21 42 433 50 483
You may try this:
columnstxt = df.columns
df['sum'] = 0
for i in columnstxt:
if i.find('_CAP') != -1:
df['sum'] = df['sum'] + df[i]
else:
pass

performing differences between rows in pandas based on columns values

I have this dataframe, I'm trying to create a new column where I want to store the difference of products sold based on code and date.
for example this is the starting dataframe:
date code sold
0 20150521 0 47
1 20150521 12 39
2 20150521 16 39
3 20150521 20 38
4 20150521 24 38
5 20150521 28 37
6 20150521 32 36
7 20150521 4 43
8 20150521 8 43
9 20150522 0 47
10 20150522 12 37
11 20150522 16 36
12 20150522 20 36
13 20150522 24 36
14 20150522 28 35
15 20150522 32 31
16 20150522 4 42
17 20150522 8 41
18 20150523 0 50
19 20150523 12 48
20 20150523 16 46
21 20150523 20 46
22 20150523 24 46
23 20150523 28 45
24 20150523 32 42
25 20150523 4 49
26 20150523 8 49
27 20150524 0 39
28 20150524 12 33
29 20150524 16 30
... ... ... ...
150 20150606 32 22
151 20150606 4 34
152 20150606 8 33
153 20150607 0 31
154 20150607 12 30
155 20150607 16 30
156 20150607 20 29
157 20150607 24 28
158 20150607 28 26
159 20150607 32 24
160 20150607 4 30
161 20150607 8 30
162 20150608 0 47
I think this could be a solution...
full_df1=full_df[full_df.date == '20150609'].reset_index(drop=True)
full_df1['code'] = full_df1['code'].astype(float)
full_df1= full_df1.sort(['code'], ascending=[False])
code date sold
8 32 20150609 33
7 28 20150609 36
6 24 20150609 37
5 20 20150609 39
4 16 20150609 42
3 12 20150609 46
2 8 20150609 49
1 4 20150609 49
0 0 20150609 50
full_df1.set_index('code')['sold'].diff().reset_index()
that gives me back this output for a single date 20150609 :
code difference
0 32 NaN
1 28 3
2 24 1
3 20 2
4 16 3
5 12 4
6 8 3
7 4 0
8 0 1
is there a better solution to have the same result in a more pythonic way?
I would like to create a new column [difference] and store the data there having as result 4 columns [date, code, sold, difference]
This exactly the kind of thing that panda's groupby functionality is built for, and I highly recommend reading and working through this documentation: panda's groupby documentation
This code replicates what you are asking for, but for every date.
df = pd.DataFrame({'date':['Mon','Mon','Mon','Tue','Tue','Tue'],'code':[10,21,30,10,21,30], 'sold':[12,13,34,10,15,20]})
df['difference'] = df.groupby('date')['sold'].diff()
df
code date sold difference
0 10 Mon 12 NaN
1 21 Mon 13 1
2 30 Mon 34 21
3 10 Tue 10 NaN
4 21 Tue 15 5
5 30 Tue 20 5

Categories

Resources