Panda .loc or .iloc to select the columns from a dataset - python

I have been trying to select a particular set of columns from a dataset for all the rows. I tried something like below.
train_features = train_df.loc[,[0,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18]]
I want to mention that all rows are inclusive but only need the numbered columns.
Is there any better way to approach this.
sample data:
age job marital education default housing loan equities contact duration campaign pdays previous poutcome emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed y
56 housemaid married basic.4y 1 1 1 1 0 261 1 999 0 2 1.1 93.994 -36.4 3.299552287 5191 1
37 services married high.school 1 0 1 1 0 226 1 999 0 2 1.1 93.994 -36.4 0.743751247 5191 1
56 services married high.school 1 1 0 1 0 307 1 999 0 2 1.1 93.994 -36.4 1.28265179 5191 1
I'm trying to neglect job, marital, education and y column in my dataset. y column is the target variable.

If need select by positions use iloc:
train_features = train_df.iloc[:, [0,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18]]
print (train_features)
age default housing loan equities contact duration campaign pdays \
0 56 1 1 1 1 0 261 1 999
1 37 1 0 1 1 0 226 1 999
2 56 1 1 0 1 0 307 1 999
previous poutcome emp.var.rate cons.price.idx cons.conf.idx euribor3m \
0 0 2 1.1 93.994 -36.4 3.299552
1 0 2 1.1 93.994 -36.4 0.743751
2 0 2 1.1 93.994 -36.4 1.282652
nr.employed
0 5191
1 5191
2 5191
Another solution is drop unnecessary columns:
cols= ['job','marital','education','y']
train_features = train_df.drop(cols, axis=1)
print (train_features)
age default housing loan equities contact duration campaign pdays \
0 56 1 1 1 1 0 261 1 999
1 37 1 0 1 1 0 226 1 999
2 56 1 1 0 1 0 307 1 999
previous poutcome emp.var.rate cons.price.idx cons.conf.idx euribor3m \
0 0 2 1.1 93.994 -36.4 3.299552
1 0 2 1.1 93.994 -36.4 0.743751
2 0 2 1.1 93.994 -36.4 1.282652
nr.employed
0 5191
1 5191
2 5191

You can access the column values via the the underlying numpy array
Consider the dataframe df
df = pd.DataFrame(np.random.randint(10, size=(5, 20)))
df
You can slice the underlying array
slc = [0,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18]
df.values[:, slc]
array([[1, 3, 9, 8, 3, 2, 1, 6, 6, 0, 3, 9, 8, 5, 9, 9],
[8, 0, 2, 3, 7, 8, 9, 2, 7, 2, 1, 3, 2, 5, 4, 9],
[1, 1, 9, 3, 5, 8, 8, 8, 8, 4, 8, 0, 5, 4, 9, 0],
[6, 3, 1, 8, 0, 3, 7, 9, 9, 0, 9, 7, 6, 1, 4, 8],
[3, 2, 3, 3, 9, 8, 3, 8, 3, 4, 1, 6, 4, 1, 6, 4]])
Or you can reconstruct a new dataframe from this slice
pd.DataFrame(df.values[:, slc], df.index, df.columns[slc])
This is not as clean and intuitive as
df.iloc[:, slc]
You could also use slc to slice the df.columns object and pass that to df.loc
df.loc[:, df.columns[slc]]

Related

Retain pandas multiindex after function across level

I'm looking to find a minimum value across level 1 of a multiindex, time in this example. But I'd like to retain all other labels of the index.
import numpy as np
import pandas as pd
stack = [
[0, 1, 1, 5],
[0, 1, 2, 6],
[0, 1, 3, 2],
[0, 2, 3, 4],
[0, 2, 2, 5],
[0, 3, 2, 1],
[1, 1, 0, 5],
[1, 1, 2, 6],
[1, 1, 3, 7],
[1, 2, 2, 8],
[1, 2, 3, 9],
[2, 1, 7, 1],
[2, 1, 8, 3],
[2, 2, 3, 4],
[2, 2, 8, 1],
]
df = pd.DataFrame(stack)
df.columns = ['self', 'time', 'other', 'value']
df.set_index(['self', 'time', 'other'], inplace=True)
df.groupby(level=1).min() doesn't return the correct values:
value
time
1 1
2 1
3 1
doing something like df.groupby(level=[0,1,2]).min() returns the original dataframe unchanged.
I swear I used to be able to do this by calling .min(level=1) but it's giving me deprecation notices and teling me to use the above groupby format, but the result seems different than I remember, am I stupid?
original:
value
self time other
0 1 1 5
2 6
3 2 #<-- min row
2 3 4 #<-- min row
2 5
3 2 1 #<-- min row
1 1 0 5 #<-- min row
2 6
3 7
2 2 8 #<-- min row
3 9
2 1 7 1 #<-- min row
8 3
2 3 4
8 1 #<-- min row
desired result:
value
self time other
0 1 3 2
2 3 4
3 2 1
1 1 0 5
2 2 8
2 1 7 1
2 8 1
Group by your 2 first levels then return the idxmin instead of min to get all indexes. Finally, use loc to filter out your original dataframe:
out = df.loc[df.groupby(level=['self', 'time'])['value'].idxmin()]
print(out)
# Output
value
self time other
0 1 3 2
2 3 4
3 2 1
1 1 0 5
2 2 8
2 1 7 1
2 8 1
Why not just groupby the first two indexes, rather than all three?
out = df.groupby(level=[0,1]).min()
Output:
>>> out
value
self time
0 1 2
2 4
3 1
1 1 5
2 8
2 1 1
2 1

How to remove rows from DF as result of a groupby query?

I have this Pandas dataframe:
df = pd.DataFrame({'site': ['a', 'a', 'a', 'b', 'b', 'b', 'a', 'a', 'a'], 'day': [1, 1, 1, 1, 1, 1, 2, 2, 2],
'hour': [1, 2, 3, 1, 2, 3, 1, 2, 3], 'clicks': [100, 200, 50, 0, 0, 0, 10, 0, 20]})
# site day hour clicks
# 0 a 1 1 100
# 1 a 1 2 200
# 2 a 1 3 50
# 3 b 1 1 0
# 4 b 1 2 0
# 5 b 1 3 0
# 6 a 2 1 10
# 7 a 2 2 0
# 8 a 2 3 20
And I want to remove all rows for a site/day, where there were 0 clicks. So in the example above, I would want to remove the rows with site='b' and day =1.
I can basically group them and show where the sum is 0 for a day/site:
print(df.groupby(['site', 'day'])['clicks'].sum() == 0)
But how would now be straight-forward way to remove the rows from original dataframe where that condition applies?
Solution I am having so far is that I iterate over group and save all tuples of site/day in a list, and then separately remove all rows that have that combinations of site/day. That works but, I am sure there must be a more functional and elegant way to achieve that result?
Option 1
Using groupby, transform and boolean indexing:
df[df.groupby(['site', 'day'])['clicks'].transform('sum') != 0]
Output:
site day hour clicks
0 a 1 1 100
1 a 1 2 200
2 a 1 3 50
6 a 2 1 10
7 a 2 2 0
8 a 2 3 20
Option 2
Using groupby and filter:
df.groupby(['site', 'day']).filter(lambda x: x['clicks'].sum() != 0)
Output:
site day hour clicks
0 a 1 1 100
1 a 1 2 200
2 a 1 3 50
6 a 2 1 10
7 a 2 2 0
8 a 2 3 20

Pandas groupby cumcount starting on row with a certain column value

I'd like to create two cumcount columns, depending on the values of two columns.
In the example below, I'd like one cumcount starting when colA is at least 100, and another cumcount starting when colB is at least 10.
columns = ['ID', 'colA', 'colB', 'cumcountA', 'cumountB']
data = [['A', 3, 1, '',''],
['A', 20, 4, '',''],
['A', 102, 8, 1, ''],
['A', 117, 10, 2, 1],
['B', 75, 0, '',''],
['B', 170, 12, 1, 1],
['B', 200, 13, 2, 2],
['B', 300, 20, 3, 3],
]
pd.DataFrame(columns=columns, data=data)
ID colA colB cumcountA cumountB
0 A 3 1
1 A 20 4
2 A 102 8 1
3 A 117 10 2 1
4 B 75 0
5 B 170 12 1 1
6 B 200 13 2 2
7 B 300 20 3 3
How would I calculate cumcountA and cumcountB?
you can try setting df.clip lower = your values (here 100 and 10) and then compare then groupby ID and cumsum :
col_list = ['colA','colB']
val_list = [100,10]
df[['cumcountA','cumountB']] = (df[col_list].ge(df[col_list].clip(lower=val_list,axis=1))
.groupby(df['ID']).cumsum().replace(0,''))
print(df)
Or may be even better to compare directly:
df[['cumcountA','cumountB']] = (df[['colA','colB']].ge([100,10])
.groupby(df['ID']).cumsum().replace(0,''))
print(df)
ID colA colB cumcountA cumountB
0 A 3 1
1 A 20 4
2 A 102 8 1
3 A 117 10 2 1
4 B 75 0
5 B 170 12 1 1
6 B 200 13 2 2
7 B 300 20 3 3

Is there a way to know how many items are sold until the end of the day after being considered a hit?

Let's imagine that we have this dataset.
import pandas as pd
import numpy as np
# create list
data = [['10/1/2019 08:12:09', np.nan, 0, 54], ['10/1/2019 09:12:09', '10/1/2019 08:52:09', 1, 54], ['10/1/2019 10:30:19','10/1/2019 10:10:09', 1, 3],
['10/1/2019 13:07:19', '10/1/2019 12:52:09', 1, 12], ['10/1/2019 13:25:09', np.nan, 0, 3],
['10/1/2019 17:52:09', np.nan, 0, 54], ['10/1/2019 18:21:09', np.nan, 0, 12],
['10/2/2019 10:52:09', np.nan, 0, 54], ['10/2/2019 12:59:19','10/2/2019 12:57:09', 1, 12],
['10/2/2019 13:52:19', '10/2/2019 13:39:09', 1, 54], ['10/2/2019 19:52:09', np.nan, 0, 12],
['10/2/2019 20:52:09', np.nan, 0, 54], ['10/2/2019 20:57:09', np.nan, 0, 12]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['first_timestamp', 'second_timestamp', 'hit', 'item'])
# print the dataframe
df
first_timestamp second_timestamp hit item
0 10/1/2019 08:12:09 NaN 0 54
1 10/1/2019 09:12:09 10/1/2019 08:52:09 1 54
2 10/1/2019 10:30:19 10/1/2019 10:10:09 1 3
3 10/1/2019 13:07:19 10/1/2019 12:52:09 1 12
4 10/1/2019 13:25:09 NaN 0 3
5 10/1/2019 17:52:09 NaN 0 54
6 10/1/2019 18:21:09 NaN 0 12
7 10/2/2019 10:52:09 NaN 0 54
8 10/2/2019 12:59:19 10/2/2019 12:57:09 1 12
9 10/2/2019 13:52:19 10/2/2019 13:39:09 1 54
10 10/2/2019 19:52:09 NaN 0 12
11 10/2/2019 20:52:09 NaN 0 54
12 10/2/2019 20:57:09 NaN 0 12
When we don't have missing values in both timestamps' columns, it means that the hit column has a value of 0. When both timestamps' columns have a value the hit column has a value of 1. My goal is to know how many items of the ones that I have(3, 12, and 54) were sold until the end of the respective day after it happened (only after, not before) a hit equals to 1.
day item items_sold
3 1
10/1/2019 12 1
54 1
3 0
10/2/2019 12 2
54 1
IIUC, for your data, we can do this:
# it's good practice to have time as datetime type
# skip if already is
df.first_timestamp = pd.to_datetime(df.first_timestamp)
df.second_timestamp = pd.to_datetime(df.second_timestamp)
# dates
df['date'] = df.first_timestamp.dt.normalize()
# s counts the number of hit so far
(df.assign(s=df.groupby(['date', 'item'])['hit'].cumsum())
.query('hit != 1 & s>0') # after the first hit
.groupby('date') # groupby date
['item'].value_counts() # counts
)
Output:
first_timestamp item
2019-10-01 3 1
12 1
54 1
2019-10-02 12 2
54 1
Name: item, dtype: int64

Add multiple dataframes based on one column

I have several hundred dataframes with same column names, like this:
df1
wave num stlines fwhm EWs MeasredWave
0 4050.32 3 0.28269 0.07365 22.16080 4050.311360
1 4208.98 5 0.48122 0.08765 44.90035 4208.972962
2 4374.94 9 0.71483 0.11429 86.96497 4374.927110
3 4379.74 9 0.31404 0.09107 30.44271 4379.760601
4 4398.01 14 0.50415 0.09845 52.83236 4398.007473
5 5520.50 1 0.06148 0.12556 8.21685 5520.484742
df2
wave num stlines fwhm EWs MeasredWave
0 4050.32 3 0.28616 0.07521 22.91064 4050.327388
1 4208.98 6 0.48781 0.08573 44.51609 4208.990029
2 4374.94 9 0.71548 0.11437 87.10152 4374.944513
3 4379.74 10 0.31338 0.09098 30.34791 4379.778009
4 4398.01 15 0.49950 0.08612 45.78707 4398.020367
5 4502.21 9 0.56362 0.10114 60.67868 4502.223123
6 4508.28 3 0.69554 0.11600 85.88428 4508.291777
7 4512.99 2 0.20486 0.08891 19.38745 4512.999332
8 5520.50 1 0.06148 0.12556 8.21685 5520.484742
That's how i0'm reading them
path_to_files = '/home/Desktop/computed_2d/'
lst = []
for filen in dir1:
df = pd.read_table(path_to_files+filen, skiprows=0, usecols=(0,1,2,3,4,8),names=['wave','num','stlines','fwhm','EWs','MeasredWave'],delimiter=r'\s+')
lst.append(df)
The desired result should look like this:
wave num stlines fwhm EWs MeasredWave
0 4050.32 3.0 0.284425 0.074430 22.535720 4050.319374
1 4208.98 5.5 0.484515 0.086690 44.708220 4208.981496
2 4374.94 9.0 0.715155 0.114330 87.033245 4374.935812
3 4379.74 9.5 0.313710 0.091025 30.395310 4379.769305
4 4398.01 14.5 0.501825 0.092285 49.309715 4398.013920
5 4502.21 9 0.56362 0.10114 60.67868 4502.223123
6 4508.28 3 0.69554 0.11600 85.88428 4508.291777
7 4512.99 2 0.20486 0.08891 19.38745 4512.999332
8 5520.50 1.0 0.061480 0.125560 8.216850 5520.484742
As you can see the number of rows are not same. Now i want to take the average of all the dataframes based on column1 wave and i want to make sure that the each index of column wave of df1 gets added to the correct index of df2
You can stack all dataframe in one by using pd.concat wich axis = 1 and take average of respective column
df3 = pd.merge(df1,df2,on=['wave'],how ='outer',)
df4 = df3.rename(columns = lambda x: x.split('_')[0]).T
df4.groupby(df4.index).mean().T
Out:
EWs MeasredWave fwhm num stlines wave
0 22.535720 4050.319374 0.074430 3.0 0.284425 4050.32
1 44.708220 4208.981496 0.086690 5.5 0.484515 4208.98
2 87.033245 4374.935812 0.114330 9.0 0.715155 4374.94
3 30.395310 4379.769305 0.091025 9.5 0.313710 4379.74
4 49.309715 4398.013920 0.092285 14.5 0.501825 4398.01
5 8.216850 5520.484742 0.125560 1.0 0.061480 5520.50
6 60.678680 4502.223123 0.101140 9.0 0.563620 4502.21
7 85.884280 4508.291777 0.116000 3.0 0.695540 4508.28
8 19.387450 4512.999332 0.088910 2.0 0.204860 4512.9
Here is an example to do what you need:
import pandas as pd
df1 = pd.DataFrame({'A': [0, 1, 2, 3],
'B': [0, 1, 2, 3],
'C': [0, 1, 2, 3],
'D': [0, 1, 2, 3]},
index=[0, 1, 2, 3])
df2 = pd.DataFrame({'A': [4, 5, 6, 7],
'B': [4, 5, 6, 7],
'C': [4, 5, 6, 7],
'D': [4, 5, 6, 7]},
index=[0, 1, 2, 3])
df3 = pd.DataFrame({'A': [8, 9, 10, 11],
'B': [8, 9, 10, 11],
'C': [8, 9, 10, 11],
'D': [8, 9, 10, 11]},
index=[0, 1, 2, 3])
df4 = pd.concat([df1, df2, df3])
df5 = pd.concat([df1, df2, df3], ignore_index=True)
print(df4)
print('\n\n')
print(df5)
print(f"Average of column A = {df4['A'].mean()}")
You will have
A B C D
0 0 0 0 0
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
0 4 4 4 4
1 5 5 5 5
2 6 6 6 6
3 7 7 7 7
0 8 8 8 8
1 9 9 9 9
2 10 10 10 10
3 11 11 11 11
A B C D
0 0 0 0 0
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
4 4 4 4 4
5 5 5 5 5
6 6 6 6 6
7 7 7 7 7
8 8 8 8 8
9 9 9 9 9
10 10 10 10 10
11 11 11 11 11
Average of column A = 5.5
Answer from #Naga Kiran is great. I updated the whole solution here:
import pandas as pd
df1 = pd.DataFrame(
{'wave' : [4050.32, 4208.98, 4374.94, 4379.74, 4398.01, 5520.50],
'num' : [3, 5, 9, 9, 14, 1],
'stlines' : [0.28269, 0.48122, 0.71483, 0.31404, 0.50415, 0.06148],
'fwhm' : [0.07365, 0.08765, 0.11429, 0.09107, 0.09845, 0.12556],
'EWs' : [22.16080, 44.90035, 86.96497, 30.44271, 52.83236, 8.21685],
'MeasredWave' : [4050.311360, 4208.972962, 4374.927110, 4379.760601, 4398.007473, 5520.484742]},
index=[0, 1, 2, 3, 4, 5])
df2 = pd.DataFrame(
{'wave' : [4050.32, 4208.98, 4374.94, 4379.74, 4398.01, 4502.21, 4508.28, 4512.99, 5520.50],
'num' : [3, 6, 9, 10, 15, 9, 3, 2, 1],
'stlines' : [0.28616, 0.48781, 0.71548, 0.31338, 0.49950, 0.56362, 0.69554, 0.20486, 0.06148],
'fwhm' : [0.07521, 0.08573, 0.11437, 0.09098, 0.08612, 0.10114, 0.11600, 0.08891, 0.12556],
'EWs' : [22.91064, 44.51609, 87.10152, 30.34791, 45.78707, 60.67868, 85.88428, 19.38745, 8.21685],
'MeasredWave' : [4050.327388, 4208.990029, 4374.944513, 4379.778009, 4398.020367, 4502.223123, 4508.291777, 4512.999332, 5520.484742]},
index=[0, 1, 2, 3, 4, 5, 6, 7, 8])
df3 = pd.merge(df1, df2, on='wave', how='outer')
df4 = df3.rename(columns = lambda x: x.split('_')[0]).T
df5 = df4.groupby(df4.index).mean().T
df6 = df5[['wave', 'num', 'stlines', 'fwhm', 'EWs', 'MeasredWave']]
df7 = df6.sort_values('wave', ascending = True).reset_index(drop=True)
df7

Categories

Resources