I am looking how to create an if function that will select a row above another row where ColumnX is equal to the number 13.
Here is the code I have
if df.attrib.get("Column_Name") in ['13']:
I know this means that if column name "Column_Name" = 13 then ...
but I want it to be if "Column_Name" 1 row below is equal to 13 then ...
You can use simple Pandas Condition:
import pandas as pd
# i created my own dataframe for testing
df = pd.DataFrame({'numbers':[1,2,13,4,5,13,6]})
# use simple condition to get the index of the element then access the element by index
df.iloc[df[df["numbers"]==13].index-1]
+output:
numbers
1 2
4 5
You can find the columns using pandas .shift function and a .loc
>>> import pandas
>>> df = pandas.DataFrame({'condition':[1,3,4,3,2,13,3]})
>>> df
condition
0 1
1 3
2 4
3 3
4 2
5 13
6 3
>>> df["condition_rolled"] = df['condition'].shift(-1)
>>> df
condition condition_rolled
0 1 3.0
1 3 4.0
2 4 3.0 3 3 2.0
4 2 13.0
5 13 3.0
6 3 NaN
>>> df.loc[(df["condition_rolled"] == 13.0)]
condition condition_rolled
4 2 13.0
Related
I have a dataframe with a series of numbers. For example:
Index Column 1
1 10
2 12
3 24
4 NaN
5 20
6 15
7 NaN
8 NaN
9 2
I can't use bfill or ffill as the rule is dynamic, taking the value from the previous row and dividing by the number of consecutive NaN + 1. For example, rows 3 and 4 should be replaced with 12 as 24/2, rows 6, 7 and 8 should be replaced with 5. All other numbers should remain unchanged.
How should I do that?
Note: Edited the dataframe to be more general by inserting a new row between rows 4 and 5 and another row at the end.
You can do:
m = (df["Column 1"].notna()) & (
(df["Column 1"].shift(-1).isna()) | (df["Column 1"].shift().isna())
)
out = df.groupby(m.cumsum()).transform(
lambda x: x.fillna(0).mean() if x.isna().any() else x
)
print(out):
Index Column 1
0 1 10.0
1 2 12.0
2 3 12.0
3 4 12.0
4 5 20.0
5 6 5.0
6 7 5.0
7 8 5.0
8 9 2.0
Explanation and intermediate values:
Basically look for the rows where the next value is NaN or previous value is NaN but their value itself is not NaN. Those rows form the first row of such groups.
So the m in above code looks like:
0 True
1 False
2 True
3 False
4 True
5 True
6 False
7 False
8 True
now I want to form groups of rows that are ['True', <all Falses>] because those are the groups I want to take average of. For that use cumsum
If you want to take a look at those groups, you can use ngroup() after groupby on m.cumsum():
0 0
1 0
2 1
3 1
4 2
5 3
6 3
7 3
8 4
The above is only to show what are the groups.
Now for each group you can get the mean of the group if the group has any NaN value. This is accomplished by checking for NaNs using x.isna().any().
If the group has any NaN value then assign mean after filling NaN with 0 ,otherwise just keep the group as is. This is accomplished by the lambda:
lambda x: x.fillna(0).mean() if x.isna().any() else x
Why not using interpolate? There is a method=s that would probably fitsyour desire
However, if you really want to do as you described above, you can do something like this. (Note that iterating over rows in pandas is considered bad practice, but it does the job)
import pandas as pd
import numpy as np
df = pd.DataFrame([10,
12,
24,
np.NaN,
15,
np.NaN,
np.NaN])
for col in df:
for idx in df.index: # (iterating over rows is considered bad practice)
local_idx=idx
while(local_idx+1<len(df) and np.isnan(df.at[local_idx+1,col])):
local_idx+=1
if (local_idx-idx)>0:
fillvalue = df.loc[idx]/(local_idx-idx+1)
for fillidx in range(idx, local_idx+1):
df.loc[fillidx] = fillvalue
df
Output:
0
0 10.0
1 12.0
2 12.0
3 12.0
4 5.0
5 5.0
6 5.0
I have a dataset that looks like his:
ID date
1 O1-01-2012
1 05-02-2012
1 25-06-2013
1 14-12-2013
1 10-04-2014
2 19-05-2012
2 07-08-2014
2 10-09-2014
2 27-11-2015
2 01-12-2015
3 15-04-2013
3 17-05-2015
3 22-05-2015
3 30-10-2016
3 02-11-2016
I am working with Python and I would like to select the 3 last dates for each ID. Here is the dataset I would like to have:
ID date
1 25-06-2013
1 14-12-2013
1 10-04-2014
2 10-09-2014
2 27-11-2015
2 01-12-2015
3 22-05-2015
3 30-10-2016
3 02-11-2016
I used this code to select the very last date for each ID:
df_2=df.sort_values(by=['date']).drop_duplicates(subset='ID',keep='last')
But how can I select more than one date (for example the 3 last dates, or 4 last dates, etc)?
You might use groupby and tail following way to get 2 last items from each group:
import pandas as pd
df = pd.DataFrame({'ID':[1,1,1,2,2,2,3,3,3],'value':['A','B','C','D','E','F','G','H','I']})
df2 = df.groupby('ID').tail(2)
print(df2)
Output:
ID value
1 1 B
2 1 C
4 2 E
5 2 F
7 3 H
8 3 I
Note that for simplicity sake I used other (already sorted) data for building df.
can try this:
df.sort_values(by=['date']).groupby('ID').tail(3).sort_values(['ID', 'date'])
I tried this but with a non-datetime data type
a = [1,1,1,1,1,2,2,2,2,2,3,3,3,3,3]
b = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o']
import pandas as pd
import numpy as np
a = np.array([a,b])
df=pd.DataFrame(a.T,columns=['ID','Date'])
# the tail would give you the last n number of elements you are interested in
df_ = df.groupby('ID').tail(3)
df_
output:
ID Date
2 1 c
3 1 d
4 1 e
7 2 h
8 2 i
9 2 j
12 3 m
13 3 n
14 3 o
The example below:
import pandas as pd
list1 = ['a','a','a','b','b','b','b','c','c','c']
list2 = range(len(list1))
df = pd.DataFrame(zip(list1, list2), columns= ['Item','Value'])
df
gives:
required: GroupFirstValue column as shown below.
The idea is to use a lambda formula to get the 'first' value for each group..for example "a"'s first value is 0, "b"'s first value is 3, "c"'s first value is 7. That's why those numbers appear in the GroupFirstValue column.
Note: I know that I can do this on 2 steps...one is the original df and the second is a grouped by df and then merge them together. The idea is to see if this can be done more efficiently in a single step. Many thanks in advance!
groupby and use first
df.groupby('Item')['Value'].first()
or you can use transform and assign to a new column in your frame
df['new_col'] = df.groupby('Item')['Value'].transform('first')
Use mask and duplicated
df['GroupFirstValue'] = df.Value.mask(df.Item.duplicated())
Out[109]:
Item Value GroupFirstValue
0 a 0 0.0
1 a 1 NaN
2 a 2 NaN
3 b 3 3.0
4 b 4 NaN
5 b 5 NaN
6 b 6 NaN
7 c 7 7.0
8 c 8 NaN
9 c 9 NaN
Assume that we have the following pandas dataframe:
df = pd.DataFrame({'x':[0,0,1,0,0,0,0],'y':[1,1,1,1,1,1,0],'z':[0,1,1,1,0,0,1]})
x y z
0 0 1 0
1 0 1 1
2 1 1 1
3 0 1 1
4 0 1 0
5 0 1 0
6 0 0 1
All dataframe is filled either by 1 or 0. Looking at each column separately, if current row value is different than previous value I need to count number of previous consecutive values:
x y z
0
1 1
2 2
3 1
4 3
5
6 6 2
I tried to write a lambda function and apply it to entire dataframe, but I failed. Any idea?
Let's try this:
def f(col):
x = (col != col.shift().bfill())
s = x.cumsum()
return s.groupby(s).transform('count').shift().where(x)
df.apply(f).fillna('')
Output:
x y z
0
1 1
2 2
3 1
4 3
5
6 6 2
Details:
Use apply, to apply a custom function on each column of the dataframe.
Find the difference spots in the column then use cumsum to create groups of consecutive values, then groupby and transform to create a count for each record, then mask the values in the column using where for the difference spots.
You can try the following, where you identify the "runs" first, get the "runs" lengths. You will only entry at where it switches, so it is the lengths of the runs except the last one.
import pandas as pd
import numpy as np
def func(x,missing=np.NaN):
runs = np.cumsum(np.append(0,np.diff(x)!=0))
switches = np.where(np.diff(x!=0))[0] + 1
out = np.repeat(missing,len(x))
out[switches] = np.bincount(runs)[:-1]
# thanks to Scott see comments below
##out[switches] = pd.value_counts(runs,sort=False)[:-1]
return(out)
df.apply(func)
x y z
0 NaN NaN NaN
1 NaN NaN 1.0
2 2.0 NaN NaN
3 1.0 NaN NaN
4 NaN NaN 3.0
5 NaN NaN NaN
6 NaN 6.0 2.0
It might be faster with a good implementation of run length encoding.. but I am not too familiar with it in python..
I have a 21840x39 data frame. A few of my columns are numerically valued and I want to make sure they are all in the same data type (which I want to be a float).
Instead of naming all the columns out and converting them:
df[['A', 'B', 'C', '...]] = df[['A', 'B', 'C', '...]].astype(float)
Can I do a for loop that will allow me to say something like " convert to float from column 18 to column 35"
I know how to do one column: df['A'] = df['A'].astype(float)
But how can I do multiple columns? I tried with list slicing within a loop but couldn't get it right.
First idea is convert selected columns, python counts from 0, so for 18 to 36 columns use:
df.iloc[:, 17:35] = df.iloc[:, 17:35].astype(float)
If not working (because possible bug) use another solution:
df = df.astype(dict.fromkeys(df.columns[17:35], float))
Sample - convert 8 to 15th columns:
np.random.seed(2020)
df = pd.DataFrame(np.random.randint(10, size=(3, 18)),
columns=list('abcdefghijklmnopqr')).astype(str)
print (df)
a b c d e f g h i j k l m n o p q r
0 0 8 3 6 3 3 7 8 0 0 8 9 3 7 2 3 6 5
1 0 4 8 6 4 1 1 5 9 5 6 6 6 5 4 6 4 2
2 3 4 7 1 4 9 3 2 0 9 1 2 7 1 0 2 8 8
df = df.astype(dict.fromkeys(df.columns[7:15], float))
print (df)
a b c d e f g h i j k l m n o p q r
0 0 8 3 6 3 3 7 8.0 0.0 0.0 8.0 9.0 3.0 7.0 2.0 3 6 5
1 0 4 8 6 4 1 1 5.0 9.0 5.0 6.0 6.0 6.0 5.0 4.0 6 4 2
2 3 4 7 1 4 9 3 2.0 0.0 9.0 1.0 2.0 7.0 1.0 0.0 2 8 8
Tweaked #jezrael code as typing in column names (I feel) is a good option.
import pandas as pd
import numpy as np
np.random.seed(2020)
df = pd.DataFrame(np.random.randint(10, size=(3, 18)),
columns=list('abcdefghijklmnopqr')).astype(str)
print(df)
columns = list(df.columns)
#change the first and last column names below as required
df = df.astype(dict.fromkeys(
df.columns[columns.index('h'):(columns.index('o')+1)], float))
print (df)
Leaving the original answer below here but note: Never loop in pandas if vectorized alternatives exist
If I had a dataframe and wanted to change columns 'col3' to 'col5' (human readable names) to floats I could...
import pandas as pd
import re
df = pd.read_csv('dummy_data.csv')
df
columns = list(df.columns)
#change the first and last column names below as required
start_column = columns.index('col3')
end_column = columns.index('col5')
for index, col in enumerate(columns):
if (start_column <= index) & (index <= end_column):
df[col] = df[col].astype(float)
df
...by just changing the column names. Perhaps it's easier to work in column names and 'from this one' and 'to that one' (inclusive).