With a DataFrame that looks like this:
tra98 tra99 tra100 tra101 tra102
0 0.1880 0.345 0.1980 0.2090 0.2190
1 0.2510 0.585 0.2710 0.3240 0.2920
2 0.3240 0.741 0.2190 0.2090 0.2820
3 0.2820 0.825 0.1040 0.1880 0.2400
4 0.2190 1.150 0.0940 0.1360 0.1770
5 0.2300 1.210 0.0522 0.0209 0.0731
6 0.1670 1.290 0.0626 0.0104 0.0104
7 0.0835 1.400 0.0104 NaN NaN
8 0.0418 1.580 NaN NaN NaN
9 0.0209 NaN NaN NaN NaN
10 NaN NaN NaN NaN NaN
11 NaN NaN NaN NaN NaN
12 NaN NaN NaN NaN NaN
How can I select the first and last valid values in each column?
Thank you for your help.
The following shows how you can iterate over the cols, then call dropna() and then access the first and last values calling iloc:
In [21]:
for col in df:
valid_col = df[col].dropna()
print("column:", col, " first:", valid_col.iloc[0], " last:", valid_col.iloc[-1])
column: tra98 first: 0.188 last: 0.0209
column: tra99 first: 0.345 last: 1.58
column: tra100 first: 0.198 last: 0.0104
column: tra101 first: 0.209 last: 0.0104
column: tra102 first: 0.219 last: 0.0104
Related
Imagine you have the following two dfs:
lines
line amount#1 line amount#2
0 18.20 0.82
1 NaN NaN
2 40.00 259.00
3 388.00 NaN
4 17.41 NaN
btws
btw-amount#1 btw-amount#2
0 0.0 0.14
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
I want to subtract these two dfs such that there is a new df that is like the following:
line amount#1 line amount#2
0 18.20 0.68
1 NaN NaN
2 40.00 259.00
3 388.00 NaN
4 17.41 NaN
I've tried:
lines.subtract(btws, axis =0)
However, everyting turns NaN.
Please help!
result = lines.to_numpy() - btws.to_numpy()
result = pd.DataFrame(result, columns=lines.columns)
I am working with Pandas and want to filter the columns with an regex. It returns something when I change the regex to rf"{c}(\.)?(\d)*" but if I want it to start with a certain letter it breaks and the filtered dataframe is empty.
for c in self.variables.split():
reg = rf"^{c}(\.)?(\d)*$"
print(reg)
filtered = self.raw_data.filter(regex=reg)
What did I do wrong and how can I fix it.
PS: This a sample of the data
variable T T.1 T.2 T.3 T.4 ... T.8 T.9 l phi dl
0 29.63 27.87 26.95 26.64 26.25 ... 23.3 22.42 2.141 0.093551 0.002
1 29.70 NaN NaN NaN NaN ... NaN NaN 2.043 0.098052 0.002
2 29.62 NaN NaN NaN NaN ... NaN NaN 1.892 0.089973 0.002
3 29.65 NaN NaN NaN NaN ... NaN NaN 1.828 0.093132 0.002
And I would like it to return 4 dfs each only containing the data of a specific variable e.g.
variable T T.1 T.2 T.3 T.4 T.5 T.6 T.7 T.8 T.9
0 29.63 27.87 26.95 26.64 26.25 25.62 24.99 23.85 23.3 22.42
1 29.70 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 29.62 NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 29.65 NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 29.38 NaN NaN NaN NaN NaN NaN NaN NaN NaN
or only l without the dl(this is why I thought I needed to use ^ in my regex)
variable l
0 2.141
1 2.043
2 1.892
3 1.828
Thx in advance dear community
Details
variable match literal string variable
| logical or, since you want the column variable with every other dataframe
^ - start of a string
{c} - followed by an f-string with the desired variable
(\.\d+)? - an optional sequence of a literal . follow by one or more digits
$ - end of string.
import pandas as pd
df = pd.read_csv("sample.csv", sep='\s+')
print(df)
variables = ['T', 'l', 'phi', 'dl']
for c in variables:
ds = df.filter(regex=rf"variable|^{c}(\.\d+)?$")
print(f'\n---Variable: [{c}] ---')
print(ds)
---Variable: [T] ---
variable T T.1 T.2 T.3 T.4 T.5 T.6 T.7 T.8 T.9
0 0 29.63 27.87 26.95 26.64 26.25 25.62 24.99 23.85 23.3 22.42
1 1 29.70 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 2 29.62 NaN NaN NaN NaN NaN NaN NaN NaN NaN
...
---Variable: [l] ---
variable l
0 0 2.141
1 1 2.043
2 2 1.892
...
---Variable: [phi] ---
variable phi
0 0 0.093551
1 1 0.098052
2 2 0.089973
...
---Variable: [dl] ---
variable dl
0 0 0.002
1 1 0.002
2 2 0.002
...
I'm looking to make a new column, MaxPriceBetweenEntries based on the max() of a slice of the dataframe
idx Price EntryBar ExitBar
0 10.00 0 1
1 11.00 NaN NaN
2 10.15 2 4
3 12.14 NaN NaN
4 10.30 NaN NaN
turned into
idx Price EntryBar ExitBar MaxPriceBetweenEntries
0 10.00 0 1 11.00
1 11.00 NaN NaN NaN
2 10.15 2 4 12.14
3 12.14 NaN NaN NaN
4 10.30 NaN NaN NaN
I can get all the rows with an EntryBar or ExitBar value with df.loc[df["EntryBar"].notnull()] and df.loc[df["ExitBar"].notnull()], but I can't use that to set a new column:
df.loc[df["EntryBar"].notnull(),"MaxPriceBetweenEntries"] = df.loc[df["EntryBar"]:df["ExitBar"]]["Price"].max()
but that's effectively a guess at this point, because nothing I'm trying works. Ideally the solution wouldn't involve a loop directly because there may be millions of rows.
You can groupby the cumulative sum of non-null entries and take the max, unsing np.where() to only apply to non-null rows::
df['MaxPriceBetweenEntries'] = np.where(df['EntryBar'].notnull(),
df.groupby(df['EntryBar'].notnull().cumsum())['Price'].transform('max'),
np.nan)
df
Out[1]:
idx Price EntryBar ExitBar MaxPriceBetweenEntries
0 0 10.00 0.0 1.0 11.00
1 1 11.00 NaN NaN NaN
2 2 10.15 2.0 4.0 12.14
3 3 12.14 NaN NaN NaN
4 4 10.30 NaN NaN NaN
Let's try groupby() and where:
s = df['EntryBar'].notna()
df['MaxPriceBetweenEntries'] = df.groupby(s.cumsum())['Price'].transform('max').where(s)
Output:
idx Price EntryBar ExitBar MaxPriceBetweenEntries
0 0 10.00 0.0 1.0 11.00
1 1 11.00 NaN NaN NaN
2 2 10.15 2.0 4.0 12.14
3 3 12.14 NaN NaN NaN
4 4 10.30 NaN NaN NaN
You can forward fill the null values, group by entry and get the max of that groups Price. Use that as the right side of a left join and you should be in business.
df.merge(df.ffill().groupby('EntryBar')['Price'].max().reset_index(name='MaxPriceBetweenEntries'),
on='EntryBar',
how='left')
Try
df.loc[df['ExitBar'].notna(),'Max']=df.groupby(df['ExitBar'].ffill()).Price.max().values
df
Out[74]:
idx Price EntryBar ExitBar Max
0 0 10.00 0.0 1.0 11.00
1 1 11.00 NaN NaN NaN
2 2 10.15 2.0 4.0 12.14
3 3 12.14 NaN NaN NaN
4 4 10.30 NaN NaN NaN
How to combine this line into pandas dataframe to drop columns which its missing rate over 90%?
this line will show all the column and its missing rate:
percentage = (LoanStats_securev1_2018Q1.isnull().sum()/LoanStats_securev1_2018Q1.isnull().count()*100).sort_values(ascending = False)
Someone familiar with pandas please kindly help.
You can use dropna with a threshold
newdf=df.dropna(axis=1,thresh=len(df)*0.9)
axis=1 indicates column and thresh is the
minimum number of non-NA values required.
I think need boolean indexing with mean of boolean mask:
df = df.loc[:, df.isnull().mean() < .9]
Sample:
np.random.seed(2018)
df = pd.DataFrame(np.random.randn(20,3), columns=list('ABC'))
df.iloc[3:8,0] = np.nan
df.iloc[:-1,1] = np.nan
df.iloc[1:,2] = np.nan
print (df)
A B C
0 -0.276768 NaN 2.148399
1 -1.279487 NaN NaN
2 -0.142790 NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
5 NaN NaN NaN
6 NaN NaN NaN
7 NaN NaN NaN
8 -0.172797 NaN NaN
9 -1.604543 NaN NaN
10 -0.276501 NaN NaN
11 0.704780 NaN NaN
12 0.138125 NaN NaN
13 1.072796 NaN NaN
14 -0.803375 NaN NaN
15 0.047084 NaN NaN
16 -0.013434 NaN NaN
17 -1.580231 NaN NaN
18 -0.851835 NaN NaN
19 -0.148534 0.133759 NaN
print(df.isnull().mean())
A 0.25
B 0.95
C 0.95
dtype: float64
df = df.loc[:, df.isnull().mean() < .9]
print (df)
A
0 -0.276768
1 -1.279487
2 -0.142790
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 -0.172797
9 -1.604543
10 -0.276501
11 0.704780
12 0.138125
13 1.072796
14 -0.803375
15 0.047084
16 -0.013434
17 -1.580231
18 -0.851835
19 -0.148534
Hi I have the following dataframe:
>df1
code item01 item02 item03 item04 item05
0 1111 nan nan nan nan 440
1 1111 nan nan nan 650 nan
2 1111 nan nan nan nan nan
3 1111 nan nan nan nan nan
4 1111 32 nan nan nan nan
5 1111 nan nan nan nan nan
6 1111 nan nan nan nan nan
7 1111 nan nan nan nan nan
8 1111 nan nan nan nan nan
9 1111 nan nan nan nan nan
10 1111 nan nan nan nan nan
11 2222 20 nan nan nan nan
12 2222 nan nan nan nan nan
13 2222 nan nan nan 5 nan
14 2222 nan 7 nan nan nan
15 2222 nan nan nan nan nan
16 2222 nan nan nan nan nan
How can I merge using 'code' column within the dataframe to get df2 without for loop or iterrows().
>df2
code item01 item02 item03 item04 item05
0 1111 32 130 nan 650 440
1 2222 20 7 nan 5 nan
You can use:
If max one non value in column per group only:
df.groupby('code').first()
If possible multiple values - more general solution:
cols = df.columns.difference(['code'])
df = df.groupby('code')[cols]
.apply(lambda x: x.apply(lambda y: pd.Series(y.dropna().values)))
print (df)
item01 item02 item03 item04 item05
code
1111 0 32.0 NaN NaN 650.0 440.0
2222 0 20.0 7.0 NaN 5.0 NaN
You can simply use a groupby:
df1.groupby('code').max().reset_index(drop=True,inplace=True)
Be careful, if there are many values for an item with the same code, here you will keep the biggest one.
The reset_index is only used to get Output DataFrame in the same format.