How to convert a particular dataframe into a series by combining columns - python

Given the following data, where 3 means yes and 2 means no
t = pd.DataFrame({"v_1": [2, 2, 3], "v_2": [2, 3, 2], "v_3": [3, 2, 2],})
which looks as
v_1 v_2 v_3
0 2 2 3
1 2 3 2
2 3 2 2
I would like to create the following series
0 v_3
1 v_2
2 v_1
All I cna think of is the following:
t['V'] = t.sum().reset_index(drop=True)
which gives
v_1 v_2 v_3 V
0 v_3 v_1
1 v_2 v_2
2 v_1 v_3
I'm wondering if there's a nicer approach than this, or perhaps more general.

Perhaps this is what you need, to keep the 3s and concat them in a series?
(
t.apply(lambda x: np.where(x.eq(3), x.name, None))
.stack().reset_index(drop=True)
)
0 v_3
1 v_2
2 v_1
dtype: object

Give this a whirl :
(t
.stack()
.droplevel(0)
.loc[lambda x: x.eq(3)]
.reset_index(name='temp')
.drop('temp',axis=1)
)
index
0 v_3
1 v_2
2 v_1

Use DataFrame.where for replace non 3 values to missing values, then reshape by DataFrame.stack, remove first level of MultiIndex and last create Series from index if performance is important:
s = pd.Series(t.where(t.eq(3)).stack().droplevel(0).index)
#alternative
#s = pd.Series(t.where(t.eq(3)).stack().reset_index(0, drop=True).index)
print (s)
0 v_3
1 v_2
2 v_1
dtype: object
Details:
print (t.where(t.eq(3)))
v_1 v_2 v_3
0 NaN NaN 3.0
1 NaN 3.0 NaN
2 3.0 NaN NaN
print (t.where(t.eq(3)).stack())
0 v_3 3.0
1 v_2 3.0
2 v_1 3.0
dtype: float64
print (t.where(t.eq(3)).stack().droplevel(0))
v_3 3.0
v_2 3.0
v_1 3.0
dtype: float64
Performance for 1k rows and 10 columns:
np.random.seed(123)
t = pd.DataFrame(np.random.choice([2,3], (1000, 10))).add_prefix('v_')
#print (t)
In [25]: %timeit pd.Series(t.where(t.eq(3)).stack().droplevel(0).index)
2.66 ms ± 93.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [26]: %timeit pd.Series(t.where(t.eq(3)).stack().reset_index(0, drop=True).index)
2.61 ms ± 41.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [27]: %timeit t.apply(lambda x: np.where(x.eq(3), x.name, None)).stack().reset_index(drop=True)
5.98 ms ± 46.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [28]: %timeit t.stack().droplevel(0).loc[lambda x: x.eq(3)].reset_index(name='temp').drop('temp',axis=1)
3.48 ms ± 36.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Performance for 100k rows and 10 columns:
t = pd.DataFrame(np.random.choice([2,3], (100000, 10))).add_prefix('v_')
print (t)
In [30]: %timeit pd.Series(t.where(t.eq(3)).stack().droplevel(0).index)
84.7 ms ± 1.41 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [31]: %timeit pd.Series(t.where(t.eq(3)).stack().reset_index(0, drop=True).index)
84.1 ms ± 459 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [32]: %timeit t.apply(lambda x: np.where(x.eq(3), x.name, None)).stack().reset_index(drop=True)
147 ms ± 688 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [33]: %timeit t.stack().droplevel(0).loc[lambda x: x.eq(3)].reset_index(name='temp').drop('temp',axis=1)
101 ms ± 635 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

You can create a new index that has the location of 3 for each column. Then you apply that index to your column names.
import pandas as pd
t = pd.DataFrame({"v_1": [2, 2, 3], "v_2": [2, 3, 2], "v_3": [3, 2, 2],})
index_list = [t[t[col]==3].index[0] for col in t.columns] # create new index
series = pd.Series(t.columns) # series of column names
series.index = index_list # apply index to column names
print(series.sort_index())

Related

Python Pandas Fast Way to Divide Row Value by Previous Value

I want to calculate daily bond returns from clean prices based on the logarithm of the bond price in t divided by the bond price in t-1. So far, I calculate it like this:
import pandas as pd
import numpy as np
#create example data
col1 = np.random.randint(0,10,size=10)
df = pd.DataFrame()
df["col1"] = col1
df["result"] = [0]*len(df)
#slow computation
for i in range(len(df)):
if i == 0:
df["result"][i] = np.nan
else:
df["result"][i] = np.log(df["col1"][i]/df["col1"][i-1])
However, since I have a large sample this takes a lot of time to compute. Is there a way to improve the code in order to make it faster?
Use Series.shift by col1 column with Series.div for division:
df["result1"] = np.log(df["col1"].div(df["col1"].shift()))
#alternative
#df["result1"] = np.log(df["col1"] / df["col1"].shift())
print (df)
col1 result result1
0 5 NaN NaN
1 0 -inf -inf
2 3 inf inf
3 3 0.000000 0.000000
4 7 0.847298 0.847298
5 9 0.251314 0.251314
6 3 -1.098612 -1.098612
7 5 0.510826 0.510826
8 2 -0.916291 -0.916291
9 4 0.693147 0.693147
I test both solutions:
np.random.seed(0)
col1 = np.random.randint(0,10,size=10000)
df = pd.DataFrame({'col1':col1})
In [128]: %timeit df["result1"] = np.log(df["col1"] / df["col1"].shift())
865 µs ± 139 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [129]: %timeit df.assign(result=lambda x: np.log(x.col1.pct_change() + 1))
1.16 ms ± 11.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [130]: %timeit df["result1"] = np.log(df["col1"].pct_change() + 1)
1.03 ms ± 14.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
np.random.seed(0)
col1 = np.random.randint(0,10,size=100000)
df = pd.DataFrame({'col1':col1})
In [132]: %timeit df["result1"] = np.log(df["col1"] / df["col1"].shift())
3.7 ms ± 189 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [133]: %timeit df.assign(result=lambda x: np.log(x.col1.pct_change() + 1))
6.31 ms ± 545 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [134]: %timeit df["result1"] = np.log(df["col1"].pct_change() + 1)
3.75 ms ± 269 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
No need to use multiple functions, use Series.pct_change():
df = df.assign(
result=lambda x: np.log(x.col1.pct_change() + 1)
)
print(df)
col1 result
0 3 NaN
1 5 0.510826
2 8 0.470004
3 7 -0.133531
4 9 0.251314
5 1 -2.197225
6 1 0.000000
7 2 0.693147
8 7 1.252763
9 0 -inf
This should be a much faster way to get the same results:
df["result_2"] = np.log(df["col1"] / df["col1"].shift())

Add a column to a df where if a certain value is 0, return 1 else return the original value of the column

the Python code with which I am trying to achieve this result is:
df['column2'] = np.where(df['column1'] == 0, 1, df['column1'])
For the sample dataframe it is fastest to use np.where.
You can also use pandas.DataFrame.where, which will replace values where the condition is False otherwise return the value in the dataframe column.
100 is used to make the update easier to see
import pandas as pd
# test dataframe
df = pd.DataFrame({'a': [2, 4, 1, 0, 2, 2, 0, 8, 4, 0], 'b': [2, 4, 0, 9, 2, 0, 2, 8, 0, 3]})
# replace 0 with 100 or leave the same number based on the same column
df['0 → 100 on a if a'] = df.a.where(df.a != 0, 100)
# replace 0 with 100 or leave the same number based on a different column
df['0 → 100 on a if b'] = df.a.where(df.b != 0, 100)
# display(df)
a b 0 → 100 on a if a 0 → 100 on a if b
0 2 2 2 2
1 4 4 4 4
2 1 0 1 100
3 0 9 100 0
4 2 2 2 2
5 2 0 2 100
6 0 2 100 0
7 8 8 8 8
8 4 0 4 100
9 0 3 100 0
%%timeit testing
Test Data
import pandas as pd
import numpy as np
# test dataframe with 1M rows
np.random.seed(365)
df = pd.DataFrame({'a': np.random.randint(0, 10, size=(1000000)), 'b': np.random.randint(0, 10, size=(1000000))})
Tests
%%timeit
np.where(df.a == 0, 1, df.a)
[out]:
161 µs ± 1.47 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit
np.where(df.b == 0, 1, df.a)
[out]:
164 µs ± 1.94 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit
df.a.where(df.a != 0, 1)
[out]:
4.51 ms ± 136 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
df.a.where(df.b != 0, 1)
[out]:
4.55 ms ± 200 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
noah1(df)
[out]:
4.63 ms ± 58.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
noah2(df)
[out]:
15.3 s ± 205 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
paul(df)
[out]:
341 ms ± 34 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
karam(df)
[out]:
299 ms ± 4.68 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Functions
def noah1(d):
return d.a.replace(0, 1)
def noah2(d):
return d.apply(lambda x: 1 if x.a == 0 else x.b, axis=1)
def paul(d):
return [1 if v==0 else v for v in d.a.values]
def karam(d):
return d.a.apply(lambda x: 1 if x == 0 else x)
The apply example provided above should work or this works too:
df['column_2'] = [1 if v==0 else v for v in df['col'].values]
My example uses list comprehension: https://www.w3schools.com/python/python_lists_comprehension.asp
And the other answer uses lambda function: https://www.w3schools.com/python/python_lambda.asp
Personally, when writing scripts that others may use I think list comprehension is more widely known and therefore more verbose, but I believe lambda function performs faster and in general is a highly useful tool so probably recommended above list comprehension.
What you want is essentially to just copy the column and replace 0s with 1s:
df["Column2"] = df["Column1"].replace(0,1)
More generally if you wanted the value in some other ColumnX you can do the following lamda function:
df["Column2"] = df.apply(lambda x: 1 if x["Column1"]==0 else x['ColumnX'], axis=1)
You should be able to achieve that using an apply statement in this manner:
df['column2'] = df['column1'].apply(lambda x: 1 if x == 0 else x)

How to obtain the datatypes of objects in a mixed-datatype column?

Given a pandas.DataFrame with a column holding mixed datatypes, like e.g.
df = pd.DataFrame({'mixed': [pd.Timestamp('2020-10-04'), 999, 'a string']})
I was wondering how to obtain the datatypes of the individual objects in the column (Series)? Suppose I want to modify all entries in the Series that are of a certain type, like multiply all integers by some factor.
I could iteratively derive a mask and use it in loc, like
m = np.array([isinstance(v, int) for v in df['mixed']])
df.loc[m, 'mixed'] *= 10
# df
# mixed
# 0 2020-10-04 00:00:00
# 1 9990
# 2 a string
That does the trick but I was wondering if there was a more pandastic way of doing this?
One idea is test if numeric by to_numeric with errors='coerce' and for non missing values:
m = pd.to_numeric(df['mixed'], errors='coerce').notna()
df.loc[m, 'mixed'] *= 10
print (df)
mixed
0 2020-10-04 00:00:00
1 9990
2 a string
Unfortunately is is slow, some another ideas:
N = 1000000
df = pd.DataFrame({'mixed': [pd.Timestamp('2020-10-04'), 999, 'a string'] * N})
In [29]: %timeit df.mixed.map(lambda x : type(x).__name__)=='int'
1.26 s ± 83.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [30]: %timeit np.array([isinstance(v, int) for v in df['mixed']])
1.12 s ± 77.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [31]: %timeit pd.to_numeric(df['mixed'], errors='coerce').notna()
3.07 s ± 55.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [34]: %timeit ([isinstance(v, int) for v in df['mixed']])
909 ms ± 8.45 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [35]: %timeit df.mixed.map(lambda x : type(x))=='int'
877 ms ± 8.69 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [36]: %timeit df.mixed.map(lambda x : type(x) =='int')
842 ms ± 6.29 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [37]: %timeit df.mixed.map(lambda x : isinstance(x, int))
807 ms ± 13.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Pandas by default here cannot use vectorization effectivelly, because mixed values - so is necessary elementwise approaches.
Still need call type
m = df.mixed.map(lambda x : type(x).__name__)=='int'
df.loc[m, 'mixed']*=10
df
mixed
0 2020-10-04 00:00:00
1 9990
2 a string
If you want to multiple all 'numbers' then you can use the following.
Let's use pd.to_numeric with parameter errors = 'coerce' and fillna:
df['mixed'] = (pd.to_numeric(df['mixed'], errors='coerce') * 10).fillna(df['mixed'])
df
Output:
mixed
0 2020-10-04 00:00:00
1 9990
2 a string
Let's add a float to the column
df = pd.DataFrame({'mixed': [pd.Timestamp('2020-10-04'), 999, 'a string', 100.3]})
Using #BenYo:
m = df.mixed.map(lambda x : type(x).__name__)=='int'
df.loc[m, 'mixed']*=10
df
Output (note only the integer 999 is multiplied by 10):
mixed
0 2020-10-04 00:00:00
1 9990
2 a string
3 100.3
Using #jezrael and similiarly this solution:
m = pd.to_numeric(df['mixed'], errors='coerce').notna()
df.loc[m, 'mixed'] *= 10
print(df)
# Or this solution
# df['mixed'] = (pd.to_numeric(df['mixed'], errors='coerce') * 10).fillna(df['mixed'])
Output (note all numbers are multiplied by 10):
mixed
0 2020-10-04 00:00:00
1 9990
2 a string
3 1003
If you do many calculation and have a littile more memory, I suggest you to add a column to indicate the type of the mixed, for better efficiency. After you construct this column, the calculation is much faster.
here's the code:
N = 1000000
df = pd.DataFrame({'mixed': [pd.Timestamp('2020-10-04'), 999, 'a string'] * N})
df["mixed_type"] = df.mixed.map(lambda x: type(x).__name__).astype('category')
m = df.mixed_type == 'int'
df.loc[m, "mixed"] *= 10
del df["mixed_type"] # after you finish all your calculation
the mixed_type column repr is
0 Timestamp
1 int
2 str
3 Timestamp
4 int
...
2999995 int
2999996 str
2999997 Timestamp
2999998 int
2999999 str
Name: mixed, Length: 3000000, dtype: category
Categories (3, object): [Timestamp, int, str]
and here's the timeit
>>> %timeit df.mixed_type == 'int'
472 µs ± 57.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit df.mixed.map(lambda x : type(x).__name__)=='int'
1.12 s ± 87.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
For not very long data frames I can suggest this way as well:
df = df.assign(mixed = lambda x: x.apply(lambda s: s['mixed']*10 if isinstance(s['mixed'], int) else s['mixed'],axis=1))

Find Minimum without Zero and NaN in Pandas Dataframe

I have a pandas Dataframe and I want to find the minimum without zeros and Nans.
I was trying to combine from numpy nonzero and nanmin, but it does not work.
Does someone has an idea?
If you want the minimum of all df, you can try so:
m = np.nanmin(df.replace(0, np.nan).values)
Use numpy.where with numpy.nanmin:
df = pd.DataFrame({'B':[4,0,4,5,5,np.nan],
'C':[7,8,9,np.nan,2,3],
'D':[1,np.nan,5,7,1,0],
'E':[5,3,0,9,2,4]})
print (df)
B C D E
0 4.0 7.0 1.0 5
1 0.0 8.0 NaN 3
2 4.0 9.0 5.0 0
3 5.0 NaN 7.0 9
4 5.0 2.0 1.0 2
5 NaN 3.0 0.0 4
Numpy solution:
arr = df.values
a = np.nanmin(np.where(arr == 0, np.nan, arr))
print (a)
1.0
Pandas solution - NaNs are removed by default:
a = df.mask(df==0).min().min()
print (a)
1.0
Performance - for each row is added one NaN value:
np.random.seed(123)
df = pd.DataFrame(np.random.rand(1000,1000))
np.fill_diagonal(df.values, np.nan)
print (df)
#joe answer
In [399]: %timeit np.nanmin(df.replace(0, np.nan).values)
15.3 ms ± 425 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [400]: %%timeit
...: arr = df.values
...: a = np.nanmin(np.where(arr == 0, np.nan, arr))
...:
6.41 ms ± 427 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [401]: %%timeit
...: df.mask(df==0).min().min()
...:
23.9 ms ± 727 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Round up or down an entire column in a dataframe

A
0 31.353
1 28.945
2 17.377
I want to create a new df["B"] with A column values round up to 5.
The desired output:
A B
0 31.353 35.0
1 28.945 30.0
2 17.377 20.0
I´ve tried:
def roundup5(x):
return int(math.ceil(x / 5.0)) * 5
df["B"] = df["A"].apply(roundup5)
I get:
TypeError: unsupported operand type(s) for /: 'str' and 'float'
I think you need convert values to floats first, then divide and use numpy.ceil with multiple:
df["B"] = df["A"].astype(float).div(5.0).apply(np.ceil).mul(5)
df["B"] = np.ceil(df["A"].astype(float).div(5.0)).mul(5)
Loop version:
def roundup5(x):
return int(math.ceil(float(x) / 5.0)) * 5.0
df["B"] = df["A"].apply(roundup5)
print (df)
A B
0 31.353 35.0
1 28.945 30.0
2 17.377 20.0
Timings:
[30000 rows x 1 columns]
df = pd.concat([df] * 10000, ignore_index=True)
In [327]: %timeit df["B1"] = df["A"].apply(roundup5)
35.7 ms ± 4.54 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [328]: %timeit df["B2"] = df["A"].astype(float).div(5.0).apply(np.ceil).mul(5)
1.25 ms ± 76.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [329]: %timeit df["B3"] = np.ceil(df["A"].astype(float).div(5.0)).mul(5)
1.19 ms ± 22.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Categories

Resources