Pandas how to not apply to whole column - python

self.df['Regular Price'] = self.df['Regular Price'].apply(
lambda x: int(round(x)) if isinstance(
x, (int, float)) else None
)
The above code is assigning None to every value of field Regular Price whenever it encounter a non numeric value in the dataframe. I want to assign None to only that cell where its non number value.
thanks

First is impossible return NaNs with integers, because NaNs is float by design.
Your solution working if mixed types - numeric with strings:
df = pd.DataFrame({
'Regular Price': ['a',1,2.3,'a',7],
'B': list(range(5))
})
print (df)
B Regular Price
0 0 a
1 1 1
2 2 2.3
3 3 a
4 4 7
df['Regular Price'] = df['Regular Price'].apply(
lambda x: int(round(x)) if isinstance(
x, (int, float)) else None
)
print (df)
B Regular Price
0 0 NaN
1 1 1.0
2 2 2.0
3 3 NaN
4 4 7.0
But if all data are strings need to_numeric with errors='coerce' for convert not numeric to NaNs:
df = pd.DataFrame({
'Regular Price': ['a','1','2.3','a','7'],
'B': list(range(5))
})
print (df)
B Regular Price
0 0 a
1 1 1
2 2 2.3
3 3 a
4 4 7
df['Regular Price'] = pd.to_numeric(df['Regular Price'], errors='coerce').round()
print (df)
B Regular Price
0 0 NaN
1 1 1.0
2 2 2.0
3 3 NaN
4 4 7.0
EDIT:
I also need to remove floating points and use int only
It is possible by convert to None for NaNs and cast to int:
df['Regular Price'] = pd.to_numeric(df['Regular Price'],
errors='coerce').round()
df['Regular Price'] = np.where(df['Regular Price'].isnull(),
None,
df['Regular Price'].fillna(0).astype(int))
print (df)
B Regular Price
0 0 None
1 1 1
2 2 2
3 3 None
4 4 7
print (df['Regular Price'].apply(type))
0 <class 'NoneType'>
1 <class 'int'>
2 <class 'int'>
3 <class 'NoneType'>
4 <class 'int'>
Name: Regular Price, dtype: object
But it slow performance, so the best dont use it. There also should be another problems - soe function failed, so the best is floats if working with NaNs:
Testing some function like diff in 50k rows DataFrame:
df = pd.DataFrame({
'Regular Price': ['a','1','2.3','a','7'],
'B': list(range(5))
})
df = pd.concat([df]*10000).reset_index(drop=True)
print (df)
df['Regular Price'] = pd.to_numeric(df['Regular Price'], errors='coerce').round()
df['Regular Price1'] = np.where(df['Regular Price'].isnull(),
None,
df['Regular Price'].fillna(0).astype(int))
In [252]: %timeit df['Regular Price2'] = df['Regular Price1'].diff()
TypeError: unsupported operand type(s) for -: 'int' and 'NoneType'
In [274]: %timeit df['Regular Price3'] = df['Regular Price'].diff()
1000 loops, best of 3: 301 µs per loop
In [272]: %timeit df['Regular Price2'] = df['Regular Price1'] * 1000
100 loops, best of 3: 4.48 ms per loop
In [273]: %timeit df['Regular Price3'] = df['Regular Price'] * 1000
1000 loops, best of 3: 469 µs per loop
EDIT:
df = pd.DataFrame({
'Regular Price': ['a','1','2.3','a','7'],
'B': list(range(5))
})
print (df)
B Regular Price
0 0 a
1 1 1
2 2 2.3
3 3 a
4 4 7
df['Regular Price'] = pd.to_numeric(df['Regular Price'], errors='coerce').round()
print (df)
B Regular Price
0 0 NaN
1 1 1.0
2 2 2.0
3 3 NaN
4 4 7.0
First is possible remove NaNs rows by column Regular Price and then convert to int.
df1 = df.dropna(subset=['Regular Price']).copy()
df1['Regular Price'] = df1['Regular Price'].astype(int)
print (df1)
B Regular Price
1 1 1
2 2 2
4 4 7
Process what you need, but dont change index.
#e.g. some process
df1['Regular Price'] = df1['Regular Price'] * 100
Last combine_first - it add NaN to Regular Price column.
df2 = df1.combine_first(df)
print (df2)
B Regular Price
0 0.0 NaN
1 1.0 100.0
2 2.0 200.0
3 3.0 NaN
4 4.0 700.0

Related

Fill NA values by a two levels indexed Series

I have a dataframe with columns (A, B and value) where there are missing values in the value column. And there is a Series indexed by two columns (A and B) from the dataframe. How can I fill the missing values in the dataframe with corresponding values in the series?
I think you need fillna with set_index and reset_index:
df = pd.DataFrame({'A': [1,1,3],
'B': [2,3,4],
'value':[2,np.nan,np.nan] })
print (df)
A B value
0 1 2 2.0
1 1 3 NaN
2 3 4 NaN
idx = pd.MultiIndex.from_product([[1,3],[2,3,4]])
s = pd.Series([5,6,0,8,9,7], index=idx)
print (s)
1 2 5
3 6
4 0
3 2 8
3 9
4 7
dtype: int64
df = df.set_index(['A','B'])['value'].fillna(s).reset_index()
print (df)
A B value
0 1 2 2.0
1 1 3 6.0
2 3 4 7.0
Consider the dataframe and series df and s
df = pd.DataFrame(dict(
A=list('aaabbbccc'),
B=list('xyzxyzxyz'),
value=[1, 2, np.nan, 4, 5, np.nan, 7, 8, 9]
))
s = pd.Series(range(1, 10)[::-1])
s.index = [df.A, df.B]
We can fillna with a clever join
df.fillna(df.join(s.rename('value'), on=['A', 'B'], lsuffix='_'))
# \_____________/ \_________/
# make series same get old
# name as column column out
# we are filling of the way
A B value
0 a x 1.0
1 a y 2.0
2 a z 7.0
3 b x 4.0
4 b y 5.0
5 b z 4.0
6 c x 7.0
7 c y 8.0
8 c z 9.0
Timing
join is cute, but #jezrael's set_index is quicker
%timeit df.fillna(df.join(s.rename('value'), on=['A', 'B'], lsuffix='_'))
100 loops, best of 3: 3.56 ms per loop
%timeit df.set_index(['A','B'])['value'].fillna(s).reset_index()
100 loops, best of 3: 2.06 ms per loop

pandas: Merge two columns with different names?

I am trying to concatenate two dataframes, above and below. Not concatenate side-by-side.
The dataframes contain the same data, however, in the first dataframe one column might have name "ObjectType" and in the second dataframe the column might have name "ObjectClass". When I do
df_total = pandas.concat ([df0, df1])
the df_total will have two column names, one with "ObjectType" and another with "ObjectClass". In each of these two columns, half of the values will be "NaN". So I have to manually merge these two columns into one which is a pain.
Can I somehow merge the two columns into one? I would like to have a function that does something like:
df_total = pandas.merge_many_columns(input=["ObjectType,"ObjectClass"], output=["MyObjectClasses"]
which merges the two columns and creates a new column. I have looked into melt() but it does not really do this?
(Maybe it would be nice if I could specify what will happen if there is a collision, say that two columns contain values, in that case I supply a lambda function that says "keep the largest value", "use an average", etc)
I think you can rename column first for align data in both DataFrames:
df0 = pd.DataFrame({'ObjectType':[1,2,3],
'B':[4,5,6],
'C':[7,8,9]})
#print (df0)
df1 = pd.DataFrame({'ObjectClass':[1,2,3],
'B':[4,5,6],
'C':[7,8,9]})
#print (df1)
inputs= ["ObjectType","ObjectClass"]
output= "MyObjectClasses"
#dict comprehension
d = {x:output for x in inputs}
print (d)
{'ObjectType': 'MyObjectClasses', 'ObjectClass': 'MyObjectClasses'}
df0 = df0.rename(columns=d)
df1 = df1.rename(columns=d)
df_total = pd.concat([df0, df1], ignore_index=True)
print (df_total)
B C MyObjectClasses
0 4 7 1
1 5 8 2
2 6 9 3
3 4 7 1
4 5 8 2
5 6 9 3
EDIT:
More simplier is update (working inplace):
df = pd.concat([df0, df1])
df['ObjectType'].update(df['ObjectClass'])
print (df)
B C ObjectClass ObjectType
0 4 7 NaN 1.0
1 5 8 NaN 2.0
2 6 9 NaN 3.0
0 4 7 1.0 1.0
1 5 8 2.0 2.0
2 6 9 3.0 3.0
Or fillna, but then need drop original columns columns:
df = pd.concat([df0, df1])
df["ObjectType"] = df['ObjectType'].fillna(df['ObjectClass'])
df = df.drop('ObjectClass', axis=1)
print (df)
B C ObjectType
0 4 7 1.0
1 5 8 2.0
2 6 9 3.0
0 4 7 1.0
1 5 8 2.0
2 6 9 3.0
df = pd.concat([df0, df1])
df["MyObjectClasses"] = df['ObjectType'].fillna(df['ObjectClass'])
df = df.drop(['ObjectType','ObjectClass'], axis=1)
print (df)
B C MyObjectClasses
0 4 7 1.0
1 5 8 2.0
2 6 9 3.0
0 4 7 1.0
1 5 8 2.0
2 6 9 3.0
EDIT1:
Timings:
df0 = pd.DataFrame({'ObjectType':[1,2,3],
'B':[4,5,6],
'C':[7,8,9]})
#print (df0)
df1 = pd.DataFrame({'ObjectClass':[1,2,3],
'B':[4,5,6],
'C':[7,8,9]})
#print (df1)
df0 = pd.concat([df0]*1000).reset_index(drop=True)
df1 = pd.concat([df1]*1000).reset_index(drop=True)
inputs= ["ObjectType","ObjectClass"]
output= "MyObjectClasses"
#dict comprehension
d = {x:output for x in inputs}
In [241]: %timeit df_total = pd.concat([df0.rename(columns=d), df1.rename(columns=d)], ignore_index=True)
1000 loops, best of 3: 821 µs per loop
In [240]: %%timeit
...: df = pd.concat([df0, df1])
...: df['ObjectType'].update(df['ObjectClass'])
...: df = df.drop(['ObjectType','ObjectClass'], axis=1)
...:
100 loops, best of 3: 2.18 ms per loop
In [242]: %%timeit
...: df = pd.concat([df0, df1])
...: df['MyObjectClasses'] = df['ObjectType'].combine_first(df['ObjectClass'])
...: df = df.drop(['ObjectType','ObjectClass'], axis=1)
...:
100 loops, best of 3: 2.21 ms per loop
In [243]: %%timeit
...: df = pd.concat([df0, df1])
...: df['MyObjectClasses'] = df['ObjectType'].fillna(df['ObjectClass'])
...: df = df.drop(['ObjectType','ObjectClass'], axis=1)
...:
100 loops, best of 3: 2.28 ms per loop
You can merge two columns separated by Nan's into one using combine_first
>>> import numpy as np
>>> import pandas as pd
>>>
>>> df0 = pd.DataFrame({'ObjectType':[1,2,3],
'B':[4,5,6],
'C':[7,8,9]})
>>> df1 = pd.DataFrame({'ObjectClass':[1,2,3],
'B':[4,5,6],
'C':[7,8,9]})
>>> df = pd.concat([df0, df1])
>>> df['ObjectType'] = df['ObjectType'].combine_first(df['ObjectClass'])
>>> df['ObjectType']
0 1
1 2
2 3
0 1
1 2
3 3
Name: ObjectType, dtype: float64

join corresponding column to dataframe pandas

I have an example dataframe looks like below. I want to make a calculation then append the result as a new column to current dataframe.
A, B # this is my df, a csv file
1, 2
3, 3
7, 6
13, 14
Below is some code I have tried.
for i in range(0,len(df.index)+1,1):
if len(df.index)-1 == i:
df['C'] = str(df.iloc[i]['A'] / df.iloc[i]['B'])
else:
df['C'] = str((df.iloc[i+1]['A'] - df.iloc[i]['A']) / (df.iloc[i+1]['B'] - df.iloc[i]['B'])) # I need string as dtype
df.to_csv(Out, index = False)
This only gives me the result of final loop, not corresponding result depending on each calculation.
A B C
1 2 2
3 3 1.33
7 6 0.75
13 14 0.93 # It is the result I'd like to see.
Does anyone know how to revise it? Thanks in advance!
UPDATE: - much more elegant solution (one-liner) from #root:
In [131]: df['C'] = (df.A.shift(-1).sub(df.A, fill_value=0) / df.B.shift(-1).sub(df.B, fill_value=0)).round(2).astype(str)
In [132]: df
Out[132]:
A B C
0 1 2 2.0
1 3 3 1.33
2 7 6 0.75
3 13 14 0.93
In [133]: df.dtypes
Out[133]:
A int64
B int64
C object
dtype: object
you can do it this way:
df['C'] = (df.A.shift(-1) - df.A) / (df.B.shift(-1) - df.B)
df.loc[df.index.max(), 'C'] = df.loc[df.index.max(), 'A'] / df.loc[df.index.max(), 'B']
df.round(2)
yields:
In [118]: df.round(2)
Out[118]:
A B C
0 1 2 2.00
1 3 3 1.33
2 7 6 0.75
3 13 14 0.93

Extract first and last row of a dataframe in pandas

How can I extract the first and last rows of a given dataframe as a new dataframe in pandas?
I've tried to use iloc to select the desired rows and then concat as in:
df=pd.DataFrame({'a':range(1,5), 'b':['a','b','c','d']})
pd.concat([df.iloc[0,:], df.iloc[-1,:]])
but this does not produce a pandas dataframe:
a 1
b a
a 4
b d
dtype: object
I think the most simple way is .iloc[[0, -1]].
df = pd.DataFrame({'a':range(1,5), 'b':['a','b','c','d']})
df2 = df.iloc[[0, -1]]
print(df2)
a b
0 1 a
3 4 d
You can also use head and tail:
In [29]: pd.concat([df.head(1), df.tail(1)])
Out[29]:
a b
0 1 a
3 4 d
The accepted answer duplicates the first row if the frame only contains a single row. If that's a concern
df[0::len(df)-1 if len(df) > 1 else 1]
works even for single row-dataframes.
Example: For the following dataframe this will not create a duplicate:
df = pd.DataFrame({'a': [1], 'b':['a']})
df2 = df[0::len(df)-1 if len(df) > 1 else 1]
print df2
a b
0 1 a
whereas this does:
df3 = df.iloc[[0, -1]]
print df3
a b
0 1 a
0 1 a
because the single row is the first AND last row at the same time.
I think you can try add parameter axis=1 to concat, because output of df.iloc[0,:] and df.iloc[-1,:] are Series and transpose by T:
print df.iloc[0,:]
a 1
b a
Name: 0, dtype: object
print df.iloc[-1,:]
a 4
b d
Name: 3, dtype: object
print pd.concat([df.iloc[0,:], df.iloc[-1,:]], axis=1)
0 3
a 1 4
b a d
print pd.concat([df.iloc[0,:], df.iloc[-1,:]], axis=1).T
a b
0 1 a
3 4 d
Alternatively you can use take:
In [3]: df.take([0, -1])
Out[3]:
a b
0 1 a
3 4 d
Here is the same style as in large datasets:
x = df[:5]
y = pd.DataFrame([['...']*df.shape[1]], columns=df.columns, index=['...'])
z = df[-5:]
frame = [x, y, z]
result = pd.concat(frame)
print(result)
Output:
date temp
0 1981-01-01 00:00:00 20.7
1 1981-01-02 00:00:00 17.9
2 1981-01-03 00:00:00 18.8
3 1981-01-04 00:00:00 14.6
4 1981-01-05 00:00:00 15.8
... ... ...
3645 1990-12-27 00:00:00 14
3646 1990-12-28 00:00:00 13.6
3647 1990-12-29 00:00:00 13.5
3648 1990-12-30 00:00:00 15.7
3649 1990-12-31 00:00:00 13

Change values conditionally in Pandas DF with multilevel columns

Given the following DF with multilevel columns:
arrays = [['foo', 'foo', 'bar', 'bar'],
['A', 'B', 'C', 'D']]
tuples = list(zip(*arrays))
columnValues = pd.MultiIndex.from_tuples(tuples)
df = pd.DataFrame(np.random.rand(6,4), columns = columnValues)
df['txt'] = 'aaa'
print(df)
yields:
foo bar txt
A B C D
0 0.080029 0.710943 0.157265 0.774827 aaa
1 0.276949 0.923369 0.550799 0.758707 aaa
2 0.416714 0.440659 0.835736 0.130818 aaa
3 0.935763 0.908967 0.502363 0.677957 aaa
4 0.191245 0.291017 0.014355 0.762976 aaa
5 0.365464 0.286350 0.450263 0.509556 aaa
Question: how do i efficiently change values in the foo sub-columns to 100 if their values < 0.5 for the huge DF?
the following works:
In [41]: df.foo < 0.5
Out[41]:
A B
0 True False
1 True False
2 True True
3 False False
4 True True
5 True True
In [42]: df.foo[df.foo < 0.5]
Out[42]:
A B
0 0.080029 NaN
1 0.276949 NaN
2 0.416714 0.440659
3 NaN NaN
4 0.191245 0.291017
5 0.365464 0.286350
but if i try to change the value it throws me:
In [45]: df.foo[df.foo < 0.5] = 100
C:\Users\USER\AppData\Local\Programs\Python35\Scripts\ipython:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
if i try to use locators:
In [46]: df.foo.loc[df.foo < 0.5] = 100
...
ValueError: cannot copy sequence with size 2 to array axis with dimension 6
the same error for df.foo.loc[df.foo < 0.5, 'foo'] = 100
if i try:
df.loc[df.foo < 0.5, 'foo']
i get:
KeyError: 'None of [ A B\n0 True False\n1 True False\n2 True True\n3 False False\n4 True True\n5 True True] are in the [index]'
Solutions - timeit comparison against DF with 10M rows:
In [19]: %timeit df.foo.applymap(lambda x: x if x >= 0.5 else 100)
1 loop, best of 3: 29.4 s per loop
In [20]: %timeit df.foo[df.foo >= 0.5].fillna(100)
1 loop, best of 3: 1.55 s per loop
John Galt:
In [21]: %timeit df.foo.where(df.foo < 0.5, 100)
1 loop, best of 3: 1.12 s per loop
B. M.:
In [5]: %timeit u=df['foo'].values;u[u<.5]=100
1 loop, best of 3: 628 ms per loop
Here's one way using where -- df['foo'] = df['foo'].where(df['foo'] < 0.5, 100)
In [96]: df
Out[96]:
foo bar txt
A B C D
0 0.255309 0.237892 0.491065 0.930555 aaa
1 0.859998 0.008269 0.376213 0.984806 aaa
2 0.479928 0.761266 0.993970 0.266486 aaa
3 0.078284 0.009748 0.461687 0.653085 aaa
4 0.923293 0.642398 0.629140 0.561777 aaa
5 0.936824 0.526626 0.413250 0.732074 aaa
In [97]: df['foo'] = df['foo'].where(df['foo'] < 0.5, 100)
In [98]: df
Out[98]:
foo bar txt
A B C D
0 0.255309 0.237892 0.491065 0.930555 aaa
1 100.000000 0.008269 0.376213 0.984806 aaa
2 0.479928 100.000000 0.993970 0.266486 aaa
3 0.078284 0.009748 0.461687 0.653085 aaa
4 100.000000 100.000000 0.629140 0.561777 aaa
5 100.000000 100.000000 0.413250 0.732074 aaa

Categories

Resources