I have an example dataframe looks like below. I want to make a calculation then append the result as a new column to current dataframe.
A, B # this is my df, a csv file
1, 2
3, 3
7, 6
13, 14
Below is some code I have tried.
for i in range(0,len(df.index)+1,1):
if len(df.index)-1 == i:
df['C'] = str(df.iloc[i]['A'] / df.iloc[i]['B'])
else:
df['C'] = str((df.iloc[i+1]['A'] - df.iloc[i]['A']) / (df.iloc[i+1]['B'] - df.iloc[i]['B'])) # I need string as dtype
df.to_csv(Out, index = False)
This only gives me the result of final loop, not corresponding result depending on each calculation.
A B C
1 2 2
3 3 1.33
7 6 0.75
13 14 0.93 # It is the result I'd like to see.
Does anyone know how to revise it? Thanks in advance!
UPDATE: - much more elegant solution (one-liner) from #root:
In [131]: df['C'] = (df.A.shift(-1).sub(df.A, fill_value=0) / df.B.shift(-1).sub(df.B, fill_value=0)).round(2).astype(str)
In [132]: df
Out[132]:
A B C
0 1 2 2.0
1 3 3 1.33
2 7 6 0.75
3 13 14 0.93
In [133]: df.dtypes
Out[133]:
A int64
B int64
C object
dtype: object
you can do it this way:
df['C'] = (df.A.shift(-1) - df.A) / (df.B.shift(-1) - df.B)
df.loc[df.index.max(), 'C'] = df.loc[df.index.max(), 'A'] / df.loc[df.index.max(), 'B']
df.round(2)
yields:
In [118]: df.round(2)
Out[118]:
A B C
0 1 2 2.00
1 3 3 1.33
2 7 6 0.75
3 13 14 0.93
Related
I have a dataframe with columns (A, B and value) where there are missing values in the value column. And there is a Series indexed by two columns (A and B) from the dataframe. How can I fill the missing values in the dataframe with corresponding values in the series?
I think you need fillna with set_index and reset_index:
df = pd.DataFrame({'A': [1,1,3],
'B': [2,3,4],
'value':[2,np.nan,np.nan] })
print (df)
A B value
0 1 2 2.0
1 1 3 NaN
2 3 4 NaN
idx = pd.MultiIndex.from_product([[1,3],[2,3,4]])
s = pd.Series([5,6,0,8,9,7], index=idx)
print (s)
1 2 5
3 6
4 0
3 2 8
3 9
4 7
dtype: int64
df = df.set_index(['A','B'])['value'].fillna(s).reset_index()
print (df)
A B value
0 1 2 2.0
1 1 3 6.0
2 3 4 7.0
Consider the dataframe and series df and s
df = pd.DataFrame(dict(
A=list('aaabbbccc'),
B=list('xyzxyzxyz'),
value=[1, 2, np.nan, 4, 5, np.nan, 7, 8, 9]
))
s = pd.Series(range(1, 10)[::-1])
s.index = [df.A, df.B]
We can fillna with a clever join
df.fillna(df.join(s.rename('value'), on=['A', 'B'], lsuffix='_'))
# \_____________/ \_________/
# make series same get old
# name as column column out
# we are filling of the way
A B value
0 a x 1.0
1 a y 2.0
2 a z 7.0
3 b x 4.0
4 b y 5.0
5 b z 4.0
6 c x 7.0
7 c y 8.0
8 c z 9.0
Timing
join is cute, but #jezrael's set_index is quicker
%timeit df.fillna(df.join(s.rename('value'), on=['A', 'B'], lsuffix='_'))
100 loops, best of 3: 3.56 ms per loop
%timeit df.set_index(['A','B'])['value'].fillna(s).reset_index()
100 loops, best of 3: 2.06 ms per loop
I have a pandas dataframe A of size (1500,5) and a dictionary D containing:
D
Out[121]:
{'newcol1': 'a',
'newcol2': 2,
'newcol3': 1}
for each key in the dictionary I would like to create a new column in the dataframe A with the values in the dictionary (same value for all the rows of each column)
at the end
A should be of size (1500,8)
Is there a "python" way to do this? thanks!
You can use concat with DataFrame constructor:
D = {'newcol1': 'a',
'newcol2': 2,
'newcol3': 1}
df = pd.DataFrame({'A':[1,2],
'B':[4,5],
'C':[7,8]})
print (df)
A B C
0 1 4 7
1 2 5 8
print (pd.concat([df, pd.DataFrame(D, index=df.index)], axis=1))
A B C newcol1 newcol2 newcol3
0 1 4 7 a 2 1
1 2 5 8 a 2 1
Timings:
D = {'newcol1': 'a',
'newcol2': 2,
'newcol3': 1}
df = pd.DataFrame(np.random.rand(10000000, 5), columns=list('abcde'))
In [37]: %timeit pd.concat([df, pd.DataFrame(D, index=df.index)], axis=1)
The slowest run took 18.06 times longer than the fastest. This could mean that an intermediate result is being cached.
1 loop, best of 3: 875 ms per loop
In [38]: %timeit df.assign(**D)
1 loop, best of 3: 1.22 s per loop
setup
A = pd.DataFrame(np.random.rand(10, 5), columns=list('abcde'))
d = {
'newcol1': 'a',
'newcol2': 2,
'newcol3': 1
}
solution
Use assign
A.assign(**d)
a b c d e newcol1 newcol2 newcol3
0 0.709249 0.275538 0.135320 0.939448 0.549480 a 2 1
1 0.396744 0.513155 0.063207 0.198566 0.487991 a 2 1
2 0.230201 0.787672 0.520359 0.165768 0.616619 a 2 1
3 0.300799 0.554233 0.838353 0.637597 0.031772 a 2 1
4 0.003613 0.387557 0.913648 0.997261 0.862380 a 2 1
5 0.504135 0.847019 0.645900 0.312022 0.715668 a 2 1
6 0.857009 0.313477 0.030833 0.952409 0.875613 a 2 1
7 0.488076 0.732990 0.648718 0.389069 0.301857 a 2 1
8 0.187888 0.177057 0.813054 0.700724 0.653442 a 2 1
9 0.003675 0.082438 0.706903 0.386046 0.973804 a 2 1
I have these two data frames with different row indices but some columns are same.
What I want to do is to a get a data frame that sums the numbers of the two data frames with same column names
df1 = pd.DataFrame([(1,2,3),(3,4,5),(5,6,7)], columns=['a','b','d'], index = ['A', 'B','C','D'])
df1
a b d
A 1 2 3
B 3 4 5
C 5 6 7
df2 = pd.DataFrame([(10,20,30)], columns=['a','b','c'])
df2
a b c
0 10 20 30
Output dataframe:
a b d
A 11 22 3
B 13 24 5
C 15 16 7
Whats the best way to do this? .add() doesn't seem to work with data frames with different indices.
This one-liner does the trick:
In [30]: df1 + df2.ix[0].reindex(df1.columns).fillna(0)
Out[30]:
a b d
A 11 22 3
B 13 24 5
C 15 26 7
Here's one way to do it.
Extract common columns on which you want to add from df1 and df2.
In [153]: col1 = df1.columns
In [154]: col2 = df2.columns
In [155]: cols = list(set(col1) & set(col2))
In [156]: cols
Out[156]: ['a', 'b']
And, now add the values
In [157]: dff = df1
In [158]: dff[cols] = df1[cols].values+df2[cols].values
In [159]: dff
Out[159]:
a b d
A 11 22 3
B 13 24 5
C 15 26 7
I think this might be the shortest approach:
In [36]:
print df1 + df2.ix[0,df1.columns].fillna(0)
a b d
A 11 22 3
B 13 24 5
C 15 26 7
Step 1 result a series like this:
In [44]:
df2.ix[0,df1.columns]
Out[44]:
a 10
b 20
d NaN
Name: 0, dtype: float64
fill the nan and added it to df1 will suffice, as the index will be aligned:
In [45]:
df2.ix[0,df1.columns].fillna(0)
Out[45]:
a 10
b 20
d 0
Name: 0, dtype: float64
I have a data frame df1 and list x:
In [22] : import pandas as pd
In [23]: df1 = pd.DataFrame({'C': range(5), "B":range(10,20,2), "A":list('abcde')})
In [24]: df1
Out[24]:
A B C
0 a 10 0
1 b 12 1
2 c 14 2
3 d 16 3
4 e 18 4
In [25]: x = ["b","c","g","h","j"]
What I want to do is to select rows in data frame based on the list.
Returning
A B C
1 b 12 1
2 c 14 2
What's the way to do it?
I tried this but failed.
df1.join(pd.DataFrame(x),how="inner")
Use isin to return a boolean index for you to index into your df:
In [152]:
df1[df1['A'].isin(x)]
Out[152]:
A B C
1 b 12 1
2 c 14 2
This is what isin is returning:
In [153]:
df1['A'].isin(x)
Out[153]:
0 False
1 True
2 True
3 False
4 False
Name: A, dtype: bool
Use df[df["column"].isin(values)]
Suppose I have this table;
A B C
2 1 4
1 8 2
...
I try to divide each row with column C value, then I get;
A B C
0.5 0.25 4
0.5 4 2
How can I do it in pandas dataframe
Use the built in div function and pass param axis=0:
In [123]:
df[['A','B']] = df[['A','B']].div(df['C'],axis=0)
df
Out[123]:
A B C
0 0.5 0.25 4
1 0.5 4.00 2
You can simply do df[col] = df[col] / df[col2] where col and col2 could be 'A' and 'C', for example.
The code below will divide A and B by C in turn. You can ignore the first part, that's just be setting the DataFrame up.
import pandas as pd
from io import StringIO
s = '''A B C
2 1 4
1 8 2'''
df = pd.read_csv(StringIO(s), sep='\s+')
print(df)
# A B C
# 0 2 1 4
# 1 1 8 2
df['A'] = df['A'] / df['C'] # or df['A'] /= df['C']
df['B'] = df['B'] / df['C'] # or df['B'] /= df['C']
print(df)
# A B C
# 0 0.5 0.25 4
# 1 0.5 4.00 2
To make this easier (if you have multiple columns) you could have a list of column names ['A', 'B'] and then iterate over it
for x in ['A', 'B']:
df[x] /= df['C']