My question is about performance only, not semantics.
Does adding a new column to a df cause the data in the existing DataFrame to be physically copied to a new memory location (to ensure that the DataFrame occupies contiguous memory, for example)?
# using pandas 0.18.1, python 3.5
import pandas as pd
df = pd.DataFrame({'a': range(100)})
b = pd.Series(range(100))
df['b'] = b # is this operation expensive?
# equivalently df.loc[:, 'b'] = b
I know (from experimentation, couldn't find it in the documentation) that df['b'] = b will semantically create a copy of b, which obviously requires copying of underlying data. But I have no idea if the data in the other columns can stay where it was, or need to be moved sometimes.
Edit:
I know that adding a large number of columns is expensive. I'm only asking about adding a single column.
I also know that adding a row requires copying of the data in some cases (or always? -- not sure) for an obvious reason that the items in a single column have to be in contiguous memory.
I think from my experiments that loc is slowier and align new Series with different index the slowiest:
But I have no idea if the data in the other columns can stay where it was, or need to be moved sometimes.
I think data are not moved, new columns are added to the end (maybe some exception can be here, but I dont know about it).
# using pandas 0.18.1, python 3.5
import pandas as pd
#len(df) = 10m
df = pd.DataFrame({'a': range(10000000)})
b = pd.Series(range(10000000))
c = pd.Series(range(10000000), index=df.index)
df['b'] = b
df.loc[:, 'c'] = b
df['d'] = c
df.loc[:, 'e'] = c
print (df)
In [36]: %timeit df['b'] = b
10 loops, best of 3: 23.5 ms per loop
In [37]: %timeit df.loc[:, 'c'] = b
The slowest run took 5.76 times longer than the fastest. This could mean that an intermediate result is being cached.
1 loop, best of 3: 40 ms per loop
In [38]: %timeit df['d'] = c
10 loops, best of 3: 22.3 ms per loop
In [39]: %timeit df.loc[:, 'e'] = c
10 loops, best of 3: 39.5 ms per loop
But if change index:
# using pandas 0.18.1, python 3.5
import pandas as pd
df = pd.DataFrame({'a': range(10000000)})
df.index = df.index + 15
b = pd.Series(range(10000000))
c = pd.Series(range(10000000), index=df.index)
df['b'] = b
df.loc[:, 'c'] = b
df['d'] = c
df.loc[:, 'e'] = c
print (df)
In [41]: %timeit df['b'] = b
1 loop, best of 3: 656 ms per loop
In [42]: %timeit df.loc[:, 'c'] = b
1 loop, best of 3: 735 ms per loop
In [43]: %timeit df['d'] = c
10 loops, best of 3: 22.4 ms per loop
In [44]: %timeit df.loc[:, 'e'] = c
10 loops, best of 3: 56.6 ms per loop
If add new row, it is fast, I think it depends of length of Series:
In [68]: %timeit df.loc[10000015, :] = pd.Series([1,2,3,2,4], index=df.columns)
1000 loops, best of 3: 274 µs per loop
But if add many rows, it is expensive and I think this can be avoided.
Related
This question already has answers here:
How to convert index of a pandas dataframe into a column
(9 answers)
Closed 3 years ago.
I have a dataframe from which I remove some rows. As a result, I get a dataframe in which index is something like that: [1,5,6,10,11] and I would like to reset it to [0,1,2,3,4]. How can I do it?
The following seems to work:
df = df.reset_index()
del df['index']
The following does not work:
df = df.reindex()
DataFrame.reset_index is what you're looking for. If you don't want it saved as a column, then do:
df = df.reset_index(drop=True)
If you don't want to reassign:
df.reset_index(drop=True, inplace=True)
Another solutions are assign RangeIndex or range:
df.index = pd.RangeIndex(len(df.index))
df.index = range(len(df.index))
It is faster:
df = pd.DataFrame({'a':[8,7], 'c':[2,4]}, index=[7,8])
df = pd.concat([df]*10000)
print (df.head())
In [298]: %timeit df1 = df.reset_index(drop=True)
The slowest run took 7.26 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 105 µs per loop
In [299]: %timeit df.index = pd.RangeIndex(len(df.index))
The slowest run took 15.05 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 7.84 µs per loop
In [300]: %timeit df.index = range(len(df.index))
The slowest run took 7.10 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 14.2 µs per loop
data1.reset_index(inplace=True)
I have been playing with pandas for a few hours now, I was wondering whether there is another faster way to add an extra column to your table which consists of the average of that row? I am creating a new list which contains the mean and then I am incorporating it in the data frame.
This is my code:
import numpy as np
import pandas as pd
userdata={"A":[2,5],"B":[4,6]}
tab=pd.DataFrame((userdata), columns=["A","B"])
lst=[np.mean([tab.loc[i,"A"],tab.loc[i,"B"]]) for i in range(len(tab.index))]
tab["Average of A and B"]=pd.DataFrame(lst)
tab
try df.mean(1) with assign. df.mean(1) tells pandas to calculate the mean along axis=1 (rows). axis=0 is the default.
df.assign(Mean=df.mean(1))
This produces a copy of df with added column.
To alter the existing dataframe
df['Mean'] = df.mean(1)
demo
tab.assign(Mean=tab.mean(1))
A B Mean
0 2 4 3.0
1 5 6 5.5
A NumPy solution would be to work with the underlying array data for performance -
tab['average'] = tab.values.mean(1)
To choose specific columns, like 'A' and 'B' -
tab['average'] = tab[['A','B']].values.mean(1)
Runtime test -
In [41]: tab = pd.DataFrame(np.random.randint(0,9,(10000,10)))
# #piRSquared's soln
In [42]: %timeit tab.assign(Mean=tab.mean(1))
1000 loops, best of 3: 615 µs per loop
In [43]: tab = pd.DataFrame(np.random.randint(0,9,(10000,10)))
In [44]: %timeit tab['average'] = tab.values.mean(1)
1000 loops, best of 3: 297 µs per loop
In [37]: tab = pd.DataFrame(np.random.randint(0,9,(10000,100)))
# #piRSquared's soln
In [38]: %timeit tab.assign(Mean=tab.mean(1))
100 loops, best of 3: 4.71 ms per loop
In [39]: tab = pd.DataFrame(np.random.randint(0,9,(10000,100)))
In [40]: %timeit tab['average'] = tab.values.mean(1)
100 loops, best of 3: 3.6 ms per loop
I have a python dataframe with 1.5 million rows and 8 columns. I want combine few columns and create a new column. I know how to do this but wanted to know which one is faster and efficient. I am reproducing my code here
import pandas as pd
import numpy as np
df=pd.Dataframe(columns=['A','B','C'],data=[[1,2,3],[4,5,6],[7,8,9]])
Now here is what I want to achieve
df['D']=0.5*df['A']+0.3*df['B']+0.2*df['C']
The other alternative is to use the apply functionality of pandas
df['D']=df.apply(lambda row: 0.5*row['A']+0.3*row['B']+0.2*row['C'])
I wanted to know which method takes less time when we have 1.5 millon rows and have to combine 8 columns
First method is faster, because is vectorized:
df=pd.DataFrame(columns=['A','B','C'],data=[[1,2,3],[4,5,6],[7,8,9]])
print (df)
#[30000 rows x 3 columns]
df = pd.concat([df]*10000).reset_index(drop=True)
df['D1']=0.5*df['A']+0.3*df['B']+0.2*df['C']
#similar timings with mul function
#df['D1']=df['A'].mul(0.5)+df['B'].mul(0.3)+df['C'].mul(0.2)
df['D']=df.apply(lambda row: 0.5*row['A']+0.3*row['B']+0.2*row['C'], axis=1)
print (df)
In [54]: %timeit df['D2']=df['A'].mul(0.5)+df['B'].mul(0.3)+df['C'].mul(0.2)
The slowest run took 10.84 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 950 µs per loop
In [55]: %timeit df['D1']=0.5*df['A']+0.3*df['B']+0.2*df['C']
The slowest run took 4.76 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 1.2 ms per loop
In [56]: %timeit df['D']=df.apply(lambda row: 0.5*row['A']+0.3*row['B']+0.2*row['C'], axis=1)
1 loop, best of 3: 928 ms per loop
Another testing in 1.5M size DataFrame, apply method is very slow:
#[1500000 rows x 6 columns]
df = pd.concat([df]*500000).reset_index(drop=True)
In [62]: %timeit df['D2']=df['A'].mul(0.5)+df['B'].mul(0.3)+df['C'].mul(0.2)
10 loops, best of 3: 34.8 ms per loop
In [63]: %timeit df['D1']=0.5*df['A']+0.3*df['B']+0.2*df['C']
10 loops, best of 3: 31.5 ms per loop
In [64]: %timeit df['D']=df.apply(lambda row: 0.5*row['A']+0.3*row['B']+0.2*row['C'], axis=1)
1 loop, best of 3: 47.3 s per loop
Using #jezrael's setup
df=pd.DataFrame(columns=['A','B','C'],data=[[1,2,3],[4,5,6],[7,8,9]])
df = pd.concat([df]*30000).reset_index(drop=True)
Far more efficient to use a dot product.
np.array([[.5, .3, .2]]).dot(df.values.T).T
Timing
I have a Pandas DataFrame which has a list of integers inside one of the columns. I'd like to access the individual elements within this list. I've found a way to do it by using tolist() and turning it back into a DataFrame, but I am wondering if there is a simpler/better way. In this example, I add Column A to the middle element of the list in Column B.
import pandas as pd
df = pd.DataFrame({'A' : (1,2,3), 'B': ([0,1,2],[3,4,5,],[6,7,8])})
df['C'] = df['A'] + pd.DataFrame(df['B'].tolist())[1]
df
Is there a better way to do this?
A bit more straightforward is:
df['C'] = df['A'] + df['B'].apply(lambda x:x[1])
One option is to use the apply, which should be faster than creating a data frame out of it:
df['C'] = df['A'] + df.apply(lambda row: row['B'][1], axis = 1)
Some speed test:
%timeit df['C'] = df['A'] + pd.DataFrame(df['B'].tolist())[1]
# 1000 loops, best of 3: 567 µs per loop
%timeit df['C'] = df['A'] + df.apply(lambda row: row['B'][1], axis = 1)
# 1000 loops, best of 3: 406 µs per loop
%timeit df['C'] = df['A'] + df['B'].apply(lambda x:x[1])
# 1000 loops, best of 3: 250 µs per loop
OK. Slightly better. #breucopter's answer is the fastest.
You can also simply try the following:
df['C'] = df['A'] + df['B'].str[1]
Performance of this method:
%timeit df['C'] = df['A'] + df['B'].str[1]
#1000 loops, best of 3: 445 µs per loop
I want to combine 2 seperate data frame of the following shape in Python Pandas:
Df1=
A B
1 1 2
2 3 4
3 5 6
Df2 =
C D
1 a b
2 c d
3 e f
I want to have as follows:
df =
A B C D
1 1 2 a b
2 3 4 c d
3 5 6 e f
I am using the following code:
dat = df1.join(df2)
But problem is that, In my actual data frame there are more than 2 Million rows and for that it takes too long time and consumes huge memory.
Is there any way to do it faster and memory efficient?
Thank you in advance for helping.
If I've read your question correctly, your indexes align exactly and you just need to combine columns into a single DataFrame. If that's right then it turns out that copying over a column from one DataFrame to another is the fastest way to go ([92] and [93]). f is my DataFrame in the example below:
In [85]: len(f)
Out[86]: 343720
In [87]: a = f.loc[:, ['date_val', 'price']]
In [88]: b = f.loc[:, ['red_date', 'credit_spread']]
In [89]: %timeit c = pd.concat([a, b], axis=1)
100 loops, best of 3: 7.11 ms per loop
In [90]: %timeit c = pd.concat([a, b], axis=1, ignore_index=True)
100 loops, best of 3: 10.8 ms per loop
In [91]: %timeit c = a.join(b)
100 loops, best of 3: 6.47 ms per loop
In [92]: %timeit a['red_date'] = b['red_date']
1000 loops, best of 3: 1.17 ms per loop
In [93]: %timeit a['credit_spread'] = b['credit_spread']
1000 loops, best of 3: 1.16 ms per loop
I also tried to copy both columns at once but for some strange reason it was more than two times slower than copying each column individually.
In [94]: %timeit a[['red_date', 'credit_spread']] = b[['red_date', 'credit_spread']]
100 loops, best of 3: 5.09 ms per loop