I have been playing with pandas for a few hours now, I was wondering whether there is another faster way to add an extra column to your table which consists of the average of that row? I am creating a new list which contains the mean and then I am incorporating it in the data frame.
This is my code:
import numpy as np
import pandas as pd
userdata={"A":[2,5],"B":[4,6]}
tab=pd.DataFrame((userdata), columns=["A","B"])
lst=[np.mean([tab.loc[i,"A"],tab.loc[i,"B"]]) for i in range(len(tab.index))]
tab["Average of A and B"]=pd.DataFrame(lst)
tab
try df.mean(1) with assign. df.mean(1) tells pandas to calculate the mean along axis=1 (rows). axis=0 is the default.
df.assign(Mean=df.mean(1))
This produces a copy of df with added column.
To alter the existing dataframe
df['Mean'] = df.mean(1)
demo
tab.assign(Mean=tab.mean(1))
A B Mean
0 2 4 3.0
1 5 6 5.5
A NumPy solution would be to work with the underlying array data for performance -
tab['average'] = tab.values.mean(1)
To choose specific columns, like 'A' and 'B' -
tab['average'] = tab[['A','B']].values.mean(1)
Runtime test -
In [41]: tab = pd.DataFrame(np.random.randint(0,9,(10000,10)))
# #piRSquared's soln
In [42]: %timeit tab.assign(Mean=tab.mean(1))
1000 loops, best of 3: 615 µs per loop
In [43]: tab = pd.DataFrame(np.random.randint(0,9,(10000,10)))
In [44]: %timeit tab['average'] = tab.values.mean(1)
1000 loops, best of 3: 297 µs per loop
In [37]: tab = pd.DataFrame(np.random.randint(0,9,(10000,100)))
# #piRSquared's soln
In [38]: %timeit tab.assign(Mean=tab.mean(1))
100 loops, best of 3: 4.71 ms per loop
In [39]: tab = pd.DataFrame(np.random.randint(0,9,(10000,100)))
In [40]: %timeit tab['average'] = tab.values.mean(1)
100 loops, best of 3: 3.6 ms per loop
Related
So given a csv file as a dataframe using pandas and python, i'd like to get the value which is in the same row as another value as efficiently as possible.
To clarify this, i will give you an example with the following csv:
STAID SOUID DATE TX Q_TX
162 100522 19010101 -31 0
162 100522 19010102 -13 0
162 100522 19010103 -5 0
162 100522 19010104 -10 0
162 100522 19010105 -18 0
So lets say im implementing te following code
import pandas as pd
data = pd.read_csv("foo.csv")
max_val = data["TX"].max()
Max_val will now get a value of -5. The thing is that I would now like to know the value in 'DATE' which will be in the same row as max_val, or in other words: a value in the column 'DATE' sharing the same index as the found value. The desired value that I'm aiming for is 19010103. What is the most efficient way to do this only using pandas??
UPDATE: Derped a bit with the min_val, it should obviously be max_val instead of min_val.
We can using idxmax
df.DATE[df.TX.idxmax()]
Out[346]: 19010103
For enhance the speed
df.values[2,df.TX.values.argmax()]
Using loc is the "standard" and should be highly readable. but using idxmax with at is your FASTEST answer here (SEE Wen's Answer for where i got the idea). You may want to test with your real data to ensure this small amount of data isn't providing red herrings. See at:
Fast label-based scalar accessor
Similarly to loc, at provides label based scalar lookups. You can also set using these indexers.
Fastest answer here from my testing:
min_val = data.TX.idxmax() #with min_val's index already set
%%timeit
data.at[min_val,'DATE']
# 100000 loops, best of 3: 6.73 µs per loop
using %%timeit on jupyter you can see the time:
%%timeit
data.loc[data['TX'] == min_val]['DATE']
# 1000 loops, best of 3: 604 µs per loop
%%timeit
data[data['TX']==min_val)].DATE #using from comments, and not using loc
# 1000 loops, best of 3: 724 µs per loop
%%timeit
data[data['TX']==data['TX'].max()]['DATE']
#1000 loops, best of 3: 575 µs per loop
%%timeit
data.at[data.TX.idxmax(),'DATE'] #using at and idxmax <----
# 10000 loops, best of 3: 69.5 µs per loop
%%timeit
data.at[data.loc[data['TX'] == min_val].index[0],'DATE']
# 1000 loops, best of 3: 560 µs per loop
I have a series s
s = pd.Series([1, 2])
What is an efficient way to make s look like
0 [1]
1 [2]
dtype: object
Here's one approach that extracts into array and extends to 2D by introducing a new axis with None/np.newaxis -
pd.Series(s.values[:,None].tolist())
Here's a similar one, but extends to 2D by reshaping -
pd.Series(s.values.reshape(-1,1).tolist())
Runtime test using #P-robot's setup -
In [43]: s = pd.Series(np.random.randint(1,10,1000))
In [44]: %timeit pd.Series(np.vstack(s.values).tolist()) # #Nickil Maveli's soln
100 loops, best of 3: 5.77 ms per loop
In [45]: %timeit pd.Series([[a] for a in s]) # #P-robot's soln
1000 loops, best of 3: 412 µs per loop
In [46]: %timeit s.apply(lambda x: [x]) # #mgc's soln
1000 loops, best of 3: 551 µs per loop
In [47]: %timeit pd.Series(s.values[:,None].tolist()) # Approach1
1000 loops, best of 3: 307 µs per loop
In [48]: %timeit pd.Series(s.values.reshape(-1,1).tolist()) # Approach2
1000 loops, best of 3: 306 µs per loop
If you want the result to still be a pandas Series you can use the apply method :
In [1]: import pandas as pd
In [2]: s = pd.Series([1, 2])
In [3]: s.apply(lambda x: [x])
Out[3]:
0 [1]
1 [2]
dtype: object
This does it:
import numpy as np
np.array([[a] for a in s],dtype=object)
array([[1],
[2]], dtype=object)
Adjusting atomh33ls' answer, here's a series of lists:
output = pd.Series([[a] for a in s])
type(output)
>> pandas.core.series.Series
type(output[0])
>> list
Timings for a selection of the suggestions:
import numpy as np, pandas as pd
s = pd.Series(np.random.randint(1,10,1000))
>> %timeit pd.Series(np.vstack(s.values).tolist())
100 loops, best of 3: 3.2 ms per loop
>> %timeit pd.Series([[a] for a in s])
1000 loops, best of 3: 393 µs per loop
>> %timeit s.apply(lambda x: [x])
1000 loops, best of 3: 473 µs per loop
I have a python dataframe with 1.5 million rows and 8 columns. I want combine few columns and create a new column. I know how to do this but wanted to know which one is faster and efficient. I am reproducing my code here
import pandas as pd
import numpy as np
df=pd.Dataframe(columns=['A','B','C'],data=[[1,2,3],[4,5,6],[7,8,9]])
Now here is what I want to achieve
df['D']=0.5*df['A']+0.3*df['B']+0.2*df['C']
The other alternative is to use the apply functionality of pandas
df['D']=df.apply(lambda row: 0.5*row['A']+0.3*row['B']+0.2*row['C'])
I wanted to know which method takes less time when we have 1.5 millon rows and have to combine 8 columns
First method is faster, because is vectorized:
df=pd.DataFrame(columns=['A','B','C'],data=[[1,2,3],[4,5,6],[7,8,9]])
print (df)
#[30000 rows x 3 columns]
df = pd.concat([df]*10000).reset_index(drop=True)
df['D1']=0.5*df['A']+0.3*df['B']+0.2*df['C']
#similar timings with mul function
#df['D1']=df['A'].mul(0.5)+df['B'].mul(0.3)+df['C'].mul(0.2)
df['D']=df.apply(lambda row: 0.5*row['A']+0.3*row['B']+0.2*row['C'], axis=1)
print (df)
In [54]: %timeit df['D2']=df['A'].mul(0.5)+df['B'].mul(0.3)+df['C'].mul(0.2)
The slowest run took 10.84 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 950 µs per loop
In [55]: %timeit df['D1']=0.5*df['A']+0.3*df['B']+0.2*df['C']
The slowest run took 4.76 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 1.2 ms per loop
In [56]: %timeit df['D']=df.apply(lambda row: 0.5*row['A']+0.3*row['B']+0.2*row['C'], axis=1)
1 loop, best of 3: 928 ms per loop
Another testing in 1.5M size DataFrame, apply method is very slow:
#[1500000 rows x 6 columns]
df = pd.concat([df]*500000).reset_index(drop=True)
In [62]: %timeit df['D2']=df['A'].mul(0.5)+df['B'].mul(0.3)+df['C'].mul(0.2)
10 loops, best of 3: 34.8 ms per loop
In [63]: %timeit df['D1']=0.5*df['A']+0.3*df['B']+0.2*df['C']
10 loops, best of 3: 31.5 ms per loop
In [64]: %timeit df['D']=df.apply(lambda row: 0.5*row['A']+0.3*row['B']+0.2*row['C'], axis=1)
1 loop, best of 3: 47.3 s per loop
Using #jezrael's setup
df=pd.DataFrame(columns=['A','B','C'],data=[[1,2,3],[4,5,6],[7,8,9]])
df = pd.concat([df]*30000).reset_index(drop=True)
Far more efficient to use a dot product.
np.array([[.5, .3, .2]]).dot(df.values.T).T
Timing
I'm trying to take a slice from a large numpy array as quickly as possible using fancy indexing. I would be happy returning a view, but advanced indexing returns a copy.
I've tried solutions from here and here with no joy so far.
Toy data:
data = np.random.randn(int(1e6), 50)
keep = np.random.rand(len(data))>0.5
Using the default method:
%timeit data[keep]
10 loops, best of 3: 86.5 ms per loop
Numpy take:
%timeit data.take(np.where(keep)[0], axis=0)
%timeit np.take(data, np.where(keep)[0], axis=0)
10 loops, best of 3: 83.1 ms per loop
10 loops, best of 3: 80.4 ms per loop
Method from here:
rows = np.where(keep)[0]
cols = np.arange(a.shape[1])
%timeit (a.ravel()[(cols + (rows * a.shape[1]).reshape((-1,1))).ravel()]).reshape(rows.size, cols.size)
10 loops, best of 3: 159 ms per loop
Whereas if you're taking a view of the same size:
%timeit data[1:-1:2, :]
1000000 loops, best of 3: 243 ns per loop
There's no way to do this with a view. A view needs consistent strides, while your data is randomly scattered throughout the original array.
My question is about performance only, not semantics.
Does adding a new column to a df cause the data in the existing DataFrame to be physically copied to a new memory location (to ensure that the DataFrame occupies contiguous memory, for example)?
# using pandas 0.18.1, python 3.5
import pandas as pd
df = pd.DataFrame({'a': range(100)})
b = pd.Series(range(100))
df['b'] = b # is this operation expensive?
# equivalently df.loc[:, 'b'] = b
I know (from experimentation, couldn't find it in the documentation) that df['b'] = b will semantically create a copy of b, which obviously requires copying of underlying data. But I have no idea if the data in the other columns can stay where it was, or need to be moved sometimes.
Edit:
I know that adding a large number of columns is expensive. I'm only asking about adding a single column.
I also know that adding a row requires copying of the data in some cases (or always? -- not sure) for an obvious reason that the items in a single column have to be in contiguous memory.
I think from my experiments that loc is slowier and align new Series with different index the slowiest:
But I have no idea if the data in the other columns can stay where it was, or need to be moved sometimes.
I think data are not moved, new columns are added to the end (maybe some exception can be here, but I dont know about it).
# using pandas 0.18.1, python 3.5
import pandas as pd
#len(df) = 10m
df = pd.DataFrame({'a': range(10000000)})
b = pd.Series(range(10000000))
c = pd.Series(range(10000000), index=df.index)
df['b'] = b
df.loc[:, 'c'] = b
df['d'] = c
df.loc[:, 'e'] = c
print (df)
In [36]: %timeit df['b'] = b
10 loops, best of 3: 23.5 ms per loop
In [37]: %timeit df.loc[:, 'c'] = b
The slowest run took 5.76 times longer than the fastest. This could mean that an intermediate result is being cached.
1 loop, best of 3: 40 ms per loop
In [38]: %timeit df['d'] = c
10 loops, best of 3: 22.3 ms per loop
In [39]: %timeit df.loc[:, 'e'] = c
10 loops, best of 3: 39.5 ms per loop
But if change index:
# using pandas 0.18.1, python 3.5
import pandas as pd
df = pd.DataFrame({'a': range(10000000)})
df.index = df.index + 15
b = pd.Series(range(10000000))
c = pd.Series(range(10000000), index=df.index)
df['b'] = b
df.loc[:, 'c'] = b
df['d'] = c
df.loc[:, 'e'] = c
print (df)
In [41]: %timeit df['b'] = b
1 loop, best of 3: 656 ms per loop
In [42]: %timeit df.loc[:, 'c'] = b
1 loop, best of 3: 735 ms per loop
In [43]: %timeit df['d'] = c
10 loops, best of 3: 22.4 ms per loop
In [44]: %timeit df.loc[:, 'e'] = c
10 loops, best of 3: 56.6 ms per loop
If add new row, it is fast, I think it depends of length of Series:
In [68]: %timeit df.loc[10000015, :] = pd.Series([1,2,3,2,4], index=df.columns)
1000 loops, best of 3: 274 µs per loop
But if add many rows, it is expensive and I think this can be avoided.