Get value in csv file in same row - python

So given a csv file as a dataframe using pandas and python, i'd like to get the value which is in the same row as another value as efficiently as possible.
To clarify this, i will give you an example with the following csv:
STAID SOUID DATE TX Q_TX
162 100522 19010101 -31 0
162 100522 19010102 -13 0
162 100522 19010103 -5 0
162 100522 19010104 -10 0
162 100522 19010105 -18 0
So lets say im implementing te following code
import pandas as pd
data = pd.read_csv("foo.csv")
max_val = data["TX"].max()
Max_val will now get a value of -5. The thing is that I would now like to know the value in 'DATE' which will be in the same row as max_val, or in other words: a value in the column 'DATE' sharing the same index as the found value. The desired value that I'm aiming for is 19010103. What is the most efficient way to do this only using pandas??
UPDATE: Derped a bit with the min_val, it should obviously be max_val instead of min_val.

We can using idxmax
df.DATE[df.TX.idxmax()]
Out[346]: 19010103
For enhance the speed
df.values[2,df.TX.values.argmax()]

Using loc is the "standard" and should be highly readable. but using idxmax with at is your FASTEST answer here (SEE Wen's Answer for where i got the idea). You may want to test with your real data to ensure this small amount of data isn't providing red herrings. See at:
Fast label-based scalar accessor
Similarly to loc, at provides label based scalar lookups. You can also set using these indexers.
Fastest answer here from my testing:
min_val = data.TX.idxmax() #with min_val's index already set
%%timeit
data.at[min_val,'DATE']
# 100000 loops, best of 3: 6.73 µs per loop
using %%timeit on jupyter you can see the time:
%%timeit
data.loc[data['TX'] == min_val]['DATE']
# 1000 loops, best of 3: 604 µs per loop
%%timeit
data[data['TX']==min_val)].DATE #using from comments, and not using loc
# 1000 loops, best of 3: 724 µs per loop
%%timeit
data[data['TX']==data['TX'].max()]['DATE']
#1000 loops, best of 3: 575 µs per loop
%%timeit
data.at[data.TX.idxmax(),'DATE'] #using at and idxmax <----
# 10000 loops, best of 3: 69.5 µs per loop
%%timeit
data.at[data.loc[data['TX'] == min_val].index[0],'DATE']
# 1000 loops, best of 3: 560 µs per loop

Related

Averaging indices of table using pandas and numpy

I have been playing with pandas for a few hours now, I was wondering whether there is another faster way to add an extra column to your table which consists of the average of that row? I am creating a new list which contains the mean and then I am incorporating it in the data frame.
This is my code:
import numpy as np
import pandas as pd
userdata={"A":[2,5],"B":[4,6]}
tab=pd.DataFrame((userdata), columns=["A","B"])
lst=[np.mean([tab.loc[i,"A"],tab.loc[i,"B"]]) for i in range(len(tab.index))]
tab["Average of A and B"]=pd.DataFrame(lst)
tab
try df.mean(1) with assign. df.mean(1) tells pandas to calculate the mean along axis=1 (rows). axis=0 is the default.
df.assign(Mean=df.mean(1))
This produces a copy of df with added column.
To alter the existing dataframe
df['Mean'] = df.mean(1)
demo
tab.assign(Mean=tab.mean(1))
A B Mean
0 2 4 3.0
1 5 6 5.5
A NumPy solution would be to work with the underlying array data for performance -
tab['average'] = tab.values.mean(1)
To choose specific columns, like 'A' and 'B' -
tab['average'] = tab[['A','B']].values.mean(1)
Runtime test -
In [41]: tab = pd.DataFrame(np.random.randint(0,9,(10000,10)))
# #piRSquared's soln
In [42]: %timeit tab.assign(Mean=tab.mean(1))
1000 loops, best of 3: 615 µs per loop
In [43]: tab = pd.DataFrame(np.random.randint(0,9,(10000,10)))
In [44]: %timeit tab['average'] = tab.values.mean(1)
1000 loops, best of 3: 297 µs per loop
In [37]: tab = pd.DataFrame(np.random.randint(0,9,(10000,100)))
# #piRSquared's soln
In [38]: %timeit tab.assign(Mean=tab.mean(1))
100 loops, best of 3: 4.71 ms per loop
In [39]: tab = pd.DataFrame(np.random.randint(0,9,(10000,100)))
In [40]: %timeit tab['average'] = tab.values.mean(1)
100 loops, best of 3: 3.6 ms per loop

How to sum values of a row of a pandas dataframe efficiently

I have a python dataframe with 1.5 million rows and 8 columns. I want combine few columns and create a new column. I know how to do this but wanted to know which one is faster and efficient. I am reproducing my code here
import pandas as pd
import numpy as np
df=pd.Dataframe(columns=['A','B','C'],data=[[1,2,3],[4,5,6],[7,8,9]])
Now here is what I want to achieve
df['D']=0.5*df['A']+0.3*df['B']+0.2*df['C']
The other alternative is to use the apply functionality of pandas
df['D']=df.apply(lambda row: 0.5*row['A']+0.3*row['B']+0.2*row['C'])
I wanted to know which method takes less time when we have 1.5 millon rows and have to combine 8 columns
First method is faster, because is vectorized:
df=pd.DataFrame(columns=['A','B','C'],data=[[1,2,3],[4,5,6],[7,8,9]])
print (df)
#[30000 rows x 3 columns]
df = pd.concat([df]*10000).reset_index(drop=True)
df['D1']=0.5*df['A']+0.3*df['B']+0.2*df['C']
#similar timings with mul function
#df['D1']=df['A'].mul(0.5)+df['B'].mul(0.3)+df['C'].mul(0.2)
df['D']=df.apply(lambda row: 0.5*row['A']+0.3*row['B']+0.2*row['C'], axis=1)
print (df)
In [54]: %timeit df['D2']=df['A'].mul(0.5)+df['B'].mul(0.3)+df['C'].mul(0.2)
The slowest run took 10.84 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 950 µs per loop
In [55]: %timeit df['D1']=0.5*df['A']+0.3*df['B']+0.2*df['C']
The slowest run took 4.76 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 1.2 ms per loop
In [56]: %timeit df['D']=df.apply(lambda row: 0.5*row['A']+0.3*row['B']+0.2*row['C'], axis=1)
1 loop, best of 3: 928 ms per loop
Another testing in 1.5M size DataFrame, apply method is very slow:
#[1500000 rows x 6 columns]
df = pd.concat([df]*500000).reset_index(drop=True)
In [62]: %timeit df['D2']=df['A'].mul(0.5)+df['B'].mul(0.3)+df['C'].mul(0.2)
10 loops, best of 3: 34.8 ms per loop
In [63]: %timeit df['D1']=0.5*df['A']+0.3*df['B']+0.2*df['C']
10 loops, best of 3: 31.5 ms per loop
In [64]: %timeit df['D']=df.apply(lambda row: 0.5*row['A']+0.3*row['B']+0.2*row['C'], axis=1)
1 loop, best of 3: 47.3 s per loop
Using #jezrael's setup
df=pd.DataFrame(columns=['A','B','C'],data=[[1,2,3],[4,5,6],[7,8,9]])
df = pd.concat([df]*30000).reset_index(drop=True)
Far more efficient to use a dot product.
np.array([[.5, .3, .2]]).dot(df.values.T).T
Timing

Fast advanced indexing in numpy

I'm trying to take a slice from a large numpy array as quickly as possible using fancy indexing. I would be happy returning a view, but advanced indexing returns a copy.
I've tried solutions from here and here with no joy so far.
Toy data:
data = np.random.randn(int(1e6), 50)
keep = np.random.rand(len(data))>0.5
Using the default method:
%timeit data[keep]
10 loops, best of 3: 86.5 ms per loop
Numpy take:
%timeit data.take(np.where(keep)[0], axis=0)
%timeit np.take(data, np.where(keep)[0], axis=0)
10 loops, best of 3: 83.1 ms per loop
10 loops, best of 3: 80.4 ms per loop
Method from here:
rows = np.where(keep)[0]
cols = np.arange(a.shape[1])
%timeit (a.ravel()[(cols + (rows * a.shape[1]).reshape((-1,1))).ravel()]).reshape(rows.size, cols.size)
10 loops, best of 3: 159 ms per loop
Whereas if you're taking a view of the same size:
%timeit data[1:-1:2, :]
1000000 loops, best of 3: 243 ns per loop
There's no way to do this with a view. A view needs consistent strides, while your data is randomly scattered throughout the original array.

Looping through pandas dataframe for speed

I'm trying to understand the fastest way to loop through in pandas. I read in many places that itertuples is much better than just regularly looping through data, and the best is apply. If this is the case why do regular loops come out the fastest? Maybe I'm not understanding the results, what does 10 loops, best of 3 mean?
%%timeit
xlist= []
for row in toMood.itertuples():
xlist.append(row[1] + 1)
1 loop, best of 3: 266 ms per loop
In [54]:
%%timeit
zlist = []
for row in toMood['user_id']:
zlist.append(row + 1)
10 loops, best of 3: 83 ms per loop
In [56]:
%%timeit
tlist = toMood['user_id'].apply(lambda x: x+1)
10 loops, best of 3: 138 ms per loop

Comparison of Pandas lookup times

After experimenting with timing various types of lookups on a Pandas (0.17.1) DataFrame I am left with a few questions.
Here is the set up...
import pandas as pd
import numpy as np
import itertools
letters = [chr(x) for x in range(ord('a'), ord('z'))]
letter_combinations = [''.join(x) for x in itertools.combinations(letters, 3)]
df1 = pd.DataFrame({
'value': np.random.normal(size=(1000000)),
'letter': np.random.choice(letter_combinations, 1000000)
})
df2 = df1.sort_values('letter')
df3 = df1.set_index('letter')
df4 = df3.sort_index()
So df1 looks something like this...
print(df1.head(5))
>>>
letter value
0 bdh 0.253778
1 cem -1.915726
2 mru -0.434007
3 lnw -1.286693
4 fjv 0.245523
Here is the code to test differences in lookup performance...
print('~~~~~~~~~~~~~~~~~NON-INDEXED LOOKUPS / UNSORTED DATASET~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~')
%timeit df1[df1.letter == 'ben']
%timeit df1[df1.letter == 'amy']
%timeit df1[df1.letter == 'abe']
print('~~~~~~~~~~~~~~~~~NON-INDEXED LOOKUPS / SORTED DATASET~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~')
%timeit df2[df2.letter == 'ben']
%timeit df2[df2.letter == 'amy']
%timeit df2[df2.letter == 'abe']
print('~~~~~~~~~~~~~~~~~~~~~INDEXED LOOKUPS~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~')
%timeit df3.loc['ben']
%timeit df3.loc['amy']
%timeit df3.loc['abe']
print('~~~~~~~~~~~~~~~~~~~~~SORTED INDEXED LOOKUPS~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~')
%timeit df4.loc['ben']
%timeit df4.loc['amy']
%timeit df4.loc['abe']
And the results...
~~~~~~~~~~~~~~~~~NON-INDEXED LOOKUPS / UNSORTED DATASET~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
10 loops, best of 3: 59.7 ms per loop
10 loops, best of 3: 59.7 ms per loop
10 loops, best of 3: 59.7 ms per loop
~~~~~~~~~~~~~~~~~NON-INDEXED LOOKUPS / SORTED DATASET~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
10 loops, best of 3: 192 ms per loop
10 loops, best of 3: 192 ms per loop
10 loops, best of 3: 193 ms per loop
~~~~~~~~~~~~~~~~~~~~~INDEXED LOOKUPS~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The slowest run took 4.66 times longer than the fastest. This could mean that an intermediate result is being cached
10 loops, best of 3: 40.9 ms per loop
10 loops, best of 3: 41 ms per loop
10 loops, best of 3: 40.9 ms per loop
~~~~~~~~~~~~~~~~~~~~~SORTED INDEXED LOOKUPS~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The slowest run took 1621.00 times longer than the fastest. This could mean that an intermediate result is being cached
1 loops, best of 3: 259 µs per loop
1000 loops, best of 3: 242 µs per loop
1000 loops, best of 3: 243 µs per loop
Questions...
It's pretty clear why the lookup on the sorted index is so much faster, binary search to get O(log(n)) performance vs O(n) for a full array scan. But, why is the lookup on the sorted non-indexed df2 column SLOWER than the lookup on the unsorted non-indexed column df1?
What is up with the The slowest run took x times longer than the fastest. This could mean that an intermediate result is being cached. Surely, the results aren't being cached. Is it because the created index is lazy and isn't actually reindexed until needed? That would explain why it is only on the first call to .loc[].
Why isn't an index sorted by default? The fixed cost of the sort can be too much?
The disparity in these %timeit results
In [273]: %timeit df1[df1['letter'] == 'ben']
10 loops, best of 3: 36.1 ms per loop
In [274]: %timeit df2[df2['letter'] == 'ben']
10 loops, best of 3: 108 ms per loop
also shows up in the pure NumPy equality comparisons:
In [275]: %timeit df1['letter'].values == 'ben'
10 loops, best of 3: 24.1 ms per loop
In [276]: %timeit df2['letter'].values == 'ben'
10 loops, best of 3: 96.5 ms per loop
Under the hood, Pandas' df1['letter'] == 'ben' calls a Cython
function
which loops through the values of the underlying NumPy array,
df1['letter'].values. It is essentially doing the same thing as
df1['letter'].values == 'ben' but with different handling of NaNs.
Moreover, notice that simply accessing the items in df1['letter'] in
sequential order can be done more quickly than doing the same for df2['letter']:
In [11]: %timeit [item for item in df1['letter']]
10 loops, best of 3: 49.4 ms per loop
In [12]: %timeit [item for item in df2['letter']]
10 loops, best of 3: 124 ms per loop
The difference in times within each of these three sets of %timeit tests are
roughly the same. I think that is because they all share the same cause.
Since the letter column holds strings, the NumPy arrays df1['letter'].values and
df2['letter'].values have dtype object and therefore they hold
pointers to the memory location of the arbitrary Python objects (in this case strings).
Consider the memory location of the strings stored in the DataFrames, df1 and
df2. In CPython the id returns the memory location of the object:
memloc = pd.DataFrame({'df1': list(map(id, df1['letter'])),
'df2': list(map(id, df2['letter'])), })
df1 df2
0 140226328244040 140226299303840
1 140226328243088 140226308389048
2 140226328243872 140226317328936
3 140226328243760 140226230086600
4 140226328243368 140226285885624
The strings in df1 (after the first dozen or so) tend to appear sequentially
in memory, while sorting causes the strings in df2 (taken in order) to be
scattered in memory:
In [272]: diffs = memloc.diff(); diffs.head(30)
Out[272]:
df1 df2
0 NaN NaN
1 -952.0 9085208.0
2 784.0 8939888.0
3 -112.0 -87242336.0
4 -392.0 55799024.0
5 -392.0 5436736.0
6 952.0 22687184.0
7 56.0 -26436984.0
8 -448.0 24264592.0
9 -56.0 -4092072.0
10 -168.0 -10421232.0
11 -363584.0 5512088.0
12 56.0 -17433416.0
13 56.0 40042552.0
14 56.0 -18859440.0
15 56.0 -76535224.0
16 56.0 94092360.0
17 56.0 -4189368.0
18 56.0 73840.0
19 56.0 -5807616.0
20 56.0 -9211680.0
21 56.0 20571736.0
22 56.0 -27142288.0
23 56.0 5615112.0
24 56.0 -5616568.0
25 56.0 5743152.0
26 56.0 -73057432.0
27 56.0 -4988200.0
28 56.0 85630584.0
29 56.0 -4706136.0
Most of the strings in df1 are 56 bytes apart:
In [14]:
In [16]: diffs['df1'].value_counts()
Out[16]:
56.0 986109
120.0 13671
-524168.0 215
-56.0 1
-12664712.0 1
41136.0 1
-231731080.0 1
Name: df1, dtype: int64
In [20]: len(diffs['df1'].value_counts())
Out[20]: 7
In contrast the strings in df2 are scattered all over the place:
In [17]: diffs['df2'].value_counts().head()
Out[17]:
-56.0 46
56.0 44
168.0 39
-112.0 37
-392.0 35
Name: df2, dtype: int64
In [19]: len(diffs['df2'].value_counts())
Out[19]: 837764
When these objects (strings) are located sequentially in memory, their values
can be retrieved more quickly. This is why the equality comparisons performed by
df1['letter'].values == 'ben' can be done faster than those in df2['letter'].values
== 'ben'. The lookup time is smaller.
This memory accessing issue also explains why there is no disparity in the
%timeit results for the value column.
In [5]: %timeit df1[df1['value'] == 0]
1000 loops, best of 3: 1.8 ms per loop
In [6]: %timeit df2[df2['value'] == 0]
1000 loops, best of 3: 1.78 ms per loop
df1['value'] and df2['value'] are NumPy arrays of dtype float64. Unlike object
arrays, their values are packed together contiguously in memory. Sorting df1
with df2 = df1.sort_values('letter') causes the values in df2['value'] to be
reordered, but since the values are copied into a new NumPy array, the values
are located sequentially in memory. So accessing the values in df2['value'] can
be done just as quickly as those in df1['value'].
(1) pandas currently has no knowledge of the sortedness of a column.
If you want to take advantage of sorted data, you could use df2.letter.searchsorted See #unutbu's answer for an explanation of what's actually causing the difference in time here.
(2) The hash table that sits underneath the index is lazily created, then cached.

Categories

Resources