efficiently find bit positions in binary strings in a numpy array - python

I have a large Pandas dataframe (a subclass of Numpy ndarray for most purposes) containing binary strings (0s and 1s). I need to find the positions of all the zeros in these strings and then label them. Also, I expect the positions of the zeros to be relatively sparse (~1% of all bit positions).
Basically, I want to run something like this:
import pandas as pd
x = pd.Series([ '11101110', '11111101' ], ) # start with strings
x = pd.Series([ 0b11101110, 0b11111101 ], ) # ... or integers of a known bit length
zero_positions = find_zero_positions( x )
Yielding zero_positions =...
value
row bit
0 4 0
0 0
1 1 0
I've tried a few different ways to do this, but haven't come up with anything better than looping through one row at a time. (EDIT: The actual strings I want to look at are much longer than the 8-bit examples here, so a lookup table won't work.)
I'm not sure whether it will be more efficient to approach this as a string problem (Pandas's Vectorized string methods don't offer a substring-position-finding method) or a numeric problem (using something like numpy.unpackbits, maybe?).

You could use numpy.unpackbits as follows, starting with an ndarray of this form:
In [1]: x = np.array([[0b11101110], [0b11111101]], dtype=np.uint8)
In [2]: x
Out[2]:
array([[238],
[253]], dtype=uint8)
In [3]: df = pd.DataFrame(np.unpackbits(x, axis=1))
In [4]: df.columns = df.columns[::-1]
In [5]: df
Out[5]:
7 6 5 4 3 2 1 0
0 1 1 1 0 1 1 1 0
1 1 1 1 1 1 1 0 1
Then from the DataFrame, just stack and find the zeros:
In [6]: s = df.stack()
In [7]: s.index.names = ['row', 'bit']
In [8]: s[s == 0]
Out[8]:
row bit
0 4 0
0 0
1 1 0
dtype: uint8
I think this would be a reasonably efficient method.

One good solution would be to split the input into smallish chunks and use that in a memoized lookup table (where you compute the first time through).
E.g., if each number/array is 128 bits; break it into eight 16-bits parts that are looked up in a table. At worst, the lookup table needs 216 ~ 65536 entries - but if zeros are very sparse (e.g., at most two zeros in any group of 8 bits only need about ~64). Depending on how sparse you can beef up the size of the chunk.

In the "yuck" department, I would like to enter the following contestant:
def numpyToBinString(numpyValue):
return "".join( [str((numpyValue[0] >> shiftLength) & 1 ) for shiftLength in range(numpyValue.dtype.itemsize * 8)] )
Works for shape (,) ndArrays, but could be extended with #vectorize decorator.

You can use a lookup table.
Create a table that has the 0 positions for each number from 0-255 and a function to access it, call it zeroBitPositions, this returns a list.
Then, assuming that you are storing your numbers as a python long type (which, I believe has unlimited precision). You can do the following:
allZeroPositions = []
shift = 0
while (num >> shift) > 0:
zeroPositions += [x + shift for x in zeroBitPositions ((num >> shift) & 0xFF)]
shift += 8
Hopefully this is a good start.

Related

How to replace two entire columns in a df by adding 5 to the previous value?

I'm new to Python and stackoverflow, so please forgive the bad edit on this question.
I have a df with 11 columns and 3 108 730 rows.
Columns 1 and 2 represent the X and Y (mathematical) coordinates, respectively and the other columns represent different frequencies in Hz.
The df looks like this:
df before adjustment
I want to plot this df in ArcGIS but for that I need to replace the (mathematical) coordinates that currently exist by the real life geograhical coordinates.
The trick is that I was only given the first geographical coordinate which is x=1055000 and y=6315000.
The other rows in columns 1 and 2 should be replaced by adding 5 to the previous row value so for example, for the x coordinates it should be 1055000, 1055005, 1055010, 1055015, .... and so on.
I have written two for loops that replace the values accordingly but my problem is that it takes much too long to run because of the size of the df and I haven't yet got a result after some hours because I used the row number as the range like this:
for i in range(0,3108729):
if i == 0:
df.at[i,'IDX'] = 1055000
else:
df.at[i,'IDX'] = df.at[i-1,'IDX'] + 5
df.head()
and like this for the y coordinates:
for j in range(0,3108729):
if j == 0:
df.at[j,'IDY'] = 6315000
else:
df.at[j,'IDY'] = df.at[j-1,'IDY'] + 5
df.head()
I have run the loops as a test with range(0,5) and it works but I'm sure there is a way to replace the coordinates in a more time-efficient manner without having to define a range? I appreciate any help !!
You can just build a range series in one go, no need to iterate:
df.loc[:, 'IDX'] = 1055000 + pd.Series(range(len(df))) * 5
df.loc[:, 'IDY'] = 6315000 + pd.Series(range(len(df))) * 5

Why does Vectorization fail with larger numbers but Map and Apply work?

I'm trying to further understand the difference between Map, Apply, and Vectorization, and just encountered a challenge I don't understand: for small numbers, these three functions achieve the same outcome, but for large numbers Vectorization appears to fail. Here's what I mean:
# get a simple dataframe set up
import numpy as np
import pandas as pd
x = range(10)
y = range(10,20)
df = pd.DataFrame(data = zip(x,y), columns = ['x','y'])
# define a simple function to test map, apply, and vectorization with
def simple_power(num1, num2):
return num1 ** num2
# use Map, Apply, and Vectorization to apply the function to every row in the dataframe
df['map power'] = list(map(simple_power, *(df['x'], df['y'])))
df['apply power'] = df.apply(lambda row: simple_power(row['x'], row['y']), axis=1)
df['optimize power'] = simple_power(df['x'], df['y'])
Everything works:
in: df.head()
out: x y map power apply power vectorized power
0 0 10 0 0 0
1 1 11 1 1 1
2 2 12 4096 4096 4096
3 3 13 1594323 1594323 1594323
4 4 14 268435456 268435456 268435456
Here's where things get confusing: if I replace my x and y with larger ranges, map and apply still work, but vectorization fails:
# set up dataframe with larger numbers to multiply together
x = range(100)
y = range(100,200)
df = pd.DataFrame(data = zip(x,y), columns = ['x','y'])
Then if I re-run map, apply, and vectorization, I get a wonky output for vectorization:
in: df.head()
out:
Map and Apply are consistent with each other, but Vectorization gives a nonsense results.
Can anyone tell me what's going on? Thank you!
https://github.com/numpy/numpy/issues/8987 and https://github.com/numpy/numpy/issues/10964 are where your problem is foundering.
When using ** in your function you are implicitly using numpy.power When you overflow the integer you don't see the error.
This is a known bug and should be getting fixed.

Returning date that corresponds with maximum value in pandas dataframe [duplicate]

How can I find the row for which the value of a specific column is maximal?
df.max() will give me the maximal value for each column, I don't know how to get the corresponding row.
Use the pandas idxmax function. It's straightforward:
>>> import pandas
>>> import numpy as np
>>> df = pandas.DataFrame(np.random.randn(5,3),columns=['A','B','C'])
>>> df
A B C
0 1.232853 -1.979459 -0.573626
1 0.140767 0.394940 1.068890
2 0.742023 1.343977 -0.579745
3 2.125299 -0.649328 -0.211692
4 -0.187253 1.908618 -1.862934
>>> df['A'].idxmax()
3
>>> df['B'].idxmax()
4
>>> df['C'].idxmax()
1
Alternatively you could also use numpy.argmax, such as numpy.argmax(df['A']) -- it provides the same thing, and appears at least as fast as idxmax in cursory observations.
idxmax() returns indices labels, not integers.
Example': if you have string values as your index labels, like rows 'a' through 'e', you might want to know that the max occurs in row 4 (not row 'd').
if you want the integer position of that label within the Index you have to get it manually (which can be tricky now that duplicate row labels are allowed).
HISTORICAL NOTES:
idxmax() used to be called argmax() prior to 0.11
argmax was deprecated prior to 1.0.0 and removed entirely in 1.0.0
back as of Pandas 0.16, argmax used to exist and perform the same function (though appeared to run more slowly than idxmax).
argmax function returned the integer position within the index of the row location of the maximum element.
pandas moved to using row labels instead of integer indices. Positional integer indices used to be very common, more common than labels, especially in applications where duplicate row labels are common.
For example, consider this toy DataFrame with a duplicate row label:
In [19]: dfrm
Out[19]:
A B C
a 0.143693 0.653810 0.586007
b 0.623582 0.312903 0.919076
c 0.165438 0.889809 0.000967
d 0.308245 0.787776 0.571195
e 0.870068 0.935626 0.606911
f 0.037602 0.855193 0.728495
g 0.605366 0.338105 0.696460
h 0.000000 0.090814 0.963927
i 0.688343 0.188468 0.352213
i 0.879000 0.105039 0.900260
In [20]: dfrm['A'].idxmax()
Out[20]: 'i'
In [21]: dfrm.iloc[dfrm['A'].idxmax()] # .ix instead of .iloc in older versions of pandas
Out[21]:
A B C
i 0.688343 0.188468 0.352213
i 0.879000 0.105039 0.900260
So here a naive use of idxmax is not sufficient, whereas the old form of argmax would correctly provide the positional location of the max row (in this case, position 9).
This is exactly one of those nasty kinds of bug-prone behaviors in dynamically typed languages that makes this sort of thing so unfortunate, and worth beating a dead horse over. If you are writing systems code and your system suddenly gets used on some data sets that are not cleaned properly before being joined, it's very easy to end up with duplicate row labels, especially string labels like a CUSIP or SEDOL identifier for financial assets. You can't easily use the type system to help you out, and you may not be able to enforce uniqueness on the index without running into unexpectedly missing data.
So you're left with hoping that your unit tests covered everything (they didn't, or more likely no one wrote any tests) -- otherwise (most likely) you're just left waiting to see if you happen to smack into this error at runtime, in which case you probably have to go drop many hours worth of work from the database you were outputting results to, bang your head against the wall in IPython trying to manually reproduce the problem, finally figuring out that it's because idxmax can only report the label of the max row, and then being disappointed that no standard function automatically gets the positions of the max row for you, writing a buggy implementation yourself, editing the code, and praying you don't run into the problem again.
You might also try idxmax:
In [5]: df = pandas.DataFrame(np.random.randn(10,3),columns=['A','B','C'])
In [6]: df
Out[6]:
A B C
0 2.001289 0.482561 1.579985
1 -0.991646 -0.387835 1.320236
2 0.143826 -1.096889 1.486508
3 -0.193056 -0.499020 1.536540
4 -2.083647 -3.074591 0.175772
5 -0.186138 -1.949731 0.287432
6 -0.480790 -1.771560 -0.930234
7 0.227383 -0.278253 2.102004
8 -0.002592 1.434192 -1.624915
9 0.404911 -2.167599 -0.452900
In [7]: df.idxmax()
Out[7]:
A 0
B 8
C 7
e.g.
In [8]: df.loc[df['A'].idxmax()]
Out[8]:
A 2.001289
B 0.482561
C 1.579985
Both above answers would only return one index if there are multiple rows that take the maximum value. If you want all the rows, there does not seem to have a function.
But it is not hard to do. Below is an example for Series; the same can be done for DataFrame:
In [1]: from pandas import Series, DataFrame
In [2]: s=Series([2,4,4,3],index=['a','b','c','d'])
In [3]: s.idxmax()
Out[3]: 'b'
In [4]: s[s==s.max()]
Out[4]:
b 4
c 4
dtype: int64
df.iloc[df['columnX'].argmax()]
argmax() would provide the index corresponding to the max value for the columnX. iloc can be used to get the row of the DataFrame df for this index.
A more compact and readable solution using query() is like this:
import pandas as pd
df = pandas.DataFrame(np.random.randn(5,3),columns=['A','B','C'])
print(df)
# find row with maximum A
df.query('A == A.max()')
It also returns a DataFrame instead of Series, which would be handy for some use cases.
Very simple: we have df as below and we want to print a row with max value in C:
A B C
x 1 4
y 2 10
z 5 9
In:
df.loc[df['C'] == df['C'].max()] # condition check
Out:
A B C
y 2 10
If you want the entire row instead of just the id, you can use df.nlargest and pass in how many 'top' rows you want and you can also pass in for which column/columns you want it for.
df.nlargest(2,['A'])
will give you the rows corresponding to the top 2 values of A.
use df.nsmallest for min values.
The direct ".argmax()" solution does not work for me.
The previous example provided by #ely
>>> import pandas
>>> import numpy as np
>>> df = pandas.DataFrame(np.random.randn(5,3),columns=['A','B','C'])
>>> df
A B C
0 1.232853 -1.979459 -0.573626
1 0.140767 0.394940 1.068890
2 0.742023 1.343977 -0.579745
3 2.125299 -0.649328 -0.211692
4 -0.187253 1.908618 -1.862934
>>> df['A'].argmax()
3
>>> df['B'].argmax()
4
>>> df['C'].argmax()
1
returns the following message :
FutureWarning: 'argmax' is deprecated, use 'idxmax' instead. The behavior of 'argmax'
will be corrected to return the positional maximum in the future.
Use 'series.values.argmax' to get the position of the maximum now.
So that my solution is :
df['A'].values.argmax()
mx.iloc[0].idxmax()
This one line of code will give you how to find the maximum value from a row in dataframe, here mx is the dataframe and iloc[0] indicates the 0th index.
Considering this dataframe
[In]: df = pd.DataFrame(np.random.randn(4,3),columns=['A','B','C'])
[Out]:
A B C
0 -0.253233 0.226313 1.223688
1 0.472606 1.017674 1.520032
2 1.454875 1.066637 0.381890
3 -0.054181 0.234305 -0.557915
Assuming one want to know the rows where column "C" is max, the following will do the work
[In]: df[df['C']==df['C'].max()])
[Out]:
A B C
1 0.472606 1.017674 1.520032
The idmax of the DataFrame returns the label index of the row with the maximum value and the behavior of argmax depends on version of pandas (right now it returns a warning). If you want to use the positional index, you can do the following:
max_row = df['A'].values.argmax()
or
import numpy as np
max_row = np.argmax(df['A'].values)
Note that if you use np.argmax(df['A']) behaves the same as df['A'].argmax().
Use:
data.iloc[data['A'].idxmax()]
data['A'].idxmax() -finds max value location in terms of row
data.iloc() - returns the row
If there are ties in the maximum values, then idxmax returns the index of only the first max value. For example, in the following DataFrame:
A B C
0 1 0 1
1 0 0 1
2 0 0 0
3 0 1 1
4 1 0 0
idxmax returns
A 0
B 3
C 0
dtype: int64
Now, if we want all indices corresponding to max values, then we could use max + eq to create a boolean DataFrame, then use it on df.index to filter out indexes:
out = df.eq(df.max()).apply(lambda x: df.index[x].tolist())
Output:
A [0, 4]
B [3]
C [0, 1, 3]
dtype: object
what worked for me is:
df[df['colX'] == df['colX'].max()
You then get the row in your df with the maximum value of colX.
Then if you just want the index you can add .index at the end of the query.

Accessing the second element of a list for every row in pandas dataframe

My data consist of Latitude in object type :
0 4.620881605
1 4.620124518
2 4.619367709
3 4.618609512
4 4.61784758
Then, I split after the decimal point using this code:
marker['Latitude'].str.split('.')
Resulting in :
0 [4, 620881605]
1 [4, 620124518]
2 [4, 619367709]
3 [4, 618609512]
4 [4, 61784758]
which is good but not quite there yet. I want to access the second element of the list for every row and the end result I am expecting is this :
0 620881605
1 620124518
2 619367709
3 618609512
4 61784758
I was looking for an answer to the same question, it seems there is nothing built-in. The best option I can find is operator.itemgetter(), which is implemented in native code and should perform fine with Series.apply():
from operator import itemgetter
series = pd.Series(["%s|%s" % (-x, x) for x in range(100)])
pairs = series.str.split('|')
# Fetch all the negative numbers
negatives = pairs.apply(itemgetter(0)).astype(int)
# Fetch all the positive numbers
positives = pairs.apply(itemgetter(1)).astype(int)
Note Series.str.split() also accepts an expand=True argument, which returns a new DataFrame containing columns 0..n rather than a series of lists. This probably should be the default behaviour, it's much easier to work with:
series = pd.Series(["%s|%s" % (-x, x) for x in range(100)])
pairs = series.str.split('|', expand=True)
# Fetch all the negative numbers
negatives = pairs[0]
# Fetch all the positive numbers
positives = pairs[1]
You can use pd.DataFrame.iterrows() to iterate by row and then select the proper index for your list.
import pandas as pd
x = pd.DataFrame({'a':[[1,2],[3,4],[5,6]]})
for index, row in x.iterrows():
print(row['a'][1])
2
4
6
marker['Latitude'].apply(lambda x : x.strip(',').split('.')[1])

Looping through a Pandas dataframe as a two dimensional array

At some point in a running process i have two dataframes containing standard deviation values from various sampling distributions.
dfbest keeps the smallest deviations as the best. dftmp records the current values..
so, the first time dftmp looks like this
0 1 2 3 4
0 22.552408 7.299163 15.114379 5.214829 9.124144
with dftmp.shape (1, 5)
Ignoring for a moment the pythonic constructs and treating the dataframes as 2d arrays like a spreadsheet in VBA excel i write...
A:
if dfbest.empty:
dfbest = dftmp
else:
for R in range( dfbest.shape[0]):
for C in range( dfbest.shape[1]):
if dfbest[R][C] > dftmp[R][C]:
dfbest[R][C] = dftmp[R][C]
B:
if dfbest.empty:
dfbest = dftmp
else:
for R in range( dfbest.shape[0]):
for C in range( dfbest.shape[1]):
if dfbest[C][R] > dftmp[C][R]:
dfbest[C][R] = dftmp[C][R]
Code A fails while B works. I 'd expect the opposite but i m new to python so who knows what i m not seeing here.. Any suggestions? I suspect there is a more appropriate .iloc solution to this.
When you access a dataframe like that (dfbest[x][y]), x means a column, then y a row. That's why code B works.
Here is more information: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#basics

Categories

Resources