Python: Select Rows by value in large dataframe - python

Given a data frame df:
Column A: [0, 1, 3, 4, 6]
Column B: [0, 0, 0, 0, 0]
The goal is to conditionally replace values in column B. If column A's values exist in a set assginedToA, we replace the corresponding values in column B with a constant b.
For example: if b=1 and assignedToA={1,4}, the result would be
Column A: [0, 1, 3, 4, 6]
Column B: [0, 1, 0, 1, 0]
My code for finding the A values and write B values into it looks like this:
df.loc[df['A'].isin(assignedToA),'B']=b
This code works, but it is really slow for a huge dataframe.
Do you have any advice, how to speed this process up?
The dataframe df has around 5 Million rows and assignedToA has a maximum of 7 values.

You may find a performance improvement by dropping down to numpy:
df = pd.DataFrame({'A': [0, 1, 3, 4, 6],
'B': [0, 0, 0, 0, 0]})
def jp(df, vals, k):
B = df['B'].values
B[np.in1d(df['A'], list(vals))] = k
df['B'] = B
return df
def original(df, vals, k):
df.loc[df['A'].isin(vals),'B'] = k
return df
df = pd.concat([df]*100000)
%timeit jp(df, {1, 4}, 1) # 8.55ms
%timeit original(df, {1, 4}, 1) # 16.6ms

Related

How to efficiently find the inverse intersection between two large dataframes with pandas?

I am trying to find the inverse intersection between two large dataframes.
I got it to work with the code snipped hereafter. Unfortunately, this approach is "too slow" on large dataframes as is further described below. Can you think of a quicker way to compute this outcome?
import pandas as pd
df_1 = pd.DataFrame({'a': [8, 2, 2],
'b': [0, 1, 3],
'c': [0, 2, 2],
'd': [0, 2, 2],
'e': [0, 2, 2]})
df_2 = pd.DataFrame({'a': [8, 2, 2, 2, 8, 2],
'b': [0, 1, 1, 6, 0, 1],
'c': [0, 3, 2, 2, 0, 2],
'd': [0, 4, 2, 2, 0, 4],
'e': [0, 1, 2, 2, 0, 2]})
l_columns = ['a','b','e']
def df_drop_df(df_1, df_2, l_columns):
"""
Eliminates all equal rows present in dataframe 1 (df_1) from dataframe 2 (df_2) depending on a subset of columns (l_columns)
:param df_1: dataframe that defines which rows to be removed
:param df_2: dataframe that is reduced
:param l_columns: list of column names, present in df_1 and df_2, that is used for the comparison
:return df_out: final dataframe
"""
df_1r = df_1[l_columns]
df_2r = df_2[l_columns].reset_index()
df_m = pd.merge(df_1r, df_2r, on=l_columns, how='inner')
row_indexes_m = df_m['index'].to_list()
row_indexes_df_2 = df_2.index.to_list()
row_indexes_out = [x for x in row_indexes_df_2 if x not in row_indexes_m]
df_out = df_2.loc[row_indexes_out]
return df_out
Giving the following correct result:
#row_indexes_out = [1,3]
df_output = df_drop_df(df_1, df_2, l_columns)
df_output
({'a': [2, 2],
'b': [1, 6],
'c': [3, 2],
'd': [4, 2],
'e': [1, 2]})
However, for the actual application, the size of the dataframes has the following dimensions, which takes roughly 30min to compute on my local machine:
variable
shape
df1
(3300,77)
df2
(642000,77)
l_columns
list 12
df_out
(611000,77)
(This means that each row present in df_1 is roughly 10 times in df_2)
Can you think of a quicker way to compute this outcome?
You can try substituting the following lines:
row_indexes_df_2 = df_2.index.to_list()
row_indexes_out = [x for x in row_indexes_df_2 if x not in row_indexes_m]
df_out = df_2.loc[row_indexes_out]
by the tilde operator:
df_out = df_2.loc[~df_2.index.isin(row_indexes_m)]
It should considerably reduce the time.

pandas: iterate over multiple columns with is_monotonic_increasing

i am trying to iterate over a time series with multiple columns and go through the columns to check if the values within the columns are motonic_increasing or decreasing.
The underlying issue is that I don't know how to iterate over the dataframe columns and treat the values as a list to allow is_monotonic_increasing to work.
I have a dataset that looks like this:
Id 10000T 20000T
2020-04-30 0 7
2020-05-31 3 5
2020-06-30 5 6
and I have tried doing this:
trend_observation_period = new_df[-3:] #the dataset
trend = np.where((trend_observation_period.is_monotonic_increasing()==True), 'go', 'nogo')
which gives me the error:
AttributeError: 'DataFrame' object has no attribute 'is_monotonic_increasing'
I am confused because I though that np.where would iterate over the columns and read them as np arrays. I have also tried this which does not work either.
for i in trend_observation_period.iteritems():
s = pd.Series(i)
trend = np.where((s.is_monotonic_increasing()==True | s.is_monotonic_decreasing()==True),
'trending', 'not_trending')
It sounds like you're after something which will iterate columns and test if each column is monotonic. See if this puts you on the right track.
Per the pandas docs .is_monotonic is the same as .is_monotonic_increasing.
Example:
# Sample dataset setup.
df = pd.DataFrame({'a': [1, 1, 1, 2],
'b': [3, 2, 1, 0],
'c': [0, 1, 1, 0],
'd': [2, 0, 1, 0]})
# Loop through each column in the DataFrame and output if monotonic.
for c in df:
print(f'Column: {c} I', df[c].is_monotonic)
print(f'Column: {c} D', df[c].is_monotonic_decreasing, end='\n\n')
Output:
Column: a I True
Column: a D False
Column: b I False
Column: b D True
Column: c I False
Column: c D False
Column: d I False
Column: d D False
You can use DataFrame.apply to apply a function to each of your columns. Since is_monotonic_increasing is a property of a Series and not a method of it, you'll need to wrap it in a function (you can use lambda for this):
df = pd.DataFrame({'a': [1, 1, 1, 1],
'b': [1, 1, 1, 0],
'c': [0, 1, 1, 0],
'd': [0, 0, 0, 0]})
increasing_cols = df.apply(lambda s: s.is_monotonic_increasing)
print(increasing_cols)
a True
b False
c False
d True
dtype: bool
Use .apply and is_monotonic.
Example :
import pandas as pd
df = pd.DataFrame({"A":[1, 2, 3, 4],
"B":[0, 1, 0, 1],
"C":[3, 5, 8, 9],
"D":[1, 2, 2, 1]})
df.apply(lambda x:x.is_monotonic)
A True
B False
C True
D False
dtype: bool

Aggregating time series data

I am no data scientist. I do know python and I currently have to manage time series data that is coming in at a regular interval. Much of this data is all zero's or values that are the same for a long time, and to save memory I'd like to filter them out. Is there some standard method for this (which I am obviously unaware of) or should I implement my own algorithm ?
What I want to achieve is the following:
interval value result
(summed)
1 0 0
2 0 # removed
3 0 0
4 1 1
5 2 2
6 2 # removed
7 2 # removed
8 2 2
9 0 0
10 0 0
Any help appreciated !
You could use pandas query on dataframes to achieve this:
import pandas as pd
matrix = [[1,0, 0],
[2, 0, 0],
[3, 0, 0],
[4, 1, 1],
[5, 2, 2],
[6, 2, 0],
[7, 2, 0],
[8, 2, 2],
[9, 0, 0],
[10,0, 0]]
df = pd.DataFrame(matrix, columns=list('abc'))
print(df.query("c != 0"))
There is no quick function call to do what you need. The following is one way
import pandas as pd
df = pd.DataFrame({'interval':[1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'value':[0, 0, 0, 1, 2, 2, 2, 2, 0, 0]}) # example dataframe
df['group'] = df['value'].ne(df['value'].shift()).cumsum() # column that increments every time the value changes
df['key'] = 1 # create column of ones
df['key'] = df.groupby('group')['key'].transform('cumsum') # get the cumulative sum
df['key'] = df.groupby('group')['key'].transform(lambda x: x.isin( [x.min(), x.max()])) # check which key is minimum and which is maximum by group
df = df[df['key']==True].drop(columns=['group', 'key']) # keep only relevant cases
df
Here is the code:
l = [0, 0, 0, 1, 2, 2, 2, 2, 0, 0]
for (i, ll) in enumerate(l):
if i != 0 and ll == l[i-1] and i<len(l)-1 and ll == l[i+1]:
continue
print(i+1, ll)
It produces what you want. You haven't specified format of your input data, so I assumed they're in a list. The conditions ll == l[i-1] and ll == l[i+1] are key to skipping the repeated values.
Thanks all!
Looking at the answers I guess I can conclude I'll need to roll my own. I'll be using your input as inspiration.
Thanks again !

Create 2-d array with column keys and row keys in Python

I am trying to create this data structure in Python:
2-d array structure
There have to be column keys and row keys that I will be using later.
Column keys and row keys are random numbers.
For now I have this code:
import random
cols, rows = 5, 5
Matrix = [[0 for x in range(cols)] for y in range(rows)]
set_col = 0
for row in Matrix:
row[set_col] = random.randint(1,2)
columnKeys = random.sample(range(1,5), 4)
Matrix[0] = columnKeys
for row in Matrix:
print(row)
Output:
[3, 1, 2, 4]
[2, 0, 0, 0, 0]
[1, 0, 0, 0, 0]
[2, 0, 0, 0, 0]
[1, 0, 0, 0, 0]
This is not quite what I want. For now each cell value have zero. But later it will have some relevant data and I will be using this data along with corresponding row and column keys. I don't know how to correctly organize this data structure so I can use cell values with corresponding row/column keys.
How to do it without Pandas and Numpy so I can use column and row keys?
It depends on what you want.
The best way is probably not to use nested lists, but instead to use dictionaries. Since you mentioned pandas, the pandas DataFrame objects have a to_dict function that will convert a DataFrame into a dictionary, and there are several options depending on what you prefer.
I see from your example that you are trying to create your data structure with duplicate indices. The best option here is likely to use the structure created by running df.to_dict("split").
Say your DataFrame (df) looks like this:
3 1 2 4
2 0 0 0 0
1 0 0 0 0
2 0 0 0 0
1 0 0 0 0
Running `df.to_dict("split") will then do this:
d = df.to_dict("split")
{
'columns': [3, 1, 2, 4],
'data': [[0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0]],
'index': [2, 1, 2, 1]
}
Accessing data in this scenario, and in the one shown by #Makiflow is tricky. Even within Pandas, having duplicate indices or columns on your Dataframe makes operations more interesting. In this case, selecting df['data'][3][1] picks the second element in the third list contained by the data key. That is actually selecting the 4th row and the 2nd column of your matrix. If you want to be able to reference items by the column name, you have to do a little more leg work.
You can run col_num = d['columns'].index(3) which will give you the index value of the element 3, but doing d['index'].index(2) will always give you 0, even if wanted to select 2 at index 3. That's because index() returns the index of the first value that matches the condition. You can of course simply select by the (col,row) index tuples, but that defeats the purpose of having column names and index values in the first place.
If you want to generate this structure without pandas, you can run:
COLS, ROWS = 5, 5
columns = [random.randint(0,COLS) for _ in range(COLS)]
rows = [random.randint(1,2) for _ in range(ROWS)]
d = {"columns": columns,
"index": rows,
"data": [[0 for _ in range(COLS)] for _ in range(ROWS)]
}
IMHO - a better solution would actually be to force your data structure to have unique index and columns values. The default output of to_dict() will output a very simply dictionary:
d = df.to_dict() # also the same as df.to_dict("dict")
{
1: {1: 0, 2: 0},
2: {1: 0, 2: 0},
3: {1: 0, 2: 0},
4: {1: 0, 2: 0}
}
In this configuration, each key to the dictionary is the name of a column. Each of those keys points to another dicitonary that represents the information in that column - each key is an index value, followed by the value.
This likely makes the most intuitive sense because if you wanted to get the value at the column named 3 at the index named 1, you would do:
d = df.to_dict()
d[3][1]
# 0
You can create this data structure without using Pandas quite simply:
COLS, ROWS = 5,5
rows = [i for i in range(ROWS)]
columns = [i for in range(COLS)]
{c : {i:0 for i in rows} for c in columns}
# {
# 0: {0: 0, 1: 0, 2: 0, 3: 0, 4: 0},
# 1: {0: 0, 1: 0, 2: 0, 3: 0, 4: 0},
# 2: {0: 0, 1: 0, 2: 0, 3: 0, 4: 0},
# 3: {0: 0, 1: 0, 2: 0, 3: 0, 4: 0},
# 4: {0: 0, 1: 0, 2: 0, 3: 0, 4: 0}
# }
It's really dependent on the constraints/requirements that you have.
import random
COLS, ROWS = 5, 5
Matrix = [[0 for x in range(COLS)] for y in range(ROWS)]
set_col = 0
for row in Matrix:
row[set_col] = random.randint(1,2)
columnKeys = random.sample(range(1,5), 4)
Matrix[0] = [0] + columnKeys
for row in Matrix:
print(row)
Output
[0, 3, 1, 2, 4]
[2, 0, 0, 0, 0]
[1, 0, 0, 0, 0]
[2, 0, 0, 0, 0]
[1, 0, 0, 0, 0]

Getting the indexes to the duplicate columns of a numpy array [duplicate]

This question already has answers here:
Find unique columns and column membership
(3 answers)
Closed 8 years ago.
I have a numpy array with duplicate columns:
import numpy as np
A = np.array([[1, 1, 1, 0, 1, 1],
[1, 2, 2, 0, 1, 2],
[1, 3, 3, 0, 1, 3]])
I need to find the indexes to those duplicates or something like that:
[0, 4]
[1, 2, 5]
I have a hard time dealing with indexes in Python. I really don't know to approach it.
Thanks
I tried identifying the unique columns first with this function:
def unique_columns(data):
ind = np.lexsort(data)
return data.T[ind[np.concatenate(([True], any(data.T[ind[1:]]!=data.T[ind[:-1]], axis=1)))]].T
But I can't figure out the indexes from there.
There is not a simple way to do this unfortunately. Using a np.unique answer. This method requires that the axis you want to unique is contiguous in memory and numpy's typical memory layout is C contiguous or contiguous in rows. Fortunately numpy makes this conversion simple:
A = np.array([[1, 1, 1, 0, 1, 1],
[1, 2, 2, 0, 1, 2],
[1, 3, 3, 0, 1, 3]])
def unique_columns2(data):
dt = np.dtype((np.void, data.dtype.itemsize * data.shape[0]))
dataf = np.asfortranarray(data).view(dt)
u,uind = np.unique(dataf, return_inverse=True)
u = u.view(data.dtype).reshape(-1,data.shape[0]).T
return (u,uind)
Our result:
u,uind = unique_columns2(A)
u
array([[0, 1, 1],
[0, 1, 2],
[0, 1, 3]])
uind
array([1, 2, 2, 0, 1, 2])
I am not really sure what you want to do from here, for example you can do something like this:
>>> [np.where(uind==x)[0] for x in range(u.shape[0])]
[array([3]), array([0, 4]), array([1, 2, 5])]
Some timings:
tmp = np.random.randint(0,4,(30000,500))
#BiRico and OP's answer
%timeit unique_columns(tmp)
1 loops, best of 3: 2.91 s per loop
%timeit unique_columns2(tmp)
1 loops, best of 3: 208 ms per loop
Here is an outline of how to approach it. Use numpy.lexsort to sort the columns, that way all the duplicates will be grouped together. Once the duplicates are all together, you can easily tell which columns are duplicates and the indices that correspond with those columns.
Here's an implementation of the method described above.
import numpy as np
def duplicate_columns(data, minoccur=2):
ind = np.lexsort(data)
diff = np.any(data.T[ind[1:]] != data.T[ind[:-1]], axis=1)
edges = np.where(diff)[0] + 1
result = np.split(ind, edges)
result = [group for group in result if len(group) >= minoccur]
return result
A = np.array([[1, 1, 1, 0, 1, 1],
[1, 2, 2, 0, 1, 2],
[1, 3, 3, 0, 1, 3]])
print(duplicate_columns(A))
# [array([0, 4]), array([1, 2, 5])]

Categories

Resources