pandas - iterate over rows and calculate - faster - python

I already have a solution -but it is very slow (13 minutes for 800 rows). here is an example of the dataframe:
import pandas as pd
d = {'col1': [20,23,40,41,48,49,50,50], 'col2': [39,32,42,50,63,68,68,69]}
df = pd.DataFrame(data=d)
df
In a new column, I want to calculate how many of the previous values (for example three)of col2 are greater or equal than row-value of col1. i also continue the first rows.
this is my slow code:
start_at_nr = 3 #variable in which row start to calculate
df["overlap_count"] = "" #create new column
for row in range(len(df)):
if row <= start_at_nr - 1:
df["overlap_count"].loc[row] = "x"
else:
df["overlap_count"].loc[row] = (
df["col2"].loc[row - start_at_nr:row - 1] >=
(df["col1"].loc[row])).sum()
df
i obtain a faster solution - thank you for your time!
this is the result i obtain:
col1 col2 overlap_count
0 20 39 x
1 23 32 x
2 40 42 x
3 41 50 1
4 48 63 1
5 49 68 2
6 50 68 3
7 50 69 3

IIUC, you can do:
df['overlap_count'] = 0
for i in range(1,start_at_nr+1):
df['overlap_count'] += df['col1'].le(df['col2'].shift(i))
# mask the first few rows
df.iloc[:start_at_nr, -1] = np.nan
Output:
col1 col2 overlap_count
0 20 39 NaN
1 23 32 NaN
2 40 42 NaN
3 41 50 1.0
4 48 63 1.0
5 49 68 2.0
6 50 68 3.0
7 50 69 3.0
Takes about 11ms on for 800 rows and start_at_nr=3.

You basically compare the current value of col1 to previous 3 rows of col2 and starting the compare from row 3. You may use shift as follow
n = 3
s = ((pd.concat([df.col2.shift(x) for x in range(1,n+1)], axis=1) >= df.col1.values[:,None])
.sum(1)[3:])
or
s = (pd.concat([df.col2.shift(x) for x in range(1,n+1)], axis=1).ge(df.col1,axis=0)
.sum(1)[3:])
Out[65]:
3 1
4 1
5 2
6 3
7 3
dtype: int64
To get your desired output, assign it back to df and fillna
n = 3
s = (pd.concat([df.col2.shift(x) for x in range(1,n+1)], axis=1).ge(df.col1,axis=0)
.sum(1)[3:])
df_final = df.assign(overlap_count=s).fillna('x')
Out[68]:
col1 col2 overlap_count
0 20 39 x
1 23 32 x
2 40 42 x
3 41 50 1
4 48 63 1
5 49 68 2
6 50 68 3
7 50 69 3

You could do it with .apply() in a single statement as follows. I have used a convenience function process_row(), which is also included below.
df.assign(OVERLAP_COUNT = (df.reset_index(drop=False).rename(
columns={'index': 'ID'})).apply(
lambda x: process_row(x, df, offset=3), axis=1))
For More Speed:
In case you need more speed and are processing a lot of rows, you may consider using swifter library. All you have to do is:
install swifter: pip install swifter.
import the library as import swifter.
replace any .apply() with .swifter.apply() in the code-block above.
Solution in Detail
#!pip install -U swifter
#import swifter
import numpy as np
import pandas as pd
d = {'col1': [20,23,40,41,48,49,50,50], 'col2': [39,32,42,50,63,68,68,69]}
df = pd.DataFrame(data=d)
def process_row(x, df, offset=3):
value = (df.loc[x.ID - offset:x.ID - 1, 'col2'] >= df.loc[x.ID, 'col1']).sum() if (x.ID >= offset) else 'x'
return value
# Use df.swifter.apply() for faster processing, instead of df.apply()
df.assign(OVERLAP_COUNT = (df.reset_index(drop=False, inplace=False).rename(
columns={'index': 'ID'}, inplace=False)).apply(
lambda x: process_row(x, df, offset=3), axis=1))
Output:
col1 col2 OVERLAP_COUNT
0 20 39 x
1 23 32 x
2 40 42 x
3 41 50 1
4 48 63 1
5 49 68 2
6 50 68 3
7 50 69 3

Related

Loop sum excluding one column of pandas dataframe

I have a dataframe as the following:
import pandas as pd
df = pd.DataFrame([(1,2,3,4,5,6),
(1,2,3,4,5,6),
(1,2,3,4,5,6)], columns=['a','b','c','d','e','f'])
Out:
a b c d e f
0 1 2 3 4 5 6
1 1 2 3 4 5 6
2 1 2 3 4 5 6
and I want to get the sum of all the elements of the dataframe, always excluding one column. In this example the desired outcome would be:
60 57 54 51 48 45
I have found a solution that seems to do the job, but I'm pretty sure there must be a more efficient way to do the same:
for x in df.columns:
df.drop(columns = x).sum().sum()
Use DataFrame.rsub for subtract from right side summed rows by df.sum(axis=1), last add sum for sum per columns:
s = df.rsub(df.sum(axis=1), axis=0).sum()
print (s)
a 60
b 57
c 54
d 51
e 48
f 45
dtype: int64
Simply subtract the total sum with the sum per column. This is efficiently done using an assignment expression:
out = (s:=df.sum()).sum()-s
output:
a 60
b 57
c 54
d 51
e 48
f 45
Yet another way: Subtract the column sums from the total sum of the DataFrame using broadcasting:
out = df.sum().sum() - df.sum()
Output:
a 60
b 57
c 54
d 51
e 48
f 45
dtype: int64

Pandas loop to numpy . Numpy count occurrences of string as nonzero in array

Suppose I have the following dataframe with element types in brackets
Column1(int) Column2(str) Column3(str)
0 2 02 34
1 2 34 02
2 2 80 85
3 2 91 09
4 2 09 34
When using pandas loops I use the following code. If Column1 = 2, count how many times Column2 occurs in Column 3 and assign the count() to Column4 :
import pandas as pd
for index in df.index:
if df.loc[index, "Column"] == 2:
df.loc[index, "Column4"] = df.loc[
df.Column3 == df.loc[index, "Column2"], "Column3"
].count()
I am trying to use NumPy and array methods for efficiency. I have tried translating the method but no luck.
import numpy as np
# turn Column3 to array
array = df.loc[:, "Column3"].values
index = df.index
df.assign(
Column4=lambda x: np.where(
(x["Column1"] == 2), np.count_nonzero(array == df.loc[index, "Column2"]), "F"
)
)
Expected output
Column1(int) Column2(str) Column3(str) Column4(int)
0 2 02 34 1
1 2 34 02 2
2 2 80 85 0
3 2 91 09 0
4 2 09 34 1
You can use pd.Series.value_counts on Column3 and use it as mapping for Column2, you can pass Series object to pd.Series.map, missing values with pd.Series.fillna with 0
s = df['Column2'].map(df['Column3'].value_counts()).fillna(0)
df.loc[df['Column1'].eq(2), 'Column4'] = s
df['Column4'] = df['Column4'].fillna('F')
# Fills with 'F' where `Column1` is not equal to 2.
Column1 Column2 Column3 Column4
0 2 2 34 1.0
1 2 34 2 2.0
2 2 80 85 0.0
3 2 91 9 0.0
4 2 9 34 1.0
Or you can use np.where here.
s = df['Column2'].map(df['Column3'].value_counts()).fillna(0)
df['Column4'] = np.where(df['Column1'].eq(2), s, 'F')

How to sum columns from two different size datasets in pandas

I have two datasets. The first one (df1) contains more then 200.000 rows, and the second one (df2) only two. I need to create a new column df1['column_2'] which is the sum of df1['column_1'] and df2['column_1']
When I try to make df1['column_2'] = df1['column_1'] + df2['column_1'] I get an error "A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead"
How can I sum values of different datasets with different amount of rows?
Will be thankful for any help!
Screenshot of my notebook: https://prnt.sc/p1d6ze
I tried your code and it works with no error, using Pandas 0.25.0
and Python 3.7.0.
If you use older versions, consider upgrade.
For the test I used df1 with 10 rows (shorter):
column_1
0 10
1 20
2 30
3 40
4 50
5 60
6 70
7 80
8 90
9 100
and df2 with 2 rows (just as in your post):
column_1
0 3
1 5
Your instruction df1['column_2'] = df1['column_1'] + df2['column_1']
gives the following result:
column_1 column_2
0 10 13.0
1 20 25.0
2 30 NaN
3 40 NaN
4 50 NaN
5 60 NaN
6 70 NaN
7 80 NaN
8 90 NaN
9 100 NaN
So that:
Elements with "overlapping" index values are summed.
Other elements (with no corresponding index in df2 are NaN.
Because of the presence of NaN values, this column is coerced to float.
Alternative form of this instruction, using .loc[...] is:
df1['column_2'] = df1.loc[:, 'column_1'] + df2.loc[:, 'column_1']
It works on my computer either.
Or maybe you want to "multiply" (replicate) df2 to the length of df1
before summing? If yes, run:
df1['column_2'] = df1.column_1 + df2.column_1.values.tolist() * 5
In this case 5 is the number of times df2 should be "multiplied".
This time no index alignment takes place and the result is:
column_1 column_2
0 10 13
1 20 25
2 30 33
3 40 45
4 50 53
5 60 65
6 70 73
7 80 85
8 90 93
9 100 105
Reindex is applied on the df which have less number of records compared to the other, For example here y
Subtraction:
import pandas as pd
import re
x = pd.DataFrame([(100,200),(300,400),(100,111)], columns=['a','b'])
y = pd.DataFrame([(1,2),(3,4)], columns=['a','b'])
z= x - y.reindex_like(x).fillna(0)
Addition
import pandas as pd
import re
x = pd.DataFrame([(100,200),(300,400),(100,111)], columns=['a','b'])
y = pd.DataFrame([(1,2),(3,4)], columns=['a','b'])
z= x + y.reindex_like(x).fillna(0)
Multiplication
import pandas as pd
import re
x = pd.DataFrame([(100,200),(300,400),(100,111)], columns=['a','b'])
y = pd.DataFrame([(1,2),(3,4)], columns=['a','b'])
z= x * y.reindex_like(x).fillna(1)
I have discovered that I can not make df_1['column_3] = df_1['column_1] + df_1['column_2] if df_1 is a slice from original dataframe df. So, I have solved my question by writing a function:
def new_column(dataframe):
if dataframe['column']=='value_1':
dataframe['new_column'] =(dataframe['column_1']
- df_2[df_2['column']=='value_1']
['column_1'].values[0])
else:
dataframe['new_column'] =(dataframe['column_1']
- df_2[df_2['column']=='value_2']
['column_1'].values[0])
return dataframe
dataframe=df_1.apply(new_column,axis=1)

Keep column order at DataFrame creation

I'd like to keep the columns in the order they were defined with pd.DataFrame. In the example below, df.info shows that GroupId is the first column and print also prints GroupId.
I'm using Python version 3.6.3
import numpy as np
import pandas as pd
df = pd.DataFrame({'Id' : np.random.randint(1,100,10),
'GroupId' : np.random.randint(1,5,10) })
df.info()
print(df.iloc[:,0])
One way is to use collections.OrderedDict, as below. Note that the OrderedDict object takes a list of tuples as an input.
from collections import OrderedDict
df = pd.DataFrame(OrderedDict([('Id', np.random.randint(1,100,10)),
('GroupId', np.random.randint(1,5,10))]))
# Id GroupId
# 0 37 4
# 1 10 2
# 2 42 1
# 3 97 2
# 4 6 4
# 5 59 2
# 6 12 2
# 7 69 1
# 8 79 1
# 9 17 1
Unless you're using python-3.6+ where dictionaries are ordered, this just isn't possible with a (standard) dictionary. You will need to zip your items together and pass a list of tuples:
np.random.seed(0)
a = np.random.randint(1, 100, 10)
b = np.random.randint(1, 5, 10)
df = pd.DataFrame(list(zip(a, b)), columns=['Id', 'GroupId'])
Or,
data = [a, b]
df = pd.DataFrame(list(zip(*data)), columns=['Id', 'GroupId']))
df
Id GroupId
0 45 3
1 48 1
2 65 1
3 68 1
4 68 3
5 10 2
6 84 3
7 22 4
8 37 4
9 88 3

pandas split timeseries in groups

I have a pandas dataframe
>>> df = pd.DataFrame()
>>> df['a'] = np.random.choice(range(0,100), 200)
>>> df['b'] = np.random.choice([0,1], 200)
>>> df.head()
a b
0 69 1
1 49 1
2 79 1
3 88 0
4 57 0
>>>
Some of the variables (in this example 'a') have a lot of unique values.
I would like to replace 'a' with a2 where a2 has 5 unique values. In other words I want to define 5 groups and assign to each value of a one of the group.
For example a2=1 if 0<=df['a']<20 and a2=2 if 20<=df['a']<40 and so on.
Note:
I used group of size 20 because 100/5 = 20
How can I do that using numpy or pandas or something else?
EDIT:
Possible solution
def group_array(a):
a = a - a.min()
a = 100 * a/a.max()
a = (a.apply(int)//20)+1
return a
You could use pd.cut to categorize the values in df['a']:
import pandas as pd
df = pd.DataFrame({'a':[69,49,79,88,57], 'b':[1,1,1,0,0]})
df['a2'] = pd.cut(df['a'], bins=range(0,101,20), labels=range(1,6), )
print(df)
yields
a b a2
0 69 1 4
1 49 1 3
2 79 1 4
3 88 0 5
4 57 0 3

Categories

Resources