Remove subseries (rows in a data frame) which meet a condition - python

I have data frame with a time series (column 1) and a column with values (column 2), which are features of each subseries of the time series.
How to remove subseries which meet a condition?
The picture illustrates what I want to do. I want to remove the orange rows:
I tried to make loops to create an additional column with features that indicate which rows to remove but this solution is very computationally expensive (I have 10mln records in a column). Code (slow solution):
import numpy as np
import pandas as pd
# sample data (smaller than actual df)
# length of df = 100; should be 10000000 in the actual data frame
time_ser = 100*[25]
max_num = 20
distance = np.random.uniform(0,max_num,100)
to_remove= 100*[np.nan]
data_dict = {'time_ser':time_ser,
'distance':distance,
'to_remove': to_remove
}
df = pd.DataFrame(data_dict)
subser_size = 3
maxdist = 18
# loop which creates an additional column which indicates which indexes should be removed.
# Takes first value in a subseries and checks if it meets the condition.
# If it does, all values in subseries (i.e. rows) should be removed ('wrong').
for i,d in zip(range(len(df)), df.distance):
if d >= maxdist:
df.to_remove.iloc[i:i+subser_size] = 'wrong'
else:
df.to_remove.iloc[i] ='good'

You can use list comprehension for create array of indexes by numpy.concatenate with numpy.unique for remove duplicates.
Then use drop or if need new column loc:
np.random.seed(123)
time_ser = 100*[25]
max_num = 20
distance = np.random.uniform(0,max_num,100)
to_remove= 100*[np.nan]
data_dict = {'time_ser':time_ser,
'distance':distance,
'to_remove': to_remove
}
df = pd.DataFrame(data_dict)
print (df)
distance time_ser to_remove
0 13.929384 25 NaN
1 5.722787 25 NaN
2 4.537029 25 NaN
3 11.026295 25 NaN
4 14.389379 25 NaN
5 8.462129 25 NaN
6 19.615284 25 NaN
7 13.696595 25 NaN
8 9.618638 25 NaN
9 7.842350 25 NaN
10 6.863560 25 NaN
11 14.580994 25 NaN
subser_size = 3
maxdist = 18
print (df.index[df['distance'] >= maxdist])
Int64Index([6, 38, 47, 84, 91], dtype='int64')
arr = [np.arange(i, min(i+subser_size,len(df))) for i in df.index[df['distance'] >= maxdist]]
idx = np.unique(np.concatenate(arr))
print (idx)
[ 6 7 8 38 39 40 47 48 49 84 85 86 91 92 93]
df = df.drop(idx)
print (df)
distance time_ser to_remove
0 13.929384 25 NaN
1 5.722787 25 NaN
2 4.537029 25 NaN
3 11.026295 25 NaN
4 14.389379 25 NaN
5 8.462129 25 NaN
9 7.842350 25 NaN
10 6.863560 25 NaN
11 14.580994 25 NaN
...
...
If need values in column:
df['to_remove'] = 'good'
df.loc[idx, 'to_remove'] = 'wrong'
print (df)
distance time_ser to_remove
0 13.929384 25 good
1 5.722787 25 good
2 4.537029 25 good
3 11.026295 25 good
4 14.389379 25 good
5 8.462129 25 good
6 19.615284 25 wrong
7 13.696595 25 wrong
8 9.618638 25 wrong
9 7.842350 25 good
10 6.863560 25 good
11 14.580994 25 good

Related

pandas: Create new column by comparing DataFrame rows with columns of another DataFrame

Assume I have df1:
df1= pd.DataFrame({'alligator_apple': range(1, 11),
'barbadine': range(11, 21),
'capulin_cherry': range(21, 31)})
alligator_apple barbadine capulin_cherry
0 1 11 21
1 2 12 22
2 3 13 23
3 4 14 24
4 5 15 25
5 6 16 26
6 7 17 27
7 8 18 28
8 9 19 29
9 10 20 30
And a df2:
df2= pd.DataFrame({'alligator_apple': [6, 7, 15, 5],
'barbadine': [3, 19, 25, 12],
'capulin_cherry': [1, 9, 15, 27]})
alligator_apple barbadine capulin_cherry
0 6 3 1
1 7 19 9
2 15 25 15
3 5 12 27
I'm looking for a way to create a new column in df2 that gets number of rows based on a condition where all columns in df1 has values greater than their counterparts in df2 for each row. For example:
alligator_apple barbadine capulin_cherry greater
0 6 3 1 4
1 7 19 9 1
2 15 25 15 0
3 5 12 27 3
To elaborate, at row 0 of df2, df1.alligator_apple has 4 rows which values are higher than df2.alligator_apple with the value of 6. df1.barbadine has 10 rows which values are higher than df2.barbadine with value of 3, while similarly df1.capulin_cherry has 10 rows.
Finally, apply an 'and' condition to all aforementioned conditions to get the number '4' of df2.greater of first row. Repeat for the rest of rows in df2.
Is there a simple way to do this?
I believe this does what you want:
df2['greater'] = df2.apply(
lambda row:
(df1['alligator_apple'] > row['alligator_apple']) &
(df1['barbadine'] > row['barbadine']) &
(df1['capulin_cherry'] > row['capulin_cherry']),
axis=1,
).sum(axis=1)
print(df2)
output:
alligator_apple barbadine capulin_cherry greater
0 6 3 1 4
1 7 19 9 1
2 15 25 15 0
3 5 12 27 3
Edit: if you want to generalize and apply this logic for a given column set, we can use functools.reduce together with operator.and_:
import functools
import operator
columns = ['alligator_apple', 'barbadine', 'capulin_cherry']
df2['greater'] = df2.apply(
lambda row: functools.reduce(
operator.and_,
(df1[column] > row[column] for column in columns),
),
axis=1,
).sum(axis=1)
There's a general solution to this that should work well.
def gt_mask(row,df):
mask = True
for key,val in row.items():
mask &= df[key] > val
return len(df[mask])
df2['greater'] = df2.apply(gt_mask,df=df1,axis=1)
Output df2
,alligator_apple,barbadine,capulin_cherry,greater
0,6,3,1,4
1,7,19,9,1
2,15,25,15,0
3,5,12,27,3
This creates a mask, iterating through the key/val pairs for a given row.
Edit this answer was a big help: Masking a DataFrame on multiple column conditions - inside a loop

pandas - iterate over rows and calculate - faster

I already have a solution -but it is very slow (13 minutes for 800 rows). here is an example of the dataframe:
import pandas as pd
d = {'col1': [20,23,40,41,48,49,50,50], 'col2': [39,32,42,50,63,68,68,69]}
df = pd.DataFrame(data=d)
df
In a new column, I want to calculate how many of the previous values (for example three)of col2 are greater or equal than row-value of col1. i also continue the first rows.
this is my slow code:
start_at_nr = 3 #variable in which row start to calculate
df["overlap_count"] = "" #create new column
for row in range(len(df)):
if row <= start_at_nr - 1:
df["overlap_count"].loc[row] = "x"
else:
df["overlap_count"].loc[row] = (
df["col2"].loc[row - start_at_nr:row - 1] >=
(df["col1"].loc[row])).sum()
df
i obtain a faster solution - thank you for your time!
this is the result i obtain:
col1 col2 overlap_count
0 20 39 x
1 23 32 x
2 40 42 x
3 41 50 1
4 48 63 1
5 49 68 2
6 50 68 3
7 50 69 3
IIUC, you can do:
df['overlap_count'] = 0
for i in range(1,start_at_nr+1):
df['overlap_count'] += df['col1'].le(df['col2'].shift(i))
# mask the first few rows
df.iloc[:start_at_nr, -1] = np.nan
Output:
col1 col2 overlap_count
0 20 39 NaN
1 23 32 NaN
2 40 42 NaN
3 41 50 1.0
4 48 63 1.0
5 49 68 2.0
6 50 68 3.0
7 50 69 3.0
Takes about 11ms on for 800 rows and start_at_nr=3.
You basically compare the current value of col1 to previous 3 rows of col2 and starting the compare from row 3. You may use shift as follow
n = 3
s = ((pd.concat([df.col2.shift(x) for x in range(1,n+1)], axis=1) >= df.col1.values[:,None])
.sum(1)[3:])
or
s = (pd.concat([df.col2.shift(x) for x in range(1,n+1)], axis=1).ge(df.col1,axis=0)
.sum(1)[3:])
Out[65]:
3 1
4 1
5 2
6 3
7 3
dtype: int64
To get your desired output, assign it back to df and fillna
n = 3
s = (pd.concat([df.col2.shift(x) for x in range(1,n+1)], axis=1).ge(df.col1,axis=0)
.sum(1)[3:])
df_final = df.assign(overlap_count=s).fillna('x')
Out[68]:
col1 col2 overlap_count
0 20 39 x
1 23 32 x
2 40 42 x
3 41 50 1
4 48 63 1
5 49 68 2
6 50 68 3
7 50 69 3
You could do it with .apply() in a single statement as follows. I have used a convenience function process_row(), which is also included below.
df.assign(OVERLAP_COUNT = (df.reset_index(drop=False).rename(
columns={'index': 'ID'})).apply(
lambda x: process_row(x, df, offset=3), axis=1))
For More Speed:
In case you need more speed and are processing a lot of rows, you may consider using swifter library. All you have to do is:
install swifter: pip install swifter.
import the library as import swifter.
replace any .apply() with .swifter.apply() in the code-block above.
Solution in Detail
#!pip install -U swifter
#import swifter
import numpy as np
import pandas as pd
d = {'col1': [20,23,40,41,48,49,50,50], 'col2': [39,32,42,50,63,68,68,69]}
df = pd.DataFrame(data=d)
def process_row(x, df, offset=3):
value = (df.loc[x.ID - offset:x.ID - 1, 'col2'] >= df.loc[x.ID, 'col1']).sum() if (x.ID >= offset) else 'x'
return value
# Use df.swifter.apply() for faster processing, instead of df.apply()
df.assign(OVERLAP_COUNT = (df.reset_index(drop=False, inplace=False).rename(
columns={'index': 'ID'}, inplace=False)).apply(
lambda x: process_row(x, df, offset=3), axis=1))
Output:
col1 col2 OVERLAP_COUNT
0 20 39 x
1 23 32 x
2 40 42 x
3 41 50 1
4 48 63 1
5 49 68 2
6 50 68 3
7 50 69 3

Comparing rows values using shift function

I'm learning pandas and I came across the following method to compare rows in a dataframe.
Here I'm using np.were and shift() functions to compare values within a column.
import pandas as pd
import numpy as np
# Initialise data to Dicts of series.
d = {'col' : pd.Series([10, 30, 20, 40, 70, 60])}
# creates Dataframe.
df = pd.DataFrame(d)
df['Relation'] = np.where(df['col'] > df['col'].shift(), "Grater", "Less")
df
Here the output appearing as the following:
col Relation
0 10 Less
1 30 Grater
2 20 Less
3 40 Grater
4 70 Grater
5 60 Less
I have confusion in row 3 why it is appearing as Grater?, 40 is less than 70 so it should appear as Less. What I'm doing wrong here?
Because compare 40 with 20, because shift index by 1:
df['Relation'] = np.where(df['col'] > df['col'].shift(), "Grater", "Less")
df['shifted'] = df['col'].shift()
df['m'] = df['col'] > df['col'].shift()
print (df)
col Relation shifted m
0 10 Less NaN False
1 30 Grater 10.0 True
2 20 Less 30.0 False
3 40 Grater 20.0 True <- here
4 70 Grater 40.0 True
5 60 Less 70.0 False
Maybe you want shift by -1:
df['Relation'] = np.where(df['col'] > df['col'].shift(-1), "Grater", "Less")
df['shifted'] = df['col'].shift(-1)
df['m'] = df['col'] > df['col'].shift(-1)
print (df)
col Relation shifted m
0 10 Less 30.0 False
1 30 Grater 20.0 True
2 20 Less 40.0 False
3 40 Less 70.0 False
4 70 Grater 60.0 True
5 60 Less NaN False

Merge two CSV's with unique columns in python

I have two CSV files representing data from two different years. I know how to do the basic merging using csvwriter and dictkeys, but the problem lies here: while the CSVs have mostly shared column headers, each may have unique columns. If a species was caught in one year but not the other, that column would only be present in that year. How can I merge the new data to the old data, creating new columns and padding the old data with zero in those columns?
File 1: "Date","Time","Species A","Species B", "Species X"
File 2: "Date","Time", "Species A", "Species B", "Species C"
I need the end result to be one csv with this header:
"Date","Time","Species A","Species B", "Species C", "Species X"
Someone else will probably post a solution using the csv module, so I'll give a pandas solution for comparison purposes:
import pandas as pd
df1 = pd.read_csv("fish1.csv")
df2 = pd.read_csv("fish2.csv")
df = pd.concat([df1, df2]).fillna(0)
df = df[["Date", "Time"] + list(df.columns[1:-1])]
df.to_csv("merged_fish.csv", index=False)
Explanation:
First, we read in the two files:
>>> df1 = pd.read_csv("fish1.csv")
>>> df2 = pd.read_csv("fish2.csv")
>>> df1
Date Time Species A Species B Species X
0 1 2 3 4 5
1 6 7 8 9 10
2 11 12 13 14 15
>>> df2
Date Time Species A Species B Species C
0 16 17 18 19 20
1 21 22 23 24 25
2 26 27 28 29 30
Then we simply concatenate them, which automatically fills the missing data with NaN:
>>> df = pd.concat([df1, df2])
>>> df
Date Species A Species B Species C Species X Time
0 1 3 4 NaN 5 2
1 6 8 9 NaN 10 7
2 11 13 14 NaN 15 12
0 16 18 19 20 NaN 17
1 21 23 24 25 NaN 22
2 26 28 29 30 NaN 27
You want them filled with 0 instead, so:
>>> df = pd.concat([df1, df2]).fillna(0)
>>> df
Date Species A Species B Species C Species X Time
0 1 3 4 0 5 2
1 6 8 9 0 10 7
2 11 13 14 0 15 12
0 16 18 19 20 0 17
1 21 23 24 25 0 22
2 26 28 29 30 0 27
This order isn't quite the one you asked for, though, you wanted Time and Date first, so:
>>> df = df[["Date", "Time"] + list(df.columns[1:-1])]
>>> df
Date Time Species A Species B Species C Species X
0 1 2 3 4 0 5
1 6 7 8 9 0 10
2 11 12 13 14 0 15
0 16 17 18 19 20 0
1 21 22 23 24 25 0
2 26 27 28 29 30 0
And then we save it as a CSV file:
>>> df.to_csv("merged_fish.csv", index=False)
producing
Date,Time,Species A,Species B,Species C,Species X
1,2,3,4,0.0,5.0
6,7,8,9,0.0,10.0
11,12,13,14,0.0,15.0
16,17,18,19,20.0,0.0
21,22,23,24,25.0,0.0
26,27,28,29,30.0,0.0
Here's a csv module solution in Python 3:
import csv
# Generate some data...
csv1 = '''\
Date,Time,Species A,Species B,Species C
04/01/2012,13:00,1,2,3
04/02/2012,13:00,1,2,3
04/03/2012,13:00,1,2,3
04/04/2012,13:00,1,2,3
'''
csv2 = '''\
Date,Time,Species A,Species B,Species X
04/01/2013,13:00,1,2,3
04/02/2013,13:00,1,2,3
04/03/2013,13:00,1,2,3
04/04/2013,13:00,1,2,3
'''
with open('2012.csv','w') as f:
f.write(csv1)
with open('2013.csv','w') as f:
f.write(csv2)
# The actual program
years = ['2012.csv','2013.csv']
lines = []
headers = set()
for year in years:
with open(year,'r',newline='') as f:
r = csv.DictReader(f)
lines.extend(list(r)) # Merge lines from all files.
headers = headers.union(r.fieldnames) # Collect unique column names.
# Sort the unique headers keeping Date,Time columns first.
new_headers = ['Date','Time'] + sorted(headers - set(['Date','Time']))
with open('result.csv','w',newline='') as f:
# The 3rd parameter is the default if the key isn't present.
w = csv.DictWriter(f,new_headers,0)
w.writeheader()
w.writerows(lines)
# View the result
with open('result.csv') as f:
print(f.read())
Output:
Date,Time,Species A,Species B,Species C,Species X
04/01/2012,13:00,1,2,3,0
04/02/2012,13:00,1,2,3,0
04/03/2012,13:00,1,2,3,0
04/04/2012,13:00,1,2,3,0
04/01/2013,13:00,1,2,0,3
04/02/2013,13:00,1,2,0,3
04/03/2013,13:00,1,2,0,3
04/04/2013,13:00,1,2,0,3
According to the docs, it looks like you should be able to read out both files, merge the keys from the 2 extracted dictionaries, then use the fieldnames and restval params on the writer to achieve your 0 defaults.

Pandas column addition/subtraction

I am using a pandas/python dataframe. I am trying to do a lag subtraction.
I am currently using:
newCol = df.col - df.col.shift()
This leads to a NaN in the first spot:
NaN
45
63
23
...
First question: Is this the best way to do a subtraction like this?
Second: If I want to add a column (same number of rows) to this new column. Is there a way that I can make all the NaN's 0's for the calculation?
Ex:
col_1 =
Nan
45
63
23
col_2 =
10
10
10
10
new_col =
10
55
73
33
and NOT
NaN
55
73
33
Thank you.
I think your method of of computing lags is just fine:
import pandas as pd
df = pd.DataFrame(range(4), columns = ['col'])
print(df['col'] - df['col'].shift())
# 0 NaN
# 1 1
# 2 1
# 3 1
# Name: col
print(df['col'] + df['col'].shift())
# 0 NaN
# 1 1
# 2 3
# 3 5
# Name: col
If you wish NaN plus (or minus) a number to be the number (not NaN), use the add (or sub) method with fill_value = 0:
print(df['col'].sub(df['col'].shift(), fill_value = 0))
# 0 0
# 1 1
# 2 1
# 3 1
# Name: col
print(df['col'].add(df['col'].shift(), fill_value = 0))
# 0 0
# 1 1
# 2 3
# 3 5
# Name: col

Categories

Resources