Assume I have df1:
df1= pd.DataFrame({'alligator_apple': range(1, 11),
'barbadine': range(11, 21),
'capulin_cherry': range(21, 31)})
alligator_apple barbadine capulin_cherry
0 1 11 21
1 2 12 22
2 3 13 23
3 4 14 24
4 5 15 25
5 6 16 26
6 7 17 27
7 8 18 28
8 9 19 29
9 10 20 30
And a df2:
df2= pd.DataFrame({'alligator_apple': [6, 7, 15, 5],
'barbadine': [3, 19, 25, 12],
'capulin_cherry': [1, 9, 15, 27]})
alligator_apple barbadine capulin_cherry
0 6 3 1
1 7 19 9
2 15 25 15
3 5 12 27
I'm looking for a way to create a new column in df2 that gets number of rows based on a condition where all columns in df1 has values greater than their counterparts in df2 for each row. For example:
alligator_apple barbadine capulin_cherry greater
0 6 3 1 4
1 7 19 9 1
2 15 25 15 0
3 5 12 27 3
To elaborate, at row 0 of df2, df1.alligator_apple has 4 rows which values are higher than df2.alligator_apple with the value of 6. df1.barbadine has 10 rows which values are higher than df2.barbadine with value of 3, while similarly df1.capulin_cherry has 10 rows.
Finally, apply an 'and' condition to all aforementioned conditions to get the number '4' of df2.greater of first row. Repeat for the rest of rows in df2.
Is there a simple way to do this?
I believe this does what you want:
df2['greater'] = df2.apply(
lambda row:
(df1['alligator_apple'] > row['alligator_apple']) &
(df1['barbadine'] > row['barbadine']) &
(df1['capulin_cherry'] > row['capulin_cherry']),
axis=1,
).sum(axis=1)
print(df2)
output:
alligator_apple barbadine capulin_cherry greater
0 6 3 1 4
1 7 19 9 1
2 15 25 15 0
3 5 12 27 3
Edit: if you want to generalize and apply this logic for a given column set, we can use functools.reduce together with operator.and_:
import functools
import operator
columns = ['alligator_apple', 'barbadine', 'capulin_cherry']
df2['greater'] = df2.apply(
lambda row: functools.reduce(
operator.and_,
(df1[column] > row[column] for column in columns),
),
axis=1,
).sum(axis=1)
There's a general solution to this that should work well.
def gt_mask(row,df):
mask = True
for key,val in row.items():
mask &= df[key] > val
return len(df[mask])
df2['greater'] = df2.apply(gt_mask,df=df1,axis=1)
Output df2
,alligator_apple,barbadine,capulin_cherry,greater
0,6,3,1,4
1,7,19,9,1
2,15,25,15,0
3,5,12,27,3
This creates a mask, iterating through the key/val pairs for a given row.
Edit this answer was a big help: Masking a DataFrame on multiple column conditions - inside a loop
I already have a solution -but it is very slow (13 minutes for 800 rows). here is an example of the dataframe:
import pandas as pd
d = {'col1': [20,23,40,41,48,49,50,50], 'col2': [39,32,42,50,63,68,68,69]}
df = pd.DataFrame(data=d)
df
In a new column, I want to calculate how many of the previous values (for example three)of col2 are greater or equal than row-value of col1. i also continue the first rows.
this is my slow code:
start_at_nr = 3 #variable in which row start to calculate
df["overlap_count"] = "" #create new column
for row in range(len(df)):
if row <= start_at_nr - 1:
df["overlap_count"].loc[row] = "x"
else:
df["overlap_count"].loc[row] = (
df["col2"].loc[row - start_at_nr:row - 1] >=
(df["col1"].loc[row])).sum()
df
i obtain a faster solution - thank you for your time!
this is the result i obtain:
col1 col2 overlap_count
0 20 39 x
1 23 32 x
2 40 42 x
3 41 50 1
4 48 63 1
5 49 68 2
6 50 68 3
7 50 69 3
IIUC, you can do:
df['overlap_count'] = 0
for i in range(1,start_at_nr+1):
df['overlap_count'] += df['col1'].le(df['col2'].shift(i))
# mask the first few rows
df.iloc[:start_at_nr, -1] = np.nan
Output:
col1 col2 overlap_count
0 20 39 NaN
1 23 32 NaN
2 40 42 NaN
3 41 50 1.0
4 48 63 1.0
5 49 68 2.0
6 50 68 3.0
7 50 69 3.0
Takes about 11ms on for 800 rows and start_at_nr=3.
You basically compare the current value of col1 to previous 3 rows of col2 and starting the compare from row 3. You may use shift as follow
n = 3
s = ((pd.concat([df.col2.shift(x) for x in range(1,n+1)], axis=1) >= df.col1.values[:,None])
.sum(1)[3:])
or
s = (pd.concat([df.col2.shift(x) for x in range(1,n+1)], axis=1).ge(df.col1,axis=0)
.sum(1)[3:])
Out[65]:
3 1
4 1
5 2
6 3
7 3
dtype: int64
To get your desired output, assign it back to df and fillna
n = 3
s = (pd.concat([df.col2.shift(x) for x in range(1,n+1)], axis=1).ge(df.col1,axis=0)
.sum(1)[3:])
df_final = df.assign(overlap_count=s).fillna('x')
Out[68]:
col1 col2 overlap_count
0 20 39 x
1 23 32 x
2 40 42 x
3 41 50 1
4 48 63 1
5 49 68 2
6 50 68 3
7 50 69 3
You could do it with .apply() in a single statement as follows. I have used a convenience function process_row(), which is also included below.
df.assign(OVERLAP_COUNT = (df.reset_index(drop=False).rename(
columns={'index': 'ID'})).apply(
lambda x: process_row(x, df, offset=3), axis=1))
For More Speed:
In case you need more speed and are processing a lot of rows, you may consider using swifter library. All you have to do is:
install swifter: pip install swifter.
import the library as import swifter.
replace any .apply() with .swifter.apply() in the code-block above.
Solution in Detail
#!pip install -U swifter
#import swifter
import numpy as np
import pandas as pd
d = {'col1': [20,23,40,41,48,49,50,50], 'col2': [39,32,42,50,63,68,68,69]}
df = pd.DataFrame(data=d)
def process_row(x, df, offset=3):
value = (df.loc[x.ID - offset:x.ID - 1, 'col2'] >= df.loc[x.ID, 'col1']).sum() if (x.ID >= offset) else 'x'
return value
# Use df.swifter.apply() for faster processing, instead of df.apply()
df.assign(OVERLAP_COUNT = (df.reset_index(drop=False, inplace=False).rename(
columns={'index': 'ID'}, inplace=False)).apply(
lambda x: process_row(x, df, offset=3), axis=1))
Output:
col1 col2 OVERLAP_COUNT
0 20 39 x
1 23 32 x
2 40 42 x
3 41 50 1
4 48 63 1
5 49 68 2
6 50 68 3
7 50 69 3
I'm learning pandas and I came across the following method to compare rows in a dataframe.
Here I'm using np.were and shift() functions to compare values within a column.
import pandas as pd
import numpy as np
# Initialise data to Dicts of series.
d = {'col' : pd.Series([10, 30, 20, 40, 70, 60])}
# creates Dataframe.
df = pd.DataFrame(d)
df['Relation'] = np.where(df['col'] > df['col'].shift(), "Grater", "Less")
df
Here the output appearing as the following:
col Relation
0 10 Less
1 30 Grater
2 20 Less
3 40 Grater
4 70 Grater
5 60 Less
I have confusion in row 3 why it is appearing as Grater?, 40 is less than 70 so it should appear as Less. What I'm doing wrong here?
Because compare 40 with 20, because shift index by 1:
df['Relation'] = np.where(df['col'] > df['col'].shift(), "Grater", "Less")
df['shifted'] = df['col'].shift()
df['m'] = df['col'] > df['col'].shift()
print (df)
col Relation shifted m
0 10 Less NaN False
1 30 Grater 10.0 True
2 20 Less 30.0 False
3 40 Grater 20.0 True <- here
4 70 Grater 40.0 True
5 60 Less 70.0 False
Maybe you want shift by -1:
df['Relation'] = np.where(df['col'] > df['col'].shift(-1), "Grater", "Less")
df['shifted'] = df['col'].shift(-1)
df['m'] = df['col'] > df['col'].shift(-1)
print (df)
col Relation shifted m
0 10 Less 30.0 False
1 30 Grater 20.0 True
2 20 Less 40.0 False
3 40 Less 70.0 False
4 70 Grater 60.0 True
5 60 Less NaN False
I have two CSV files representing data from two different years. I know how to do the basic merging using csvwriter and dictkeys, but the problem lies here: while the CSVs have mostly shared column headers, each may have unique columns. If a species was caught in one year but not the other, that column would only be present in that year. How can I merge the new data to the old data, creating new columns and padding the old data with zero in those columns?
File 1: "Date","Time","Species A","Species B", "Species X"
File 2: "Date","Time", "Species A", "Species B", "Species C"
I need the end result to be one csv with this header:
"Date","Time","Species A","Species B", "Species C", "Species X"
Someone else will probably post a solution using the csv module, so I'll give a pandas solution for comparison purposes:
import pandas as pd
df1 = pd.read_csv("fish1.csv")
df2 = pd.read_csv("fish2.csv")
df = pd.concat([df1, df2]).fillna(0)
df = df[["Date", "Time"] + list(df.columns[1:-1])]
df.to_csv("merged_fish.csv", index=False)
Explanation:
First, we read in the two files:
>>> df1 = pd.read_csv("fish1.csv")
>>> df2 = pd.read_csv("fish2.csv")
>>> df1
Date Time Species A Species B Species X
0 1 2 3 4 5
1 6 7 8 9 10
2 11 12 13 14 15
>>> df2
Date Time Species A Species B Species C
0 16 17 18 19 20
1 21 22 23 24 25
2 26 27 28 29 30
Then we simply concatenate them, which automatically fills the missing data with NaN:
>>> df = pd.concat([df1, df2])
>>> df
Date Species A Species B Species C Species X Time
0 1 3 4 NaN 5 2
1 6 8 9 NaN 10 7
2 11 13 14 NaN 15 12
0 16 18 19 20 NaN 17
1 21 23 24 25 NaN 22
2 26 28 29 30 NaN 27
You want them filled with 0 instead, so:
>>> df = pd.concat([df1, df2]).fillna(0)
>>> df
Date Species A Species B Species C Species X Time
0 1 3 4 0 5 2
1 6 8 9 0 10 7
2 11 13 14 0 15 12
0 16 18 19 20 0 17
1 21 23 24 25 0 22
2 26 28 29 30 0 27
This order isn't quite the one you asked for, though, you wanted Time and Date first, so:
>>> df = df[["Date", "Time"] + list(df.columns[1:-1])]
>>> df
Date Time Species A Species B Species C Species X
0 1 2 3 4 0 5
1 6 7 8 9 0 10
2 11 12 13 14 0 15
0 16 17 18 19 20 0
1 21 22 23 24 25 0
2 26 27 28 29 30 0
And then we save it as a CSV file:
>>> df.to_csv("merged_fish.csv", index=False)
producing
Date,Time,Species A,Species B,Species C,Species X
1,2,3,4,0.0,5.0
6,7,8,9,0.0,10.0
11,12,13,14,0.0,15.0
16,17,18,19,20.0,0.0
21,22,23,24,25.0,0.0
26,27,28,29,30.0,0.0
Here's a csv module solution in Python 3:
import csv
# Generate some data...
csv1 = '''\
Date,Time,Species A,Species B,Species C
04/01/2012,13:00,1,2,3
04/02/2012,13:00,1,2,3
04/03/2012,13:00,1,2,3
04/04/2012,13:00,1,2,3
'''
csv2 = '''\
Date,Time,Species A,Species B,Species X
04/01/2013,13:00,1,2,3
04/02/2013,13:00,1,2,3
04/03/2013,13:00,1,2,3
04/04/2013,13:00,1,2,3
'''
with open('2012.csv','w') as f:
f.write(csv1)
with open('2013.csv','w') as f:
f.write(csv2)
# The actual program
years = ['2012.csv','2013.csv']
lines = []
headers = set()
for year in years:
with open(year,'r',newline='') as f:
r = csv.DictReader(f)
lines.extend(list(r)) # Merge lines from all files.
headers = headers.union(r.fieldnames) # Collect unique column names.
# Sort the unique headers keeping Date,Time columns first.
new_headers = ['Date','Time'] + sorted(headers - set(['Date','Time']))
with open('result.csv','w',newline='') as f:
# The 3rd parameter is the default if the key isn't present.
w = csv.DictWriter(f,new_headers,0)
w.writeheader()
w.writerows(lines)
# View the result
with open('result.csv') as f:
print(f.read())
Output:
Date,Time,Species A,Species B,Species C,Species X
04/01/2012,13:00,1,2,3,0
04/02/2012,13:00,1,2,3,0
04/03/2012,13:00,1,2,3,0
04/04/2012,13:00,1,2,3,0
04/01/2013,13:00,1,2,0,3
04/02/2013,13:00,1,2,0,3
04/03/2013,13:00,1,2,0,3
04/04/2013,13:00,1,2,0,3
According to the docs, it looks like you should be able to read out both files, merge the keys from the 2 extracted dictionaries, then use the fieldnames and restval params on the writer to achieve your 0 defaults.
I am using a pandas/python dataframe. I am trying to do a lag subtraction.
I am currently using:
newCol = df.col - df.col.shift()
This leads to a NaN in the first spot:
NaN
45
63
23
...
First question: Is this the best way to do a subtraction like this?
Second: If I want to add a column (same number of rows) to this new column. Is there a way that I can make all the NaN's 0's for the calculation?
Ex:
col_1 =
Nan
45
63
23
col_2 =
10
10
10
10
new_col =
10
55
73
33
and NOT
NaN
55
73
33
Thank you.
I think your method of of computing lags is just fine:
import pandas as pd
df = pd.DataFrame(range(4), columns = ['col'])
print(df['col'] - df['col'].shift())
# 0 NaN
# 1 1
# 2 1
# 3 1
# Name: col
print(df['col'] + df['col'].shift())
# 0 NaN
# 1 1
# 2 3
# 3 5
# Name: col
If you wish NaN plus (or minus) a number to be the number (not NaN), use the add (or sub) method with fill_value = 0:
print(df['col'].sub(df['col'].shift(), fill_value = 0))
# 0 0
# 1 1
# 2 1
# 3 1
# Name: col
print(df['col'].add(df['col'].shift(), fill_value = 0))
# 0 0
# 1 1
# 2 3
# 3 5
# Name: col