Compare 2 different columns of different dataframes - python

I have 2 dataframes let's say->
df1 =>
colA colB colC
0 1 2
3 4 5
6 7 8
df2 (same number of rows and columns) =>
colD colE colF
10 11 12
13 14 15
16 17 18
I want to compare columns from both dataframes , example ->
df1['colB'] < df2['colF']
Currently I am getting ->
ValueError: Can only compare identically-labeled Series objects
while comparing in ->
EDIT :
df1.loc[
df1['colB'] < df2['colF']
],'set_something' = 1;
Any help how I can implement it ? Thanks

You have the error because your series are not aligned (and might have duplicated indices)
If you just care about position, not indices use the underlying nummy array:
df1['colB'] < df2['colF'].to_numpy()
If you want to assign back in a column, make sure to transform the column full the other DataFrame to array.
df1['new'] = df1['colB'] < df2['colF'].to_numpy()
Or
df2['new'] = df1['colB'].to_numpy() < df2['colF']

This is a non-equi join; you should get more performance with some of binary search; conditional_join from pyjanitor does that under the hood:
# pip install pyjanitor
import pandas as pd
import janitor
df1.conditional_join(df2, ('colB', 'colF', '<'))
colA colB colC colD colE colF
0 0 1 2 10 11 12
1 0 1 2 13 14 15
2 0 1 2 16 17 18
3 3 4 5 10 11 12
4 3 4 5 13 14 15
5 3 4 5 16 17 18
6 6 7 8 10 11 12
7 6 7 8 13 14 15
8 6 7 8 16 17 18
If it is based on an equality (df1.colB == df2.colF), then pd.merge should suffice and is efficient

Related

Compare even and odd rows in a Pandas Data Frame

I have a data frame like that :
Index
Time
Id
0
10:10:00
11
1
10:10:01
12
2
10:10:02
12
3
10:10:04
12
4
10:10:06
13
5
10:10:07
13
6
10:10:08
11
7
10:10:10
11
8
10:10:12
11
9
10:10:14
13
I want to compare id column for each pairs. So between the row 0 and 1, between the row 2 and 3 etc.
In others words I want to compare even rows with odd rows and keep same id pairs rows.
My ideal output would be :
Index
Time
Id
2
10:10:02
12
3
10:10:04
12
4
10:10:06
13
5
10:10:07
13
6
10:10:08
11
7
10:10:10
11
I tried that but it did not work :
df = df[
df[::2]["id"] ==df[1::2]["id"]
]
You can use a GroupBy.transform approach:
# for each pair, is there only one kind of Id?
out = df[df.groupby(np.arange(len(df))//2)['Id'].transform('nunique').eq(1)]
Or, more efficient, using the underlying numpy array:
# convert to numpy
a = df['Id'].to_numpy()
# are the odds equal to evens?
out = df[np.repeat((a[::2]==a[1::2]), 2)]
output:
Index Time Id
2 2 10:10:02 12
3 3 10:10:04 12
4 4 10:10:06 13
5 5 10:10:07 13
6 6 10:10:08 11
7 7 10:10:10 11

Concatenating two Pandas DataFrames while maintaining index order

Basic question - I am trying to concatenate two DataFrames, with the resulting DataFrame preserving the index in order of the original two. For example:
df = pd.DataFrame({'Houses':[10,20,30,40,50], 'Cities':[3,4,7,6,1]}, index = [1,2,4,6,8])
df2 = pd.DataFrame({'Houses':[15,25,35,45,55], 'Cities':[1,8,11,14,4]}, index = [0,3,5,7,9])
Using pd.concat([df, df2]) simply appends df2 to the end of df1. I am trying to instead concatenate them to produce correct index order (0 through 9).
Use concat with parameter sort for avoid warning and then DataFrame.sort_index:
df = pd.concat([df, df2], sort=False).sort_index()
print(df)
Cities Houses
0 1 15
1 3 10
2 4 20
3 8 25
4 7 30
5 11 35
6 6 40
7 14 45
8 1 50
9 4 55
Try using:
print(df.T.join(df2.T).T.sort_index())
Output:
Cities Houses
0 1 15
1 3 10
2 4 20
3 8 25
4 7 30
5 11 35
6 6 40
7 14 45
8 1 50
9 4 55

Replace by previous values

I have some dataframe like the one shown above. The goal of this program is to replace some specific value by the previous one.
import pandas as pd
test = pd.DataFrame([2,2,3,1,1,2,4,6,43,23,4,1,3,3,1,1,1,4,5], columns = ['A'])
obtaining:
If one want to replace all 1 by the previous values, a possible solution is:
for li in test[test['A'] == 1].index:
test['A'].iloc[li] = test['A'].iloc[li-1]
However, it is very inefficient. Can you suggest a more efficient solution?
IIUC, replace to np.nan then ffill
test.replace(1,np.nan).ffill().astype(int)
Out[881]:
A
0 2
1 2
2 3
3 3
4 3
5 2
6 4
7 6
8 43
9 23
10 4
11 4
12 3
13 3
14 3
15 3
16 3
17 4
18 5

I want to get the relative index of a column in a pandas dataframe

I want to make a new column of the 5 day return for a stock, let's say. I am using pandas dataframe. I computed a moving average using the rolling_mean function, but I'm not sure how to reference lines like i would in a spreadsheet (B6-B1) for example. Does anyone know how I can do this index reference and subtraction?
sample data frame:
day price 5-day-return
1 10 -
2 11 -
3 15 -
4 14 -
5 12 -
6 18 i want to find this ((day 5 price) -(day 1 price) )
7 20 then continue this down the list
8 19
9 21
10 22
Are you wanting this:
In [10]:
df['5-day-return'] = (df['price'] - df['price'].shift(5)).fillna(0)
df
Out[10]:
day price 5-day-return
0 1 10 0
1 2 11 0
2 3 15 0
3 4 14 0
4 5 12 0
5 6 18 8
6 7 20 9
7 8 19 4
8 9 21 7
9 10 22 10
shift returns the row at a specific offset, we use this to subtract this from the current row. fillna fills the NaN values which will occur prior to the first valid calculation.

Efficiently adding calculated rows based on index values to a pandas DataFrame

I have a pandas DataFrame in the following format:
a b c
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
5 15 16 17
I want to append a calculated row that performs some math based on a given items index value, e.g. adding a row that sums the values of all items with an index value < 2, with the new row having an index label of 'Red'. Ultimately, I am trying to add three rows that group the index values into categories:
A row with the sum of item values where index value are < 2, labeled as 'Red'
A row with the sum of item values where index values are 1 < x < 4, labeled as 'Blue'
A row with the sum of item values where index values are > 3, labeled as 'Green'
Ideal output would look like this:
a b c
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
5 15 16 17
Red 3 5 7
Blue 15 17 19
Green 27 29 31
My current solution involves transposing the DataFrame, applying a map function for each calculated column and then re-transposing, but I would imagine pandas has a more efficient way of doing this, likely using .append().
EDIT:
My in-elegant pre-set list solution (originally used .transpose() but I improved it using .groupby() and .append()):
df = pd.DataFrame(np.arange(18).reshape((6,3)),columns=['a', 'b', 'c'])
df['x'] = ['Red', 'Red', 'Blue', 'Blue', 'Green', 'Green']
df2 = df.groupby('x').sum()
df = df.append(df2)
del df['x']
I much prefer the flexibility of BrenBarn's answer (see below).
Here is one way:
def group(ix):
if ix < 2:
return "Red"
elif 2 <= ix < 4:
return "Blue"
else:
return "Green"
>>> print d
a b c
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
5 15 16 17
>>> print d.append(d.groupby(d.index.to_series().map(group)).sum())
a b c
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
5 15 16 17
Blue 15 17 19
Green 27 29 31
Red 3 5 7
For the general case, you need to define a function (or dict) to handle the mapping to different groups. Then you can just use groupby and its usual abilities.
For your particular case, it can be done more simply by directly slicing on the index value as Dan Allan showed, but that will break down if you have a more complex case where the groups you want are not simply definable in terms of contiguous blocks of rows. The method above will also easily extend to situations where the groups you want to create are not based on the index but on some other column (i.e., group together all rows whose value in column X is within range 0-10, or whatever).
The role of "transpose," which you say you used in your unshown solution, might be played more naturally by the orient keyword argument, which is available when you construct a DataFrame from a dictionary.
In [23]: df
Out[23]:
a b c
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
5 15 16 17
In [24]: dict = {'Red': df.loc[:1].sum(),
'Blue': df.loc[2:3].sum(),
'Green': df.loc[4:].sum()}
In [25]: DataFrame.from_dict(dict, orient='index')
Out[25]:
a b c
Blue 15 17 19
Green 27 29 31
Red 3 5 7
In [26]: df.append(_)
Out[26]:
a b c
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
5 15 16 17
Blue 15 17 19
Green 27 29 31
Red 3 5 7
Based the numbers in your example, I assume that by "> 4" you actually meant ">= 4".

Categories

Resources