Related
I have created a pandas dataframe using this code:
import numpy as np
import pandas as pd
ds = {'col1': [1,2,3,3,3,6,7,8,9,10]}
df = pd.DataFrame(data=ds)
The dataframe looks like this:
print(df)
col1
0 1
1 2
2 3
3 3
4 3
5 6
6 7
7 8
8 9
9 10
I need to create a field called col2 that contains in a list (for each record) the last 3 elements of col1 while iterating through each record. So, the resulting dataframe would look like this:
Does anyone know how to do it by any chance?
Here is a solution using rolling and list comprehension
df['col2'] = [x.tolist() for x in df['col1'].rolling(3)]
col1 col2
0 1 [1]
1 2 [1, 2]
2 3 [1, 2, 3]
3 3 [2, 3, 3]
4 3 [3, 3, 3]
5 6 [3, 3, 6]
6 7 [3, 6, 7]
7 8 [6, 7, 8]
8 9 [7, 8, 9]
9 10 [8, 9, 10]
Use a list comprehension:
N = 3
l = df['col1'].tolist()
df['col2'] = [l[max(0,i-N+1):i+1] for i in range(df.shape[0])]
Output:
col1 col2
0 1 [1]
1 2 [1, 2]
2 3 [1, 2, 3]
3 3 [2, 3, 3]
4 3 [3, 3, 3]
5 6 [3, 3, 6]
6 7 [3, 6, 7]
7 8 [6, 7, 8]
8 9 [7, 8, 9]
9 10 [8, 9, 10]
Upon seeing the other answers, I'm affirmed my answer is pretty stupid.
Anyways, here it is.
import pandas as pd
ds = {'col1': [1,2,3,3,3,6,7,8,9,10]}
df = pd.DataFrame(data=ds)
df['col2'] = df['col1'].shift(1)
df['col3'] = df['col2'].shift(1)
df['col4'] = (df[['col3','col2','col1']]
.apply(lambda x:','.join(x.dropna().astype(str)),axis=1)
)
The last column contains the resulting list.
col1 col4
0 1 1.0
1 2 1.0,2.0
2 3 1.0,2.0,3.0
3 3 2.0,3.0,3.0
4 3 3.0,3.0,3.0
5 6 3.0,3.0,6.0
6 7 3.0,6.0,7.0
7 8 6.0,7.0,8.0
8 9 7.0,8.0,9.0
9 10 8.0,9.0,10.0
lastThree = []
for x in range(len(df)):
lastThree.append([df.iloc[x - 2]['col1'], df.iloc[x - 1]['col1'], df.iloc[x]['col1']])
df['col2'] = lastThree
I would like to extend an existing pandas DataFrame and fill the new column successively:
df = pd.DataFrame({'col1': [1, 2, 3, 4, 5, 6], 'col2': [7, 8, 9, 10, 11, 12]})
df['col3'] = pd.Series(['a' for x in df[:3]])
df['col3'] = pd.Series(['b' for x in df[3:4]])
df['col3'] = pd.Series(['c' for x in df[4:]])
I would expect a result as follows:
col1 col2 col3
0 1 7 a
1 2 8 a
2 3 9 a
3 4 10 b
4 5 11 c
5 6 12 c
However, my code fails and I get:
col1 col2 col3
0 1 7 a
1 2 8 a
2 3 9 NaN
3 4 10 NaN
4 5 11 NaN
5 6 12 NaN
What is wrong?
Use the loc accessor:
df = pd.DataFrame({'col1': [1, 2, 3, 4, 5, 6], 'col2': [7, 8, 9, 10, 11, 12]})
df.loc[:2,'col3'] = 'a'
df.loc[3,'col3'] = 'b'
df.loc[4:,'col3'] = 'c'
df
col1
col2
col3
0
1
7
a
1
2
8
a
2
3
9
a
3
4
10
b
4
5
11
c
5
6
12
c
As #Amirhossein Kiani and #Emma notes in the comments, you're never using df itself to assign values, so there is no need to slice it. Since you can assign a list to a DataFrame column, the following suffices:
df['col3'] = ['a'] * 3 + ['b'] + ['c'] * (len(df) - 4)
You can also use numpy.select to assign values. The idea is to create a list of boolean Serieses for certain index ranges and select values accordingly. For example, if index is less than 3, select 'a', if index is between 3 and 4, select 'b', etc.
import numpy as np
df['col3'] = np.select([df.index<3, df.index.to_series().between(3, 4, inclusive='left')], ['a','b'], 'c')
Output:
col1 col2 col3
0 1 7 a
1 2 8 a
2 3 9 a
3 4 10 b
4 5 11 c
5 6 12 c
Every time you do something like df['col3'] = pd.Series(['a' for x in df[:3]]), you're assigning a new pd.Series to the column col3. One alternative way to do this is to create your new column separately, then assign it to the df.
df = pd.DataFrame({'col1': [1, 2, 3, 4, 5, 6], 'col2': [7, 8, 9, 10, 11, 12]})
new_col = ['a' for _ in range(3)] + ['b'] + ['c' for _ in range(4, len(df))]
df['col3'] = pd.Series(new_col)
The operation pandas.DataFrame.lookup is "Deprecated since version 1.2.0", and has since invalidated a lot of previous answers.
This post attempts to function as a canonical resource for looking up corresponding row col pairs in pandas versions 1.2.0 and newer.
Standard LookUp Values With Default Range Index
Given the following DataFrame:
df = pd.DataFrame({'Col': ['B', 'A', 'A', 'B'],
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8]})
Col A B
0 B 1 5
1 A 2 6
2 A 3 7
3 B 4 8
I would like to be able to lookup the corresponding value in the column specified in Col:
I would like my result to look like:
Col A B Val
0 B 1 5 5
1 A 2 6 2
2 A 3 7 3
3 B 4 8 8
Standard LookUp Values With a Non-Default Index
Non-Contiguous Range Index
Given the following DataFrame:
df = pd.DataFrame({'Col': ['B', 'A', 'A', 'B'],
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8]},
index=[0, 2, 8, 9])
Col A B
0 B 1 5
2 A 2 6
8 A 3 7
9 B 4 8
I would like to preserve the index but still find the correct corresponding Value:
Col A B Val
0 B 1 5 5
2 A 2 6 2
8 A 3 7 3
9 B 4 8 8
MultiIndex
df = pd.DataFrame({'Col': ['B', 'A', 'A', 'B'],
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8]},
index=pd.MultiIndex.from_product([['C', 'D'], ['E', 'F']]))
Col A B
C E B 1 5
F A 2 6
D E A 3 7
F B 4 8
I would like to preserve the index but still find the correct corresponding Value:
Col A B Val
C E B 1 5 5
F A 2 6 2
D E A 3 7 3
F B 4 8 8
LookUp with Default For Unmatched/Not-Found Values
Given the following DataFrame
df = pd.DataFrame({'Col': ['B', 'A', 'A', 'C'],
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8]})
Col A B
0 B 1 5
1 A 2 6
2 A 3 7
3 C 4 8 # Column C does not correspond with any column
I would like to look up the corresponding values if one exists otherwise I'd like to have it default to 0
Col A B Val
0 B 1 5 5
1 A 2 6 2
2 A 3 7 3
3 C 4 8 0 # Default value 0 since C does not correspond
LookUp with Missing Values in the lookup Col
Given the following DataFrame:
Col A B
0 B 1 5
1 A 2 6
2 A 3 7
3 NaN 4 8 # <- Missing Lookup Key
I would like any NaN values in Col to result in a NaN value in Val
Col A B Val
0 B 1 5 5.0
1 A 2 6 2.0
2 A 3 7 3.0
3 NaN 4 8 NaN # NaN to indicate missing
Standard LookUp Values With Any Index
The documentation on Looking up values by index/column labels recommends using NumPy indexing via factorize and reindex as the replacement for the deprecated DataFrame.lookup.
import numpy as np
import pandas as pd
df = pd.DataFrame({'Col': ['B', 'A', 'A', 'B'],
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8]},
index=[0, 2, 8, 9])
idx, col = pd.factorize(df['Col'])
df['Val'] = df.reindex(columns=col).to_numpy()[np.arange(len(df)), idx]
df
Col A B Val
0 B 1 5 5
1 A 2 6 2
2 A 3 7 3
3 B 4 8 8
factorize is used to convert the column encode the values as an "enumerated type".
idx, col = pd.factorize(df['Col'])
# idx = array([0, 1, 1, 0], dtype=int64)
# col = Index(['B', 'A'], dtype='object')
Notice that B corresponds to 0 and A corresponds to 1. reindex is used to ensure that columns appear in the same order as the enumeration:
df.reindex(columns=col)
B A # B appears First (location 0) A appers second (location 1)
0 5 1
1 6 2
2 7 3
3 8 4
We need to create an appropriate range indexer compatible with NumPy indexing.
The standard approach is to use np.arange based on the length of the DataFrame:
np.arange(len(df))
[0 1 2 3]
Now NumPy indexing will work to select values from the DataFrame:
df['Val'] = df.reindex(columns=col).to_numpy()[np.arange(len(df)), idx]
[5 2 3 8]
*Note: This approach will always work regardless of type of index.
MultiIndex
import numpy as np
import pandas as pd
df = pd.DataFrame({'Col': ['B', 'A', 'A', 'B'],
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8]},
index=pd.MultiIndex.from_product([['C', 'D'], ['E', 'F']]))
idx, col = pd.factorize(df['Col'])
df['Val'] = df.reindex(columns=col).to_numpy()[np.arange(len(df)), idx]
Col A B Val
C E B 1 5 5
F A 2 6 2
D E A 3 7 3
F B 4 8 8
Why use np.arange and not df.index directly?
Standard Contiguous Range Index
import pandas as pd
df = pd.DataFrame({'Col': ['B', 'A', 'A', 'B'],
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8]})
idx, col = pd.factorize(df['Col'])
df['Val'] = df.reindex(columns=col).to_numpy()[df.index, idx]
In this case only, there is no error as the result from np.arange is the same as the df.index.
df
Col A B Val
0 B 1 5 5
1 A 2 6 2
2 A 3 7 3
3 B 4 8 8
Non-Contiguous Range Index Error
Raises IndexError:
df = pd.DataFrame({'Col': ['B', 'A', 'A', 'B'],
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8]},
index=[0, 2, 8, 9])
idx, col = pd.factorize(df['Col'])
df['Val'] = df.reindex(columns=col).to_numpy()[df.index, idx]
df['Val'] = df.reindex(columns=col).to_numpy()[df.index, idx]
IndexError: index 8 is out of bounds for axis 0 with size 4
MultiIndex Error
df = pd.DataFrame({'Col': ['B', 'A', 'A', 'B'],
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8]},
index=pd.MultiIndex.from_product([['C', 'D'], ['E', 'F']]))
idx, col = pd.factorize(df['Col'])
df['Val'] = df.reindex(columns=col).to_numpy()[df.index, idx]
Raises IndexError:
df['Val'] = df.reindex(columns=col).to_numpy()[df.index, idx]
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices
LookUp with Default For Unmatched/Not-Found Values
There are a few approaches.
First let's look at what happens by default if there is a non-corresponding value:
import numpy as np
import pandas as pd
df = pd.DataFrame({'Col': ['B', 'A', 'A', 'C'],
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8]})
# Col A B
# 0 B 1 5
# 1 A 2 6
# 2 A 3 7
# 3 C 4 8
idx, col = pd.factorize(df['Col'])
df['Val'] = df.reindex(columns=col).to_numpy()[np.arange(len(df)), idx]
Col A B Val
0 B 1 5 5.0
1 A 2 6 2.0
2 A 3 7 3.0
3 C 4 8 NaN # NaN Represents the Missing Value in C
If we look at why the NaN values are introduced, we will find that when factorize goes through the column it will enumerate all groups present regardless of whether they correspond to a column or not.
For this reason, when we reindex the DataFrame we will end up with the following result:
idx, col = pd.factorize(df['Col'])
df.reindex(columns=col)
idx = array([0, 1, 1, 2], dtype=int64)
col = Index(['B', 'A', 'C'], dtype='object')
df.reindex(columns=col)
B A C
0 5 1 NaN
1 6 2 NaN
2 7 3 NaN
3 8 4 NaN # Reindex adds the missing column with the Default `NaN`
If we want to specify a default value, we can specify the fill_value argument of reindex which allows us to modify the behaviour as it relates to missing column values:
idx, col = pd.factorize(df['Col'])
df.reindex(columns=col, fill_value=0)
idx = array([0, 1, 1, 2], dtype=int64)
col = Index(['B', 'A', 'C'], dtype='object')
df.reindex(columns=col, fill_value=0)
B A C
0 5 1 0
1 6 2 0
2 7 3 0
3 8 4 0 # Notice reindex adds missing column with specified value `0`
This means that we can do:
idx, col = pd.factorize(df['Col'])
df['Val'] = df.reindex(
columns=col,
fill_value=0 # Default value for Missing column values
).to_numpy()[np.arange(len(df)), idx]
df:
Col A B Val
0 B 1 5 5
1 A 2 6 2
2 A 3 7 3
3 C 4 8 0
*Notice the dtype of the column is int, since NaN was never introduced, and, therefore, the column type was not changed.
LookUp with Missing Values in the lookup Col
factorize has a default na_sentinel=-1, meaning that when NaN values appear in the column being factorized the resulting idx value is -1
import numpy as np
import pandas as pd
df = pd.DataFrame({'Col': ['B', 'A', 'A', np.nan],
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8]})
# Col A B
# 0 B 1 5
# 1 A 2 6
# 2 A 3 7
# 3 NaN 4 8 # <- Missing Lookup Key
idx, col = pd.factorize(df['Col'])
# idx = array([ 0, 1, 1, -1], dtype=int64)
# col = Index(['B', 'A'], dtype='object')
df['Val'] = df.reindex(columns=col).to_numpy()[np.arange(len(df)), idx]
# Col A B Val
# 0 B 1 5 5
# 1 A 2 6 2
# 2 A 3 7 3
# 3 NaN 4 8 4 <- Value From A
This -1 means that, by default, we'll be pulling from the last column when we reindex. Notice the col still only contains the values B and A. Meaning, that we will end up with the value from A in Val for the last row.
The easiest way to handle this is to fillna Col with some value that cannot be found in the column headers.
Here I use the empty string '':
idx, col = pd.factorize(df['Col'].fillna(''))
# idx = array([0, 1, 1, 2], dtype=int64)
# col = Index(['B', 'A', ''], dtype='object')
Now when I reindex, the '' column will contain NaN values meaning that the lookup produces the desired result:
import numpy as np
import pandas as pd
df = pd.DataFrame({'Col': ['B', 'A', 'A', np.nan],
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8]})
idx, col = pd.factorize(df['Col'].fillna(''))
df['Val'] = df.reindex(columns=col).to_numpy()[np.arange(len(df)), idx]
df:
Col A B Val
0 B 1 5 5.0
1 A 2 6 2.0
2 A 3 7 3.0
3 NaN 4 8 NaN # Missing as expected
Other Approaches to LookUp
There are 2 other approaches to performing this operation:
apply (Intuitive, but quite slow)
apply can be used on axis=1 in order to use the Column values as the key:
import pandas as pd
df = pd.DataFrame({'Col': ['B', 'A', 'A', 'B'],
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8]})
df['Val'] = df.apply(lambda row: row[row['Col']], axis=1)
df
Col A B Val
0 B 1 5 5
1 A 2 6 2
2 A 3 7 3
3 B 4 8 8
This operation will work regardless of index type:
import pandas as pd
df = pd.DataFrame({'Col': ['B', 'A', 'A', 'B'],
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8]},
index=[0, 2, 8, 9])
# Col A B
# 0 B 1 5
# 2 A 2 6
# 8 A 3 7
# 9 B 4 8
df['Val'] = df.apply(lambda row: row[row['Col']], axis=1)
df:
Col A B Val
0 B 1 5 5
2 A 2 6 2
8 A 3 7 3
9 B 4 8 8
When dealing with Missing/Non-Corresponding Values we can use Series.get can be used to remedy this issue:
import numpy as np
import pandas as pd
df = pd.DataFrame({'Col': ['B', 'A', 'C', np.nan],
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8]})
# Col A B
# 0 B 1 5
# 1 A 2 6
# 2 C 3 7 <- Non Corresponding
# 3 NaN 4 8 <- Missing
df['Val'] = df.apply(lambda row: row.get(row['Col']), axis=1)
Col A B Val
0 B 1 5 5.0
1 A 2 6 2.0
2 C 3 7 NaN # Missing value
3 NaN 4 8 NaN # Missing value
With Default Value
df['Val'] = df.apply(lambda row: row.get(row['Col'], default=-1), axis=1)
Col A B Val
0 B 1 5 5
1 A 2 6 2
2 C 3 7 -1 # Default -1
3 NaN 4 8 -1 # Default -1
apply is extremely flexible and modifications are straightforward, however, the general iterative approach, as well as all the individual Series lookups can become extremely costly in large DataFrames.
get_indexer (limited)
Index.get_indexer can be used to convert the column to index values into an indexer for the DataFrame. This means there is no reason to reindex the DataFrame as the indexer corresponds to the DataFrame as a whole.
import pandas as pd
df = pd.DataFrame({'Col': ['B', 'A', 'A', 'B'],
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8]})
df['Val'] = df.to_numpy()[df.index, df.columns.get_indexer(df['Col'])]
df
Col A B Val
0 B 1 5 5
1 A 2 6 2
2 A 3 7 3
3 B 4 8 8
This approach is reasonably fast, however, missing values are represented by -1 meaning that if a value is missing it will grab the value from the -1 column (The last column in the DataFrame).
import pandas as pd
df = pd.DataFrame({'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8],
'Col': ['B', 'A', 'A', 'C']})
# A B Col <- Col is now the Last Col
# 0 1 5 B
# 1 2 6 A
# 2 3 7 A
# 3 4 8 C <- Notice Col `C` does not correspond to a Valid Column Header
df['Val'] = df.to_numpy()[df.index, df.columns.get_indexer(df['Col'])]
df:
A B Col Val
0 1 5 B 5
1 2 6 A 2
2 3 7 A 3
3 4 8 C C # <- Value from the last column in the DataFrame (index -1)
It is also notable that not reindexing the DataFrame means converting the entire DataFrame to numpy. This can be very costly if there are many unrelated columns that all need converted:
import numpy as np
import pandas as pd
df = pd.DataFrame({1: 10,
2: 20,
3: 't',
4: 40,
5: np.nan,
'Col': ['B', 'A', 'A', 'B'],
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8]})
df['Val'] = df.to_numpy()[df.index, df.columns.get_indexer(df['Col'])]
df.to_numpy()
[[10 20 't' 40 nan 'B' 1 5 5]
[10 20 't' 40 nan 'A' 2 6 2]
[10 20 't' 40 nan 'A' 3 7 3]
[10 20 't' 40 nan 'B' 4 8 8]]
Compared to the reindexing approach which only contains columns relevant to the column values:
df.reindex(columns=['B', 'A']).to_numpy()
[[5 1]
[6 2]
[7 3]
[8 4]]
Another option is to build a tuple of the lookup columns, pivot the dataframe, and select the relevant columns with the tuples:
cols = [(ent, ent) for ent in df.Col.unique()]
df.assign(Val = df.pivot(index = None, columns = 'Col')
.reindex(columns = cols)
.ffill(axis=1)
.iloc[:, -1])
Col A B Val
0 B 1 5 5.0
2 A 2 6 2.0
8 A 3 7 3.0
9 B 4 8 8.0
Another possible method is to use melt:
df['value'] = (df.melt('Col', ignore_index=False)
.loc[lambda x: x['Col'] == x['variable'], 'value'])
print(df)
# Output:
Col A B value
0 B 1 5 5
1 A 2 6 2
2 A 3 7 3
3 B 4 8 8
This method also works with Missing/Non-Corresponding Values:
df['value'] = (df.melt('Col', ignore_index=False)
.loc[lambda x: x['Col'] == x['variable'], 'value'])
print(df)
# Output
Col A B value
0 B 1 5 5.0
1 A 2 6 2.0
2 C 3 7 NaN
3 NaN 4 8 NaN
You can replace .loc[...] by query(...) but it's little slower although more expressive:
df['value'] = df.melt('Col', ignore_index=False).query('Col == variable')['value']
import pandas as pd
d = {
'one': [1, 2, 3, 4, 5],
'one': [9, 8, 7, 6, 5],
'three': ['a', 'b', 'c', 'd', 'e']
}
df = pd.DataFrame(d)
I have bigger dataframe with multiple columns of having same name .
I want to change the column name from number of column as in r.
e.g. colnames(df)[2]='two'
I want to change second column name 'one' to 'two' .I want to do
that in python .
I think the simpliest is assign new columns names by np.arange or range:
#valid dictionary have unique keys
d = {
'one1': [1, 2, 3, 4, 5],
'one2': [9, 8, 7, 6, 5],
'three': ['a', 'b', 'c', 'd', 'e']
}
df = pd.DataFrame(d)
df.columns = ['one'] * 2 + ['three']
print (df)
one one three
0 1 9 a
1 2 8 b
2 3 7 c
3 4 6 d
4 5 5 e
df.columns = np.arange(len(df.columns))
#alternative
#df.columns = range(len(df.columns))
print (df)
0 1 2
0 1 9 a
1 2 8 b
2 3 7 c
3 4 6 d
4 5 5 e
Then select by name:
print (df[1])
0 9
1 8
2 7
3 6
4 5
Name: 1, dtype: int64
I have several hundred dataframes with same column names, like this:
df1
wave num stlines fwhm EWs MeasredWave
0 4050.32 3 0.28269 0.07365 22.16080 4050.311360
1 4208.98 5 0.48122 0.08765 44.90035 4208.972962
2 4374.94 9 0.71483 0.11429 86.96497 4374.927110
3 4379.74 9 0.31404 0.09107 30.44271 4379.760601
4 4398.01 14 0.50415 0.09845 52.83236 4398.007473
5 5520.50 1 0.06148 0.12556 8.21685 5520.484742
df2
wave num stlines fwhm EWs MeasredWave
0 4050.32 3 0.28616 0.07521 22.91064 4050.327388
1 4208.98 6 0.48781 0.08573 44.51609 4208.990029
2 4374.94 9 0.71548 0.11437 87.10152 4374.944513
3 4379.74 10 0.31338 0.09098 30.34791 4379.778009
4 4398.01 15 0.49950 0.08612 45.78707 4398.020367
5 4502.21 9 0.56362 0.10114 60.67868 4502.223123
6 4508.28 3 0.69554 0.11600 85.88428 4508.291777
7 4512.99 2 0.20486 0.08891 19.38745 4512.999332
8 5520.50 1 0.06148 0.12556 8.21685 5520.484742
That's how i0'm reading them
path_to_files = '/home/Desktop/computed_2d/'
lst = []
for filen in dir1:
df = pd.read_table(path_to_files+filen, skiprows=0, usecols=(0,1,2,3,4,8),names=['wave','num','stlines','fwhm','EWs','MeasredWave'],delimiter=r'\s+')
lst.append(df)
The desired result should look like this:
wave num stlines fwhm EWs MeasredWave
0 4050.32 3.0 0.284425 0.074430 22.535720 4050.319374
1 4208.98 5.5 0.484515 0.086690 44.708220 4208.981496
2 4374.94 9.0 0.715155 0.114330 87.033245 4374.935812
3 4379.74 9.5 0.313710 0.091025 30.395310 4379.769305
4 4398.01 14.5 0.501825 0.092285 49.309715 4398.013920
5 4502.21 9 0.56362 0.10114 60.67868 4502.223123
6 4508.28 3 0.69554 0.11600 85.88428 4508.291777
7 4512.99 2 0.20486 0.08891 19.38745 4512.999332
8 5520.50 1.0 0.061480 0.125560 8.216850 5520.484742
As you can see the number of rows are not same. Now i want to take the average of all the dataframes based on column1 wave and i want to make sure that the each index of column wave of df1 gets added to the correct index of df2
You can stack all dataframe in one by using pd.concat wich axis = 1 and take average of respective column
df3 = pd.merge(df1,df2,on=['wave'],how ='outer',)
df4 = df3.rename(columns = lambda x: x.split('_')[0]).T
df4.groupby(df4.index).mean().T
Out:
EWs MeasredWave fwhm num stlines wave
0 22.535720 4050.319374 0.074430 3.0 0.284425 4050.32
1 44.708220 4208.981496 0.086690 5.5 0.484515 4208.98
2 87.033245 4374.935812 0.114330 9.0 0.715155 4374.94
3 30.395310 4379.769305 0.091025 9.5 0.313710 4379.74
4 49.309715 4398.013920 0.092285 14.5 0.501825 4398.01
5 8.216850 5520.484742 0.125560 1.0 0.061480 5520.50
6 60.678680 4502.223123 0.101140 9.0 0.563620 4502.21
7 85.884280 4508.291777 0.116000 3.0 0.695540 4508.28
8 19.387450 4512.999332 0.088910 2.0 0.204860 4512.9
Here is an example to do what you need:
import pandas as pd
df1 = pd.DataFrame({'A': [0, 1, 2, 3],
'B': [0, 1, 2, 3],
'C': [0, 1, 2, 3],
'D': [0, 1, 2, 3]},
index=[0, 1, 2, 3])
df2 = pd.DataFrame({'A': [4, 5, 6, 7],
'B': [4, 5, 6, 7],
'C': [4, 5, 6, 7],
'D': [4, 5, 6, 7]},
index=[0, 1, 2, 3])
df3 = pd.DataFrame({'A': [8, 9, 10, 11],
'B': [8, 9, 10, 11],
'C': [8, 9, 10, 11],
'D': [8, 9, 10, 11]},
index=[0, 1, 2, 3])
df4 = pd.concat([df1, df2, df3])
df5 = pd.concat([df1, df2, df3], ignore_index=True)
print(df4)
print('\n\n')
print(df5)
print(f"Average of column A = {df4['A'].mean()}")
You will have
A B C D
0 0 0 0 0
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
0 4 4 4 4
1 5 5 5 5
2 6 6 6 6
3 7 7 7 7
0 8 8 8 8
1 9 9 9 9
2 10 10 10 10
3 11 11 11 11
A B C D
0 0 0 0 0
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
4 4 4 4 4
5 5 5 5 5
6 6 6 6 6
7 7 7 7 7
8 8 8 8 8
9 9 9 9 9
10 10 10 10 10
11 11 11 11 11
Average of column A = 5.5
Answer from #Naga Kiran is great. I updated the whole solution here:
import pandas as pd
df1 = pd.DataFrame(
{'wave' : [4050.32, 4208.98, 4374.94, 4379.74, 4398.01, 5520.50],
'num' : [3, 5, 9, 9, 14, 1],
'stlines' : [0.28269, 0.48122, 0.71483, 0.31404, 0.50415, 0.06148],
'fwhm' : [0.07365, 0.08765, 0.11429, 0.09107, 0.09845, 0.12556],
'EWs' : [22.16080, 44.90035, 86.96497, 30.44271, 52.83236, 8.21685],
'MeasredWave' : [4050.311360, 4208.972962, 4374.927110, 4379.760601, 4398.007473, 5520.484742]},
index=[0, 1, 2, 3, 4, 5])
df2 = pd.DataFrame(
{'wave' : [4050.32, 4208.98, 4374.94, 4379.74, 4398.01, 4502.21, 4508.28, 4512.99, 5520.50],
'num' : [3, 6, 9, 10, 15, 9, 3, 2, 1],
'stlines' : [0.28616, 0.48781, 0.71548, 0.31338, 0.49950, 0.56362, 0.69554, 0.20486, 0.06148],
'fwhm' : [0.07521, 0.08573, 0.11437, 0.09098, 0.08612, 0.10114, 0.11600, 0.08891, 0.12556],
'EWs' : [22.91064, 44.51609, 87.10152, 30.34791, 45.78707, 60.67868, 85.88428, 19.38745, 8.21685],
'MeasredWave' : [4050.327388, 4208.990029, 4374.944513, 4379.778009, 4398.020367, 4502.223123, 4508.291777, 4512.999332, 5520.484742]},
index=[0, 1, 2, 3, 4, 5, 6, 7, 8])
df3 = pd.merge(df1, df2, on='wave', how='outer')
df4 = df3.rename(columns = lambda x: x.split('_')[0]).T
df5 = df4.groupby(df4.index).mean().T
df6 = df5[['wave', 'num', 'stlines', 'fwhm', 'EWs', 'MeasredWave']]
df7 = df6.sort_values('wave', ascending = True).reset_index(drop=True)
df7