Add multiple dataframes based on one column - python

I have several hundred dataframes with same column names, like this:
df1
wave num stlines fwhm EWs MeasredWave
0 4050.32 3 0.28269 0.07365 22.16080 4050.311360
1 4208.98 5 0.48122 0.08765 44.90035 4208.972962
2 4374.94 9 0.71483 0.11429 86.96497 4374.927110
3 4379.74 9 0.31404 0.09107 30.44271 4379.760601
4 4398.01 14 0.50415 0.09845 52.83236 4398.007473
5 5520.50 1 0.06148 0.12556 8.21685 5520.484742
df2
wave num stlines fwhm EWs MeasredWave
0 4050.32 3 0.28616 0.07521 22.91064 4050.327388
1 4208.98 6 0.48781 0.08573 44.51609 4208.990029
2 4374.94 9 0.71548 0.11437 87.10152 4374.944513
3 4379.74 10 0.31338 0.09098 30.34791 4379.778009
4 4398.01 15 0.49950 0.08612 45.78707 4398.020367
5 4502.21 9 0.56362 0.10114 60.67868 4502.223123
6 4508.28 3 0.69554 0.11600 85.88428 4508.291777
7 4512.99 2 0.20486 0.08891 19.38745 4512.999332
8 5520.50 1 0.06148 0.12556 8.21685 5520.484742
That's how i0'm reading them
path_to_files = '/home/Desktop/computed_2d/'
lst = []
for filen in dir1:
df = pd.read_table(path_to_files+filen, skiprows=0, usecols=(0,1,2,3,4,8),names=['wave','num','stlines','fwhm','EWs','MeasredWave'],delimiter=r'\s+')
lst.append(df)
The desired result should look like this:
wave num stlines fwhm EWs MeasredWave
0 4050.32 3.0 0.284425 0.074430 22.535720 4050.319374
1 4208.98 5.5 0.484515 0.086690 44.708220 4208.981496
2 4374.94 9.0 0.715155 0.114330 87.033245 4374.935812
3 4379.74 9.5 0.313710 0.091025 30.395310 4379.769305
4 4398.01 14.5 0.501825 0.092285 49.309715 4398.013920
5 4502.21 9 0.56362 0.10114 60.67868 4502.223123
6 4508.28 3 0.69554 0.11600 85.88428 4508.291777
7 4512.99 2 0.20486 0.08891 19.38745 4512.999332
8 5520.50 1.0 0.061480 0.125560 8.216850 5520.484742
As you can see the number of rows are not same. Now i want to take the average of all the dataframes based on column1 wave and i want to make sure that the each index of column wave of df1 gets added to the correct index of df2

You can stack all dataframe in one by using pd.concat wich axis = 1 and take average of respective column
df3 = pd.merge(df1,df2,on=['wave'],how ='outer',)
df4 = df3.rename(columns = lambda x: x.split('_')[0]).T
df4.groupby(df4.index).mean().T
Out:
EWs MeasredWave fwhm num stlines wave
0 22.535720 4050.319374 0.074430 3.0 0.284425 4050.32
1 44.708220 4208.981496 0.086690 5.5 0.484515 4208.98
2 87.033245 4374.935812 0.114330 9.0 0.715155 4374.94
3 30.395310 4379.769305 0.091025 9.5 0.313710 4379.74
4 49.309715 4398.013920 0.092285 14.5 0.501825 4398.01
5 8.216850 5520.484742 0.125560 1.0 0.061480 5520.50
6 60.678680 4502.223123 0.101140 9.0 0.563620 4502.21
7 85.884280 4508.291777 0.116000 3.0 0.695540 4508.28
8 19.387450 4512.999332 0.088910 2.0 0.204860 4512.9

Here is an example to do what you need:
import pandas as pd
df1 = pd.DataFrame({'A': [0, 1, 2, 3],
'B': [0, 1, 2, 3],
'C': [0, 1, 2, 3],
'D': [0, 1, 2, 3]},
index=[0, 1, 2, 3])
df2 = pd.DataFrame({'A': [4, 5, 6, 7],
'B': [4, 5, 6, 7],
'C': [4, 5, 6, 7],
'D': [4, 5, 6, 7]},
index=[0, 1, 2, 3])
df3 = pd.DataFrame({'A': [8, 9, 10, 11],
'B': [8, 9, 10, 11],
'C': [8, 9, 10, 11],
'D': [8, 9, 10, 11]},
index=[0, 1, 2, 3])
df4 = pd.concat([df1, df2, df3])
df5 = pd.concat([df1, df2, df3], ignore_index=True)
print(df4)
print('\n\n')
print(df5)
print(f"Average of column A = {df4['A'].mean()}")
You will have
A B C D
0 0 0 0 0
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
0 4 4 4 4
1 5 5 5 5
2 6 6 6 6
3 7 7 7 7
0 8 8 8 8
1 9 9 9 9
2 10 10 10 10
3 11 11 11 11
A B C D
0 0 0 0 0
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
4 4 4 4 4
5 5 5 5 5
6 6 6 6 6
7 7 7 7 7
8 8 8 8 8
9 9 9 9 9
10 10 10 10 10
11 11 11 11 11
Average of column A = 5.5

Answer from #Naga Kiran is great. I updated the whole solution here:
import pandas as pd
df1 = pd.DataFrame(
{'wave' : [4050.32, 4208.98, 4374.94, 4379.74, 4398.01, 5520.50],
'num' : [3, 5, 9, 9, 14, 1],
'stlines' : [0.28269, 0.48122, 0.71483, 0.31404, 0.50415, 0.06148],
'fwhm' : [0.07365, 0.08765, 0.11429, 0.09107, 0.09845, 0.12556],
'EWs' : [22.16080, 44.90035, 86.96497, 30.44271, 52.83236, 8.21685],
'MeasredWave' : [4050.311360, 4208.972962, 4374.927110, 4379.760601, 4398.007473, 5520.484742]},
index=[0, 1, 2, 3, 4, 5])
df2 = pd.DataFrame(
{'wave' : [4050.32, 4208.98, 4374.94, 4379.74, 4398.01, 4502.21, 4508.28, 4512.99, 5520.50],
'num' : [3, 6, 9, 10, 15, 9, 3, 2, 1],
'stlines' : [0.28616, 0.48781, 0.71548, 0.31338, 0.49950, 0.56362, 0.69554, 0.20486, 0.06148],
'fwhm' : [0.07521, 0.08573, 0.11437, 0.09098, 0.08612, 0.10114, 0.11600, 0.08891, 0.12556],
'EWs' : [22.91064, 44.51609, 87.10152, 30.34791, 45.78707, 60.67868, 85.88428, 19.38745, 8.21685],
'MeasredWave' : [4050.327388, 4208.990029, 4374.944513, 4379.778009, 4398.020367, 4502.223123, 4508.291777, 4512.999332, 5520.484742]},
index=[0, 1, 2, 3, 4, 5, 6, 7, 8])
df3 = pd.merge(df1, df2, on='wave', how='outer')
df4 = df3.rename(columns = lambda x: x.split('_')[0]).T
df5 = df4.groupby(df4.index).mean().T
df6 = df5[['wave', 'num', 'stlines', 'fwhm', 'EWs', 'MeasredWave']]
df7 = df6.sort_values('wave', ascending = True).reset_index(drop=True)
df7

Related

How to create lists from pandas columns

I have created a pandas dataframe using this code:
import numpy as np
import pandas as pd
ds = {'col1': [1,2,3,3,3,6,7,8,9,10]}
df = pd.DataFrame(data=ds)
The dataframe looks like this:
print(df)
col1
0 1
1 2
2 3
3 3
4 3
5 6
6 7
7 8
8 9
9 10
I need to create a field called col2 that contains in a list (for each record) the last 3 elements of col1 while iterating through each record. So, the resulting dataframe would look like this:
Does anyone know how to do it by any chance?
Here is a solution using rolling and list comprehension
df['col2'] = [x.tolist() for x in df['col1'].rolling(3)]
col1 col2
0 1 [1]
1 2 [1, 2]
2 3 [1, 2, 3]
3 3 [2, 3, 3]
4 3 [3, 3, 3]
5 6 [3, 3, 6]
6 7 [3, 6, 7]
7 8 [6, 7, 8]
8 9 [7, 8, 9]
9 10 [8, 9, 10]
Use a list comprehension:
N = 3
l = df['col1'].tolist()
df['col2'] = [l[max(0,i-N+1):i+1] for i in range(df.shape[0])]
Output:
col1 col2
0 1 [1]
1 2 [1, 2]
2 3 [1, 2, 3]
3 3 [2, 3, 3]
4 3 [3, 3, 3]
5 6 [3, 3, 6]
6 7 [3, 6, 7]
7 8 [6, 7, 8]
8 9 [7, 8, 9]
9 10 [8, 9, 10]
Upon seeing the other answers, I'm affirmed my answer is pretty stupid.
Anyways, here it is.
import pandas as pd
ds = {'col1': [1,2,3,3,3,6,7,8,9,10]}
df = pd.DataFrame(data=ds)
df['col2'] = df['col1'].shift(1)
df['col3'] = df['col2'].shift(1)
df['col4'] = (df[['col3','col2','col1']]
.apply(lambda x:','.join(x.dropna().astype(str)),axis=1)
)
The last column contains the resulting list.
col1 col4
0 1 1.0
1 2 1.0,2.0
2 3 1.0,2.0,3.0
3 3 2.0,3.0,3.0
4 3 3.0,3.0,3.0
5 6 3.0,3.0,6.0
6 7 3.0,6.0,7.0
7 8 6.0,7.0,8.0
8 9 7.0,8.0,9.0
9 10 8.0,9.0,10.0
lastThree = []
for x in range(len(df)):
lastThree.append([df.iloc[x - 2]['col1'], df.iloc[x - 1]['col1'], df.iloc[x]['col1']])
df['col2'] = lastThree

Pandas replace with dict and condition

In Pandas in Python you have the function df.replace(), which you can give a dict to change the values in a column:
df = pd.DataFrame({'A': [0, 1, 2, 3, 4],
'B': [5, 6, 7, 8, 9],
'C': ['a', 'b', 'c', 'd', 'e']})
df.replace('A': {0: 10, 3: 100})
Is it possible to add a condition to this? For example that it will only replace the values in the A column if the value in the B column is smaller than 8.
Using where:
df['A'] = df['A'].replace({0: 10, 3: 100}).where(df['B'].lt(8), df['A'])
output:
A B C
0 10 5 a
1 1 6 b
2 2 7 c
3 3 8 d
4 4 9 e
Try this:
df.update(df['A'][df['B'] < 8].replace({0: 10, 3: 100}))
Output:
>>> df
A B C
0 10.0 5 a
1 1.0 6 b
2 2.0 7 c
3 3.0 8 d
4 4.0 9 e
Notice how A at row 3 is not 100, but 3.0 (the old value). Because B at row 3 is 8, which, per your condition, is not less then 8.

Add new column in the Dataset with condition?

I am new in pandas and python, but I want to add new column that collect all row data and paste it in new column for example:
df_final = pd.read_csv('df_final.csv')
House_No = df_final['House_No_'].copy()
Street = df_final['Street'].copy()
City = df_final['City'].copy()
District = df_final['District'].copy()
Postl_Code = df_final['Postl_Code'].copy()
df_final['Full_Address']=(House_No +' , '+ Street +' , '+ City +' , '+ District +' , '+ str(Postl_Code))
the output is :
when the House No is null the new cell become null .. find it in row 7,8 and 9 in the image.
how can ignore the null cell and just take the rest of the roe ??
Thank you in advance.
import numpy as np
import pandas as pd
# sample data
np.random.seed(1)
df = pd.DataFrame(np.random.randint(1,10, size=(5,5)), columns=list('abcde'))
df.iloc[2, 0] = np.nan
# use np.where with join
df['new_col'] = np.where(df['a'].isna(), df.iloc[:, 1:].astype(str).apply(', '.join, axis=1),
df.astype(str).apply(', '.join, axis=1))
a b c d e new_col
0 6.0 9 6 1 1 6.0, 9, 6, 1, 1
1 2.0 8 7 3 5 2.0, 8, 7, 3, 5
2 NaN 3 5 3 5 3, 5, 3, 5
3 8.0 8 2 8 1 8.0, 8, 2, 8, 1
4 7.0 8 7 2 1 7.0, 8, 7, 2, 1
or if you do not care if the nan is in the final output simply do:
df['new_col1'] = df.astype(str).apply(', '.join, axis=1)
a b c d e new_col1
0 6.0 9 6 1 1 6.0, 9, 6, 1, 1
1 2.0 8 7 3 5 2.0, 8, 7, 3, 5
2 NaN 3 5 3 5 nan, 3, 5, 3, 5
3 8.0 8 2 8 1 8.0, 8, 2, 8, 1
4 7.0 8 7 2 1 7.0, 8, 7, 2, 1

Divide numbers in same column in python

In Python, I have a dataframe and want to divide the numbers in same column.(such as divide 6 by 9 in column c)
df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),columns=['a', 'b', 'c'])
df2
a b c
0 1 2 3
1 4 5 6
2 7 8 9

create a dataframe from 3 other dataframes in python

I am trying to create a new df which summarises my key information, by taking that information from 3 (say) other dataframes.
dfdate = {'x1': [2, 4, 7, 5, 6],
'x2': [2, 2, 2, 6, 7],
'y1': [3, 1, 4, 5, 9]}
dfdate = pd.DataFrame(df, index=range(0:4))
dfqty = {'x1': [1, 2, 6, 6, 8],
'x2': [3, 1, 1, 7, 5],
'y1': [2, 4, 3, 2, 8]}
dfqty = pd.DataFrame(df2, range(0:4))
dfprices = {'x1': [0, 2, 2, 4, 4],
'x2': [2, 0, 0, 3, 4],
'y1': [1, 3, 2, 1, 3]}
dfprices = pd.DataFrame(df3, range(0:4))
Let us say the above 3 dataframes are my data. Say, some dates, qty, and prices of goods. My new df is to be constructed from the above data:
rng = len(dfprices.columns)*len(dfprices.index) # This is the len of new df
dfnew = pd.DataFrame(np.nan,index=range(0,rng),columns=['Letter', 'Number', 'date', 'qty', 'price])
Now, this is where I struggle to piece my stuff together. I am trying to take all the data in dfdate and put it into a column in the new df. same with dfqty and dfprice. (so 3x5 matricies essentially goto a 1x15 vector and are placed into the new df).
As well as that, I need a couple of columns in dfnew as identifiers, from the names of the columns of the old df.
Ive tried for loops but to no avail, and don't know how to convert a df to series. But my desired output is:
dfnew:
'Lettercol','Numbercol', 'date', 'qty', 'price'
0 X 1 2 1 0
1 X 1 4 2 2
2 X 1 7 6 2
3 X 1 5 6 4
4 X 1 6 8 4
5 X 2 2 3 2
6 X 2 2 1 0
7 X 2 2 1 0
8 X 2 6 7 3
9 X 2 7 5 4
10 Y 1 3 2 1
11 Y 1 1 4 3
12 Y 1 4 3 2
13 Y 1 5 2 1
14 Y 1 9 8 3
where the numbers 0-14 are the index.
letter = letter from col header in DFs
number = number from col header in DFs
next 3 columns are data from the orig df's
(don't ask why the original data is in that funny format :)
thanks so much. my last Q wasn't well received so have tried to make this one better, thanks
Use:
#list of DataFrames
dfs = [dfdate, dfqty, dfprices]
#list comprehension with reshape
comb = [x.unstack() for x in dfs]
#join together
df = pd.concat(comb, axis=1, keys=['date', 'qty', 'price'])
#remove second level of MultiIndex and index to column
df = df.reset_index(level=1, drop=True).reset_index().rename(columns={'index':'col'})
#extract all values without first by indexing [1:] and first letter by [0]
df['Number'] = df['col'].str[1:]
df['Letter'] = df['col'].str[0]
cols = ['Letter', 'Number', 'date', 'qty', 'price']
#change order of columns
df = df.reindex(columns=cols)
print (df)
Letter Number date qty price
0 x 1 2 1 0
1 x 1 4 2 2
2 x 1 7 6 2
3 x 1 5 6 4
4 x 1 6 8 4
5 x 2 2 3 2
6 x 2 2 1 0
7 x 2 2 1 0
8 x 2 6 7 3
9 x 2 7 5 4
10 y 1 3 2 1
11 y 1 1 4 3
12 y 1 4 3 2
13 y 1 5 2 1
14 y 1 9 8 3

Categories

Resources