Creating a new column based on other columns from another dataframe - python

I have 2 dataframes:
df1
Name Apples Pears Grapes Peachs
James 3 5 5 2
Harry 1 0 2 9
Will 20 2 7 3
df2
Class User Factor
A Harry 3
A Will 2
A James 5
B NaN 4
I want to create a new column in df2 called Total which is a list of all the columns for each user in df1, multiplied by the Factor for that user - this should only be done if they are in Class A.
This is how the final df should look
df2
Class User Factor Total
A Harry 3 [3,0,6,27]
A Will 2 [40,4,14,6]
A James 5 [15,25,25,10]
B NaN 4
This is what I tried:
df2['Total'] = list(df1.Name.isin((df2.User) and (df2.Class==A)) * df2.Factor)

This will do what your question asks:
df2 = df2[df2.Class=='A'].join(df.set_index('Name'), on='User').set_index(['Class','User'])
df2['Total'] = df2.apply(lambda x: list(x * x.Factor)[1:], axis=1)
df2 = df2.reset_index()[['Class','User','Factor','Total']]
Full test code:
import pandas as pd
import numpy as np
df = pd.DataFrame(columns=[
x.strip() for x in 'Name Apples Pears Grapes Peachs'.split()], data =[
['James', 3, 5, 5, 2],
['Harry', 1, 0, 2, 9],
['Will', 20, 2, 7, 3]])
print(df)
df2 = pd.DataFrame(columns=[
x.strip() for x in 'Class User Factor'.split()], data =[
['A', 'Harry', 3],
['A', 'Will', 2],
['A', 'James', 5],
['B', np.nan, 4]])
print(df2)
df2 = df2[df2.Class=='A'].join(df.set_index('Name'), on='User').set_index(['Class','User'])
df2['Total'] = df2.apply(lambda x: list(x * x.Factor)[1:], axis=1)
df2 = df2.reset_index()[['Class','User','Factor','Total']]
print(df2)
Input:
Name Apples Pears Grapes Peachs
0 James 3 5 5 2
1 Harry 1 0 2 9
2 Will 20 2 7 3
Class User Factor
0 A Harry 3
1 A Will 2
2 A James 5
3 B NaN 4
Output
Class User Factor Total
0 A Harry 3 [3, 0, 6, 27]
1 A Will 2 [40, 4, 14, 6]
2 A James 5 [15, 25, 25, 10]

You can use:
# First lookup
factor = df2[df2['Class'] == 'A'].set_index('User')['Factor']
df1['Total'] = df1[cols].mul(df1['Name'].map(factor), axis=0).agg(list, axis=1)
# Second lookup
df2['Total'] = df2['User'].map(df1.set_index('Name')['Total'])
Output:
>>> df2
Class User Factor Total
0 A Harry 3 [3, 0, 6, 27]
1 A Will 2 [40, 4, 14, 6]
2 A James 5 [15, 25, 25, 10]
3 B NaN 4 NaN
>>> df1
Name Apples Pears Grapes Peachs Total
0 James 3 5 5 2 [15, 25, 25, 10]
1 Harry 1 0 2 9 [3, 0, 6, 27]
2 Will 20 2 7 3 [40, 4, 14, 6]

On-liner masochists, greetings ;)
df2['Total'] = pd.Series(df1.sort_values(by='Name').reset_index(drop=True).iloc[:,1:5]\
.mul(df2[df2.Class == 'A'].sort_values(by='User')['Factor'].reset_index(drop=True), axis=0)\
.values.tolist())
df2
Output:
index
Class
User
Factor
Total
0
A
Harry
3
3,0,6,27
1
A
Will
2
15,25,25,10
2
A
James
5
40,4,14,6
3
B
NaN
4
NaN

Related

Grouping by ID choosing highest values in columns from same ID

I have a problem trying to calculate some final tests marks. I need to group by Students, getting only the highest value in each column for each student.
Being DF the dataframe:
data = {'Students': ['Student1', 'Student1', 'Student1', 'Student2','Student2','Studen3'],
'Result1': [2, 4, 5, 8, 2, 5],
'Result2': [5, 3, 2, 8, 5, 5],
'Result3': [7, 5, 7, 3, 8, 9]}
df = pd.DataFrame(data)
Students Result1 Result2 Result3
0 Student1 2 5 7
1 Student1 4 3 5
2 Student1 5 2 7
3 Student2 8 8 3
4 Student2 2 5 8
5 Studen3 5 5 9
I need to generate a DF choosing the higher mark, for each student, in each Result.
So, the final DF should look like:
Students Result1 Result2 Result3
0 Student1 5 5 7
1 Student2 8 8 8
2 Student3 5 5 9
Any help?
The dataframe can be generated using simply iterations over groups:
df2 = pd.DataFrame(columns=('Student', 'res1', 'res2', 'res3'))
for s in df.Students.unique():
stdf = df[df["Students"]==s]
df2 = df2.append({'Student':s,'res1':max(stdf.Result1),'res2':max(stdf.Result2),
'res3':max(stdf.Result3)}, ignore_index=True)
 Works calling groupby('Students').max()
>>> df.groupby('Students').max()
Result1 Result2 Result3
Students
Student1 5 5 7
Student2 8 8 8
Student3 5 5 9

Pandas groupby cumcount starting on row with a certain column value

I'd like to create two cumcount columns, depending on the values of two columns.
In the example below, I'd like one cumcount starting when colA is at least 100, and another cumcount starting when colB is at least 10.
columns = ['ID', 'colA', 'colB', 'cumcountA', 'cumountB']
data = [['A', 3, 1, '',''],
['A', 20, 4, '',''],
['A', 102, 8, 1, ''],
['A', 117, 10, 2, 1],
['B', 75, 0, '',''],
['B', 170, 12, 1, 1],
['B', 200, 13, 2, 2],
['B', 300, 20, 3, 3],
]
pd.DataFrame(columns=columns, data=data)
ID colA colB cumcountA cumountB
0 A 3 1
1 A 20 4
2 A 102 8 1
3 A 117 10 2 1
4 B 75 0
5 B 170 12 1 1
6 B 200 13 2 2
7 B 300 20 3 3
How would I calculate cumcountA and cumcountB?
you can try setting df.clip lower = your values (here 100 and 10) and then compare then groupby ID and cumsum :
col_list = ['colA','colB']
val_list = [100,10]
df[['cumcountA','cumountB']] = (df[col_list].ge(df[col_list].clip(lower=val_list,axis=1))
.groupby(df['ID']).cumsum().replace(0,''))
print(df)
Or may be even better to compare directly:
df[['cumcountA','cumountB']] = (df[['colA','colB']].ge([100,10])
.groupby(df['ID']).cumsum().replace(0,''))
print(df)
ID colA colB cumcountA cumountB
0 A 3 1
1 A 20 4
2 A 102 8 1
3 A 117 10 2 1
4 B 75 0
5 B 170 12 1 1
6 B 200 13 2 2
7 B 300 20 3 3

Add new column in the Dataset with condition?

I am new in pandas and python, but I want to add new column that collect all row data and paste it in new column for example:
df_final = pd.read_csv('df_final.csv')
House_No = df_final['House_No_'].copy()
Street = df_final['Street'].copy()
City = df_final['City'].copy()
District = df_final['District'].copy()
Postl_Code = df_final['Postl_Code'].copy()
df_final['Full_Address']=(House_No +' , '+ Street +' , '+ City +' , '+ District +' , '+ str(Postl_Code))
the output is :
when the House No is null the new cell become null .. find it in row 7,8 and 9 in the image.
how can ignore the null cell and just take the rest of the roe ??
Thank you in advance.
import numpy as np
import pandas as pd
# sample data
np.random.seed(1)
df = pd.DataFrame(np.random.randint(1,10, size=(5,5)), columns=list('abcde'))
df.iloc[2, 0] = np.nan
# use np.where with join
df['new_col'] = np.where(df['a'].isna(), df.iloc[:, 1:].astype(str).apply(', '.join, axis=1),
df.astype(str).apply(', '.join, axis=1))
a b c d e new_col
0 6.0 9 6 1 1 6.0, 9, 6, 1, 1
1 2.0 8 7 3 5 2.0, 8, 7, 3, 5
2 NaN 3 5 3 5 3, 5, 3, 5
3 8.0 8 2 8 1 8.0, 8, 2, 8, 1
4 7.0 8 7 2 1 7.0, 8, 7, 2, 1
or if you do not care if the nan is in the final output simply do:
df['new_col1'] = df.astype(str).apply(', '.join, axis=1)
a b c d e new_col1
0 6.0 9 6 1 1 6.0, 9, 6, 1, 1
1 2.0 8 7 3 5 2.0, 8, 7, 3, 5
2 NaN 3 5 3 5 nan, 3, 5, 3, 5
3 8.0 8 2 8 1 8.0, 8, 2, 8, 1
4 7.0 8 7 2 1 7.0, 8, 7, 2, 1

Add multiple dataframes based on one column

I have several hundred dataframes with same column names, like this:
df1
wave num stlines fwhm EWs MeasredWave
0 4050.32 3 0.28269 0.07365 22.16080 4050.311360
1 4208.98 5 0.48122 0.08765 44.90035 4208.972962
2 4374.94 9 0.71483 0.11429 86.96497 4374.927110
3 4379.74 9 0.31404 0.09107 30.44271 4379.760601
4 4398.01 14 0.50415 0.09845 52.83236 4398.007473
5 5520.50 1 0.06148 0.12556 8.21685 5520.484742
df2
wave num stlines fwhm EWs MeasredWave
0 4050.32 3 0.28616 0.07521 22.91064 4050.327388
1 4208.98 6 0.48781 0.08573 44.51609 4208.990029
2 4374.94 9 0.71548 0.11437 87.10152 4374.944513
3 4379.74 10 0.31338 0.09098 30.34791 4379.778009
4 4398.01 15 0.49950 0.08612 45.78707 4398.020367
5 4502.21 9 0.56362 0.10114 60.67868 4502.223123
6 4508.28 3 0.69554 0.11600 85.88428 4508.291777
7 4512.99 2 0.20486 0.08891 19.38745 4512.999332
8 5520.50 1 0.06148 0.12556 8.21685 5520.484742
That's how i0'm reading them
path_to_files = '/home/Desktop/computed_2d/'
lst = []
for filen in dir1:
df = pd.read_table(path_to_files+filen, skiprows=0, usecols=(0,1,2,3,4,8),names=['wave','num','stlines','fwhm','EWs','MeasredWave'],delimiter=r'\s+')
lst.append(df)
The desired result should look like this:
wave num stlines fwhm EWs MeasredWave
0 4050.32 3.0 0.284425 0.074430 22.535720 4050.319374
1 4208.98 5.5 0.484515 0.086690 44.708220 4208.981496
2 4374.94 9.0 0.715155 0.114330 87.033245 4374.935812
3 4379.74 9.5 0.313710 0.091025 30.395310 4379.769305
4 4398.01 14.5 0.501825 0.092285 49.309715 4398.013920
5 4502.21 9 0.56362 0.10114 60.67868 4502.223123
6 4508.28 3 0.69554 0.11600 85.88428 4508.291777
7 4512.99 2 0.20486 0.08891 19.38745 4512.999332
8 5520.50 1.0 0.061480 0.125560 8.216850 5520.484742
As you can see the number of rows are not same. Now i want to take the average of all the dataframes based on column1 wave and i want to make sure that the each index of column wave of df1 gets added to the correct index of df2
You can stack all dataframe in one by using pd.concat wich axis = 1 and take average of respective column
df3 = pd.merge(df1,df2,on=['wave'],how ='outer',)
df4 = df3.rename(columns = lambda x: x.split('_')[0]).T
df4.groupby(df4.index).mean().T
Out:
EWs MeasredWave fwhm num stlines wave
0 22.535720 4050.319374 0.074430 3.0 0.284425 4050.32
1 44.708220 4208.981496 0.086690 5.5 0.484515 4208.98
2 87.033245 4374.935812 0.114330 9.0 0.715155 4374.94
3 30.395310 4379.769305 0.091025 9.5 0.313710 4379.74
4 49.309715 4398.013920 0.092285 14.5 0.501825 4398.01
5 8.216850 5520.484742 0.125560 1.0 0.061480 5520.50
6 60.678680 4502.223123 0.101140 9.0 0.563620 4502.21
7 85.884280 4508.291777 0.116000 3.0 0.695540 4508.28
8 19.387450 4512.999332 0.088910 2.0 0.204860 4512.9
Here is an example to do what you need:
import pandas as pd
df1 = pd.DataFrame({'A': [0, 1, 2, 3],
'B': [0, 1, 2, 3],
'C': [0, 1, 2, 3],
'D': [0, 1, 2, 3]},
index=[0, 1, 2, 3])
df2 = pd.DataFrame({'A': [4, 5, 6, 7],
'B': [4, 5, 6, 7],
'C': [4, 5, 6, 7],
'D': [4, 5, 6, 7]},
index=[0, 1, 2, 3])
df3 = pd.DataFrame({'A': [8, 9, 10, 11],
'B': [8, 9, 10, 11],
'C': [8, 9, 10, 11],
'D': [8, 9, 10, 11]},
index=[0, 1, 2, 3])
df4 = pd.concat([df1, df2, df3])
df5 = pd.concat([df1, df2, df3], ignore_index=True)
print(df4)
print('\n\n')
print(df5)
print(f"Average of column A = {df4['A'].mean()}")
You will have
A B C D
0 0 0 0 0
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
0 4 4 4 4
1 5 5 5 5
2 6 6 6 6
3 7 7 7 7
0 8 8 8 8
1 9 9 9 9
2 10 10 10 10
3 11 11 11 11
A B C D
0 0 0 0 0
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
4 4 4 4 4
5 5 5 5 5
6 6 6 6 6
7 7 7 7 7
8 8 8 8 8
9 9 9 9 9
10 10 10 10 10
11 11 11 11 11
Average of column A = 5.5
Answer from #Naga Kiran is great. I updated the whole solution here:
import pandas as pd
df1 = pd.DataFrame(
{'wave' : [4050.32, 4208.98, 4374.94, 4379.74, 4398.01, 5520.50],
'num' : [3, 5, 9, 9, 14, 1],
'stlines' : [0.28269, 0.48122, 0.71483, 0.31404, 0.50415, 0.06148],
'fwhm' : [0.07365, 0.08765, 0.11429, 0.09107, 0.09845, 0.12556],
'EWs' : [22.16080, 44.90035, 86.96497, 30.44271, 52.83236, 8.21685],
'MeasredWave' : [4050.311360, 4208.972962, 4374.927110, 4379.760601, 4398.007473, 5520.484742]},
index=[0, 1, 2, 3, 4, 5])
df2 = pd.DataFrame(
{'wave' : [4050.32, 4208.98, 4374.94, 4379.74, 4398.01, 4502.21, 4508.28, 4512.99, 5520.50],
'num' : [3, 6, 9, 10, 15, 9, 3, 2, 1],
'stlines' : [0.28616, 0.48781, 0.71548, 0.31338, 0.49950, 0.56362, 0.69554, 0.20486, 0.06148],
'fwhm' : [0.07521, 0.08573, 0.11437, 0.09098, 0.08612, 0.10114, 0.11600, 0.08891, 0.12556],
'EWs' : [22.91064, 44.51609, 87.10152, 30.34791, 45.78707, 60.67868, 85.88428, 19.38745, 8.21685],
'MeasredWave' : [4050.327388, 4208.990029, 4374.944513, 4379.778009, 4398.020367, 4502.223123, 4508.291777, 4512.999332, 5520.484742]},
index=[0, 1, 2, 3, 4, 5, 6, 7, 8])
df3 = pd.merge(df1, df2, on='wave', how='outer')
df4 = df3.rename(columns = lambda x: x.split('_')[0]).T
df5 = df4.groupby(df4.index).mean().T
df6 = df5[['wave', 'num', 'stlines', 'fwhm', 'EWs', 'MeasredWave']]
df7 = df6.sort_values('wave', ascending = True).reset_index(drop=True)
df7

create a dataframe from 3 other dataframes in python

I am trying to create a new df which summarises my key information, by taking that information from 3 (say) other dataframes.
dfdate = {'x1': [2, 4, 7, 5, 6],
'x2': [2, 2, 2, 6, 7],
'y1': [3, 1, 4, 5, 9]}
dfdate = pd.DataFrame(df, index=range(0:4))
dfqty = {'x1': [1, 2, 6, 6, 8],
'x2': [3, 1, 1, 7, 5],
'y1': [2, 4, 3, 2, 8]}
dfqty = pd.DataFrame(df2, range(0:4))
dfprices = {'x1': [0, 2, 2, 4, 4],
'x2': [2, 0, 0, 3, 4],
'y1': [1, 3, 2, 1, 3]}
dfprices = pd.DataFrame(df3, range(0:4))
Let us say the above 3 dataframes are my data. Say, some dates, qty, and prices of goods. My new df is to be constructed from the above data:
rng = len(dfprices.columns)*len(dfprices.index) # This is the len of new df
dfnew = pd.DataFrame(np.nan,index=range(0,rng),columns=['Letter', 'Number', 'date', 'qty', 'price])
Now, this is where I struggle to piece my stuff together. I am trying to take all the data in dfdate and put it into a column in the new df. same with dfqty and dfprice. (so 3x5 matricies essentially goto a 1x15 vector and are placed into the new df).
As well as that, I need a couple of columns in dfnew as identifiers, from the names of the columns of the old df.
Ive tried for loops but to no avail, and don't know how to convert a df to series. But my desired output is:
dfnew:
'Lettercol','Numbercol', 'date', 'qty', 'price'
0 X 1 2 1 0
1 X 1 4 2 2
2 X 1 7 6 2
3 X 1 5 6 4
4 X 1 6 8 4
5 X 2 2 3 2
6 X 2 2 1 0
7 X 2 2 1 0
8 X 2 6 7 3
9 X 2 7 5 4
10 Y 1 3 2 1
11 Y 1 1 4 3
12 Y 1 4 3 2
13 Y 1 5 2 1
14 Y 1 9 8 3
where the numbers 0-14 are the index.
letter = letter from col header in DFs
number = number from col header in DFs
next 3 columns are data from the orig df's
(don't ask why the original data is in that funny format :)
thanks so much. my last Q wasn't well received so have tried to make this one better, thanks
Use:
#list of DataFrames
dfs = [dfdate, dfqty, dfprices]
#list comprehension with reshape
comb = [x.unstack() for x in dfs]
#join together
df = pd.concat(comb, axis=1, keys=['date', 'qty', 'price'])
#remove second level of MultiIndex and index to column
df = df.reset_index(level=1, drop=True).reset_index().rename(columns={'index':'col'})
#extract all values without first by indexing [1:] and first letter by [0]
df['Number'] = df['col'].str[1:]
df['Letter'] = df['col'].str[0]
cols = ['Letter', 'Number', 'date', 'qty', 'price']
#change order of columns
df = df.reindex(columns=cols)
print (df)
Letter Number date qty price
0 x 1 2 1 0
1 x 1 4 2 2
2 x 1 7 6 2
3 x 1 5 6 4
4 x 1 6 8 4
5 x 2 2 3 2
6 x 2 2 1 0
7 x 2 2 1 0
8 x 2 6 7 3
9 x 2 7 5 4
10 y 1 3 2 1
11 y 1 1 4 3
12 y 1 4 3 2
13 y 1 5 2 1
14 y 1 9 8 3

Categories

Resources