new column containing header of another column based on conditionals - python

I'm relatively new to python and I feel this is a complex task
From dfa:
I'm trying to return the smallest and second smallest values from a range of columns (dist 1 through to dist 5) and return the name of the column where these values have come from (i.e. "dist_3"), placing this information into 4 new columns. A given distX column will have a mix of numbers and NaN either as string or np.nan.
dfa = pd.DataFrame({'date': ['09-03-1988', '10-03-1988', '11-03-1988', '12-03-1988', '13-03-1988'],
'dist1': ['NaN',2,'NaN','NaN', 30],
'dist2': [20, 21, 22, 23, 'NaN'],
'dist3': [120, 'NaN', 122, 123, 11],
'dist4': [40, 'NaN', 42, 43, 'NaN'],
'dist5': ['NaN',1,'NaN','NaN', 70]})
Task 1) I want to add two new columns "fir_closest" and "fir_closest_dist".
fir_closest_dist should contain the smallest value from columns dist1 through to dist5 (i.e. 20 for row 1, 11 for row 5).
fir_closest should contain the name of the column from where the value in fir_closest_dist came from (i.e. "dist2 for the first row)
Task 2) Repeat the above but for the second/next smallest value to create two new columns "sec_closest" and "sec_closest_dist"
Output table needs to look like dfb
dfb = pd.DataFrame({'date': ['09-03-1988', '10-03-1988', '11-03-1988', '12-03-1988', '13-03-1988'],
'dist1': ['NaN',2,'NaN','NaN', 30],
'dist2': [20, 21, 22, 23, 'NaN'],
'dist3': [120, 'Nan', 122, 123, 11],
'dist4': [40, 'NaN', 42, 43, 'NaN'],
'dist5': ['NaN',1,'NaN','NaN', 70],
'fir_closest': ['dist2','dist5','dist2','dist2', 'dist3'],
'fir_closest_dist': [20,1,22,23,11],
'sec_closest': ['dist4','dist1','dist4','dist4', 'dist1'],
'sec_closest_dist': [40,2,42,43,30]})
Please can you show code or explain how best to approach this. What is the name for this method of populating new columns?
Thanks in advance

I think this may do what you need.
import pandas as pd
import numpy as np
#Reproducibility and data generation for example
np.random.seed(0)
X = np.random.randint(low = 0, high = 10, size = (5,5))
#Your data
df = pd.DataFrame(X, columns = [f'dist{j}' for j in range(5)])
# Number of columns
ix = range(df.shape[1])
col_names = df.columns.values
#Find arg of kth smallest
arg_row_min,arg_row_min2,*rest = np.argsort(df.values, axis = 1).T
df['dist_min'] = col_names[arg_row_min]
df['num_min'] = df.values[ix,arg_row_min]
df['dist_min2'] = col_names[arg_row_min2]
df['num_min2'] = df.values[ix,arg_row_min2]

Assuming your DataFrame is named df, and you have run import pandas as pd and import numpy as np:
# Example data
df = pd.DataFrame({'date': pd.date_range('2017-04-15', periods=5),
'name': ['Mullion']*5,
'dist1': [pd.np.nan, pd.np.nan, 30, 20, 15],
'dist2': [40, 30, 20, 15, 16],
'dist3': [101, 100, 98, 72, 11]})
df
date dist1 dist2 dist3 name
0 2017-04-15 NaN 40 101 Mullion
1 2017-04-16 NaN 30 100 Mullion
2 2017-04-17 30.0 20 98 Mullion
3 2017-04-18 20.0 15 72 Mullion
4 2017-04-19 15.0 16 11 Mullion
# Select only those columns with numeric data types. In your case, this is
# the same as:
# df_num = df[['dist1', 'dist2', ...]].copy()
df_num = df.select_dtypes(np.number)
# Get the column index of each row's minimum distance. First, fill NaN with
# numpy's infinity placeholder to ensure that NaN distances are never chosen.
idxs = df_num.fillna(np.inf).values.argsort(axis=1)
# The 1st column of idxs (which is idxs[:, 0]) contains the column index of
# each row's smallest distance.
# The 2nd column of idxs (which is idxs[:, 1]) contains the column index of
# each row's second-smallest distance.
# Convert the index of each row's closest distance to a column name.
# (df.columns is a list-like that holds the column names of df.)
df['closest_name'] = df_num.columns[max_idxs[:, 0]]
# Now get the distances themselves by indexing the underlying numpy array
# of values. There may be a more pandas-specific way of doing this, but
# this should be very fast.
df['closest_dist'] = df_num.values[np.arange(len(df_num)), max_idxs[:, 0]]
# Same idea for the second-closest distances.
df['second_closest_name'] = df_num.columns[max_idxs[:, 1]]
df['second_closest_dist'] = df_num.values[np.arange(len(df_num)), max_idxs[:, 1]]
df
date dist1 dist2 dist3 name closest_name closest_dist \
0 2017-04-15 NaN 40 101 Mullion dist2 40.0
1 2017-04-16 NaN 30 100 Mullion dist2 30.0
2 2017-04-17 30.0 20 98 Mullion dist2 20.0
3 2017-04-18 20.0 15 72 Mullion dist1 20.0
4 2017-04-19 15.0 16 11 Mullion dist3 11.0
second_closest_name second_closest_dist
0 dist3 101.0
1 dist3 100.0
2 dist1 30.0
3 dist2 15.0
4 dist1 15.0

Related

Finding the summation of values from two pandas dataframe column

I have a pandas dataframe like below
import pandas as pd
data = [[5, 10], [4, 20], [15, 30], [20, 15], [12, 14], [5, 5]]
df = pd.DataFrame(data, columns=['x', 'y'])
I am trying to attain the value of this expression.
I havnt got an idea how to mutiply first value in a column with 2nd value in another column like in the expression.
Try pd.DataFrame.shift() but I think you need to enter -1 into shift judging by the summation notation you posted. i + 1 implies using the next x or y, so shift needs to use a negative integer to shift 1 number ahead. Positive integers in shift go backwards.
Can you confirm 320 is the right answer?
0.5 * ((df.x * df.y.shift(-1)) - (df.x.shift(-1) + df.y)).sum()
>>>320
I think the below code has the correct value in expresion_end
import pandas as pd
data = [[5, 10], [4, 20], [15, 30], [20, 15], [12, 14], [5, 5]]
df = pd.DataFrame(data, columns=['x', 'y'])
df["x+1"]=df["x"].shift(periods=-1)
df["y+1"]=df["y"].shift(periods=-1)
df["exp"]=df["x"]*df["y+1"]-df["x+1"]*df["y"]
expresion_end=0.5*df["exp"].sum()
You can use pandas.DataFrame.shift(). You can one times compute shift(-1) and use it for 'x' and 'y'.
>>> df_tmp = df.shift(-1)
>>> (df['x']*df_tmp['y'] - df_tmp['x']*df['y']).sum() * 0.5
-202.5
# Explanation
>>> df[['x+1', 'y+1']] = df.shift(-1)
>>> df
x y x+1 y+1
0 5 10 4.0 20.0 # x*(y+1) - y*(x+1) = 5*20 - 10*4
1 4 20 15.0 30.0
2 15 30 20.0 15.0
3 20 15 12.0 14.0
4 12 14 5.0 5.0
5 5 5 NaN NaN

Check if a column value is in a list and report to a new column

After this discussion, I have the following dataframe:
data = {'Item':['1', '2', '3', '4', '5'],
'Len':[142, 11, 50, 60, 12],
'Hei':[55, 65, 130, 14, 69],
'C':[68, -18, 65, 16, 17],
'Thick':[60, 0, -150, 170, 130],
'Vol':[230, 200, -500, 10, 160]
'Fail':[['Len', 'Thick'], ['Thick'], ['Hei', 'Thick', 'Vol'], ['Vol'], ""}
df = pd.DataFrame(data)
representing different items and the corresponding values related to some of their parameters (Le, Hei, C, ...). In the column Fail are reported the parameters that are failed, e. g. item 1 fails for parameters Len and Thick, item 3 fails for parameters B, Thick and Vol, while item 4 shows no failure.
For each item I need a new column where it is reported the failed parameter together with its value, in the following format: failed parameter = value. So, for the first item I should get Len=142 and Thick=60.
So far, I have exploded the Fail column into multiple columns:
failed_param = df['Fail'].apply(pd.Series)
failed_param = failed_param.rename(columns = lambda x : 'Failed_param_' + str(x +1 ))
df2_list = failed_param.columns.values.tolist()
df2 = pd.concat([df[:], failed_param[:]], axis=1)
Then, if I do the following:
for name in df2_list:
df2.loc[df2[f"{name}"] == "D", "new"] = "D"+ "=" + df2["D"].map(str)
I can get what I need but for only one parameter (D in this case). How can I obtain the same for all the parameters all at once?
As mentioned in the question, you need to insert a new column (e.g., FailParams) that contains a list of strings. Each string represents the items' failures (e.g., Len=142,Thick=60). A quick solution can be:
import pandas as pd
data = {
'Item' : ['1', '2', '3', '4', '5'],
'Len' : [142, 11, 50, 60, 12],
'Hei' : [55, 65, 130, 14, 69],
'C' : [68, -18, 65, 16, 17],
'Thick': [60, 0, -150, 170, 130],
'Vol' : [230, 200, -500, 10, 160],
'Fail' : [['Len', 'Thick'], ['Thick'], ['Hei', 'Thick', 'Vol'], ['Vol'], []]
}
# Convert the dictionary into a DataFrame.
df = pd.DataFrame(data)
# The first solution: using list comprehension.
column = [
",".join( # Add commas between the list items.
# Find the target items and their values.
[el + "=" + str(df.loc[int(L[0]) - 1, el]) for el in L[1]]
)
if (len(L[1]) > 0) else "" # If the Fail inner is empty, return an empty string.
for L in zip(df['Item'].values, df['Fail'].values) # Loop on the Fail items.
]
# Insert the new column.
df['FailParams'] = column
# Print the DF after insertion.
print(df)
The previous solution is added using list comprehension. Another solution using loops can be:
# The second solution: using loops.
records = []
for L in zip(df['Item'].values, df['Fail'].values):
if (len(L[1]) <= 0):
record = ""
else:
record = ",".join([el + "=" + str(df.loc[int(L[0]) - 1, el]) for el in L[1]])
records.append(record)
print(records)
# Insert the new column.
df['FailParams'] = records
# Print the DF after insertion.
print(df)
A sample output should be:
Item Len Hei C Thick Vol Fail FailParams
0 1 142 55 68 60 230 [Len, Thick] Len=142,Thick=60
1 2 11 65 -18 0 200 [Thick] Thick=0
2 3 50 130 65 -150 -500 [Hei, Thick, Vol] Hei=130,Thick=-150,Vol=-500
3 4 60 14 16 170 10 [Vol] Vol=10
4 5 12 69 17 130 160 []
It might be a good idea to build an intermediate representation first, something like this (I am assuming the empty cell in the Fail column is an empty list [] so as to match the datatype of the other values):
# create a Boolean mask to filter failed values
m = df.apply(lambda row: row.index.isin(row.Fail),
axis=1,
result_type='broadcast')
>>> df[m]
Item Len Hei C Thick Vol Fail
0 NaN 142.0 NaN NaN 60.0 NaN NaN
1 NaN NaN NaN NaN 0.0 NaN NaN
2 NaN NaN 130.0 NaN -150.0 -500.0 NaN
3 NaN NaN NaN NaN NaN 10.0 NaN
4 NaN NaN NaN NaN NaN NaN NaN
This allows you to actually do something with the failed values, too.
With that in place, generating the value list could be done by something similar to Hossam Magdy Balaha's answer, perhaps with a little function:
def join_params(row):
row = row.dropna().to_dict()
return ', '.join(f'{k}={v}' for k,v in row.items())
>>> df[m].apply(join_params, axis=1)
0 Len=142.0, Thick=60.0
1 Thick=0.0
2 Hei=130.0, Thick=-150.0, Vol=-500.0
3 Vol=10.0
4
dtype: object

Multiply Columns from Multiple DataFrames, Keeping Non-Overlapping Columns, and Creating New DataFrames

For the sake of simplicity, let's say I have two dataframes as shown below:
import pandas as pd
df1 = {'B36': {'A44': 0.218, 'A45': 0.062, 'A46': 0.035, 'plt': 0.450, 'rs': 0.878},
'B43': {'A44': 0.018, 'A45': 0.427, 'A46': 0.100, 'plt': 0.450, 'rs': 0.878}}
df1 = pd.DataFrame(df1)
df1
df2 = {'lID': [26, 26, 26, 26, 26, 12, 12, 12, 12, 12],
'lCTY': [18, 18, 18, 18, 18, 18, 18, 18, 18, 18],
'A44':[77, 37, 51, 55, 57, 10, 10, 10, 10, 10],
'A45':[77, 37, 51, 55, 57, 10, 10, 10, 10, 10],
'A46':[78, 36, 49, 53, 50, 99, 99, 99, 10, 99]
}
df2 = pd.DataFrame(df2)
df2
What I want to do is multiply rows in df1 with similarly-named columns in df2. I want to do this for each column in df1
and output these as separate dataframes. Additionally, I want to keep the non-overlapping rows/columns in both df1 and df2.
My attempt at doing both is shown below. What I was hoping for was a more concise way of going about this.
df1_index = set(df1.index)
df2_cols = set(df2.columns)
col = list(df1_index.intersection(df2_cols))
# multiply df1 B36 items with df2 columns
df2[col] = df1['B36'][col].mul(df2[col], axis=0, fill_value=1)
df2['rs'] = df1.loc['rs'][0]
df2['plt'] = df1.loc['plt'][0]
df2
# multiply df1 B43 items with df2 columns
df2[col] = df1['B43'][col].mul(df2[col], axis=0, fill_value=1)
df2['rs'] = df1.loc['rs'][0]
df2['plt'] = df1.loc['plt'][0]
df2
One option is to melt the column before multiplying; I'd like to think that your step may be more efficient (tests are a sure way to confirm/refute):
# melt df2:
# keeping the index helps ensure unique index
# which we need later when unstacking
(df2
.melt(["lID", "lCTY"], ignore_index = False)
.set_index(["variable", "lID", "lCTY"], append = True)
# move variable to level 0, so we can use it for multiplying
.reorder_levels(['variable', 'lID', 'lCTY', None])
.value # not really necessary to convert to a series here
# multiply independently, since the index is df1 is unique
# easier this way
.mul(df1.B36, level=0)
.mul(df1.B43, level=0)
.unstack('variable')
# sorting can be ignored if it is not that important
.sort_index(level = -1)
# assign the remaining rows that do not match
.assign(rs=df1.at["rs", "B36"], plt=df1.at["plt", "B36"])
.rename_axis(columns = None)
.reset_index(level = ['lID', 'lCTY'])
)
lID lCTY A44 A45 A46 rs plt
0 26 18 0.302148 2.038498 0.2730 0.878 0.45
1 26 18 0.145188 0.979538 0.1260 0.878 0.45
2 26 18 0.200124 1.350174 0.1715 0.878 0.45
3 26 18 0.215820 1.456070 0.1855 0.878 0.45
4 26 18 0.223668 1.509018 0.1750 0.878 0.45
5 12 18 0.039240 0.264740 0.3465 0.878 0.45
6 12 18 0.039240 0.264740 0.3465 0.878 0.45
7 12 18 0.039240 0.264740 0.3465 0.878 0.45
8 12 18 0.039240 0.264740 0.0350 0.878 0.45
9 12 18 0.039240 0.264740 0.3465 0.878 0.45
This matches the example shared; depending on the complexity, you may have to break it up, put into separate variables and recombine, or use the pipe method (i prefer the first, as pipe , when combined with anonymous functions can lead to incomprehensible code)
# sammywemmy, based on your answer I used the following steps which I think works well for what I want to do.
1. the multiplication code
this code excludes all non-similar indices/columns from df1 and df2
d1idx = set(df1.index)
d2cols = set(df2.columns)
col = list(d1idx.intersection(d2cols))
collt = []
for idx, colname in enumerate(df1, start=1):
collt.append(df1[colname][col].mul(df2[col], axis=0, fill_value=1))
print(collt)
2. collect non-similar indices/columns from df1 and df2
this combines the non-similar indices/columns from df1 and df2. I chose to use use df.iloc[:,][0], instead of df.at[] because the variable name at that index might change, so this approach makes it a bit easier.
d2 = df2.loc[:, ['lID', 'lCTY']]
d2 = d2.join(pd.DataFrame({'plt': df1.loc['plt'][0],
'rs': df1.loc['rs'][0]}, index=d2.index))
print(d2)
3. merge non-similar columns into the multiplied dataframe
finally, add the non-similar columns back into the dataframe
lstdf = {}
for idx, dif in enumerate(collt, start=1):
lstdf[idx] = dif.merge(d2, left_index=True, right_index=True)
print(lstdf)

How to iterate over column values to create transformed variables in Python?

I have 10 columns and I have to create new 10 columns of sin of original columns. How to do this in Python? I have tried it using a for loop but it is giving me an error?
My data frame is
d = {'col1': [0, 15, 30, 45, 60], 'col2': [0, 60, 180, 240, 300]}
df = pd.DataFrame(data=d)
I have created a function transform but how to specify all columns using var? I am getting an error in the below command.
df = df.pipe(transform, var=[0, 60])
sample output:
col1_sin col2_sin col1_cos col2_cos
0 0.000000 0.304811 1.000000 -0.952413
1 0.650288 0.000000 -0.759688 1.000000
2 -0.988032 0.580611 0.154251 0.814181
3 0.850904 -0.801153 0.525322 -0.598460
4 -0.304811 0.945445 -0.952413 0.325781
You have not provided what you function is doing. From inference sin() and cos()
Below shows how pipe() can be used with parameters. In this case the functions you want to apply to all columns in the dataframe.
import math
d = {'col1': [0, 15, 30, 45, 60], 'col2': [0, 60, 180, 240, 300]}
df = pd.DataFrame(data=d)
def pipefoo(dfa, ops=[math.sin]):
return dfa.assign(**{f"{c}_{f.__name__}":dfa[c].apply(f) for f in ops for c in dfa.columns})
df.pipe(pipefoo, ops=[math.sin,math.cos])
col1
col2
col1_sin
col2_sin
col1_cos
col2_cos
0
0
0
0
0
1
1
1
15
60
0.650288
-0.304811
-0.759688
-0.952413
2
30
180
-0.988032
-0.801153
0.154251
-0.59846
3
45
240
0.850904
0.945445
0.525322
0.325781
4
60
300
-0.304811
-0.999756
-0.952413
-0.0220966

Within a dataframe, copy selected cells based on filter critera to another row within the same dataframe

I would like to copy the value of cells based on a filter of another cell to specific rows
import pandas as pd
sales = {'Flight Number': ['LX2104', 'LX2104', 'LX2104', 'LX2105', 'LX2105', 'LX2105', 'LX2106', 'LX2106', 'LX2106'],
'STD Departure': [0, 1, 2, 0, 1, 2, 0, 1, 2],
'Bircher': [200, 210, 90, 40, 20, 10, 10, 30, 20],
'Carac': [140, 215, 95,40, 50, 30, 40, 30, 50]}
df = pd.DataFrame.from_dict(sales)
I would like to copy the cells "Bircher" and "Carac" from rows with the "Flight Number" LX2104 to the rows with "Flight Number" LX2105". The values in "STD Departure" should stay unchanged
You can do this, it can be more visually clearer:
df.loc[df['Flight Number'] == 'LX2104', 'Bircher'] = df[df['Flight Number'] == 'LX2105'].Bircher.values
df.loc[df['Flight Number'] == 'LX2104', 'Carac'] = df[df['Flight Number'] == 'LX2105'].Carac.values
Output:
Flight Number STD Departure Bircher Carac
0 LX2104 0 40.0 40
1 LX2104 1 20.0 50
2 LX2104 2 10.0 30
3 LX2105 0 40.0 40
4 LX2105 1 20.0 50
5 LX2105 2 10.0 30
6 LX2106 0 10.0 40
7 LX2106 1 30.0 30
8 LX2106 2 20.0 50
Also you can use, but I think it is more unclear:
df.loc[df['Flight Number'] == 'LX2104', ['Bircher', 'Carac']] = df[df['Flight Number'] == 'LX2105'][['Bircher', 'Carac']].values
I will try to explain this code. I use df.loc[raw_index, column_index] to get a slice (correct raws and columns). This df['Flight Number'] == 'LX2104' will return a boolean array with True-values where a flight number is LX2104, so we have a needed raws and then I just pass column names to have a needed columns. In right side I do the same but with another flight number. Be careful, if they have not the same length (number of raws) it won't work.

Categories

Resources