Substituting variable in a dataframe row based on other row's value - python

I have a dataframe that contains ID, Formula, and a dependent ID column that I extracted the ID from the Formula column.
Now I have to substitute all the dependent ID into the formulas based on the dataframe.
My approach is to run a nested loop for each row to substitute a dependent ID in the formula using the replace function. The loop would stop until there's no more possible substitution. However I don't know where to begin and not sure if this is the correct approach.
I am wondering if there's any function that can make the process easier?
Here is the code to create the current dataframe:
data = pd.DataFrame({'ID':['A1','A3','B2','C2','D3','E3'],
'Formula':['C2/500','If B2 >10 then (B2*D3) + 100 else D3+10','E3/2 +20','E3/2 +20','var_i','var_x'],
'Dependent ID':['C2','B2, D3','E3','D3, E3', '','']})
Here are the examples of my current dataframe and my desire end result.
Current dataframe:
Desire end result:

Recursively replace dependent ID inside formula with formula:
df = pd.DataFrame({'ID':['A1','A3','B2','C2','D3','E3'],
'Formula':['C2/500','If B2 >10 then (B2*D3) + 100 else D3+10','E3/2 +20','D3+E3','var_i','var_x'],
'Dependent ID':['C2','B2,D3','E3','D3,E3', '','']})
def find_formula(formula:str, ids:str):
#replace all the ids inside formula with the correct formula
if ids == '':
return formula
ids = ids.split(',')
for x in ids:
sub_formula = df.loc[df['ID']==x, 'Formula'].values[0]
sub_id = df.loc[df['ID']==x, 'Dependent ID'].values[0]
formula = formula.replace(x, find_formula(sub_formula, sub_id))
return formula
df['new_formula']=df.apply(lambda x: find_formula(x['Formula'], x['Dependent ID']), axis=1)
output:
ID Formula Dependent ID new_formula
0 A1 C2/500 C2 var_i+var_x/500
1 A3 If B2 >10 then (B2*D3) + ... If var_x/2 +20 >10 then (var_x/2 +20*var_i) + ...
2 B2 E3/2 +20 E3 var_x/2 +20
3 C2 D3+E3 D3,E3 var_i+var_x
4 D3 var_i var_i
5 E3 var_x var_x

Related

Create a new column using a condition from other two columns in a dataframe

I'm trying to create a column as I detailed in the next lines of code where if there is a zero value in one of the rows of the dataframe create a 'Sin Valor' value in the row of the new column of this dataframe.
import pandas as pd
data = pd.DataFrame({
'Qu1' : [12,34,0,45,0],
'Qu2' : ['A1','A2',"B0",'C2','B0'],
'Control' : ['A1', 'A2','Sin Valor/ -' + "B0" ,'C2','Sin Valor/ -'+ "B0"]})
In Excel, what I am trying to do should be something like this picture attached.
I was trying to create a function to do that and applying via lambda function but this isnĀ“t working.
def fill_df(x):
if data["Qu1"] == 0:
return x.zfill('Sin Valor/ -')
else:
return ' '
data['Control'] = data.apply(fill_df)
Is it possible to do that ? Every help is welcome. Thanks.
use np.where to accomplish it
df['ctrl'] = np.where(df['Qu1'] == 0,
'Sin Valor/-'+df['Qu2'],
df['Qu2'])
df
I introduced 'ctrl' column, that meets your requirement and matches 'control' (desired) column
Qu1 Qu2 Control ctrl
0 12 A1 A1 A1
1 34 A2 A2 A2
2 0 B0 Sin Valor/ -B0 Sin Valor/-B0
3 45 C2 C2 C2
4 0 B0 Sin Valor/ -B0 Sin Valor/-B0
mask = df['Qu1'] == 0
df.loc[mask, 'Control'] = 'Sin Valor/ -' + df['Qu2'][mask]

How to remove outliers in a text dataframe?

I'm writing a program that reads a text file and sorts the data into name, job, company and location fields in the form of a pandas dataframe. The location field is the same for all of the rows except for one or two outliers. I want to remove these rows from the df and put them in a separate list.
Example:
Name Job Company Location
1. n1 j1 c1 l
2. n2 j2 c2 l
3. n3 j3 c3 x
4. n4 j4 c4 l
Is there a way to remove only the row with location 'x'(row 3)?
I would extract the two groups into separate DFS
same_df = df.query('location == "<onethatisthesame>"')
Then I would repeat this but using != To get the others
other_df = df.query('location =! "<onethatisthesame>"')
You can use :
import pandas as pd
# df = df[df['location'] == yourRepeatedValue]
df = pd.DataFrame(columns = ['location'] )
df.at[1, 'location'] = 'mars'
df.at[2, 'location'] = 'pluto'
df.at[3, 'location'] = 'mars'
print(df)
df = df[df['location'] == 'mars']
print(df)
This will create a new DataFrame that only contains yourRepeatedValue.
In the example, the new df won't contain rows that are different from 'mars'
The output would be:
location
1 mars
2 pluto
3 mars
location
1 mars
3 mars

Create a new column based on previous row value and delete the current row

I have an input dataframe which can be generated from the code given below
df = pd.DataFrame({'subjectID' :[1,1,2,2],'keys':
['H1Date','H1','H2Date','H2'],'Values':
['10/30/2006',4,'8/21/2006',6.4]})
The input dataframe looks like as shown below
This is what I did
s1 = df.set_index('subjectID').stack().reset_index()
s1.rename(columns={0:'values'},
inplace=True)
d1 = s1[s1['level_1'].str.contains('Date')]
d2 = s1[~s1['level_1'].str.contains('Date')]
d1['g'] = d1.groupby('subjectID').cumcount()
d2['g'] = d2.groupby('subjectID').cumcount()
d3 = pd.merge(d1,d2,on=["subjectID", 'g'],how='left').drop(['g','level_1_x','level_1_y'], axis=1)
Though it works, I am afraid that this may not be the best approach. As we might have more than 200 columns and 50k RECORDS. Any help to improve my code further is very helpful.
I expect my output dataframe to look like as shown below
may be something like:
s=df.groupby(df['keys'].str.contains('Date').cumsum()).cumcount()+1
final=(df.assign(s=s.astype(str)).set_index(['subjectID','s']).
unstack().sort_values(by='s',axis=1))
final.columns=final.columns.map(''.join)
print(final)
keys1 Values1 keys2 Values2
subjectID
1 H1Date 10/30/2006 H1 4
2 H2Date 8/21/2006 H2 6.4

Two different excel file to match their rows having same name

Using python pandas,
I am trying to write a condition in pandas which will match two columns from two different excel file having the same column name and different numerical values in them. For each column there are 2000 rows to match.
The condition:
if final value = ( if File1(column1value) - File2(column1value) = 0 then update the value with 1;
if File1(column1value) - File2(column1value) is less than or equ al to 0.2 then keep File1Column1Value;
if (File1Column1) - File2(column1value) greater than 0.2 the. update the value with 0.
https://i.stack.imgur.com/Nx3WA.jpg
df1 = pd.read_excel('file_name1') # get input from excel files
df2 = pd.read_excel('file_name2')
p1 = df1['p1'].values
p11 = df2['p11'].values
new_col = [] # we will store desired values here
for i in range(len(p1)):
if p1[i] - p11[i] == 0:
new_col.append(1)
elif abs(p1[i] - p11[i]) > 0.2:
new_col.append(0)
else:
new_col.append(p1[i])
df1['new_column'] = new_col # we add new column with our values
You can also remove old column df.drop('column', axis = 1)

Reading excel and storing data with xlrd

I have this data in excel sheet
FT_NAME FC_NAME C_NAME
FT_NAME1 FC1 C1
FT_NAME2 FC21 C21
FC22 C22
FT_NAME3 FC31 C31
FC32 C32
FT_NAME4 FC4 C4
where column names are
FT_NAME,FC_NAME,C_NAME
and I want to store this values in a data structure for further use, currently I am trying to store them in a list of list but could not do so with following code
i=4
oc=sheet.cell(i,8).value
fcl,ocl=[],[]
while oc:
ft=sheet.cell(i,6).value
fc=sheet.cell(i,7).value
oc=sheet.cell(i,8).value
if ft:
self.foreign_tables.append(ft)
fcl.append(fc)
ocl.append(oc)
self.foreign_col.append(fcl)
self.own_col.append(ocl)
fcl,ocl=[],[]
else:
fcl.append(fc)
ocl.append(oc)
i+=1
i expect output as
ft=[FT_NAME1,FT_NAME2,FT_NAME3,FT_NAME4]
fc=[FC1, [FC21,FC22],[FC31,FC32],FC4]
oc=[C1,[C21,C22],[C31,C32],C4]
could anyone please help for better pythonic solution ?
You can use pandas. It reads the data into a DataFrame which is essentially a big dictionary.
import pandas as pd
data =pd.read_excel('file.xlsx', 'Sheet1')
data = data.fillna(method='pad')
print(data)
it gives the following output:
FT_NAME FC_NAME C_NAME
0 FT_NAME1 FC1 C1
1 FT_NAME2 FC21 C21
2 FT_NAME2 FC22 C22
3 FT_NAME3 FC31 C31
4 FT_NAME3 FC32 C32
5 FT_NAME4 FC4 C4
To get the sublist structure try using this function:
def group(data):
output = []
names = list(set(data['FT_NAME'].values))
names.sort()
output.append(names)
headernames = list(data.columns)
headernames.pop(0)
for ci in list(headernames):
column_group = []
column_data = data[ci].values
for name in names:
column_group.append(list(column_data[data['FT_NAME'].values == name]))
output.append(column_group)
return output
If you call it like this:
ft, fc, oc = group(data)
print(ft)
print(fc)
print(oc)
you get the following output:
['FT_NAME1', 'FT_NAME2', 'FT_NAME3', 'FT_NAME4']
[['FC1'], ['FC21', 'FC22'], ['FC31', 'FC32'], ['FC4']]
[['C1'], ['C21', 'C22'], ['C31', 'C32'], ['C4']]
which is what you want except for the single element now also being in a list.
It is not the cleanest method but it gets the job done.
Hope it helps.

Categories

Resources