import pandas
d = {'col1': [25,20,30],
'col2': [25,20,30],
'col3': [25,20,30],
'col4': [25,39,11]
}
df = pandas.DataFrame(data=d)
How would I loop from this data frame and add col1 + col2 + col3 + col4 and if not equal 100, take value in that index perform this col1/(col1+col2+col3+col4 and make that the new value for that spot. This way now when you sum col1 + col2 + col3 + col4 it will add up to 100 for that index.
So for example for index 0, when you add col1 +col2 + col3 + col4 it equals 100, therefore, go to the next index, however for index 1 it adds up to 99 so take 20/99 and make it the new value of that position, etc.
expected output:
d = {'col1': [25,20/99,30/101],
'col2': [25,20/99,30/101],
'col3': [25,20/99,30/101],
'col4': [25,39/99,11/101]
}
df = pandas.DataFrame(data=d)
here is a vectorized version:
c = df.sum(1).ne(100)
vals = np.where(c[:,None],df.div(df.sum(1),axis=0),df)
new_df = pd.DataFrame(vals,index=df.index,columns=df.columns)
# for overwriting the original df , use: df[:] = vals
print(new_df)
col1 col2 col3 col4
0 25.00000 25.00000 25.00000 25.000000
1 0.20202 0.20202 0.20202 0.393939
2 0.29703 0.29703 0.29703 0.108911
This achieves what you want by first generating each column as a list of its own:
col = [d[row][i] for row in d]
The process you describe is then applied:
if sum(col) != 100:
newcol = [n/sum(col) for n in col]
and then the column can be re-inserted. Final product:
for i in range(0, 3):
col = [d[row][i] for row in d]
if sum(col) != 100:
newcol = [n/sum(col) for n in col]
else:
newcol = col.copy()
for row in d:
d[row][i] = newcol[int(row[-1:])-1]
I ended up using this method to resolve my question
for i in range(len(df)):
x = (df.loc[i,'col1']+df.loc[i,'col2']+df.loc[i,'col3']+df.loc[i,'col4'])
for j in range(0,4):
df.iloc[i,j] = (df.iloc[i,j])/(x)
Related
I have dataframe I am trying to split col1 string value if value contains ":" and take first element and then put it into another col2 like this:
df['col1'] = df['col1'].astype(str)
df['col2'] = df['col1'].astype(str)
for i, row in df.iterrows():
if (":") in row['col1']:
row['col2'] = row['col1'].split(":")[1]+" "+ "in Person"
row['col1'] = 'j'
It is working on sample dataframe like this but It doesn't change the result in origional dataframe--
import pandas as pd
d = {'col1': ['a:b', 'ac'], 'col2': ['z 26', 'y 25']}
df = pd.DataFrame(data=d)
print(df)
col1 col2
j b in Person
ac y 25
what I am doing wrong and what are alternatives for this condition.
For the extracting part, try:
df['col2'] = df.col1.str.extract(r':(.+)', expand=False).add(' ').add(df.col2, fill_value='')
# Output
col1 col2
0 a:b b z 26
1 ac y 25
I'm not sure if I understand the replacing correctly, but here is a try:
df.loc[df.col1.str.contains(':'), 'col1'] = 'j'
# Output
col1 col2
0 j b z 26
1 ac y 25
The code below creates multiple empty dataframes named from the report2 list. They are then populated with a filtered existing dataframe called dfsource.
With a nested for loop, I'd like to filter each of these dataframes using a list of values but the sub loop does not work as shown.
import pandas as pd
report=['A','B','C']
suffix='_US'
report2=[s + suffix for s in report]
print (report2) #result: ['A_US', 'B_US', 'C_US']
source = {'COL1': ['A','B','C'], 'COL2': ['D','E','F']}
dfsource=pd.DataFrame(source)
print(dfsource)
df_dict = {}
for i in report2:
df_dict[i]=pd.DataFrame()
for x in report:
df_dict[i]=dfsource.query('COL1==x')
#df_dict[i]=dfsource.query('COL1=="A"') #Example, this works filtering for value A but not what I need.
print(df_dict['A_US'])
print(df_dict['B_US'])
print(df_dict['C_US'])
You can reference a variable in a query by using #
df_dict[i]=dfsource.query('COL1==#x')
So the total code looks like this
import pandas as pd
report=['A','B','C']
suffix='_US'
report2=[s + suffix for s in report]
print (report2) #result: ['A_US', 'B_US', 'C_US']
source = {'COL1': ['A','B','C'], 'COL2': ['D','E','F']}
dfsource=pd.DataFrame(source)
print(dfsource)
df_dict = {}
for i in report2:
df_dict[i]=pd.DataFrame()
for x in report:
df_dict[i]=dfsource.query('COL1==#x')
#df_dict[i]=dfsource.query('COL1=="A"') #Example, this works filtering for value A but not what I need.
print(df_dict['A_US'])
print(df_dict['B_US'])
print(df_dict['C_US'])
which outputs
COL1 COL2
0 A D
1 B E
2 C F
COL1 COL2
2 C F
COL1 COL2
2 C F
COL1 COL2
2 C F
However, I think you want to create a new dictionary based on the i and x of each list, then you can move the creation of the dataframe to the second for loop and then create a new key for each iteration.
import pandas as pd
report=['A','B','C']
suffix='_US'
report2=[s + suffix for s in report]
print (report2) #result: ['A_US', 'B_US', 'C_US']
source = {'COL1': ['A','B','C'], 'COL2': ['D','E','F']}
dfsource=pd.DataFrame(source)
print(dfsource)
df_dict = {}
for i in report2:
for x in report:
new_key = x + i
df_dict[new_key]=pd.DataFrame()
df_dict[new_key]=dfsource.query('COL1==#x')
for item in df_dict.items():
print(item)
Outputs 9 unique dataframes which are filtered based on whatever x value was passed.
('AA_US', COL1 COL2
0 A D)
('BA_US', COL1 COL2
1 B E)
('CA_US', COL1 COL2
2 C F)
('AB_US', COL1 COL2
0 A D)
('BB_US', COL1 COL2
1 B E)
('CB_US', COL1 COL2
2 C F)
('AC_US', COL1 COL2
0 A D)
('BC_US', COL1 COL2
1 B E)
('CC_US', COL1 COL2
2 C F)
Let's say I have a list as following,
l1 = ['SAP_QGF_126151_HFM_1_MOB_T_GFG_XZY_S7_L001_R1_001_MM_1.gz',
'SAP_QGF_126151_HFM_1_MOB_T_GFG_XZY_S7_L001_R2_001_MM_1.gz',
'SAP_QGF_126151_HFM_2_MOB_T_GFG_XZY_S7_L002_R1_001_MM_1.gz',
'SAP_QGF_126151_HFM_2_MOB_T_GFG_XZY_S7_L002_R2_001_MM_1.gz']
And I wanna convert the above list into a data frame, with four columns.
First I wanna split it on _ and use the 5th string as the first column, 4th string as second column and the whole string of the first and second elements in the list as the third and fourth column based on the if condition.
And I tried to generate them form lists,
col1 = [x.split('_')[5] for x in l1]
col2 = [x.split('_')[4] for x in l1]
col3 = [x.split('_')[10] for x in l1 if x == "L001"]
col4 = [x.split('_')[10] for x in l1 if x == "L002"]
However, for col3 and col4 it is not return anything with if condition.
I try to convert all the list using the following one-liner :
pd.DataFrame( {'col1': col1, 'col2': col2, 'col3': col3, 'col4':col4 })
In the end, I aim to have a data frame as, My desired output
col1 col2 col3 col4
MOB 1 SAP_QGF_126151_HFM_1_MOB_T_GFG_XZY_S7_L001_R1_001_MM_1.gz SAP_QGF_126151_HFM_1_MOB_T_GFG_XZY_S7_L001_R2_001_MM_1.gz
MOB 2 SAP_QGF_126151_HFM_1_MOB_T_GFG_XZY_S7_L002_R1_001_MM_1.gz SAP_QGF_126151_HFM_1_MOB_T_GFG_XZY_S7_L002_R2_001_MM_1.gz
So I need the first element from list l1 as it is in the col3 and second element in col4 in the first row. So as the third element in col3 and 4th element in col4 and these both must be in the second row.
Any suggestions or pointers are appreciated
col1 = [x.split('_')[5] for x in l1]
col2 = [x.split('_')[4] for x in l1]
col3 = [x for x in l1 if x.split('_')[10] == "L001"]
col4 = [x for x in l1 if x.split('_')[10] == "L002"]
pd.DataFrame( {'col1': col1[:len(col3)], 'col2': col2[:len(col3)], 'col3': col3, 'col4':col4 })
I have the following dataframes:
df1 = pd.DataFrame({'col1': ['A','M','C'],
'col2': ['B','N','O'],
# plus many more
})
df2 = pd.DataFrame({'col3': ['A','A','A','B','B','B'],
'col4': ['M','P','Q','J','P','M'],
# plus many more
})
Which look like these:
df1:
col1 col2
A B
M N
C O
#...plus many more
df2:
col3 col4
A M
A P
A Q
B J
B P
B M
#...plus many more
The objective is to create a dataframe containing all elements in col4 for each col3 that occurs in one row in df1. For example, let's look at row 1 of df1. We see that A is in col1 and B is in col2. Then, we go to df2, and check what col4 is for df2[df2['col3'] == 'A'] and df2[df2['col3'] == 'B']. We get, for A: ['M','P','Q'], and for B, ['J','P','M']. The intersection of these is['M', 'P'], so what I want is something like this
col1 col2 col4
A B M
A B P
....(and so on for the other rows)
The naive way to go about this is to iterate over rows and then get the intersection, but I was wondering if it's possible to solve this via merging techniques or other faster methods. So far, I can't think of any way how.
This should achieve what you want, using a combination of merge, groupby and set intersection:
# Getting tuple of all col1=col3 values in col4
df3 = pd.merge(df1, df2, left_on='col1', right_on='col3')
df3 = df3.groupby(['col1', 'col2'])['col4'].apply(tuple)
df3 = df3.reset_index()
# Getting tuple of all col2=col3 values in col4
df3 = pd.merge(df3, df2, left_on='col2', right_on='col3')
df3 = df3.groupby(['col1', 'col2', 'col4_x'])['col4_y'].apply(tuple)
df3 = df3.reset_index()
# Taking set intersection of our two tuples
df3['col4'] = df3.apply(lambda row: set(row['col4_x']) & set(row['col4_y']), axis=1)
# Dropping unnecessary columns
df3 = df3.drop(['col4_x', 'col4_y'], axis=1)
print(df3)
col1 col2 col4
0 A B {P, M}
If required, see this answer for examples of how to 'melt' col4.
This question is an extension of Pandas conditional creation of a series/dataframe column.
If we had this dataframe:
Col1 Col2
1 A Z
2 B Z
3 B X
4 C Y
5 C W
and we wanted to do the equivalent of:
if Col2 in ('Z','X') then Col3 = 'J'
else if Col2 = 'Y' then Col3 = 'K'
else Col3 = {value of Col1}
How could I do that?
You can use loc with isin and last fillna:
df.loc[df.Col2.isin(['Z','X']), 'Col3'] = 'J'
df.loc[df.Col2 == 'Y', 'Col3'] = 'K'
df['Col3'] = df.Col3.fillna(df.Col1)
print (df)
Col1 Col2 Col3
1 A Z J
2 B Z J
3 B X J
4 C Y K
5 C W C
Try this use np.where : outcome = np.where(condition, true, false)
df["Col3"] = np.where(df['Col2'].isin(['Z','X']), "J", np.where(df['Col2'].isin(['Y']), 'K', df['Col1']))
Col1 Col2 Col3
1 A Z J
2 B Z J
3 B X J
4 C Y K
5 C W C
A simple (but likely inefficient) way can be useful when you have multiple if condition. Like you are trying to put values into (say) four buckets based on quartiles.
df holds your data, col1 has the values, col2 should have the bucketized values (1,2,3,4)
quart has the 25%, 50% and 75% bounds.
try this
create a dummy list as dummy = []
iterate through the data frame with: for index, row in df.iterrows():
Set up the if conditions like: if row[col1] <= quart[0]:#25%
append proper value in dummy under the if: dummy.append(1)
the nested if-elif can take care of all the needed optional values which you append to dummy.
add dummy as a column: df[col2] = dummy
You can find the quartiles via A = df.describe() and then print(A[col1])