Pandas dataframe how to replace single column with multiple - python

For example I have a dataframe like:
col1 col2 col3
0 2 1
and I want to replace it so that
{0: [a,b], 1: [c,d], 2: [e, f]}
So I want to end up with a dataframe like this:
col1 col1b col2 col2b col3 col3b
a b e f c d
I want to feed this data into tensorflow after transforming it so the below might also be acceptable output if tensorflow would accept it?
col1 col2 col3
[a,b] [e,f] [c,d]
Below is my current code:
field_names = ["elo", "map", "c1", "c2", "c3", "c4", "c5", "e1", "e2", "e3", "e4", "e5", "result"]
df_train = pd.read_csv('input/match_results.csv', names=field_names, skiprows=1, usecols=range(2, 13))
for count in range(1, 6):
str_count = str(count)
df_train['c' + str_count] = df_train['c' + str_count].map(champ_dict)

IIUC, you can use .stack .map and .cumcount to reshape your dataframe and index.
import pandas as pd
from string import ascii_lowercase
col_dict = dict(enumerate(ascii_lowercase))
map_dict = {0: ['a','b'], 1: ['c','d'], 2: ['e', 'f']}
s = df.stack().map(map_dict).explode().reset_index()
s['level_1'] = s['level_1'] + s.groupby(['level_1','level_0']).cumcount().map(col_dict)
df_new = s.set_index(['level_0','level_1']).unstack(1).droplevel(0,1).reset_index(drop=True)
print(df_new)
level_1 col1a col1b col2a col2b col3a col3b
0 a b e f c d

Related

How to convert an unlabled key value pair dictionary data into 2 columns in Python

I have an array as input in key value pair:
test = {'a':32, 'b':21, 'c':92}
I have to get the result in a new data frame as follows:
col1
col2
a
32
b
21
c
92
Try:
test = {"a": 32, "b": 21, "c": 92}
df = pd.DataFrame({"col1": test.keys(), "col2": test.values()})
print(df)
Prints:
col1 col2
0 a 32
1 b 21
2 c 92
Using unpacking:
df = pd.DataFrame([*test.items()], columns=['col1', 'col2'])
Using a Series:
df = pd.Series(test, name='col2').rename_axis('col1').reset_index()
output:
col1 col2
0 a 32
1 b 21
2 c 92
There's maybe shorter / better ways of doing this but here goes
test = {'a':32, 'b':21, 'c':92}
df = pd.DataFrame(test, index=[0])
df = df.stack().reset_index(-1).iloc[:, ::-1]
df.columns = ['col2', 'col1']
df.reset_index()

df.iterrows() if condition not working on a dataframe?

I have dataframe I am trying to split col1 string value if value contains ":" and take first element and then put it into another col2 like this:
df['col1'] = df['col1'].astype(str)
df['col2'] = df['col1'].astype(str)
for i, row in df.iterrows():
if (":") in row['col1']:
row['col2'] = row['col1'].split(":")[1]+" "+ "in Person"
row['col1'] = 'j'
It is working on sample dataframe like this but It doesn't change the result in origional dataframe--
import pandas as pd
d = {'col1': ['a:b', 'ac'], 'col2': ['z 26', 'y 25']}
df = pd.DataFrame(data=d)
print(df)
col1 col2
j b in Person
ac y 25
what I am doing wrong and what are alternatives for this condition.
For the extracting part, try:
df['col2'] = df.col1.str.extract(r':(.+)', expand=False).add(' ').add(df.col2, fill_value='')
# Output
col1 col2
0 a:b b z 26
1 ac y 25
I'm not sure if I understand the replacing correctly, but here is a try:
df.loc[df.col1.str.contains(':'), 'col1'] = 'j'
# Output
col1 col2
0 j b z 26
1 ac y 25

Filter multiple dataframes with criteria from list using loop

The code below creates multiple empty dataframes named from the report2 list. They are then populated with a filtered existing dataframe called dfsource.
With a nested for loop, I'd like to filter each of these dataframes using a list of values but the sub loop does not work as shown.
import pandas as pd
report=['A','B','C']
suffix='_US'
report2=[s + suffix for s in report]
print (report2) #result: ['A_US', 'B_US', 'C_US']
source = {'COL1': ['A','B','C'], 'COL2': ['D','E','F']}
dfsource=pd.DataFrame(source)
print(dfsource)
df_dict = {}
for i in report2:
df_dict[i]=pd.DataFrame()
for x in report:
df_dict[i]=dfsource.query('COL1==x')
#df_dict[i]=dfsource.query('COL1=="A"') #Example, this works filtering for value A but not what I need.
print(df_dict['A_US'])
print(df_dict['B_US'])
print(df_dict['C_US'])
You can reference a variable in a query by using #
df_dict[i]=dfsource.query('COL1==#x')
So the total code looks like this
import pandas as pd
report=['A','B','C']
suffix='_US'
report2=[s + suffix for s in report]
print (report2) #result: ['A_US', 'B_US', 'C_US']
source = {'COL1': ['A','B','C'], 'COL2': ['D','E','F']}
dfsource=pd.DataFrame(source)
print(dfsource)
df_dict = {}
for i in report2:
df_dict[i]=pd.DataFrame()
for x in report:
df_dict[i]=dfsource.query('COL1==#x')
#df_dict[i]=dfsource.query('COL1=="A"') #Example, this works filtering for value A but not what I need.
print(df_dict['A_US'])
print(df_dict['B_US'])
print(df_dict['C_US'])
which outputs
COL1 COL2
0 A D
1 B E
2 C F
COL1 COL2
2 C F
COL1 COL2
2 C F
COL1 COL2
2 C F
However, I think you want to create a new dictionary based on the i and x of each list, then you can move the creation of the dataframe to the second for loop and then create a new key for each iteration.
import pandas as pd
report=['A','B','C']
suffix='_US'
report2=[s + suffix for s in report]
print (report2) #result: ['A_US', 'B_US', 'C_US']
source = {'COL1': ['A','B','C'], 'COL2': ['D','E','F']}
dfsource=pd.DataFrame(source)
print(dfsource)
df_dict = {}
for i in report2:
for x in report:
new_key = x + i
df_dict[new_key]=pd.DataFrame()
df_dict[new_key]=dfsource.query('COL1==#x')
for item in df_dict.items():
print(item)
Outputs 9 unique dataframes which are filtered based on whatever x value was passed.
('AA_US', COL1 COL2
0 A D)
('BA_US', COL1 COL2
1 B E)
('CA_US', COL1 COL2
2 C F)
('AB_US', COL1 COL2
0 A D)
('BB_US', COL1 COL2
1 B E)
('CB_US', COL1 COL2
2 C F)
('AC_US', COL1 COL2
0 A D)
('BC_US', COL1 COL2
1 B E)
('CC_US', COL1 COL2
2 C F)

Summing Columns and replacing individual values if met specific condition

import pandas
d = {'col1': [25,20,30],
'col2': [25,20,30],
'col3': [25,20,30],
'col4': [25,39,11]
}
df = pandas.DataFrame(data=d)
How would I loop from this data frame and add col1 + col2 + col3 + col4 and if not equal 100, take value in that index perform this col1/(col1+col2+col3+col4 and make that the new value for that spot. This way now when you sum col1 + col2 + col3 + col4 it will add up to 100 for that index.
So for example for index 0, when you add col1 +col2 + col3 + col4 it equals 100, therefore, go to the next index, however for index 1 it adds up to 99 so take 20/99 and make it the new value of that position, etc.
expected output:
d = {'col1': [25,20/99,30/101],
'col2': [25,20/99,30/101],
'col3': [25,20/99,30/101],
'col4': [25,39/99,11/101]
}
df = pandas.DataFrame(data=d)
here is a vectorized version:
c = df.sum(1).ne(100)
vals = np.where(c[:,None],df.div(df.sum(1),axis=0),df)
new_df = pd.DataFrame(vals,index=df.index,columns=df.columns)
# for overwriting the original df , use: df[:] = vals
print(new_df)
col1 col2 col3 col4
0 25.00000 25.00000 25.00000 25.000000
1 0.20202 0.20202 0.20202 0.393939
2 0.29703 0.29703 0.29703 0.108911
This achieves what you want by first generating each column as a list of its own:
col = [d[row][i] for row in d]
The process you describe is then applied:
if sum(col) != 100:
newcol = [n/sum(col) for n in col]
and then the column can be re-inserted. Final product:
for i in range(0, 3):
col = [d[row][i] for row in d]
if sum(col) != 100:
newcol = [n/sum(col) for n in col]
else:
newcol = col.copy()
for row in d:
d[row][i] = newcol[int(row[-1:])-1]
I ended up using this method to resolve my question
for i in range(len(df)):
x = (df.loc[i,'col1']+df.loc[i,'col2']+df.loc[i,'col3']+df.loc[i,'col4'])
for j in range(0,4):
df.iloc[i,j] = (df.iloc[i,j])/(x)

Pivoting string into more columns using Pandas

My table looks like the following:
import pandas as pd
d = {'col1': ['a>b>c']}
df = pd.DataFrame(data=d)
print(df)
"""
col1
0 a>b>c
"""
and my desired output need to be like this:
d1 = {'col1': ['a>b>c'],'col11': ['a'],'col12': ['b'],'col13': ['c']}
d1 = pd.DataFrame(data=d1)
print(d1)
"""
col1 col11 col12 col13
0 a>b>c a b c
"""
I have to run .split('>') method but then I don't know how to go on. Any help?
You can simply split using str.split('>')and expand the dataframe
import pandas as pd
d = {'col1': ['a>b>c'],'col2':['a>b>c']}
df = pd.DataFrame(data=d)
print(df)
col='col1'
#temp = df[col].str.split('>',expand=True).add_prefix(col)
temp = df[col].str.split('>',expand=True).rename(columns=lambda x: col + str(int(x)+1))
temp.merge(df,left_index=True,right_index=True,how='outer')
Out:
col1 col11 col12 col13
0 a>b>c a b c
Incase if you want to do it on multiple columns you can also take
for col in df.columns:
temp = df[col].str.split('>',expand=True).rename(columns=lambda x: col + str(int(x)+1))
df = temp.merge(df,left_index=True,right_index=True,how='outer')
Out:
col21 col22 col23 col11 col12 col13 col1 col2
0 a b c a b c a>b>c a>b>c
Using split:
d = {'col1': ['a>b>c']}
df = pd.DataFrame(data=d)
df = pd.concat([df, df.col1.str.split('>', expand=True)], axis=1)
df.columns = ['col1', 'col11', 'col12', 'col13']
df
Output:
col1 col11 col12 col13
0 a>b>c a b c

Categories

Resources