Pivoting string into more columns using Pandas - python

My table looks like the following:
import pandas as pd
d = {'col1': ['a>b>c']}
df = pd.DataFrame(data=d)
print(df)
"""
col1
0 a>b>c
"""
and my desired output need to be like this:
d1 = {'col1': ['a>b>c'],'col11': ['a'],'col12': ['b'],'col13': ['c']}
d1 = pd.DataFrame(data=d1)
print(d1)
"""
col1 col11 col12 col13
0 a>b>c a b c
"""
I have to run .split('>') method but then I don't know how to go on. Any help?

You can simply split using str.split('>')and expand the dataframe
import pandas as pd
d = {'col1': ['a>b>c'],'col2':['a>b>c']}
df = pd.DataFrame(data=d)
print(df)
col='col1'
#temp = df[col].str.split('>',expand=True).add_prefix(col)
temp = df[col].str.split('>',expand=True).rename(columns=lambda x: col + str(int(x)+1))
temp.merge(df,left_index=True,right_index=True,how='outer')
Out:
col1 col11 col12 col13
0 a>b>c a b c
Incase if you want to do it on multiple columns you can also take
for col in df.columns:
temp = df[col].str.split('>',expand=True).rename(columns=lambda x: col + str(int(x)+1))
df = temp.merge(df,left_index=True,right_index=True,how='outer')
Out:
col21 col22 col23 col11 col12 col13 col1 col2
0 a b c a b c a>b>c a>b>c

Using split:
d = {'col1': ['a>b>c']}
df = pd.DataFrame(data=d)
df = pd.concat([df, df.col1.str.split('>', expand=True)], axis=1)
df.columns = ['col1', 'col11', 'col12', 'col13']
df
Output:
col1 col11 col12 col13
0 a>b>c a b c

Related

How to convert string date column to timestamp in a new column in Python Pandas

I have the following example dataframe:
d = {'col1': ["2022-05-16T12:31:00Z", "2021-01-11T11:32:00Z"]}
df = pd.DataFrame(data=d)
df
col1
0 2022-05-16T12:31:00Z
1 2021-01-11T11:32:00Z
I need a second column (say col2) which will have the corresponding timestamp value for each col1 date string value from col1.
How can I do that without using a for loop?
Maybe try this?
import pandas as pd
import numpy as np
d = {'col1': ["2022-05-16T12:31:00Z", "2021-01-11T11:32:00Z"]}
df = pd.DataFrame(data=d)
df['col2'] = pd.to_datetime(df['col1'])
df['col2'] = df.col2.values.astype(np.int64) // 10 ** 9
df
Let us try to_datetime
df['col2'] = pd.to_datetime(df['col1'])
df
Out[614]:
col1 col2
0 2022-05-16T12:31:00Z 2022-05-16 12:31:00+00:00
1 2021-01-11T11:32:00Z 2021-01-11 11:32:00+00:00
Update
st = pd.to_datetime('1970-01-01T00:00:00Z')
df['unix'] = (pd.to_datetime(df['col1'])- st).dt.total_seconds()
Out[632]:
0 1.652704e+09
1 1.610365e+09
Name: col1, dtype: float64

df.iterrows() if condition not working on a dataframe?

I have dataframe I am trying to split col1 string value if value contains ":" and take first element and then put it into another col2 like this:
df['col1'] = df['col1'].astype(str)
df['col2'] = df['col1'].astype(str)
for i, row in df.iterrows():
if (":") in row['col1']:
row['col2'] = row['col1'].split(":")[1]+" "+ "in Person"
row['col1'] = 'j'
It is working on sample dataframe like this but It doesn't change the result in origional dataframe--
import pandas as pd
d = {'col1': ['a:b', 'ac'], 'col2': ['z 26', 'y 25']}
df = pd.DataFrame(data=d)
print(df)
col1 col2
j b in Person
ac y 25
what I am doing wrong and what are alternatives for this condition.
For the extracting part, try:
df['col2'] = df.col1.str.extract(r':(.+)', expand=False).add(' ').add(df.col2, fill_value='')
# Output
col1 col2
0 a:b b z 26
1 ac y 25
I'm not sure if I understand the replacing correctly, but here is a try:
df.loc[df.col1.str.contains(':'), 'col1'] = 'j'
# Output
col1 col2
0 j b z 26
1 ac y 25

pd.merge and check changed data

if I have these dataframes
df1 = pd.DataFrame({'index': [1,2,3,4],
'col1': ['a','b','c','d'],
'col2': ['h','e','l','p']})
df2 = pd.DataFrame({'index': [1,2,3,4],
'col1': ['a','e','f','d'],
'col2': ['h','e','lp','p']})
df1
index col1 col2
0 1 a h
1 2 b e
2 3 c l
3 4 d p
df2
index col1 col2
0 1 a h
1 2 e e
2 3 f lp
3 4 d p
I want to merge them and see whether or not the rows are different and get an output like this
index col1 col1_validation col2 col2_validation
0 1 a True h True
1 2 b False e True
2 3 c False l False
3 4 d True p True
how can I achieve that?
It looks like col1 and col2 from your "merged" dataframe are just taken from df1. In that case, you can simply compare the col1, col2 between the original data frames and add those as columns:
cols = ["col1", "col2"]
val_cols = ["col1_validation", "col2_validation"]
# (optional) new dataframe, so you don't mutate df1
df = df1.copy()
new_cols = (df1[cols] == df2[cols])
df[val_cols] = new_cols
You can merge and compare the two data frames with something similar to the following:
df1 = pd.DataFrame({'index': [1,2,3,4],
'col1': ['a','b','c','d'],
'col2': ['h','e','l','p']})
df2 = pd.DataFrame({'index': [1,2,3,4],
'col1': ['a','e','f','d'],
'col2': ['h','e','lp','p']})
# give columns unique name when merging
df1.columns = df1.columns + '_df1'
df2.columns = df2.columns + '_df2'
# merge/combine data frames
combined = pd.concat([df1, df2], axis = 1)
# add calculated columns
combined['col1_validation'] = combined['col1_df1'] == combined['col1_df2']
combined['col12validation'] = combined['col2_df1'] == combined['col2_df2']

Pandas dataframe how to replace single column with multiple

For example I have a dataframe like:
col1 col2 col3
0 2 1
and I want to replace it so that
{0: [a,b], 1: [c,d], 2: [e, f]}
So I want to end up with a dataframe like this:
col1 col1b col2 col2b col3 col3b
a b e f c d
I want to feed this data into tensorflow after transforming it so the below might also be acceptable output if tensorflow would accept it?
col1 col2 col3
[a,b] [e,f] [c,d]
Below is my current code:
field_names = ["elo", "map", "c1", "c2", "c3", "c4", "c5", "e1", "e2", "e3", "e4", "e5", "result"]
df_train = pd.read_csv('input/match_results.csv', names=field_names, skiprows=1, usecols=range(2, 13))
for count in range(1, 6):
str_count = str(count)
df_train['c' + str_count] = df_train['c' + str_count].map(champ_dict)
IIUC, you can use .stack .map and .cumcount to reshape your dataframe and index.
import pandas as pd
from string import ascii_lowercase
col_dict = dict(enumerate(ascii_lowercase))
map_dict = {0: ['a','b'], 1: ['c','d'], 2: ['e', 'f']}
s = df.stack().map(map_dict).explode().reset_index()
s['level_1'] = s['level_1'] + s.groupby(['level_1','level_0']).cumcount().map(col_dict)
df_new = s.set_index(['level_0','level_1']).unstack(1).droplevel(0,1).reset_index(drop=True)
print(df_new)
level_1 col1a col1b col2a col2b col3a col3b
0 a b e f c d

Pasting selected columns into new column with separator in Pandas

I have the following DF:
import pandas as pd
df = pd.DataFrame({'col1' : ["a","b"],
'col2' : ["ab","XX"], 'col3' : ["w","e"], 'col4':["foo","bar"]})
Which looks like this:
In [8]: df
Out[8]:
col1 col2 col3 col4
0 a ab w foo
1 b XX e bar
What I want to do is to combine col2, 3, 4 into a new column called ID
col1 col2 col3 col4 ID
0 a ab w foo ab.w.foo
1 b XX e bar XX.e.bar
How can I achieve that?
I tried this but failed:
df["ID"] = df.apply(lambda x: '.'.join(["col2","col3","col4"]),axis=1)
In [10]: df
Out[10]:
col1 col2 col3 col4 ID
0 a ab w foo col2.col3.col4
1 b XX e bar col2.col3.col4
Use x[['col2', 'col3', 'col4']]
In [54]: df.apply(lambda x: '.'.join(x[['col2', 'col3', 'col4']]),axis=1)
Out[54]:
0 ab.w.foo
1 XX.e.bar
dtype: object
A small typo in your code, you should use the x that is being passed into the lambda function to access those values :
In [29]: df["ID"] = df.apply(lambda x: '.'.join([x['col2'],x['col3'],x['col4']]),axis=1)
In [30]: df
Out[30]:
col1 col2 col3 col4 ID
0 a ab w foo ab.w.foo
1 b XX e bar XX.e.bar
A little bit simpler which runs faster:
df['id'] = df.col2 + '.' + df.col3 + '.' + df.col4
Illustrative timing with 10000 rows:
>>> t1 = timeit.timeit("df['id'] = df.col2 + '.' + df.col3 +'.' + df.col4", "from __main__ import pd,df", number=100)
Yields 0.00221121072769s per loop
>>> t2 = timeit.timeit("df.apply(lambda x: '.'.join(x[['col2', 'col3', 'col4']]), axis=1)","from __main__ import pd,df", number=100)
Yields 3.32903954983s per loop

Categories

Resources