transforming multiple columns in data frame at once - python

I have some data that I'm trying to clean up. That involves modifying some columns, combining other cols into new ones, etc. I am wondering if there is a way to do this in a succinct way in pandas or if each operation needs to be a separate line of code. Here is an example:
ex_df = pd.DataFrame(data = {"a": [1,2,3,4], "b": ["a-b", "c-d", "e-f", "g-h"]})
Say I want to create a new column called c which will be the first letter in each row of b, I want to transform b by removing the "-", and I want to create another col called d which will be the first letter of b concatenated with the entry in a in that same row. Right now I would have to do something like this:
ex_df["b"] = ex_df["b"].map(lambda x: "".join(x.split(sep="-")))
ex_df["c"] = ex_df["b"].map(lambda x: x[0])
ex_df["d"] = ex_df.apply(func=lambda s: s["c"] + str(s["a"]), axis=1)
ex_df
# a b c d
#0 1 ab a a1
#1 2 cd c c2
#2 3 ef e e3
#3 4 gh g g4
Coming from an R data.table background (which would combine all these operations into a single statement), I'm wondering how things are done in pandas.

You can use:
In [12]: ex_df.assign(
...: b=ex_df.b.str.replace('-', ''),
...: c=ex_df.b.str[0],
...: d=ex_df.b.str[0] + ex_df.a.astype(str)
...: )
Out[12]:
a b c d
0 1 ab a a1
1 2 cd c c2
2 3 ef e e3
3 4 gh g g4

This is one approach.
Demo:
import pandas as pd
ex_df = pd.DataFrame(data = {"a": [1,2,3,4], "b": ["a-b", "c-d", "e-f", "g-h"]})
ex_df["c"] = ex_df["b"].str[0]
ex_df["b"] = ex_df["b"].str.replace("-", "")
ex_df["d"] = ex_df.apply(lambda s: s["c"] + str(s["a"])), axis=1)
print(ex_df)
Output:
a b c d
0 1 ab a a1
1 2 cd c c2
2 3 ef e e3
3 4 gh g g4
You can use the build in str method to make the required output.

Related

How to shift a dataframe element-wise to fill NaNs?

I have a DataFrame like this:
>>> df = pd.DataFrame({'a': list('ABCD'), 'b': ['E',np.nan,np.nan,'F']})
a b
0 A E
1 B NaN
2 C NaN
3 D F
I am trying to fill NaN with values of the previous column in the next row and dropping this second row. In other words, I want to combine the two rows with NaNs to form a single row without NaNs like this:
a b
0 A E
1 B C
2 D F
I have tried various flavors of df.fillna(method="<bfill/ffill>") but this didn't give me the expected output.
I haven't found any other question about this problem, Here's one. And actually that DataFrame is made from list of DataFrame by doing .concat(), you may notice that from indexes also. I am telling this because it may be easy to do in single row rather then in multiple rows.
I have found some suggestions to use shift, combine_first but non of them worked for me. You may try these too.
I also have found this too. It is a whole article about filling nan values but I haven't found problem/answer like mine.
OK misunderstood what you wanted to do the first time. The dummy example was a bit ambiguous.
Here is another:
>>> df = pd.DataFrame({'a': list('ABCD'), 'b': ['E',np.nan,np.nan,'F']})
a b
0 A E
1 B NaN
2 C NaN
3 D F
To my knowledge, this operation does not exist with pandas, so we will use numpy to do the work.
First transform the dataframe to numpy array and flatten it to be one-dimensional. Then drop NaNs using pandas.isna that is working on a larger range types than numpy.isnan, and then reshape the array to its original shape before transforming back to dataframe:
array = df.to_numpy().flatten()
pd.DataFrame(array[~pd.isna(array)].reshape(-1,df.shape[1]), columns=df.columns)
output:
a b
0 A E
1 B C
2 D F
It is also working for more complex examples, as long as the NaN pattern is conserved among columns with NaNs:
In:
a b c d
0 A H A2 H2
1 B NaN B2 NaN
2 C NaN C2 NaN
3 D I D2 I2
4 E NaN E2 NaN
5 F NaN F2 NaN
6 G J G2 J2
Out:
a b c d
0 A H A2 H2
1 B B2 C C2
2 D I D2 I2
3 E E2 F F2
4 G J G2 J2
In:
a b c
0 A F H
1 B NaN NaN
2 C NaN NaN
3 D NaN NaN
4 E G I
Out:
a b c
0 A F H
1 B C D
2 E G I
In case NaNs columns do not have the same pattern such as:
a b c d
0 A H A2 NaN
1 B NaN B2 NaN
2 C NaN C2 H2
3 D I D2 I2
4 E NaN E2 NaN
5 F NaN F2 NaN
6 G J G2 J2
You can apply the operation per group of two columns:
def elementwise_shift(df):
array = df.to_numpy().flatten()
return pd.DataFrame(array[~pd.isna(array)].reshape(-1,df.shape[1]), columns=df.columns)
(df.groupby(np.repeat(np.arange(df.shape[1]/2), 2), axis=1)
.apply(elementwise_shift)
)
output:
a b c d
0 A H A2 B2
1 B C C2 H2
2 D I D2 I2
3 E F E2 F2
4 G J G2 J2
You can do this in two steps with a placeholder column. First you fill all the nans in column b with the a values from the next row. Then you apply the filtering. In this example I use ffill with a limit of 1 to filter all nan values after the first, there's probably a better method.
import pandas as pd
import numpy as np
df=pd.DataFrame({"a":[1,2,3,3,4],"b":[1,2,np.nan,np.nan,4]})
# Fill all nans:
df['new_b'] = df['b'].fillna(df['a'].shift(-1))
df = df[df['b'].ffill(limit=1).notna()].copy() # .copy() because loc makes a view
df = df.drop('b', axis=1).rename(columns={'new_b': 'b'})
print(df)
# output:
# a b
# 0 1 1
# 1 2 2
# 2 3 2
# 4 4 4

efficient solution for reshaping the dataframe in pandas

I have a dataframe like
id col1 col2 col3 ......col25
1 a b c d ...........
2 d e f NA ........
3 a NA NA NA .......
What I want is:
id start end
1 a b
1 b c
1 c d
2 d e
2 e f
for names, row in data_final.iterrows():
for i in range(0,26):
try:
x = pd.Series([row["id"],row[i], row[i+1]],index=['id', 'start','end'])
df1 = df1.append(x, ignore_index = True)
except:
break
This works but it is definitely is not the best solution as its time complexity is too high.
I need a better and efficient solution for this.
One way could be to stack to remove missing values, groupby and zip to aggregate each elements with the succeeding one. The we just need to flatten the result with itertools.chain and create a dataframe:
from itertools import chain
l = [list(zip(v.values[:-1], v.values[1:])) for _,v in df.stack().groupby(level=0)]
pd.DataFrame(chain.from_iterable(l), columns=['start', 'end'])
start end
0 a b
1 b c
2 c d
3 d e
4 e f

Trouble with ignore_index and concat()

I'm new to Python. I have 2 dataframes each with a single column. I want to join them together and keep the values based on their respective positions in each of the tables.
My code looks something like this:
huh = pd.DataFrame(columns=['result'], data=['a','b','c','d'])
huh2 = pd.DataFrame(columns=['result2'], data=['aa','bb','cc','dd'])
huh2 = huh2.sort_values('result2', ascending=False)
tmp = pd.concat([huh,huh2], ignore_index=True, axis=1)
tmp
From the documentation it looks like the ignore_index flag and axis=1 should be sufficient to achieve this but the results obviously disagree.
Current Output:
0 1
0 a aa
1 b bb
2 c cc
3 d dd
Desired Output:
result result2
0 a dd
1 b cc
2 c bb
3 d aa
If you concatenate the DataFrames horizontally, then the column names are ignored. If you concatenate vertically, the indexes are ignored. You can only ignore one or the other, not both.
In your case, I would recommend setting the index of "huh2" to be the same as that of "huh".
pd.concat([huh, huh2.set_index(huh.index)], axis=1)
result result2
0 a dd
1 b cc
2 c bb
3 d aa
If you aren't dealing with custom indices, reset_index will suffice.
pd.concat([huh, huh2.reset_index(drop=True)], axis=1)
result result2
0 a dd
1 b cc
2 c bb
3 d aa

How to turn convert rows to columns in pandas?

I want to convert every three rows of a DataFrame into columns .
Input:
import pandas as pd
df = pd.DataFrame({'a': [1,2,3,11,12,13],'b':['a','b','c','aa','bb','cc']})
print(df)
Output:
a b
0 1 a
1 2 b
2 3 c
3 11 aa
4 12 bb
5 13 cc
Expected:
a1 a2 a3 b1 b2 b3
0 1 2 3 a b c
1 11 12 13 aa bb cc
Use set_index by floor division and modulo by 3 with unstack and flattening MultiIndex:
a = np.arange(len(df))
#if default index
#a = df.index
df1 = df.set_index([a // 3, a % 3]).unstack()
#python 3.6+ solution
df1.columns = [f'{i}{j + 1}' for i,j in df1.columns]
#python bellow 3.6
#df1.columns = ['{}{}'.format(i,j+1) for i,j in df1.columns]
print (df1)
a1 a2 a3 b1 b2 b3
0 1 2 3 a b c
1 11 12 13 aa bb cc
I'm adding a different approach with group -> apply.
df is first grouped by df.index//3 and then the munge function is applied to each group.
def munge(group):
g = group.T.stack()
g.index = ['{}{}'.format(c, i+1) for i, (c, _) in enumerate(g.index)]
return g
result = df.groupby(df.index//3).apply(munge)
Output:
>>> df.groupby(df.index//3).apply(munge)
a1 a2 a3 b4 b5 b6
0 1 2 3 a b c
1 11 12 13 aa bb cc

How to insert in every second row in pandas? [duplicate]

This question already has answers here:
Python Pandas - Combining Multiple Columns into one Staggered Column
(2 answers)
Closed last year.
Basically, I have a DataFrame which looks like this:
c1 c2
0 a b
1 c d
2 e f
3 g h
I need to convert it to this one:
c1
0 a
1 b
2 c
3 d
4 e
...
I know how to get all the values from the second column:
second_col_items = [df[['1']].iloc[i].item() for i in range(0,len(df.index))]
But I'm stuck on inserting. I need to insert rows in loop, and, moreover, I need to insert new rows between the existing ones. Is it even possible?
So, my question is: how to iterate through the list (second_col_items in my case) and insert it's values to every second row in DataFrame? Thanks in advance!
you can use stack() method:
source DF
In [2]: df
Out[2]:
c1 c2
0 a b
1 c d
2 e f
3 g h
stacked
In [3]: df.stack()
Out[3]:
0 c1 a
c2 b
1 c1 c
c2 d
2 c1 e
c2 f
3 c1 g
c2 h
dtype: object
stacked + reset_index
In [4]: df.stack().reset_index(drop=True)
Out[4]:
0 a
1 b
2 c
3 d
4 e
5 f
6 g
7 h
dtype: object
In [5]:
You can unwind with ravel or flatten. Both are numpy methods that can be applied the the values attribute of a pd.DataFrame or pd.Series
solution
pd.Series(df.values.ravel(), name='c1')
0 a
1 b
2 c
3 d
4 e
5 f
6 g
7 h
Name: c1, dtype: object
Or
pd.DataFrame(dict(c1=df.values.ravel())
c1
0 a
1 b
2 c
3 d
4 e
5 f
6 g
7 h
naive time test

Categories

Resources