Pandas DataFrame efficiently split one column into multiple - python

I have a dataframe similar to this:
data = {"col_1": [0, 1, 2],
"col_2": ["abc", "defg", "hi"]}
df = pd.DataFrame(data)
Visually:
col_1 col_2
0 0 abc
1 1 defg
2 2 hi
What I'd like to do is split up each character in col_2, and append it as a new column to the dataframe
example iterative method:
def get_chars(string):
chars = []
for char in string:
chars.append(char)
return chars
char_df = pd.DataFrame()
for i in range(len(df)):
char_arr = get_chars(df.loc[i, "col_2"])
temp_df = pd.DataFrame(char_arr).T
char_df = pd.concat([char_df, temp_df], ignore_index=True, axis=0)
df = pd.concat([df, char_df], ignore_index=True, axis=1)
Which results in the correct form:
0 1 2 3 4 5
0 0 abc a b c NaN
1 1 defg d e f g
2 2 hi h i NaN NaN
But I believe iterating though the dataframe like this is very inefficient, so I want to find a faster (ideally vectorised) solution.
In reality, I'm not really splitting up strings, but the point of this question is to find a way to efficiently process one column, and return many.

If need performance use DataFrame constructor with convert values to lists:
df = df.join(pd.DataFrame([list(x) for x in df['col_2']], index=df.index))
Or:
df = df.join(pd.DataFrame(df['col_2'].apply(list).tolist(), index=df.index))
print (df)
col_1 col_2 0 1 2 3
0 0 abc a b c None
1 1 defg d e f g
2 2 hi h i None None

Related

Filter Pandas Dataframe using list, but making sure count of elements matches count in list

so I have a list
my_list = [1,1,2,3,4,4]
I have a dataframe that looks like this
col_1 col_2
a 1
b 1
c 2
d 3
e 3
f 4
g 4
h 4
I basically want a final dataframe like
col_1 col_2
a 1
b 1
c 2
d 3
f 4
g 4
Basically I cant use
my_df[my_df['col_2'].isin(my_list)]
since this will include all the rows. I want the first row that matches with each item on the list, but all the same count of rows.
Use GroupBy.cumcount for counter with original and helper DataFrame and filter by inner join in DataFrame.merge:
my_list = [1,1,2,3,4,4]
df1 = pd.DataFrame({'col_2':my_list})
df1['g'] = df1.groupby('col_2').cumcount()
my_df['g'] = my_df.groupby('col_2').cumcount()
df = my_df.merge(df1).drop('g', axis=1)
print (df)
col_1 col_2
0 a 1
1 b 1
2 c 2
3 d 3
4 f 4
5 g 4

Python - lookup value in dataframe and return random corresponding value

df1 has a lot of NaN values.
I have compiled df2 with all unique values for code and name.
I need to replace the NaN code values in df1 with a random code value from df2 where df1 and df2 match on name.
df1 = pd.DataFrame(columns=['ID','name','code'])
df1.ID = [1,2,3,4]
df1.name = ['A','A','B','B']
df1.code = [np.nan,np.nan,np.nan,np.nan]
df2 = pd.DataFrame(columns=['name','code'])
df2.name = ['A','A','A','A','B','B','B','B']
df2.code = ['a','b','c','d','e','f','g','h']
df1
df2
example result
You could use random.sample and pas 2 after joining the values for each group into a list. Then, merge back into the initial dataframe, explode the list and drop_duplicates()
import random
df2 = df2.groupby('name')['code'].apply(lambda x: random.sample(list(x), 2)).reset_index()
df3 = df1[['ID', 'name']].merge(df2).explode('code').drop_duplicates(['name', 'code']).reset_index(drop=True)
df3['ID'] = np.flatnonzero(df3['ID']) + 1
Out[1]:
ID name code
0 1 A d
1 2 A a
2 3 B h
3 4 B f
You could create a dictionary where the keys are names and the values are the possible code, then for each name in df1 sample from the corresponding value:
import random
lookup = df2.groupby('name')['code'].apply(list).to_dict()
df1['code'] = df1['code'].fillna(pd.Series([random.choice(lookup[name]) for name in df1['name']],
index=df1.index))
print(df1)
Output
ID name code
0 1 A b
1 2 A b
2 3 B g
3 4 B g
If sample without replacement is needed you could do:
lst = [s for k, g in df1.groupby('name', as_index=False) for s in random.sample(lookup[k], len(g))]
df1['code'] = df1['code'].fillna(pd.Series(lst, index=df1.index))
print(df1)
Output
ID name code
0 1 A d
1 2 A a
2 3 B e
3 4 B h

Appending duplicates as columns and removing the other rows

I have a df with some repeated IDs, like this:
index ID name surname
1 1 a x
2 2 b y
3 1 c z
4 3 d j
I'd like to append the columns of the repeated rows to the right and to remove the "single" rows, like this:
index ID name surname second_name second_surname
1 1 a x c z
What is the most efficient way to do it? (I have many millions of rows)
Try using drop_duplicates, merge and query like so:
df['second_name'] = (df.drop_duplicates(subset='ID')
.reset_index()
.merge(df, on='ID', how='inner', suffixes=('', '_'))
.query("name != name_")
.set_index('level_0')['name_'])
[out]
index ID name second_name
0 1 1 a c
1 2 2 b NaN
2 3 1 c NaN
3 4 3 d NaN
If you only need the single row, use dropna:
df.dropna(subset=['second_name'])
[out]
index ID name second_name
0 1 1 a c
My suggestion involves groupby and should work for an arbitrary number of "additional" names:
df_in = pd.DataFrame({'ID': [1, 2, 1, 3], 'name': ['a', 'b', 'c', 'd']})
grp = df_in.groupby('ID', as_index=True)
df_a = grp.first()
df_b = grp['name'].unique().apply(pd.Series).rename(columns = lambda x: 'name_{:.0f}'.format(x+1)).drop('name_1', axis=1)
df_out = df_a.merge(df_b, how='inner', left_index=True, right_index=True).reset_index(drop=False)
I would try to pivot the dataframe. For that, I will first add a rank column to give the rank of a name for its ID:
df['rank'] = df.groupby('ID').cumcount()
pivoted = df.pivot(index='ID', columns='rank', values='name')
giving:
rank 0 1
ID
1 a c
2 b NaN
3 d NaN
Let us just format it:
pivoted = pivoted.rename_axis(None, axis=1).rename(lambda x: 'name_{}'.format(x),
axis=1).reset_index()
ID name_0 name_1
0 1 a c
1 2 b NaN
2 3 d NaN
Numpy / Pandas
r, i = np.unique(df.ID, return_inverse=True)
j = df.groupby('ID').cumcount()
names = np.empty((len(r), j.max() + 1), object)
names.fill(np.nan)
names[i, j] = df.name
pd.DataFrame(names, r).rename_axis('ID').add_prefix('name_')
name_0 name_1
ID
1 a c
2 b NaN
3 d NaN
Loopy
from itertools import count
from collections import defaultdict
c = defaultdict(count)
d = defaultdict(dict)
for i, n in zip(df.ID, df.name):
d[f'name_{next(c[i])}'][i] = n
pd.DataFrame(d).rename_axis('ID')
name_0 name_1
ID
1 a c
2 b NaN
3 d NaN

How to drop columns from a dataframe that contain specific values in any row

In a pandas dataframe, I need to find columns that contain a zero in any row, and drop that whole column.
For example, if my dataframe looks like this:
A B C D E F G H
0 1 0 1 0 1 1 1 1
1 0 1 1 1 1 0 1 1
I need to drop columns A, B, D, and F. I know how to drop the columns, but identifying the ones with zeros programmatically is eluding me.
You can use .loc to slice the dataframe and perform boolean indexation on the columns, checking which have any 0 in them:
df.loc[:,~(df==0).any()]
C E G H
0 1 1 1 1
1 1 1 1 1
Or equivalently you can do:
df.loc[:,(df!=0).all()]
Try this:
Code:
import pandas as pd
df = pd.DataFrame({'A': [1, 1, 1], 'B': [1, 0, 1]})
for col in df.columns:
if 0 in df[col].tolist():
df = df.drop(columns=col)
df

Pandas: set the value of a column in a row to be the value stored in a different df at the index of its other rows

>>> df
0 1
0 0 0
1 1 1
2 2 1
>>> df1
0 1 2
0 A B C
1 D E F
>>> crazy_magic()
>>> df
0 1 3
0 0 0 A #df1[0][0]
1 1 1 E #df1[1][1]
2 2 1 F #df1[2][1]
Is there a way to achieve this without for?
import pandas as pd
df = pd.DataFrame([[0,0],[1,1],[2,1]])
df1 = pd.DataFrame([['A', 'B', 'C'],['D', 'E', 'F']])
df2 = df1.reset_index(drop=False)
# index 0 1 2
# 0 0 A B C
# 1 1 D E F
df3 = pd.melt(df2, id_vars=['index'])
# index variable value
# 0 0 0 A
# 1 1 0 D
# 2 0 1 B
# 3 1 1 E
# 4 0 2 C
# 5 1 2 F
result = pd.merge(df, df3, left_on=[0,1], right_on=['variable', 'index'])
result = result[[0, 1, 'value']]
print(result)
yields
0 1 value
0 0 0 A
1 1 1 E
2 2 1 F
My reasoning goes as follows:
We want to use two columns of df as coordinates.
The word "coordinates" reminds me of pivot, since
if you have two columns whose values represent "coordinates" and a third
column representing values, and you want to convert that to a grid, then
pivot is the tool to use.
But df does not have a third column of values. The values are in df1. In fact df1 looks like the result of a pivot operation. So instead of pivoting df, we want to unpivot df1.
pd.melt is the function to use when you want to unpivot.
So I tried melting df1. Comparison with other uses of pd.melt led me to conclude df1 needed the index as a column. That's the reason for defining df2. So we melt df2.
Once you get that far, visually comparing df3 to df leads you naturally to the use of pd.merge.

Categories

Resources