My DataFrame has some columns where each value can be "1", "2", "3" or "any". Here is an example:
>>> df = pd.DataFrame({'a': ['1', '2', 'any', '3'], 'b': ['any', 'any', '3', '1']})
>>> df
a b
0 1 any
1 2 any
2 any 3
3 3 1
In my case, "any" means that the value can be "1", "2" or "3". I would like to generate all possible rows using only values "1", "2" and "3" (or, in general, any list of values that I might have). Here is the expected output for the example above:
a b
0 1 1
1 1 2
2 1 3
3 2 1
4 2 2
5 2 3
6 3 3
7 3 1
I got this output with this kind of ugly and complicated approach:
a = df['a'].replace('any', '1,2,3').apply(lambda x: eval(f'[{str(x)}]')).explode()
result = pd.merge(df.drop(columns=['a']), a, left_index=True, right_index=True)
b = result['b'].replace('any', '1,2,3').apply(lambda x: eval(f'[{str(x)}]')).explode()
result = pd.merge(result.drop(columns=['b']), b, left_index=True, right_index=True)
result = result.drop_duplicates().reset_index(drop=True)
Is there any simpler and/or nicer approach?
You can replace the string any with, e.g. '1,2,3', then split and explode:
(df.replace('any', '1,2,3')
.apply(lambda x: x.str.split(',') if x.name in ['a','b'] else x)
.explode('a').explode('b')
.drop_duplicates(['a','b'])
)
Output:
a b c
0 1 1 1
0 1 2 1
0 1 3 1
1 2 1 1
1 2 2 1
1 2 3 1
2 3 3 1
3 3 1 1
I would not use eval and string manipulations, but just replace 'any' with a set of values
import pandas as pd
df = pd.DataFrame({'a': ['1', '2', 'any', '3'], 'b': ['any', 'any', '3', '1']})
df['c'] = '1'
df[df == 'any'] = {'1', '2', '3'}
for col in df:
df = df.explode(col)
df = df.drop_duplicates().reset_index(drop=True)
print(df)
This gives the result
a b c
0 1 2 1
1 1 3 1
2 1 1 1
3 2 2 1
4 2 3 1
5 2 1 1
6 3 3 1
7 3 1 1
Related
I have a data frame as:
a b c d......
1 1
3 3 3 5
4 1 1 4 6
1 0
I want to select number of columns based on value given in column "a". In this case for first row it would only select column b.
How can I achieve something like:
df.iloc[:,column b:number of columns corresponding to value in column a]
My expected output would be:
a b c d e
1 1 0 0 1 # 'e' contains value in column b because colmn a = 1
3 3 3 5 335 # 'e' contains values of column b,c,d because colm a
4 1 1 4 1 # = 3
1 0 NAN
Define a little function for this:
def select(df, r):
return df.iloc[r, 1:1 + df.iat[r, 0]]
The function uses iat to query the a column for that row, and iloc to select columns from the same row.
Call it as such:
select(df, 0)
b 1.0
Name: 0, dtype: float64
And,
select(df, 1)
b 3.0
c 3.0
d 5.0
Name: 1, dtype: float64
Based on your edit, consider this -
df
a b c d e
0 1 1 0 0 0
1 3 3 3 5 0
2 4 1 1 4 6
3 1 0 0 0 0
Use where/mask (with numpy broadcasting) + agg here -
df['e'] = df.iloc[:, 1:]\
.astype(str)\
.where(np.arange(df.shape[1] - 1) < df.a[:, None], '')\
.agg(''.join, axis=1)
df
a b c d e
0 1 1 0 0 1
1 3 3 3 5 335
2 4 1 1 4 1146
3 1 0 0 0 0
If nothing matches, then those entries in e will have an empty string. Just use replace -
df['e'] = df['e'].replace('', np.nan)
A numpy slicing approach
a = v[:, 0]
b = v[:, 1:]
n, m = b.shape
b = b.ravel()
b = np.where(b == 0, '', b.astype(str))
r = np.arange(n) * m
f = lambda t: b[t[0]:t[1]]
df.assign(g=list(map(''.join, map(f, zip(r, r + a)))))
a b c d e g
0 1 1 0 0 0 1
1 3 3 3 5 0 335
2 4 1 1 4 6 1146
3 1 0 0 0 0
Edit: one line solution with slicing.
df["f"] = df.astype(str).apply(lambda r: "".join(r[1:int(r["a"])+1]), axis=1)
# df["f"] = df["f"].astype(int) if you need `f` to be integer
df
a b c d e f
0 1 1 X X X 1
1 3 3 3 5 X 335
2 4 1 1 4 6 1146
3 1 0 X X X 0
Dataset used:
df = pd.DataFrame({'a': {0: 1, 1: 3, 2: 4, 3: 1},
'b': {0: 1, 1: 3, 2: 1, 3: 0},
'c': {0: 'X', 1: '3', 2: '1', 3: 'X'},
'd': {0: 'X', 1: '5', 2: '4', 3: 'X'},
'e': {0: 'X', 1: 'X', 2: '6', 3: 'X'}})
Suggestion for improvement would be appreciated!
I have the following df
list_columns = ['A', 'B', 'C']
list_data = [
[1, '2', 3],
[4, '4', 5],
[1, '2', 3],
[4, '4', 6]
]
df = pd.DataFrame(columns=list_columns, data=list_data)
I want to check if multiple columns exist, and if not to create them.
Example:
If B,C,D do not exist, create them(For the above df it will create only D column)
I know how to do this with one column:
if 'D' not in df:
df['D']=0
Is there a way to test if all my columns exist, and if not create the one that are missing? And not to make an if for each column
Here loop is not necessary - use DataFrame.reindex with Index.union:
cols = ['B','C','D']
df = df.reindex(df.columns.union(cols, sort=False), axis=1, fill_value=0)
print (df)
A B C D
0 1 2 3 0
1 4 4 5 0
2 1 2 3 0
3 4 4 6 0
Just to add, you can unpack the set diff between your columns and the list with an assign and ** unpacking.
import numpy as np
cols = ['B','C','D','E']
df.assign(**{col : 0 for col in np.setdiff1d(cols,df.columns.values)})
A B C D E
0 1 2 3 0 0
1 4 4 5 0 0
2 1 2 3 0 0
3 4 4 6 0 0
I have data like this
ID INFO
1 A=2;B=2;C=5
2 A=3;B=4;C=1
3 A=1;B=3;C=2
I want to split the Info columns into
ID A B C
1 2 2 5
2 3 4 1
3 1 3 2
I can split columns with one delimiter by using
df['A'], df['B'], df['C'] = df['INFO'].str.split(';').str
then split again by = but this seems to not so efficient in case I have many rows and especially when there are so many field that cannot be hard-coded beforehand.
Any suggestion would be greatly welcome.
You could use named groups together with Series.str.extract. In the end concat back the 'ID'. This assumes you always have A=;B=;and C= in a line.
pd.concat([df['ID'],
df['INFO'].str.extract('A=(?P<A>\d);B=(?P<B>\d);C=(?P<C>\d)')], axis=1)
# ID A B C
#0 1 2 2 5
#1 2 3 4 1
#2 3 1 3 2
If you want a more flexible solution that can deal with cases where a single line might be 'A=1;C=2' then we can split on ';' and partition on '='. pivot in the end to get to your desired output.
### Starting Data
#ID INFO
#1 A=2;B=2;C=5
#2 A=3;B=4;C=1
#3 A=1;B=3;C=2
#4 A=1;C=2
(df.set_index('ID')['INFO']
.str.split(';', expand=True)
.stack()
.str.partition('=')
.reset_index(-1, drop=True)
.pivot(columns=0, values=2)
)
# A B C
#ID
#1 2 2 5
#2 3 4 1
#3 1 3 2
#4 1 NaN 2
Browsing a Series is much faster that iterating across the rows of a dataframe.
So I would do:
pd.DataFrame([dict([x.split('=') for x in t.split(';')]) for t in df['INFO']], index=df['ID']).reset_index()
It gives as expected:
ID A B C
0 1 2 2 5
1 2 3 4 1
2 3 1 3 2
It should be faster than splitting twice dataframe columns.
values = [dict(item.split("=") for item in value.split(";")) for value in df.INFO]
df[['a', 'b', 'c']] = pd.DataFrame(values)
This will give you the desired output:
ID INFO a b c
1 a=1;b=2;c=3 1 2 3
2 a=4;b=5;c=6 4 5 6
3 a=7;b=8;c=9 7 8 9
Explanation:
The first line converts every value to a dictionary.
e.g.
x = 'a=1;b=2;c=3'
dict(item.split("=") for item in x.split(";"))
results in :
{'a': '1', 'b': '2', 'c': '3'}
DataFrame can take a list of dicts as an input and turn it into a dataframe.
Then you only need to assign the dataframe to the columns you want:
df[['a', 'b', 'c']] = pd.DataFrame(values)
Another solution is Series.str.findAll to extract values and then apply(pd.Series):
df[["A", "B", "C"]] = df.INFO.str.findall(r'=(\d+)').apply(pd.Series)
df = df.drop("INFO", 1)
Details:
df = pd.DataFrame([[1, "A=2;B=2;C=5"],
[2, "A=3;B=4;C=1"],
[3, "A=1;B=3;C=2"]],
columns=["ID", "INFO"])
print(df.INFO.str.findall(r'=(\d+)'))
# 0 [2, 2, 5]
# 1 [3, 4, 1]
# 2 [1, 3, 2]
df[["A", "B", "C"]] = df.INFO.str.findall(r'=(\d+)').apply(pd.Series)
print(df)
# ID INFO A B C
# 0 1 A=2;B=2;C=5 2 2 5
# 1 2 A=3;B=4;C=1 3 4 1
# 2 3 A=1;B=3;C=2 1 3 2
# Remove INFO column
df = df.drop("INFO", 1)
print(df)
# ID A B C
# 0 1 2 2 5
# 1 2 3 4 1
# 2 3 1 3 2
Another solution :
#split on ';'
#explode
#then split on '='
#and pivot
df_INFO = (df.INFO
.str.split(';')
.explode()
.str.split('=',expand=True)
.pivot(columns=0,values=1)
)
pd.concat([df.ID,df_INFO],axis=1)
ID A B C
0 1 2 2 5
1 2 3 4 1
2 3 1 3 2
I have a dataframe, with the following columns, in this order;
'2','4','9','A','1','B','C'
I want the first 3 columns to be ABC but the rest it doesn't matter.
Output:
'A','B','C','3','2','9'... and so on
Is this possible?
(there are 100's of columns, so i can't put them all in a list
You can try to reorder like this:
first_cols = ['A','B','C']
last_cols = [col for col in df.columns if col not in first_cols]
df = df[first_cols+last_cols]
Setup
cols = ['2','4','9','A','1','B','C']
df = pd.DataFrame(1, range(3), cols)
df
2 4 9 A 1 B C
0 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1
2 1 1 1 1 1 1 1
sorted with key
key = lambda x: (x != 'A', x != 'B', x != 'C')
df[sorted(df, key=key)]
A B C 2 4 9 1
0 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1
2 1 1 1 1 1 1 1
Better suited for longer length of first column members
first_cols = ['A', 'B', 'C']
key = lambda x: tuple(y != x for y in first_cols)
df[sorted(df, key=key)]
I have a data frame as:
a b c d......
1 1
3 3 3 5
4 1 1 4 6
1 0
I want to select number of columns based on value given in column "a". In this case for first row it would only select column b.
How can I achieve something like:
df.iloc[:,column b:number of columns corresponding to value in column a]
My expected output would be:
a b c d e
1 1 0 0 1 # 'e' contains value in column b because colmn a = 1
3 3 3 5 335 # 'e' contains values of column b,c,d because colm a
4 1 1 4 1 # = 3
1 0 NAN
Define a little function for this:
def select(df, r):
return df.iloc[r, 1:1 + df.iat[r, 0]]
The function uses iat to query the a column for that row, and iloc to select columns from the same row.
Call it as such:
select(df, 0)
b 1.0
Name: 0, dtype: float64
And,
select(df, 1)
b 3.0
c 3.0
d 5.0
Name: 1, dtype: float64
Based on your edit, consider this -
df
a b c d e
0 1 1 0 0 0
1 3 3 3 5 0
2 4 1 1 4 6
3 1 0 0 0 0
Use where/mask (with numpy broadcasting) + agg here -
df['e'] = df.iloc[:, 1:]\
.astype(str)\
.where(np.arange(df.shape[1] - 1) < df.a[:, None], '')\
.agg(''.join, axis=1)
df
a b c d e
0 1 1 0 0 1
1 3 3 3 5 335
2 4 1 1 4 1146
3 1 0 0 0 0
If nothing matches, then those entries in e will have an empty string. Just use replace -
df['e'] = df['e'].replace('', np.nan)
A numpy slicing approach
a = v[:, 0]
b = v[:, 1:]
n, m = b.shape
b = b.ravel()
b = np.where(b == 0, '', b.astype(str))
r = np.arange(n) * m
f = lambda t: b[t[0]:t[1]]
df.assign(g=list(map(''.join, map(f, zip(r, r + a)))))
a b c d e g
0 1 1 0 0 0 1
1 3 3 3 5 0 335
2 4 1 1 4 6 1146
3 1 0 0 0 0
Edit: one line solution with slicing.
df["f"] = df.astype(str).apply(lambda r: "".join(r[1:int(r["a"])+1]), axis=1)
# df["f"] = df["f"].astype(int) if you need `f` to be integer
df
a b c d e f
0 1 1 X X X 1
1 3 3 3 5 X 335
2 4 1 1 4 6 1146
3 1 0 X X X 0
Dataset used:
df = pd.DataFrame({'a': {0: 1, 1: 3, 2: 4, 3: 1},
'b': {0: 1, 1: 3, 2: 1, 3: 0},
'c': {0: 'X', 1: '3', 2: '1', 3: 'X'},
'd': {0: 'X', 1: '5', 2: '4', 3: 'X'},
'e': {0: 'X', 1: 'X', 2: '6', 3: 'X'}})
Suggestion for improvement would be appreciated!