How to I sperate rows and form a new dataframe with the series ?
Suppose I have a dataframe df and I am iterating over df with the following and trying to append over an empty dataframe
df = pd.DataFrame(np.random.randint(low=0, high=10, size=(5, 5)),
columns=['a', 'b', 'c', 'd', 'e'])
df1 = pd.DataFrame()
df2 = pd.DataFrame()
for index,row in df.iterrows():
if (few conditions goes here):
df1.append(row)
else:
df2.append(row)
the type of each rows over iteration is a series, but if I append it to empty dataframe it appends rows as columns and columns as row. Is there a fix for this ?
I think the best is avoid iterating and use boolean indexing with conditions chained by & for AND, | for OR, ~ for NOT and ^ for XOR:
#define all conditions
mask = (df['a'] > 2) & (df['b'] > 3)
#filter
df1 = df[mask]
#invert condition by ~
df2 = df[~mask]
Sample:
np.random.seed(125)
df = pd.DataFrame(np.random.randint(low=0, high=10, size=(5, 5)),
columns=['a', 'b', 'c', 'd', 'e'])
print (df)
a b c d e
0 2 7 3 6 0
1 5 6 2 5 0
2 4 2 9 0 7
3 2 7 9 5 3
4 5 7 9 9 1
mask = (df['a'] > 2) & (df['b'] > 3)
print (mask)
0 False
1 True
2 False
3 False
4 True
df1 = df[mask]
print (df1)
a b c d e
1 5 6 2 5 0
4 5 7 9 9 1
df2 = df[~mask]
print (df2)
a b c d e
0 2 7 3 6 0
2 4 2 9 0 7
3 2 7 9 5 3
EDIT:
Loop version, if possible dont use it because slow:
df1 = pd.DataFrame(columns=df.columns)
df2 = pd.DataFrame(columns=df.columns)
for index,row in df.iterrows():
if (row['a'] > 2) and (row['b'] > 3):
df1.loc[index] = row
else:
df2.loc[index] = row
print (df1)
a b c d e
1 5 6 2 5 0
4 5 7 9 9 1
print (df2)
a b c d e
0 2 7 3 6 0
2 4 2 9 0 7
3 2 7 9 5 3
Try the query method
df2 = df1.query('conditions go here')
Related
I am trying to modify specific values in a column, where the modification uses values from another column. For example say I have a df:
A B C
1 3 8
1 6 8
2 2 9
2 6 1
3 4 5
3 6 7
Where I want df['B'] = df['B'] + df['C'] only for the subset df.loc[df['A'] == 2]
Producing:
A B C
1 3 8
1 6 8
2 11 9
2 7 1
3 4 5
3 6 7
I have tried
df.loc[(df['A']==2), 'B'].apply(lambda x: x + df['C'])
but get:
InvalidIndexError: Reindexing only valid with uniquely valued Index objects
You are close, apply is not necessary:
m = df['A'] == 2
#short way
df.loc[m, 'B'] += df.loc[m, 'C']
#long way
df.loc[m, 'B'] = df.loc[m, 'B'] + df.loc[m, 'C']
Or:
df.loc[df['A'] == 2, 'B'] += df['C']
If you don't mind using numpy, I find it very simple for tasks like yours:
import numpy as np
df['B'] = np.where(df['A'] == 2, df['B']+df['C'],df['B'])
prints:
A B C
0 1 3 8
1 1 6 8
2 2 11 9
3 2 7 1
4 3 4 5
5 3 6 7
This is a tricky question, I have a dataframe like this and I want to create 3 columns with conditional sums such as,
If the id = A then A = A1 and B and C = B1
If the id = B then B = B1 and A and C = A1
Example data:
id A1 B1 A B C
A 5 4 5 4 4
B 6 1 6 1 6
A 7 2 7 2 2
B 6 8 8 6 6
C 2 1 2 1 0
I´m trying to come with a general solution so I don´t need a lot of sum by axis.
Your condition can be reduced to:
if id == A, then column A = column A1, column C = column B1
if id == B, then column B = column B1, column C = column A1
So, it transferred to pandas code as:
df = pd.DataFrame([[5,4],[6,1],[7,2],[6,8],[2,1]], index=['A', 'B', 'A', 'B', 'C'], columns=['A1', 'B1'])
df['A'] = df['A1']
df['B'] = df['B1']
df['C'] = (df.index == 'B')*df['A1'] +(df.index == 'A')*df['B1']
# or faster method from #user3483203
# df['id'] = df.index
# df['C'] = np.select([df.id.eq('A'), df.id.eq('B')], [df.B1, df.A1], 0)
# >>> df
# A1 B1 A B C
# A 5 4 5 4 4
# B 6 1 6 1 6
# A 7 2 7 2 2
# B 6 8 6 8 6
# C 2 1 2 1 0
I have a pandas DataFrame with about 200 columns. Roughly, I want to do this
for col in df.columns:
if col begins with a number:
df.drop(col)
I'm not sure what are the best practices when it comes to handling pandas DataFrames, how should I handle this? Will my pseudocode work, or is it not recommended to modify a pandas dataframe in a for loop?
I think simpliest is select all columns which not starts with number by filter with regex - ^ is for start of string and \D is for not number:
df1 = df.filter(regex='^\D')
Similar alternative:
df1 = df.loc[:, df.columns.str.contains('^\D')]
Or inverse condition and select numbers:
df1 = df.loc[:, ~df.columns.str.contains('^\d')]
df1 = df.loc[:, ~df.columns.str[0].str.isnumeric()]
If want use your pseudocode:
for col in df.columns:
if col[0].isnumeric():
df = df.drop(col, axis=1)
Sample:
df = pd.DataFrame({'2A':list('abcdef'),
'1B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D3':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')})
print (df)
1B 2A C D3 E F
0 4 a 7 1 5 a
1 5 b 8 3 3 a
2 4 c 9 5 6 a
3 5 d 4 7 9 b
4 5 e 2 1 2 b
5 4 f 3 0 4 b
df1 = df.filter(regex='^\D')
print (df1)
C D3 E F
0 7 1 5 a
1 8 3 3 a
2 9 5 6 a
3 4 7 9 b
4 2 1 2 b
5 3 0 4 b
An alternative can be this:
columns = [x for x in df.columns if not x[0].isdigit()]
df = df[columns]
I have a list l=['a', 'b' ,'c']
and a dataframe with columns d,e,f and values are all numbers
How can I insert list l in my dataframe just below the columns.
Setup
df = pd.DataFrame(np.ones((2, 3), dtype=int), columns=list('def'))
l = list('abc')
df
d e f
0 1 1 1
1 1 1 1
Option 1
I'd accomplish this task by adding a level to the columns object
df.columns = pd.MultiIndex.from_tuples(list(zip(df.columns, l)))
df
d e f
a b c
0 1 1 1
1 1 1 1
Option 2
Use a dictionary comprehension passed to the dataframe constructor
pd.DataFrame({(i, j): df[i] for i, j in zip(df, l)})
d e f
a b c
0 1 1 1
1 1 1 1
But if you insist on putting it in the dataframe proper... (keep in mind, this turns the dataframe into dtype object and we lose significant computational efficiencies.)
Alternative 1
pd.DataFrame([l], columns=df.columns).append(df, ignore_index=True)
d e f
0 a b c
1 1 1 1
2 1 1 1
Alternative 2
pd.DataFrame([l] + df.values.tolist(), columns=df.columns)
d e f
0 a b c
1 1 1 1
2 1 1 1
Use pd.concat
In [1112]: df
Out[1112]:
d e f
0 0.517243 0.731847 0.259034
1 0.318821 0.551298 0.773115
2 0.194192 0.707525 0.804102
3 0.945842 0.614033 0.757389
In [1113]: pd.concat([pd.DataFrame([l], columns=df.columns), df], ignore_index=True)
Out[1113]:
d e f
0 a b c
1 0.517243 0.731847 0.259034
2 0.318821 0.551298 0.773115
3 0.194192 0.707525 0.804102
4 0.945842 0.614033 0.757389
Are you looking for append i.e
df = pd.DataFrame([[1,2,3]],columns=list('def'))
I = ['a','b','c']
ndf = df.append(pd.Series(I,index=df.columns.tolist()),ignore_index=True)
Output:
d e f
0 1 2 3
1 a b c
If you want add list to columns for MultiIndex:
df.columns = [df.columns, l]
print (df)
d e f
a b c
0 4 7 1
1 5 8 3
2 4 9 5
3 5 4 7
4 5 2 1
5 4 3 0
print (df.columns)
MultiIndex(levels=[['d', 'e', 'f'], ['a', 'b', 'c']],
labels=[[0, 1, 2], [0, 1, 2]])
If you want add list to specific position pos:
pos = 0
df1 = pd.DataFrame([l], columns=df.columns)
print (df1)
d e f
0 a b c
df = pd.concat([df.iloc[:pos], df1, df.iloc[pos:]], ignore_index=True)
print (df)
d e f
0 a b c
1 4 7 1
2 5 8 3
3 4 9 5
4 5 4 7
5 5 2 1
6 4 3 0
But if append this list to numeric dataframe, get mixed types - numeric with strings, so some pandas functions should failed.
Setup:
df = pd.DataFrame({'d':[4,5,4,5,5,4],
'e':[7,8,9,4,2,3],
'f':[1,3,5,7,1,0]})
print (df)
I have a large DataFrame of observations. i.e.
value 1,value 2
a,1
a,1
a,2
b,3
a,3
I now have an external DataFrame of values
_ ,a,b
1 ,10,20
2 ,30,40
3 ,50,60
What will be an efficient way to add to the first DataFrame the values from the indexed table? i.e.:
value 1,value 2, new value
a,1,10
a,1,10
a,2,30
b,3,60
a,3,50
An alternative solution using .lookup(). It's just one line, vectorized solution. suitable for large dataset.
import pandas as pd
import numpy as np
# generate some artificial data
# ================================
np.random.seed(0)
df1 = pd.DataFrame(dict(value1=np.random.choice('a b'.split(), 10), value2=np.random.randint(1, 10, 10)))
df2 = pd.DataFrame(dict(a=np.random.randn(10), b=np.random.randn(10)), columns=['a', 'b'], index=np.arange(1, 11))
df1
Out[178]:
value1 value2
0 a 6
1 b 3
2 b 5
3 a 8
4 b 7
5 b 9
6 b 9
7 b 2
8 b 7
9 b 8
df2
Out[179]:
a b
1 2.5452 0.0334
2 1.0808 0.6806
3 0.4843 -1.5635
4 0.5791 -0.5667
5 -0.1816 -0.2421
6 1.4102 1.5144
7 -0.3745 -0.3331
8 0.2752 0.0474
9 -0.9608 1.4627
10 0.3769 1.5350
# processing: one liner lookup function
# =======================================================
# df1.value2 is the index and df1.value1 is the column
df1['new_values'] = df2.lookup(df1.value2, df1.value1)
Out[181]:
value1 value2 new_values
0 a 6 1.4102
1 b 3 -1.5635
2 b 5 -0.2421
3 a 8 0.2752
4 b 7 -0.3331
5 b 9 1.4627
6 b 9 1.4627
7 b 2 0.6806
8 b 7 -0.3331
9 b 8 0.0474
Assuming your first and second dfs are df and df1 respectively, you can merge on the matching columns and then mask the 'a' and 'b' conditions:
In [9]:
df = df.merge(df1, left_on=['value 2'], right_on=['_'])
a_mask = (df['value 2'] == df['_']) & (df['value 1'] == 'a')
b_mask = (df['value 2'] == df['_']) & (df['value 1'] == 'b')
df.loc[a_mask, 'new value'] = df['a'].where(a_mask)
df.loc[b_mask, 'new value'] = df['b'].where(b_mask)
df
Out[9]:
value 1 value 2 _ a b new value
0 a 1 1 10 20 10
1 a 1 1 10 20 10
2 a 2 2 30 40 30
3 b 3 3 50 60 60
4 a 3 3 50 60 50
You can then drop the additional columns:
In [11]:
df = df.drop(['_','a','b'], axis=1)
df
Out[11]:
value 1 value 2 new value
0 a 1 10
1 a 1 10
2 a 2 30
3 b 3 60
4 a 3 50
Another way is to define a func to perform the lookup:
In [15]:
def func(x):
row = df1[(df1['_'] == x['value 2'])]
return row[x['value 1']].values[0]
df['new value'] = df.apply(lambda x: func(x), axis = 1)
df
Out[15]:
value 1 value 2 new value
0 a 1 10
1 a 1 10
2 a 2 30
3 b 3 60
4 a 3 50
EDIT
Using #Jianxun Li's lookup works but you have to offset the index as your index is 0 based:
In [20]:
df['new value'] = df1.lookup(df['value 2'] - 1, df['value 1'])
df
Out[20]:
value 1 value 2 new value
0 a 1 10
1 a 1 10
2 a 2 30
3 b 3 60
4 a 3 50