I'm a python beginner and I'm trying to do some operations with dataframes that I usually do with R language.
I Have a large dataframe with 2592 rows and 205 columns and I want to replace the 0.0 values by half the minimum value of its column.
An example with a random dataframe would be:
>>> import pandas as pd
>>> import numpy as np
>>> np.random.seed(1)
>>> df = pd.DataFrame(np.random.randint(0,10, size=(3,5)), columns = ['A', 'B', 'C', 'D', 'E'])
>>> print(df)
A B C D E
0 5 8 9 5 0
1 0 1 7 6 9
2 2 4 5 2 4
And the result I'm looking for is:
A B C D E
0 5 8 9 5 2
1 1 1 7 6 9
2 2 4 5 2 4
Intuitively I would do it like this:
>>> for column in df:
for element in column:
if element == 0:
element = df[column].min()/2
But it doesn't work... any help?
Thank you!
Use DataFrame.mask with replace minimum values without 0 divide by 2:
df1 = df.mask(df.eq(0), df.replace(0, np.nan).min().div(2), axis=1)
print(df1)
A B C D E
0 5 8 9 5 2
1 1 1 7 6 9
2 2 4 5 2 4
For more efficient solution is possible use (thanks #mozway):
m = df.eq(0)
df1 = df.mask(m, df[~m].min().div(2), axis=1)
To work on your "intuitive" way, this is how to do it.
Use a function to perform the fancy logics you need.
Pandas has .apply function is optimised, so it should be sufficiently fast anyway.
import pandas as pd
import numpy as np
np.random.seed(1)
df = pd.DataFrame(np.random.randint(0,10, size=(3,5)), columns = ['A', 'B', 'C', 'D', 'E'])
def make_half_minimum(value, dataseries):
if value == 0:
dataseries_ = dataseries[dataseries!=0]
return dataseries_.min()/2
else:
return value
for column_name in df.columns:
df[column_name] = df[column_name].apply(lambda x: make_half_minimum(x,df[column_name]))
print(df)
A B C D E
0 5.0 8 9 5 2.0
1 1.0 1 7 6 9.0
2 2.0 4 5 2 4.0
[Finished in 521ms]
Related
I would like to reshape the folowing dataframe
into
Could somebody help me with that?
Have you tried df.pivot() or pd.pivot()? The values in column C will become column headers. After that, flatten the multi-index columns, and rename them.
import pandas as pd
#df = df.pivot(['A', 'B'], columns='C').reset_index() #this also works
df = pd.pivot(data=df, index=['A', 'B'], columns='C').reset_index()
df.columns = ['A', 'B', 'X', 'Y']
print(df)
Output
A B X Y
0 a aa 1 5
1 b bb 6 2
2 c cc 3 7
3 d dd 8 4
Sometimes, there might be repeated records with the same index, then you'd have to use pd.pivot_table() instead. The param aggfunc=np.mean will take the mean of these repeated records, and become type float as you can see from the output.
import pandas as pd
import numpy as np
df = pd.pivot_table(data=df, index=['A', 'B'], columns='C', aggfunc=np.mean).reset_index()
df.columns = ['A', 'B', 'X', 'Y']
print(df)
Output
A B X Y
0 a aa 1.0 5.0
1 b bb 6.0 2.0
2 c cc 3.0 7.0
3 d dd 8.0 4.0
You can try
out = df.pivot(index=['A', 'B'], columns='C', values='D').reset_index()
print(out)
C A B X Y
0 a aa 1 5
1 b bb 6 2
2 c cc 3 7
3 d dd 8 4
I am trying to modify specific values in a column, where the modification uses values from another column. For example say I have a df:
A B C
1 3 8
1 6 8
2 2 9
2 6 1
3 4 5
3 6 7
Where I want df['B'] = df['B'] + df['C'] only for the subset df.loc[df['A'] == 2]
Producing:
A B C
1 3 8
1 6 8
2 11 9
2 7 1
3 4 5
3 6 7
I have tried
df.loc[(df['A']==2), 'B'].apply(lambda x: x + df['C'])
but get:
InvalidIndexError: Reindexing only valid with uniquely valued Index objects
You are close, apply is not necessary:
m = df['A'] == 2
#short way
df.loc[m, 'B'] += df.loc[m, 'C']
#long way
df.loc[m, 'B'] = df.loc[m, 'B'] + df.loc[m, 'C']
Or:
df.loc[df['A'] == 2, 'B'] += df['C']
If you don't mind using numpy, I find it very simple for tasks like yours:
import numpy as np
df['B'] = np.where(df['A'] == 2, df['B']+df['C'],df['B'])
prints:
A B C
0 1 3 8
1 1 6 8
2 2 11 9
3 2 7 1
4 3 4 5
5 3 6 7
So my dataframe has multiple columns, one of them is named "multiple" which contains boolean, only 1s and 0s. Now, I want to replicate all the rows 4 times only for all the df.loc[df.multiple==1]. How can I do that? (I don't want to replicate indexes)
example input:
df=
index strings multiple
0 A 0
1 B 1
2 C 1
3 D 0
4 E 1
Expected output:
index strings multiple
0 A 0
1 B 1
2 B 1
3 B 1
4 B 1
5 B 1
6 C 1
7 C 1
8 C 1
9 C 1
10 C 1
11 D 0
12 E 1
13 E 1
14 E 1
15 E 1
16 E 1
Here is another alternative, based on #Vinzent answer.
It is using the same approach to construct the repeats, but doesn't require to reconstruct the full dataframe. It is instead based on indexing. This solution is ~30% faster on the provided dataset and larger datasets.
df.loc[np.repeat(df.multiple, df.multiple.values*4+1).index].reset_index(drop=True)
This is what numpy.repeat is for:
import pandas as pd
import numpy as np
df = pd.DataFrame([['A', 0],
['B', 1],
['C', 1],
['D', 0],
['E', 1]],
columns=['strings', 'multiple'])
df = pd.DataFrame(np.repeat(df.values, df['multiple']*4+1, axis=0), columns=df.columns)
print(df)
# strings multiple
# 0 A 0
# 1 B 1
# 2 B 1
# 3 B 1
# 4 B 1
# 5 B 1
# 6 C 1
# 7 C 1
# 8 C 1
# 9 C 1
# 10 C 1
# 11 D 0
# 12 E 1
# 13 E 1
# 14 E 1
# 15 E 1
# 16 E 1
You can do it with pandas:
(df.groupby('multiple')
.apply(lambda x: pd.concat([x]*4) if x.name else x)
.droplevel(level=0)
.sort_index()
.reset_index(drop=True)
)
I have a pandas DataFrame with about 200 columns. Roughly, I want to do this
for col in df.columns:
if col begins with a number:
df.drop(col)
I'm not sure what are the best practices when it comes to handling pandas DataFrames, how should I handle this? Will my pseudocode work, or is it not recommended to modify a pandas dataframe in a for loop?
I think simpliest is select all columns which not starts with number by filter with regex - ^ is for start of string and \D is for not number:
df1 = df.filter(regex='^\D')
Similar alternative:
df1 = df.loc[:, df.columns.str.contains('^\D')]
Or inverse condition and select numbers:
df1 = df.loc[:, ~df.columns.str.contains('^\d')]
df1 = df.loc[:, ~df.columns.str[0].str.isnumeric()]
If want use your pseudocode:
for col in df.columns:
if col[0].isnumeric():
df = df.drop(col, axis=1)
Sample:
df = pd.DataFrame({'2A':list('abcdef'),
'1B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D3':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')})
print (df)
1B 2A C D3 E F
0 4 a 7 1 5 a
1 5 b 8 3 3 a
2 4 c 9 5 6 a
3 5 d 4 7 9 b
4 5 e 2 1 2 b
5 4 f 3 0 4 b
df1 = df.filter(regex='^\D')
print (df1)
C D3 E F
0 7 1 5 a
1 8 3 3 a
2 9 5 6 a
3 4 7 9 b
4 2 1 2 b
5 3 0 4 b
An alternative can be this:
columns = [x for x in df.columns if not x[0].isdigit()]
df = df[columns]
I have a large pandas dataframe that contains many columns.
I would like to change the order of the columns so that only a subset of them appears first. I dont care about the ordering of the rest (and there are too many variables to list them all)
For instance, if my dataframe is like this
a b c d e f g h i
5 8 7 2 1 4 1 2 3
1 4 2 2 3 4 1 5 3
I would like to specify a subset of the columns
mysubset=['d','f'] and reorder the dataframe such that
the order of the columns is now
d,f,a,b,c,e,g,h,i
Is there a way to do that in a panda-esque way?
You could use a column mask:
>>> mysubset = ["d","f"]
>>> mask = df.columns.isin(mysubset)
>>> pd.concat([df.loc[:,mask], df.loc[:,~mask]], axis=1)
d f a b c e g h i
0 2 4 5 8 7 1 1 2 3
1 2 4 1 4 2 3 1 5 3
or use sorted:
>>> mysubset = ["d","f"]
>>> df[sorted(df, key=lambda x: x not in mysubset)]
d f a b c e g h i
0 2 4 5 8 7 1 1 2 3
1 2 4 1 4 2 3 1 5 3
which works because x not in mysubset will be False for d and f, and False < True.
I usually do something like this:
mysubset = ['d', 'f']
othercols = [c for c in df.columns if c not in mysubset]
df = df[mysubset+othercols]
use a multi-index to do that :
priority=[ 0 if x in {'d','f'} else 1 for x in df.columns]
newdf=df.T.set_index([priority,df.columns]).sort_index().T
Then you have :
In [3]: newdf
Out[3]:
0 1
d f a b c e g h i
0 2 4 5 8 7 1 1 2 3
1 2 4 1 4 2 3 1 5 3
To move an entire subset of columns, you could do this:
#!/usr/bin/python
import numpy as np
import pandas as pd
dates = pd.date_range('20130101',periods=6)
df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
print df
cols = df.columns.tolist()
print cols
mysubset = ['B','D']
for idx, item in enumerate(mysubset):
cols.remove(item)
cols.insert(idx, item)
print cols
df = df[cols]
print df
Here I moved B and D first and let the others trailing. Output:
A B C D
2013-01-01 0.905122 -0.004839 -0.697663 -1.307550
2013-01-02 0.651998 -1.092546 0.594493 0.341066
2013-01-03 0.355832 -0.840057 0.016989 0.377502
2013-01-04 -0.544407 0.826708 -0.889118 0.871769
2013-01-05 0.190630 0.717418 1.325479 -0.882652
2013-01-06 2.730582 0.195908 -0.657642 1.606263
['A', 'B', 'C', 'D']
['B', 'D', 'A', 'C']
B D A C
2013-01-01 -0.004839 -1.307550 0.905122 -0.697663
2013-01-02 -1.092546 0.341066 0.651998 0.594493
2013-01-03 -0.840057 0.377502 0.355832 0.016989
2013-01-04 0.826708 0.871769 -0.544407 -0.889118
2013-01-05 0.717418 -0.882652 0.190630 1.325479
2013-01-06 0.195908 1.606263 2.730582 -0.657642
For more, read this answer.
a=list('abcdefghi')
b=list('dfabceghi')
ind = pd.Series(range(9),index=b).reindex(a)
df.sort_index(axis=1,inplace=True,key=lambda x:ind)
The benefit of the above approach is inplace=True , and costs lower memory and time when df is a large dataframe.
If your dataframe is in common shape:
df.filter(b)
may be more pythonic.