Converting multiple columns to categories in Pandas. apply? - python

Consider a Dataframe. I want to convert a set of columns to_convert to categories.
I can certainly do the following:
for col in to_convert:
df[col] = df[col].astype('category')
but I was surprised that the following does not return a dataframe:
df[to_convert].apply(lambda x: x.astype('category'), axis=0)
which of course makes the following not work:
df[to_convert] = df[to_convert].apply(lambda x: x.astype('category'), axis=0)
Why does apply (axis=0) return a Series even though it is supposed to act on the columns one by one?

This was just fixed in master, and so will be in 0.17.0, see the issue here
In [7]: df = DataFrame({'A' : list('aabbcd'), 'B' : list('ffghhe')})
In [8]: df
Out[8]:
A B
0 a f
1 a f
2 b g
3 b h
4 c h
5 d e
In [9]: df.dtypes
Out[9]:
A object
B object
dtype: object
In [10]: df.apply(lambda x: x.astype('category'))
Out[10]:
A B
0 a f
1 a f
2 b g
3 b h
4 c h
5 d e
In [11]: df.apply(lambda x: x.astype('category')).dtypes
Out[11]:
A category
B category
dtype: object

Note that since pandas 0.23.0 you no longer apply to convert multiple columns to categorical data types. Now you can simply do df[to_convert].astype('category') instead (where to_convert is a set of columns as defined in the question).

Related

Use dataframe column containing "column name strings", to return values from dataframe based on column name and index without using .apply()

I have a dataframe as follows:
df=pandas.DataFrame()
df['A'] = numpy.random.random(10)
df['B'] = numpy.random.random(10)
df['C'] = numpy.random.random(10)
df['Col_name'] = numpy.random.choice(['A','B','C'],size=10)
I want to obtain an output that uses 'Col_name' and the respective index of the dataframe row to lookup the value in the dataframe.
I can get the desired output this with .apply() follows:
df['output'] = df.apply(lambda x: x[ x['Col_name'] ], axis=1)
.apply() is slow over a large dataframe with it iterating row by row. Is there an obvious solution in pandas that is faster/vectorised?
You can also pick each column name (or give list of possible names) and then apply it as mask to filter your dataframe then pick values from desired column and assign them to all rows matching the mask. Then repeat this for another coulmn.
for column_name in df: #or: for column_name in ['A', 'B', 'C']
df.loc[df['Col_name']==column_name, 'output'] = df[column_name]
Rows that will not match any mask will have NaN values.
PS. Accodring to my test with 10000000 random rows - method with .apply() takes 2min 24s to finish while my method takes only 4,3s.
Use melt to flatten your dataframe and keep rows where Col_name equals to variable column:
df['output'] = df.melt('Col_name', ignore_index=False).query('Col_name == variable')['value']
print(df)
# Output
A B C Col_name output
0 0.202197 0.430735 0.093551 B 0.430735
1 0.344753 0.979453 0.999160 C 0.999160
2 0.500904 0.778715 0.074786 A 0.500904
3 0.050951 0.317732 0.363027 B 0.317732
4 0.722624 0.026065 0.424639 C 0.424639
5 0.578185 0.626698 0.376692 C 0.376692
6 0.540849 0.805722 0.528886 A 0.540849
7 0.918618 0.869893 0.825991 C 0.825991
8 0.688967 0.203809 0.734467 B 0.203809
9 0.811571 0.010081 0.372657 B 0.010081
Transformation after melt:
>>> df.melt('Col_name', ignore_index=False)
Col_name variable value
0 B A 0.202197
1 C A 0.344753
2 A A 0.500904 # keep
3 B A 0.050951
4 C A 0.722624
5 C A 0.578185
6 A A 0.540849 # keep
7 C A 0.918618
8 B A 0.688967
9 B A 0.811571
0 B B 0.430735 # keep
1 C B 0.979453
2 A B 0.778715
3 B B 0.317732 # keep
4 C B 0.026065
5 C B 0.626698
6 A B 0.805722
7 C B 0.869893
8 B B 0.203809 # keep
9 B B 0.010081 # keep
0 B C 0.093551
1 C C 0.999160 # keep
2 A C 0.074786
3 B C 0.363027
4 C C 0.424639 # keep
5 C C 0.376692 # keep
6 A C 0.528886
7 C C 0.825991 # keep
8 B C 0.734467
9 B C 0.372657
Update
Alternative with set_index and stack for #Rabinzel:
df['output'] = (
df.set_index('Col_name', append=True).stack()
.loc[lambda x: x.index.get_level_values(1) == x.index.get_level_values(2)]
.droplevel([1, 2])
)
print(df)
# Output
A B C Col_name output
0 0.209953 0.332294 0.812476 C 0.812476
1 0.284225 0.566939 0.087084 A 0.284225
2 0.815874 0.185154 0.155454 A 0.815874
3 0.017548 0.733474 0.766972 A 0.017548
4 0.494323 0.433719 0.979399 C 0.979399
5 0.875071 0.789891 0.319870 B 0.789891
6 0.475554 0.229837 0.338032 B 0.229837
7 0.123904 0.397463 0.288614 C 0.288614
8 0.288249 0.631578 0.393521 A 0.288249
9 0.107245 0.006969 0.367748 C 0.367748
import pandas as pd
import numpy as np
df=pd.DataFrame()
df['A'] = np.random.random(10)
df['B'] = np.random.random(10)
df['C'] = np.random.random(10)
df['Col_name'] = np.random.choice(['A','B','C'],size=10)
df["output"] = np.nan
Even though you do not like going row per row, I still routinely use loops to go through each row just to know where it breaks when it breaks. Here are two loops just to satisfy myself. The column is created ahead with na values becausethe loops needs it to be.
# each rows by index
for i in range(len(df)):
df['output'][i] = df[df['Col_name'][i]][i]
# each rows but by column name
for col in list(df["Col_name"]):
df.loc[:,'output'] = df.loc[:,col]
Here are some "non-loop" ways to do so.
df["output"] = df.lookup(df.index, df.Col_name)
df['output'] = np.where(np.isnan(df['output']), df[df['Col_name']], np.nan)

Is there a way to merge 2 rows of a df into 1?

I have a df that has plenty of row pairs that need to be condensed into 1. Column B identifies the pairs. All column values except one are identical. Is there a way to accomplish this in pandas?
Existing df:
A B C D E
x c v 2 w
x c v 2 r
Desired Output:
A B C D E
x c v 2 w,r
It's a little bit unintuitive to read but works:
df2 = (
df.groupby('B', as_index=False)
.agg({**dict.fromkeys(df.columns, 'first'), 'E': ','.join})
)
What we're doing here is grouping by column B and indicating that we want the first value occurring for each value of B across all columns, but then we're also over-riding what we want for the column E for aggregation to take place to join E's values sharing identical columns with B with a comma.
Hence you get:
A B C D E
0 x c v 2 w,r
This doesn't make assumptions about data types and leave columns alone that aren't strings but of course will error out if your E column contains non string values (or types that can't logically support it).
Like this:
df = df.apply(lambda x: ','.join(x), axis=0)
To use specific cols
df = df[['A','B']] ....

Changing multiple column names

Let's say I have a data frame with such column names:
['a','b','c','d','e','f','g']
And I would like to change names from 'c' to 'f' (actually add string to the name of column), so the whole data frame column names would look like this:
['a','b','var_c_equal','var_d_equal','var_e_equal','var_f_equal','g']
Well, firstly I made a function that changes column names with the string i want:
df.rename(columns=lambda x: 'or_'+x+'_no', inplace=True)
But now I really want to understand how to implement something like this:
df.loc[:,'c':'f'].rename(columns=lambda x: 'var_'+x+'_equal', inplace=True)
You can a use a list comprehension for that like:
Code:
new_columns = ['var_{}_equal'.format(c) if c in 'cdef' else c for c in columns]
Test Code:
import pandas as pd
df = pd.DataFrame({'a':(1,2), 'b':(1,2), 'c':(1,2), 'd':(1,2)})
print(df)
df.columns = ['var_{}_equal'.format(c) if c in 'cdef' else c
for c in df.columns]
print(df)
Results:
a b c d
0 1 1 1 1
1 2 2 2 2
a b var_c_equal var_d_equal
0 1 1 1 1
1 2 2 2 2
One way is to use a dictionary instead of an anonymous function. Both the below variations assume the columns you need to rename are contiguous.
Contiguous columns by position
d = {k: 'var_'+k+'_equal' for k in df.columns[2:6]}
df = df.rename(columns=d)
Contiguous columns by name
If you need to calculate the numerical indices:
cols = df.columns.get_loc
d = {k: 'var_'+k+'_equal' for k in df.columns[cols('c'):cols('f')+1]}
df = df.rename(columns=d)
Specifically identified columns
If you want to provide the columns explicitly:
d = {k: 'var_'+k+'_equal' for k in 'cdef'}
df = df.rename(columns=d)

Summing columns to form a new dataframe

I have a DataFrame
A B C D
2015-07-18 4.534390e+05 2.990611e+05 5.706540e+05 4.554383e+05
2015-07-22 3.991351e+05 2.606576e+05 3.876394e+05 4.019723e+05
2015-08-07 1.085791e+05 8.215599e+04 1.356295e+05 1.096541e+05
2015-08-19 1.397305e+06 8.681048e+05 1.672141e+06 1.403100e+06
...
I simply want to sum all columns to get a new dataframe
A B C D
sum s s s s
With the columnwise sums And then print it with to_csv(). When is use
df.sum(axis=0)
print(df)
A 9.099377e+06
B 5.897003e+06
C 1.049932e+07
D 9.208681e+06
dtype: float64
You can transform df.sum() to DataFrame and transpose it:
In [39]: df.sum().to_frame('sum').T
Out[39]:
A B C D
sum 2358458.2 1509979.49 2766063.9 2370164.7
A slightly shorter version of pd.DataFrame is (with credit to jezrael for simplification):
In [120]: pd.DataFrame([df.sum()], index=['sum'])
Out[120]:
A B C D
sum 2358458.2 1509979.49 2766063.9 2370164.7
Use DataFrame constructor:
df = pd.DataFrame(df.sum().values.reshape(-1, len(df.columns)),
columns=df.columns,
index=['sum'])
print (df)
A B C D
sum 2358458.2 1509979.49 2766063.9 2370164.7
I think the simplest is df.agg([sum])
df.agg([sum])
Out[40]:
A B C D
sum 2358458.2 1509979.49 2766063.9 2370164.7

Pandas, filter rows which column contain another column

How can I filter rows which column contain another column?
For example, if we have DT with two columns A, B, can we filter rows with B.contains(A)? Not just if B contains some A values from all A from DT, but just in one row.
A B
'lol' 'lolec'
'ram' 'rambo'
'ki' 'pio'
Result:
A B
'lol' 'lolec'
'ram' 'rambo'
You can use boolean indexing with mask created by apply and in if need filter columns A and B per rows:
#if necessary strip ' in all values
df = df.apply(lambda x: x.str.strip("'"))
#df = df.applymap(lambda x: x.strip("'"))
print (df.apply(lambda x: x.A in x.B, axis=1))
0 True
1 True
2 False
dtype: bool
df = df[df.apply(lambda x: x.A in x.B, axis=1)]
print (df)
A B
0 lol lolec
1 ram rambo
Difference of solutions - input DataFrame is changed:
print (df)
A B
0 lol pio
1 ram rambo
2 ki lolec
print (df[df.apply(lambda x: x.A in x.B, axis=1)])
A B
1 ram rambo
print (df[df['B'].str.contains("|".join(df['A']))])
A B
1 ram rambo
2 ki lolec
for improve performance use list comprehension:
df = df[[a in b for a, b in zip(df.A, df.B)]]
You can use str.contains to match each of the substrings by using the regex | character which implies an OR selection from the contents of the other series:
df[df['B'].str.contains("|".join(df['A']))]

Categories

Resources