I have a Series of Labels
pd.Series(['L1', 'L2', 'L3'], ['A', 'B', 'A'])
and a dataframe
pd.DataFrame([[1,2], [3,4]], ['I1', 'I2'], ['A', 'B'])
I'd like to have a dataframe with columns ['L1', 'L2', 'L3'] with the column data from 'A', 'B', 'A' respectively. Like so...
pd.DataFrame([[1,2,1], [3,4,3]], ['I1', 'I2'], ['L1', 'L2', 'L3'])
in a nice pandas way.
Since you mention reindex
#s=pd.Series(['L1', 'L2', 'L3'], ['A', 'B', 'A'])
#df=pd.DataFrame([[1,2], [3,4]], ['I1', 'I2'], ['A', 'B'])
df.reindex(s.index,axis=1).rename(columns=s.to_dict())
Out[598]:
L3 L2 L3
I1 1 2 1
I2 3 4 3
This will produce the dataframe you described:
import pandas as pd
import numpy as np
data = [['A','B','A','A','B','B'],
['B','B','B','A','B','B'],
['A','B','A','B','B','B']]
columns = ['L1', 'L2', 'L3', 'L4', 'L5', 'L6']
pd.DataFrame(data, columns = columns)
You can use loc accessor:
s = pd.Series(['L1', 'L2', 'L3'], ['A', 'B', 'A'])
df = pd.DataFrame([[1,2], [3,4]], ['I1', 'I2'], ['A', 'B'])
res = df.loc[:, s.index]
print(res)
A B A
I1 1 2 1
I2 3 4 3
Or iloc accesor with columns.get_loc:
res = df.iloc[:, s.index.map(df.columns.get_loc)]
Both methods allows accessing duplicate labels / locations, in the same vein as NumPy arrays.
Related
Problem:
Given a large data set (3 million rows x 6 columns) what's the fastest way to join values of columns in a single pandas data frame, based on the rows where the mask is true?
My current solution:
import pandas as pd
import numpy as np
# Note: Real data will be 3 millon rows X 6 columns,
df = pd.DataFrame({'time': ['0', '1', '2', '3'],
'msg': ['msg0', 'msg1', 'msg0', 'msg2'],
'd0': ['a', 'x', 'a', '1'],
'd1': ['b', 'x', 'b', '2'],
'd2': ['c', 'x', np.nan, '3']})
#print(df)
msg_text_filter = ['msg0', 'msg2']
columns = df.columns.drop(df.columns[:3])
column_join = ["d0"]
mask = df['msg'].isin(msg_text_filter)
df.replace(np.nan,'',inplace=True)
# THIS IS SLOW, HOW TO SPEED UP?
df['d0'] = np.where(
mask,
df[['d0','d1','d2']].agg(''.join, axis=1),
df['d0']
)
df.loc[mask, columns] = np.nan
print(df)
IMHO you can save a lot of time by using
df[['d0', 'd1', 'd2']].sum(axis=1)
instead of
df[['d0', 'd1', 'd2']].agg(''.join, axis=1)
And I think instead of using np.where you could just do:
df.loc[mask, 'd0'] = df.loc[mask, ['d0', 'd1', 'd2']].sum(axis=1)
I have a pandas Dataframe where one of the columns is full of lists:
import pandas
df = pandas.DataFrame([[1, [a, b, c]],
[2, [d, e, f]],
[3, [a, b, c]]])
And I'd like to make a pivot table that shows the list and a count of occurrences
List Count
[a,b,c] 2
[d,e,f] 1
Because list is a non-hashable type, what aggregate functions could do this?
You can zip a list of rows and a list of counts, then make a dataframe from the zip object:
import pandas
df = pandas.DataFrame([[1, ['a', 'b', 'c']],
[2, ['d', 'e', 'f']],
[3, ['a', 'b', 'c']]])
rows = []
counts = []
for index,row in df.iterrows():
if row[1] not in rows:
rows.append(row[1])
counts.append(1)
else:
counts[rows.index(row[1])] += 1
df = pandas.DataFrame(zip(rows, counts))
print(df)
The solution I ended up using was:
import pandas
df = pandas.DataFrame([[1, ['a', 'b', 'c']],
[2, ['d','e', 'f']],
[3, ['a', 'b', 'c']]])
print(df[1])
df[1] = df[1].map(tuple)
#Thanks Ch3steR
df2 = pandas.pivot_table(df,index=df[1], aggfunc='count')
print(df2)
I just came across this question, how do I do str.join by one column to join the other, here is my DataFrame:
>>> df = pd.DataFrame({'a': ['a', 'b', 'c', 'd'], 'b': ['hello', 'good', 'great', 'nice']})
a b
0 a hello
1 b good
2 c great
3 d nice
I would like the a column to join the values in the b column, so my desired output is:
a b
0 a haealalao
1 b gbobobd
2 c gcrcecact
3 d ndidcde
How would I go about that?
Hope you can see the correlation with this, here is one example with the first row that you can do in python:
>>> 'a'.join('hello')
'haealalao'
>>>
Just like in the desired output.
I think it might be useful to know how two columns can interact. join might not be the best example but there are other functions that you could do. It could maybe be useful if you use split to split on the other columns, or replace the characters in the other columns with something else.
P.S. I have a self-answer below.
TL;DR
The below code is the fastest answer I could figure out from this question:
it = iter(df['a'])
df['b'] = [next(it).join(i) for i in df['b']]
The above code first does a generator of the a column, then you can use next for getting the next value every time, then in the list comprehension it joins the two strings.
Long answer:
Going to show my solutions:
Solution 1:
To use a list comprehension and a generator:
it = iter(df['a'])
df['b'] = [next(it).join(i) for i in df['b']]
print(df)
Solution 2:
Group by the index, and apply and str.join the two columns' value:
df['b'] = df.groupby(df.index).apply(lambda x: x['a'].item().join(x['b'].item()))
print(df)
Solution 3:
Use a list comprehension that iterates through both columns and str.joins:
df['b'] = [x.join(y) for x, y in df.values.tolist()]
print(df)
These codes all output:
a b
0 a haealalao
1 b gbobobd
2 c gcrcecact
3 d ndidcde
Timing:
Now it's time to move on to timing with the timeit module, here is the code we use to time:
from timeit import timeit
df = pd.DataFrame({'a': ['a', 'b', 'c', 'd'], 'b': ['hello', 'good', 'great', 'nice']})
def u11_1():
it = iter(df['a'])
df['b'] = [next(it).join(i) for i in df['b']]
def u11_2():
df['b'] = df.groupby(df.index).apply(lambda x: x['a'].item().join(x['b'].item()))
def u11_3():
df['b'] = [x.join(y) for x, y in df.values.tolist()]
print('Solution 1:', timeit(u11_1, number=5))
print('Solution 2:', timeit(u11_2, number=5))
print('Solution 3:', timeit(u11_3, number=5))
Output:
Solution 1: 0.007374127670871819
Solution 2: 0.05485127553865618
Solution 3: 0.05787154087587698
So the first solution is the quickest, using a generator.
I tried achieving the output using df.apply
>>> df.apply(lambda x: x['a'].join(x['b']), axis=1)
0 haealalao
1 gbobobd
2 gcrcecact
3 ndidcde
dtype: object
Timing it for performance comparison,
from timeit import timeit
df = pd.DataFrame({'a': ['a', 'b', 'c', 'd'], 'b': ['hello', 'good', 'great', 'nice']})
def u11_1():
it = iter(df['a'])
df['b'] = [next(it).join(i) for i in df['b']]
def u11_2():
df['b'] = df.groupby(df.index).apply(lambda x: x['a'].item().join(x['b'].item()))
def u11_3():
df['b'] = [x.join(y) for x, y in df.values.tolist()]
def u11_4():
df['c'] = df.apply(lambda x: x['a'].join(x['b']), axis=1)
df = pd.DataFrame({'a': ['a', 'b', 'c', 'd'], 'b': ['hello', 'good', 'great', 'nice']})
print('Solution 1:', timeit(u11_1, number=5))
df = pd.DataFrame({'a': ['a', 'b', 'c', 'd'], 'b': ['hello', 'good', 'great', 'nice']})
print('Solution 2:', timeit(u11_2, number=5))
df = pd.DataFrame({'a': ['a', 'b', 'c', 'd'], 'b': ['hello', 'good', 'great', 'nice']})
print('Solution 3:', timeit(u11_3, number=5))
df = pd.DataFrame({'a': ['a', 'b', 'c', 'd'], 'b': ['hello', 'good', 'great', 'nice']})
print('Solution 4:', timeit(u11_4, number=5))
Note that I am reinitializing df before every line so that all the functions process the same dataframe. It can also be done by passing the df as a parameter to the function.
Here's another solution using zip and list comprehension. Should be better than df.apply:
In [1576]: df.b = [i.join(j) for i,j in zip(df.a, df.b)]
In [1578]: df
Out[1578]:
a b
0 a haealalao
1 b gbobobd
2 c gcrcecact
3 d ndidcde
I have a pandas data frame that looks like:
col11 col12
X ['A']
Y ['A', 'B', 'C']
Z ['C', 'A']
And another one that looks like:
col21 col22
'A' 'alpha'
'B' 'beta'
'C' 'gamma'
I would like to replace col12 base on col22 in a efficient way and get, as a result:
col31 col32
X ['alpha']
Y ['alpha', 'beta', 'gamma']
Z ['gamma', 'alpha']
One solution is to use an indexed series as a mapper with a list comprehension:
import pandas as pd
df1 = pd.DataFrame({'col1': ['X', 'Y', 'Z'],
'col2': [['A'], ['A', 'B', 'C'], ['C', 'A']]})
df2 = pd.DataFrame({'col21': ['A', 'B', 'C'],
'col22': ['alpha', 'beta', 'gamma']})
s = df2.set_index('col21')['col22']
df1['col2'] = [list(map(s.get, i)) for i in df1['col2']]
Result:
col1 col2
0 X [alpha]
1 Y [alpha, beta, gamma]
2 Z [gamma, alpha]
I'm not sure its the most efficient way but you can turn your DataFrame to a dict and then use apply to map the keys to the values:
Assuming your first DataFrame is df1 and the second is df2:
df_dict = dict(zip(df2['col21'], df2['col22']))
df3 = pd.DataFrame({"31":df1['col11'], "32": df1['col12'].apply(lambda x: [df_dict[y] for y in x])})
or as #jezrael suggested with nested list comprehension:
df3 = pd.DataFrame({"31":df1['col11'], "32": [[df_dict[y] for y in x] for x in df1['col12']]})
note: df3 has a default index
31 32
0 X [alpha]
1 Y [alpha, beta, gamma]
2 Z [gamma, alpha]
I have a dataframe containing strings and NaNs. I want to str.lower() certain columns by name to_lower = ['b', 'd', 'e']. Ideally I could do it with a method on the whole dataframe, rather than with a method on df[to_lower]. I have
df[to_lower] = df[to_lower].apply(lambda x: x.astype(str).str.lower())
but I would like a way to do it without assigning to the selected columns.
df = pd.DataFrame({'a': ['A', 'a'], 'b': ['B', 'b']})
to_lower = ['a']
df2 = df.copy()
df2[to_lower] = df2[to_lower].apply(lambda x: x.astype(str).str.lower())
You can use assign method and unpack the result as keyword argument:
df = pd.DataFrame({'a': ['A', 'a'], 'b': ['B', 'b'], 'c': ['C', 'c']})
to_lower = ['a', 'b']
df.assign(**df[to_lower].apply(lambda x: x.astype(str).str.lower()))
# a b c
#0 a b C
#1 a b c
You want this:
for column in to_lower:
df[column] = df[column].str.lower()
This is far more efficient assuming you have more rows than columns.