I just came across this question, how do I do str.join by one column to join the other, here is my DataFrame:
>>> df = pd.DataFrame({'a': ['a', 'b', 'c', 'd'], 'b': ['hello', 'good', 'great', 'nice']})
a b
0 a hello
1 b good
2 c great
3 d nice
I would like the a column to join the values in the b column, so my desired output is:
a b
0 a haealalao
1 b gbobobd
2 c gcrcecact
3 d ndidcde
How would I go about that?
Hope you can see the correlation with this, here is one example with the first row that you can do in python:
>>> 'a'.join('hello')
'haealalao'
>>>
Just like in the desired output.
I think it might be useful to know how two columns can interact. join might not be the best example but there are other functions that you could do. It could maybe be useful if you use split to split on the other columns, or replace the characters in the other columns with something else.
P.S. I have a self-answer below.
TL;DR
The below code is the fastest answer I could figure out from this question:
it = iter(df['a'])
df['b'] = [next(it).join(i) for i in df['b']]
The above code first does a generator of the a column, then you can use next for getting the next value every time, then in the list comprehension it joins the two strings.
Long answer:
Going to show my solutions:
Solution 1:
To use a list comprehension and a generator:
it = iter(df['a'])
df['b'] = [next(it).join(i) for i in df['b']]
print(df)
Solution 2:
Group by the index, and apply and str.join the two columns' value:
df['b'] = df.groupby(df.index).apply(lambda x: x['a'].item().join(x['b'].item()))
print(df)
Solution 3:
Use a list comprehension that iterates through both columns and str.joins:
df['b'] = [x.join(y) for x, y in df.values.tolist()]
print(df)
These codes all output:
a b
0 a haealalao
1 b gbobobd
2 c gcrcecact
3 d ndidcde
Timing:
Now it's time to move on to timing with the timeit module, here is the code we use to time:
from timeit import timeit
df = pd.DataFrame({'a': ['a', 'b', 'c', 'd'], 'b': ['hello', 'good', 'great', 'nice']})
def u11_1():
it = iter(df['a'])
df['b'] = [next(it).join(i) for i in df['b']]
def u11_2():
df['b'] = df.groupby(df.index).apply(lambda x: x['a'].item().join(x['b'].item()))
def u11_3():
df['b'] = [x.join(y) for x, y in df.values.tolist()]
print('Solution 1:', timeit(u11_1, number=5))
print('Solution 2:', timeit(u11_2, number=5))
print('Solution 3:', timeit(u11_3, number=5))
Output:
Solution 1: 0.007374127670871819
Solution 2: 0.05485127553865618
Solution 3: 0.05787154087587698
So the first solution is the quickest, using a generator.
I tried achieving the output using df.apply
>>> df.apply(lambda x: x['a'].join(x['b']), axis=1)
0 haealalao
1 gbobobd
2 gcrcecact
3 ndidcde
dtype: object
Timing it for performance comparison,
from timeit import timeit
df = pd.DataFrame({'a': ['a', 'b', 'c', 'd'], 'b': ['hello', 'good', 'great', 'nice']})
def u11_1():
it = iter(df['a'])
df['b'] = [next(it).join(i) for i in df['b']]
def u11_2():
df['b'] = df.groupby(df.index).apply(lambda x: x['a'].item().join(x['b'].item()))
def u11_3():
df['b'] = [x.join(y) for x, y in df.values.tolist()]
def u11_4():
df['c'] = df.apply(lambda x: x['a'].join(x['b']), axis=1)
df = pd.DataFrame({'a': ['a', 'b', 'c', 'd'], 'b': ['hello', 'good', 'great', 'nice']})
print('Solution 1:', timeit(u11_1, number=5))
df = pd.DataFrame({'a': ['a', 'b', 'c', 'd'], 'b': ['hello', 'good', 'great', 'nice']})
print('Solution 2:', timeit(u11_2, number=5))
df = pd.DataFrame({'a': ['a', 'b', 'c', 'd'], 'b': ['hello', 'good', 'great', 'nice']})
print('Solution 3:', timeit(u11_3, number=5))
df = pd.DataFrame({'a': ['a', 'b', 'c', 'd'], 'b': ['hello', 'good', 'great', 'nice']})
print('Solution 4:', timeit(u11_4, number=5))
Note that I am reinitializing df before every line so that all the functions process the same dataframe. It can also be done by passing the df as a parameter to the function.
Here's another solution using zip and list comprehension. Should be better than df.apply:
In [1576]: df.b = [i.join(j) for i,j in zip(df.a, df.b)]
In [1578]: df
Out[1578]:
a b
0 a haealalao
1 b gbobobd
2 c gcrcecact
3 d ndidcde
Related
I am grouping and counting a set of data.
df = pd.DataFrame({'key': ['A', 'B', 'A'],
'data': np.ones(3,)})
df.groupby('key').count()
outputs
data
key
A 2
B 1
The piece of code above works though, I wonder if there is a simpler one.
'data': np.ones(3,) seems to be a placeholder and indispensable.
pd.DataFrame(['A', 'B', 'A']).groupby(0).count()
outputs
A
B
My question is, is there a simpler way to do this, produce the count of 'A' and 'B' respectively, without something like 'data': np.ones(3,) ?
It doesn't have to be a pandas method, numpy or python native function are also appreciated.
Use a Series instead.
>>> import pandas as pd
>>>
>>> data = ['A', 'A', 'A', 'B', 'C', 'C', 'D', 'D', 'D', 'D', 'D']
>>>
>>> pd.Series(data).value_counts()
D 5
A 3
C 2
B 1
dtype: int64
Use a defaultdict:
from collections import defaultdict
data = ['A', 'A', 'B', 'A', 'C', 'C', 'A']
d = defaultdict(int)
for element in data:
d[element] += 1
d # output: defaultdict(int, {'A': 4, 'B': 1, 'C': 2})
There's not any grouping , just counting, so you can use
from collections import Counter
counter(['A', 'B', 'A'])
All I have successfully written a list comprehension that tests for non ascii characters in a column in a dataframe.
I am trying now to write a nest listed comprehension to check all of the columns in the data frame.
I have researched this by searching nested List Comprehensions dataframes and several other variations and while they are close I can get them to fit my problem.
Here is my code:
import pandas as pd
import numpy as np
data = {'X1': ['A', 'B', 'C', 'D', 'E'],
'X2': ['meow', 'bark', 'moo', 'squeak', '120°']}
data2 = {'X1': ['A', 'B', 'F', 'D', 'E'],
'X3': ['cat', 'dog', 'frog', 'mouse®', 'chick']}
df = pd.DataFrame(data)
df2 = pd.DataFrame(data2)
dfAsc = pd.merge(df, df2, how ='inner', on = 'X1')
dfAsc['X2']=[row.encode('ascii', 'ignore').decode('ascii') for row in
dfAsc['X2'] if type(row) is str]
dfAsc
which correctly returns:
X1 X2 X3
0 A meow cat
1 B bark dog
2 D squeak mouse®
3 E 120 chick
I have tried to create a nested comprehension to check all of the columns instead of just X2. The attempt below is to create a new df that contains the answer. If this continues to be an issue of confusion, I'll will delete it as it is only one of my attempts to obtain the answer, don't get hung up on it please
df3 = pd.DataFrame([dfAsc.loc[idx]
for idx in dfAsc.index
[row.encode('ascii', 'ignore').decode('ascii') for row in
dfAsc[idx] if type(row) is str]
df3
which doesnt work.
I know Im close but Im still having trouble getting my head around comprehensions
You don't need to use list comprehension, you can directly use df.applymap
This will be lot faster than using comprehensions.
data = {'X1': ['A', 'B', 'C', 'D', 'E'],
'X2': ['meow', 'bark', 'moo', 'squeak', '120°']}
data2 = {'X1': ['A', 'B', 'F', 'D', 'E'],
'X3': ['cat', 'dog', 'frog', 'mouse®', 'chick']}
df1 = pd.DataFrame(data, index=data['X1'], columns=['X2'])
df2 = pd.DataFrame(data2, index=data2['X1'], columns=['X3'])
dfAsc = pd.merge(df1, df2, how ='inner', left_index=True, right_index=True)
dfAsc = dfAsc.applymap(lambda x: x.encode('ascii', 'ignore').decode('ascii') if isinstance(x, str) else x)
>>> dfAsc
X2 X3
A meow cat
B bark dog
D squeak mouse
E 120 chick
As a follow-up for comments:
def clean(x):
try:
return x.encode('ascii', 'ignore').decode('ascii')
except AttributeError:
return x
dfAsc = dfAsc.applymap(clean)
lambda is a usual way of defining you transformation in .apply(), but you can also read that def is preferred for readability.
As for type check all the elements in the dfAsc dataframe are strings, including '120°' and later 120:
dfAsc.applymap(lambda x: isinstance(x, str))
#Out[37]:
# X1 X2 X3
#0 True True True
#1 True True True
#2 True True True
#3 True True True
On import with pd.read_csv() the type may be selected by column. If the
dfAsc. Some dignostics can be done with dfAsc.dtypes and change of type with .astype() method.
I have a pandas data frame that looks like:
col11 col12
X ['A']
Y ['A', 'B', 'C']
Z ['C', 'A']
And another one that looks like:
col21 col22
'A' 'alpha'
'B' 'beta'
'C' 'gamma'
I would like to replace col12 base on col22 in a efficient way and get, as a result:
col31 col32
X ['alpha']
Y ['alpha', 'beta', 'gamma']
Z ['gamma', 'alpha']
One solution is to use an indexed series as a mapper with a list comprehension:
import pandas as pd
df1 = pd.DataFrame({'col1': ['X', 'Y', 'Z'],
'col2': [['A'], ['A', 'B', 'C'], ['C', 'A']]})
df2 = pd.DataFrame({'col21': ['A', 'B', 'C'],
'col22': ['alpha', 'beta', 'gamma']})
s = df2.set_index('col21')['col22']
df1['col2'] = [list(map(s.get, i)) for i in df1['col2']]
Result:
col1 col2
0 X [alpha]
1 Y [alpha, beta, gamma]
2 Z [gamma, alpha]
I'm not sure its the most efficient way but you can turn your DataFrame to a dict and then use apply to map the keys to the values:
Assuming your first DataFrame is df1 and the second is df2:
df_dict = dict(zip(df2['col21'], df2['col22']))
df3 = pd.DataFrame({"31":df1['col11'], "32": df1['col12'].apply(lambda x: [df_dict[y] for y in x])})
or as #jezrael suggested with nested list comprehension:
df3 = pd.DataFrame({"31":df1['col11'], "32": [[df_dict[y] for y in x] for x in df1['col12']]})
note: df3 has a default index
31 32
0 X [alpha]
1 Y [alpha, beta, gamma]
2 Z [gamma, alpha]
I have a dataframe containing strings and NaNs. I want to str.lower() certain columns by name to_lower = ['b', 'd', 'e']. Ideally I could do it with a method on the whole dataframe, rather than with a method on df[to_lower]. I have
df[to_lower] = df[to_lower].apply(lambda x: x.astype(str).str.lower())
but I would like a way to do it without assigning to the selected columns.
df = pd.DataFrame({'a': ['A', 'a'], 'b': ['B', 'b']})
to_lower = ['a']
df2 = df.copy()
df2[to_lower] = df2[to_lower].apply(lambda x: x.astype(str).str.lower())
You can use assign method and unpack the result as keyword argument:
df = pd.DataFrame({'a': ['A', 'a'], 'b': ['B', 'b'], 'c': ['C', 'c']})
to_lower = ['a', 'b']
df.assign(**df[to_lower].apply(lambda x: x.astype(str).str.lower()))
# a b c
#0 a b C
#1 a b c
You want this:
for column in to_lower:
df[column] = df[column].str.lower()
This is far more efficient assuming you have more rows than columns.
I'm trying to simplify pandas and python syntax when executing a basic Pandas operation.
I have 4 columns:
a_id
a_score
b_id
b_score
I create a new label called doc_type based on the following:
a >= b, doc_type: a
b > a, doc_type: b
Im struggling in how to calculate in Pandas where a exists but b doesn't, in this case then a needs to be the label. Right now it returns the else statement or b.
I needed to create 2 additional comparison which at scale may be efficient as I already compare the data before. Looking how to improve it.
df = pd.DataFrame({
'a_id': ['A', 'B', 'C', 'D', '', 'F', 'G'],
'a_score': [1, 2, 3, 4, '', 6, 7],
'b_id': ['a', 'b', 'c', 'd', 'e', 'f', ''],
'b_score': [0.1, 0.2, 3.1, 4.1, 5, 5.99, None],
})
print df
# Replace empty string with NaN
m_score = r['a_score'] >= r['b_score']
m_doc = (r['a_id'].isnull() & r['b_id'].isnull())
df = df.apply(lambda x: x.str.strip() if isinstance(x, str) else x).replace('', np.nan)
# Calculate higher score
df['doc_id'] = df.apply(lambda df: df['a_id'] if df['a_score'] >= df['b_score'] else df['b_id'], axis=1)
# Select type based on higher score
r['doc_type'] = numpy.where(m_score, 'a',
numpy.where(m_doc, numpy.nan, 'b'))
# Additional lines looking for improvement:
df['doc_type'].loc[(df['a_id'].isnull() & df['b_id'].notnull())] = 'b'
df['doc_type'].loc[(df['a_id'].notnull() & df['b_id'].isnull())] = 'a'
print df
Use numpy.where, assuming your logic is:
Both exist, the doc_type will be the one with higher score;
One missing, the doc_type will be the one not null;
Both missing, the doc_type will be null;
Added an extra edge case at the last line:
import numpy as np
df = df.replace('', np.nan)
df['doc_type'] = np.where(df.b_id.isnull() | (df.a_score >= df.b_score),
np.where(df.a_id.isnull(), None, 'a'), 'b')
df
Not sure I fully understand all conditions or if this has any particular edge cases, but I think you can just do an np.argmax on the columns and swap the values for 'a' or 'b' when you're done:
In [21]: import numpy as np
In [22]: df['doc_type'] = pd.Series(np.argmax(df[["a_score", "b_score"]].values, axis=1)).replace({0: 'a', 1: 'b'})
In [23]: df
Out[23]:
a_id a_score b_id b_score doc_type
0 A 1 a 0.10 a
1 B 2 b 0.20 a
2 C 3 c 3.10 b
3 D 4 d 4.10 b
4 2 e 5.00 b
5 F f 5.99 a
6 G 7 NaN a
Use the apply method in pandas with a custom function, trying out on your dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'a_id': ['A', 'B', 'C', 'D', '', 'F', 'G'],
'a_score': [1, 2, 3, 4, '', 6, 7],
'b_id': ['a', 'b', 'c', 'd', 'e', 'f', ''],
'b_score': [0.1, 0.2, 3.1, 4.1, 5, 5.99, None],
})
df = df.replace('',np.NaN)
def func(row):
if np.isnan(row.a_score) and np.isnan(row.b_score):
return np.NaN
elif np.isnan(row.b_score) and not(np.isnan(row.a_score)):
return 'a'
elif not(np.isnan(row.b_score)) and np.isnan(row.a_score):
return 'a'
elif row.a_score>=row.b_score:
return 'a'
elif row.b_score>row.a_score:
return 'b'
df['doc_type'] = df.apply(func,axis=1)
You can make the function as complicated as you need and include any amount of comparisons and add more conditions later if you need to.