I want the most common letter for each number. I've tried a variety of things; not sure what's the right way.
import pandas as pd
from pandas import DataFrame, Series
original = DataFrame({
'letter': {0: 'A', 1: 'A', 2: 'A', 3: 'B', 4: 'B'},
'number': {0: '01', 1: '01', 2: '02', 3: '02', 4: '02'}
})
expected = DataFrame({'most_common_letter': {'01': 'A', '02': 'B'}})
Ideally I'm looking to maximize readability.
We can use DataFrame.mode() method:
In [43]: df.groupby('number')[['letter']] \
.apply(lambda x: x.mode()) \
.reset_index(level=1, drop=True)
Out[43]:
letter
number
01 A
02 B
Use groupby + apply + value_counts + select first index values, because values are sorted.
Last convert Series to_frame and remove index name by rename_axis:
df = original.groupby('number')['letter'] \
.apply(lambda x: x.value_counts().index[0])
.to_frame('most_common_letter')
.rename_axis(None)
print (df)
most_common_letter
01 A
02 B
Similar solution:
from collections import Counter
df = original.groupby('number')['letter'] \
.apply(lambda x: Counter(x).most_common(1)[0][0]) \
.to_frame('most_common_letter') \
.rename_axis(None)
print (df)
most_common_letter
01 A
02 B
Or use Series.mode:
df = original.groupby('number')['letter'] \
.apply(lambda x: x.mode()[0][0])
.to_frame('most_common_letter')
.rename_axis(None)
print (df)
most_common_letter
01 A
02 B
>>> df = pd.DataFrame({
'letter': {0: 'A', 1: 'A', 2: 'A', 3: 'B', 4: 'B'},
'number': {0: '01', 1: '01', 2: '02', 3: '02', 4: '02'}})
>>> df['most_common_letter']=df.groupby('number')['letter'].transform(max)
>>> df = df.iloc[:,1:].drop_duplicates().set_index('number')
>>> df.index.name = None
>>> df
most_common_letter
01 A
02 B
Or this way if it helps readability:
>>> df['most_common_letter']=df.groupby('number')['letter'].transform(max)
>>> df = df.drop('letter', axis=1).drop_duplicates().rename({'number': None}).set_index('number')
>>> df
most_common_letter
01 A
02 B
Related
I have the following df:
import pandas as pd
d = {
'Group': ['Month', 'Sport'],
'col1': [1, 'A'],
'col2': [4, 'B'],
'col3': [9, 'C']
}
df = pd.DataFrame(d)
I would like to convert all of the values in row index[0] excluding 'Month' to actual months. I've tried the following:
import datetime as dt
m_lst = []
for i in df.iloc[0]:
if type(i) != str:
x = dt.date(1900, i, 1).strftime('%B')
m_lst.append(x)
df.iloc[0][1:] = m_lst #(doesn't work)
So the for loop creates a list of months that correlate to the value in the dataframe. I just can't seem to figure out how to replace the original values with the values from the list. If there's an easier way of doing this, that would be great as well.
You can convert those values to datetime using pandas.to_datetime and then use the month_name property
import pandas as pd
d = {
'Group': ['Month', 'Sport'],
'col1': [1, 'A'],
'col2': [4, 'B'],
'col3': [9, 'C']
}
df = pd.DataFrame(d)
df.iloc[0, 1:] = pd.to_datetime(df.iloc[0, 1:], format='%m').dt.month_name()
Output:
>>> df
Group col1 col2 col3
0 Month January April September
1 Sport A B C
Assuming your month numbers are always in the same position, row 0, I'd use iloc and apply lambda like this:
import datetime as dt
import pandas as pd
def month_number_to_str(m: int):
return dt.datetime.strptime(str(m), '%m').strftime('%B')
d = {
'Group': ['Month', 'Sport'],
'col1': [1, 'A'],
'col2': [4, 'B'],
'col3': [9, 'C']
}
df = pd.DataFrame(d)
df.iloc[0, 1:] = df.iloc[0, 1:].apply(lambda x: month_number_to_str(x))
print(df)
Output:
Group col1 col2 col3
0 Month January April September
1 Sport A B C
Another way is to use Series.map.
It can translate values for you, e.g., based on a dictionary like this
(where you get it is up to you):
months = {1: 'January',
2: 'February',
3: 'March',
4: 'April',
5: 'May',
6: 'June',
7: 'July',
8: 'August',
9: 'September',
10: 'October',
11: 'November',
12: 'December'}
Then it's just a matter of selecting the right part of df and mapping the values:
>>> df.iloc[0, 1:] = df.iloc[0, 1:].map(month_dict)
>>> df
Group col1 col2 col3
0 Month January April September
1 Sport A B C
I just came across this question, how do I do str.join by one column to join the other, here is my DataFrame:
>>> df = pd.DataFrame({'a': ['a', 'b', 'c', 'd'], 'b': ['hello', 'good', 'great', 'nice']})
a b
0 a hello
1 b good
2 c great
3 d nice
I would like the a column to join the values in the b column, so my desired output is:
a b
0 a haealalao
1 b gbobobd
2 c gcrcecact
3 d ndidcde
How would I go about that?
Hope you can see the correlation with this, here is one example with the first row that you can do in python:
>>> 'a'.join('hello')
'haealalao'
>>>
Just like in the desired output.
I think it might be useful to know how two columns can interact. join might not be the best example but there are other functions that you could do. It could maybe be useful if you use split to split on the other columns, or replace the characters in the other columns with something else.
P.S. I have a self-answer below.
TL;DR
The below code is the fastest answer I could figure out from this question:
it = iter(df['a'])
df['b'] = [next(it).join(i) for i in df['b']]
The above code first does a generator of the a column, then you can use next for getting the next value every time, then in the list comprehension it joins the two strings.
Long answer:
Going to show my solutions:
Solution 1:
To use a list comprehension and a generator:
it = iter(df['a'])
df['b'] = [next(it).join(i) for i in df['b']]
print(df)
Solution 2:
Group by the index, and apply and str.join the two columns' value:
df['b'] = df.groupby(df.index).apply(lambda x: x['a'].item().join(x['b'].item()))
print(df)
Solution 3:
Use a list comprehension that iterates through both columns and str.joins:
df['b'] = [x.join(y) for x, y in df.values.tolist()]
print(df)
These codes all output:
a b
0 a haealalao
1 b gbobobd
2 c gcrcecact
3 d ndidcde
Timing:
Now it's time to move on to timing with the timeit module, here is the code we use to time:
from timeit import timeit
df = pd.DataFrame({'a': ['a', 'b', 'c', 'd'], 'b': ['hello', 'good', 'great', 'nice']})
def u11_1():
it = iter(df['a'])
df['b'] = [next(it).join(i) for i in df['b']]
def u11_2():
df['b'] = df.groupby(df.index).apply(lambda x: x['a'].item().join(x['b'].item()))
def u11_3():
df['b'] = [x.join(y) for x, y in df.values.tolist()]
print('Solution 1:', timeit(u11_1, number=5))
print('Solution 2:', timeit(u11_2, number=5))
print('Solution 3:', timeit(u11_3, number=5))
Output:
Solution 1: 0.007374127670871819
Solution 2: 0.05485127553865618
Solution 3: 0.05787154087587698
So the first solution is the quickest, using a generator.
I tried achieving the output using df.apply
>>> df.apply(lambda x: x['a'].join(x['b']), axis=1)
0 haealalao
1 gbobobd
2 gcrcecact
3 ndidcde
dtype: object
Timing it for performance comparison,
from timeit import timeit
df = pd.DataFrame({'a': ['a', 'b', 'c', 'd'], 'b': ['hello', 'good', 'great', 'nice']})
def u11_1():
it = iter(df['a'])
df['b'] = [next(it).join(i) for i in df['b']]
def u11_2():
df['b'] = df.groupby(df.index).apply(lambda x: x['a'].item().join(x['b'].item()))
def u11_3():
df['b'] = [x.join(y) for x, y in df.values.tolist()]
def u11_4():
df['c'] = df.apply(lambda x: x['a'].join(x['b']), axis=1)
df = pd.DataFrame({'a': ['a', 'b', 'c', 'd'], 'b': ['hello', 'good', 'great', 'nice']})
print('Solution 1:', timeit(u11_1, number=5))
df = pd.DataFrame({'a': ['a', 'b', 'c', 'd'], 'b': ['hello', 'good', 'great', 'nice']})
print('Solution 2:', timeit(u11_2, number=5))
df = pd.DataFrame({'a': ['a', 'b', 'c', 'd'], 'b': ['hello', 'good', 'great', 'nice']})
print('Solution 3:', timeit(u11_3, number=5))
df = pd.DataFrame({'a': ['a', 'b', 'c', 'd'], 'b': ['hello', 'good', 'great', 'nice']})
print('Solution 4:', timeit(u11_4, number=5))
Note that I am reinitializing df before every line so that all the functions process the same dataframe. It can also be done by passing the df as a parameter to the function.
Here's another solution using zip and list comprehension. Should be better than df.apply:
In [1576]: df.b = [i.join(j) for i,j in zip(df.a, df.b)]
In [1578]: df
Out[1578]:
a b
0 a haealalao
1 b gbobobd
2 c gcrcecact
3 d ndidcde
python dataframe
I want to delete the last character if it is number.
from current dataframe
data = {'d':['AAA2', 'BB 2', 'C', 'DDD ', 'EEEEEEE)', 'FFF ()', np.nan, '123456']}
df = pd.DataFrame(data)
to new dataframe
data = {'d':['AAA2', 'BB 2', 'C', 'DDD ', 'EEEEEEE)', 'FFF ()', np.nan, '123456'],
'expected': ['AAA', 'BB', 'C', 'DDD', 'EEEEEEE)', 'FFF (', np.nan, '12345']}
df = pd.DataFrame(data)
df
ex
Using .str.replace:
df['d'] = df['d'].str.replace(r'(\d)$','',regex=True)
I am trying the following:
import pandas as pd
df = pd.DataFrame({'Col1': {0: 'A', 1: 'A', 2: 'B', 3: 'B', 4: 'B'},
'Col2': {0: 'a', 1: 'a', 2: 'b', 3: 'b', 4: 'c'},
'Col3': {0: 42, 1: 28, 2: 56, 3: 62, 4: 48}})
ii = 1
for idx, row in df.iterrows():
print(row)
df.at[:, 'Col2'] = 'asd{}'.format(ii)
ii += 1
But the print statement above doesn't reflect the change df.at[:, 'Col2'] = 'asd'.format(ii). I need the print statements to reflect the change df.at[:, 'Col2'] = 'asd'.format(ii)
Edit: Since I am updating all rows of df, I was expecting the idx and row to grab new values from dataframe.
If this is not the right way to grab updated values from df through idx and row, then what is the correct approach. I need idx and row to reflect new values.
Expected output:
Col1 A
Col2 a
Col3 42
Name: 0, dtype: object
Col1 A
Col2 asd1
Col3 28
Name: 1, dtype: object
Col1 B
Col2 asd2
Col3 56
.....
From iterrows documentation:
You should never modify something you are iterating over. This is not
guaranteed to work in all cases. Depending on the data types, the
iterator returns a copy and not a view, and writing to it will have no
effect.
As per your request for an alternative solution, here is one using DataFrame.apply:
df['Col2'] = df.apply(lambda row: 'asd{}'.format(row.name), axis=1)
Other examples (also using Series.apply) that may be useful for your eventual goal: (not clear what it is yet)
df['Col2'] = df['Col2'].apply(lambda x: 'asd{}'.format(x))
df['Col2'] = df.apply(lambda row: 'asd{}'.format(row['Col3']), axis=1)
Here is something you can try,
import pandas as pd
df = pd.DataFrame({'Col1': {0: 'A', 1: 'A', 2: 'B', 3: 'B', 4: 'B'},
'Col2': {0: 'a', 1: 'a', 2: 'b', 3: 'b', 4: 'c'},
'Col3': {0: 42, 1: 28, 2: 56, 3: 62, 4: 48}})
print(
df.assign(idx=df.index)[['idx', 'Col2']]
.apply(lambda x: x['Col2'] if x['idx'] == 0 else f"asd{x['idx']}", axis=1)
)
0 a
1 asd1
2 asd2
3 asd3
4 asd4
dtype: object
I'm trying to simplify pandas and python syntax when executing a basic Pandas operation.
I have 4 columns:
a_id
a_score
b_id
b_score
I create a new label called doc_type based on the following:
a >= b, doc_type: a
b > a, doc_type: b
Im struggling in how to calculate in Pandas where a exists but b doesn't, in this case then a needs to be the label. Right now it returns the else statement or b.
I needed to create 2 additional comparison which at scale may be efficient as I already compare the data before. Looking how to improve it.
df = pd.DataFrame({
'a_id': ['A', 'B', 'C', 'D', '', 'F', 'G'],
'a_score': [1, 2, 3, 4, '', 6, 7],
'b_id': ['a', 'b', 'c', 'd', 'e', 'f', ''],
'b_score': [0.1, 0.2, 3.1, 4.1, 5, 5.99, None],
})
print df
# Replace empty string with NaN
m_score = r['a_score'] >= r['b_score']
m_doc = (r['a_id'].isnull() & r['b_id'].isnull())
df = df.apply(lambda x: x.str.strip() if isinstance(x, str) else x).replace('', np.nan)
# Calculate higher score
df['doc_id'] = df.apply(lambda df: df['a_id'] if df['a_score'] >= df['b_score'] else df['b_id'], axis=1)
# Select type based on higher score
r['doc_type'] = numpy.where(m_score, 'a',
numpy.where(m_doc, numpy.nan, 'b'))
# Additional lines looking for improvement:
df['doc_type'].loc[(df['a_id'].isnull() & df['b_id'].notnull())] = 'b'
df['doc_type'].loc[(df['a_id'].notnull() & df['b_id'].isnull())] = 'a'
print df
Use numpy.where, assuming your logic is:
Both exist, the doc_type will be the one with higher score;
One missing, the doc_type will be the one not null;
Both missing, the doc_type will be null;
Added an extra edge case at the last line:
import numpy as np
df = df.replace('', np.nan)
df['doc_type'] = np.where(df.b_id.isnull() | (df.a_score >= df.b_score),
np.where(df.a_id.isnull(), None, 'a'), 'b')
df
Not sure I fully understand all conditions or if this has any particular edge cases, but I think you can just do an np.argmax on the columns and swap the values for 'a' or 'b' when you're done:
In [21]: import numpy as np
In [22]: df['doc_type'] = pd.Series(np.argmax(df[["a_score", "b_score"]].values, axis=1)).replace({0: 'a', 1: 'b'})
In [23]: df
Out[23]:
a_id a_score b_id b_score doc_type
0 A 1 a 0.10 a
1 B 2 b 0.20 a
2 C 3 c 3.10 b
3 D 4 d 4.10 b
4 2 e 5.00 b
5 F f 5.99 a
6 G 7 NaN a
Use the apply method in pandas with a custom function, trying out on your dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'a_id': ['A', 'B', 'C', 'D', '', 'F', 'G'],
'a_score': [1, 2, 3, 4, '', 6, 7],
'b_id': ['a', 'b', 'c', 'd', 'e', 'f', ''],
'b_score': [0.1, 0.2, 3.1, 4.1, 5, 5.99, None],
})
df = df.replace('',np.NaN)
def func(row):
if np.isnan(row.a_score) and np.isnan(row.b_score):
return np.NaN
elif np.isnan(row.b_score) and not(np.isnan(row.a_score)):
return 'a'
elif not(np.isnan(row.b_score)) and np.isnan(row.a_score):
return 'a'
elif row.a_score>=row.b_score:
return 'a'
elif row.b_score>row.a_score:
return 'b'
df['doc_type'] = df.apply(func,axis=1)
You can make the function as complicated as you need and include any amount of comparisons and add more conditions later if you need to.