I have a tuple-valued score that I'd like to get the row corresponding to the maximum value of. A toy example of what I'd like to do would be:
import pandas as pd
df = pd.DataFrame({'id': ['a', 'a', 'b', 'b'],
'score': [(1,1,1), (1,1,2), (0, 0, 100), (8,8,8)],
'numeric_score': [1, 2, 3, 4],
'value':['foo', 'bar', 'baz', 'qux']})
# Works, gives correct result:
correct_df = df.loc[df.groupby('id')['numeric_score'].idxmax(), :]
# Fails with a TypeError
goal_df = df.loc[df.groupby('id')['score'].idxmax(), :]
correct_df has the result I'd like in goal_df. This throws a bunch of errors, the core of which seems to be:
TypeError: reduction operation 'argmax' not allowed for this dtype
A working, but ugly solution is:
best_scores = df.groupby('id')['score'].max().reset_index()[['id', 'score']]
goal_df = (pd.merge(df, best_scores, on=['id', 'score'])
.groupby(['id'])
.first()
.reset_index())
Is there a slick version of this?
I understand your question to be:
"NumPy's .argmax() does not work for tuples. For a Series of tuples, how do I determine the index for the maximum valued tuple?"
IIUC, this will return the desired outcome:
df.loc[df.score == df.score.max()]
Related
I want to fill missing values with like this:
data = pd.read_csv("E:\\SPEED.csv")
Data - DataFrame
Case - 1
if flcass= "motorway", "motorway_link", "trunk" or "trunk_link"
I want to replace the text "nan" with 110
Case - 2
if flcass= "primary", "primary_link", "secondary" or "secondary_link"
I want to replace the text "nan" with 70
Case - 3
if "fclass" is another value, I want to change it to 40.
I would be grateful for any help.
Two ways in pandas:
df = DataFrame(
{
"A": [1, 2, np.nan, 4],
"B": [1, 4, 9, np.nan],
"C": [1, 2, 3, 5],
"D": list("abcd"),
}
)
fillna lets you fill NA's (or NaNs) with what appears to be a fixed value:
df['B'].fillna(12)
[1,4,9,12]
interpolate uses scipy's interpolation methods -- linear by default:
df.interpolate()
df['A']
[1,2,3,4]
Thank you all for your answers. However, as there are 6812 rows and 16 columns (containing nan values) in the data, it seems that different solutions are required.
You can try this
import pandas as pd
import math
def valuesMapper(data, valuesDict, columns_to_update):
for i in columns_to_update:
data[i] = data[i].apply(lambda x: valuesDict.get(x, 40) if math.isnan(x) else x)
return data
data = pd.read_csv("E:\\SPEED.csv")
valuesDict = {"motorway":110, "motorway_link":110, "trunk":110, "primary":70, "primary_link":70, "secondary":70, "secondary_link":70}
column_to_update = ['AGU_PZR_07_10'] #columns_to_update is the list of columns to be updated, you can get it through code didn't added that as i dont have your data
print(valuesMapper(data, valuesDict, columns_to_update))
With the below example:
data = pandas.DataFrame({
'flclass': ['a', 'b', 'c', 'a'],
'AGU': [float('nan'), float('nan'), float('nan'), 9]
})
You can update it using numpy conditionals iterating over your columns starting from 2nd ([1:]) - 5th ([4:]) in your data:
for column in data.columns[1:]:
data[column] = np.where((data['flclass'] == 'b') & (data[column].isna()), 110, data[column])
Or panadas apply:
import numpy as np
data['AGU'] = data.apply(
lambda row: 110 if np.isnan(row['AGU']) and row['flclass'] in ("b","a") else row['AGU'],
axis=1,
)
where you can replace ("b","a") with eg ("motorway", "motorway_link", "trunk", "trunk_link")
I have a list/tuple containing uneven sublists (query result from SQLAlchemy), which looks like this:
x = [('x', [1,2], [3,4])]
I want to unnest/explode x as the following:
x = [('x',[1],[3]),('x',[2],[4])]
Or
x = [('x',1,3),('x',2,4)]
I can achieve this using pandas dataframes with the following:
df = pd.DataFrame(x, columns=['X','A','B'])
df = df.apply(lambda x: x.explode() if x.name in ['A', 'B'] else x)
print([tuple(i) for i in df.values.tolist()])
which generates the following output:
[('x', 1, 3), ('x', 2, 4)]
However, I would love to know if there is any pure python only solution possible. I have been playing around with list comprehension based on following answer with no luck.
[item for sublist in x for item in sublist]
Any help would be appreciated.
Edit:
my input looks like this:
[(u'x.x#gmail.com', u'Contact Info Refused/Not provided was Documented', 0L,
[None, None, None], [1447748, 1447751, 1447750], 3L, [1491930], 'nce', 1, 2037)]
Expected output:
Just unpacking two sublist and keep everything the same.
[(u'x.x#gmail.com',u'Contact Info Refused/Not provided was Documented','0L',None,1447748,3L,[1491930],'nce',1,2037),
,(u'x.x#gmail.com',u'Contact Info Refused/Not provided was Documented','0L',None,1447751,3L,[1491930],'nce',1,2037),
(u'x.x#gmail.com',u'Contact Info Refused/Not provided was Documented','0L',None,1447750,3L,[1491930],'nce',1,2037)) ]
From itertools
import itertools
list(itertools.zip_longest(*x[0],fillvalue=x[0][0]))
Out[25]: [('x', 1, 3), ('x', 2, 4)]
# [list(itertools.zip_longest(*x[0],fillvalue=x[0][0])) for x in sublist]
Here is my dataframe
ord_datetime
2019-05-01 22.483871
2019-05-02 27.228070
2019-05-03 30.140625
2019-05-04 32.581633
2019-05-05 30.259259
if i do code like this
b=[]
b.append((df.iloc[2]-df.iloc[1])/(df.iloc[1]))
print(b)
output is
[Ordered Items 0.106969
dtype: float64]
I want an output like 0.106969 only
How can i do that?
You are working with Series here, which is why you get this result.
Your iloc returns a Series of 1 element, and the arithmetic operators also return series.
If you want to get the scalar value, you can simply use my_series[0].
So for your example:
data = {datetime(2019, 5, 1): 22.483871, datetime(2019, 5, 2): 27.228070,
datetime(2019, 5, 3): 30.140625, datetime(2019, 5, 4): 32.581633,
datetime(2019, 5, 5): 30.259259}
df = pd.DataFrame.from_dict(data, orient="index")
series_result = (df.iloc[2] - df.iloc[1]) / df.iloc[1]
scalar_result = series_result[0]
# you can append the result to your list if you want
You can do something like the following
import pandas as pd
data = {
"ord_datetime": ["2019-05-01","2019-05-02","2019-05-03","2019-05-04","2019-05-05"],
"value": [22.483871,27.228070,30.140625,32.581633,30.259259]
}
df = pd.DataFrame(data=data)
res = [ (df.iloc[ridx + 1, 1] - df.iloc[ ridx, 1]) / (df.iloc[ridx, 1]) for ridx in range(0, df.shape[0]-1) ]
res # [0.2110045463256749, 0.10696883767376833, 0.08098730533955406, -0.0712786249848188]
Hope it helps.
If you want to just get the values from the output you can use df.values which returns a numpy array. If you want a list from that numpy array you can then use np_array.tolist.
So
b = ((df.iloc[2]-df.iloc[1])/(df.iloc[1])).values #returns numpy array
b.tolist # returns python list
I do a split-apply-merge type of workflow with pandas. The 'apply' part returns a DataFrame. When the DataFrame I run gropupby on is firstly sorted, simply returning a DataFrame from apply raises ValueError: cannot reindex from a duplicate axis. Instead, I have found it to work properly when I return pd.concat([df]) (instead of just return df). If I don't sort the DataFrame, both ways of merging results work correctly. I expect sorting must be doing something to the index yet I don't understand what. Can someone please explain?
import pandas as pd
import numpy as np
def fill_out_ids(df, filling_function, sort=False, sort_col='sort_col',
group_by='group_col', to_fill=['id1', 'id2']):
df = df.copy()
df.set_index(group_by, inplace=True)
if sort:
df.sort_values(by=sort_col, inplace=True)
g = df.groupby(df.index, sort=False, group_keys=False)
df = g.apply(filling_function, to_fill)
df.reset_index(inplace=True)
return df
def _fill_ids_concat(df, to_fill):
df[to_fill] = df[to_fill].fillna(method='ffill')
df[to_fill] = df[to_fill].fillna(method='bfill')
return pd.concat([df])
def _fill_ids_plain(df, to_fill):
df[to_fill] = df[to_fill].fillna(method='ffill')
df[to_fill] = df[to_fill].fillna(method='bfill')
return df
def test_fill_out_ids():
input_df = pd.DataFrame(
[
['a', None, 1.0, 1],
['a', None, 1.0, 3],
['a', 'name1', np.nan, 2],
['b', None, 2.0, 3],
['b', 'name1', np.nan, 2],
['b', 'name2', np.nan, 1],
],
columns=['group_col', 'id1', 'id2', 'sort_col']
)
# this works
fill_out_ids(input_df, _fill_ids_plain, sort=False)
# this raises: ValueError: cannot reindex from a duplicate axis
fill_out_ids(input_df, _fill_ids_plain, sort=True)
# this works
fill_out_ids(input_df, _fill_ids_concat, sort=True)
# this works
fill_out_ids(input_df, _fill_ids_concat, sort=False)
if __name__ == "__main__":
test_fill_out_ids()
I need to be able to rank an array based on a single column and then again with using a second column as basically a tie breaker and then save those two ranks into the database
Array:
array = np.array(
[(70,3,100),
(72,3,101),
(70,2,102)], dtype=[
('score','int8'),
('tiebreaker','int8'),
('row_id','int8')])
array['score'] = array([70, 72, 70], dtype=int8)
First Rank using only the 'score' column would return
(1,3,1)
Then the second Rank rankings using 'score' and 'tiebreaker' columns
(2,3,1)
Then I want to save those two ranks to the database for example:
result1 = Result.objects.get(id=array[0]['row_id'])
result1.relative_rank = 1
result1.absolute_rank = 2
results.save()
You can use scipy.stats.rankdata, as follows:
In [10]: a
Out[10]:
array([(70, 3, 100), (72, 3, 101), (70, 2, 102)],
dtype=[('score', 'i1'), ('tiebreaker', 'i1'), ('row_id', 'i1')])
In [11]: from scipy.stats import rankdata
First rank:
In [12]: rankdata(a['score'], method='min').astype(int)
Out[12]: array([1, 3, 1])
Second rank:
In [13]: rankdata(256*a['score'] + a['tiebreaker'], method='min').astype(int)
Out[13]: array([2, 3, 1])
The value used in the second rank (256*a['score'] + a['tiebreaker']) relies on the data having type int8.
Check the docstring to see if a different method would be more appropriate for the second rank. If you know there will be no ties in the second rank, the method doesn't matter.