An example dataset I'm working with
df = pd.DataFrame({"competitorname": ["3 Musketeers", "Almond Joy"], "winpercent": [67.602936, 50.347546] }, index = [1, 2])
I am trying to see whether 3 Musketeers or Almond Joy has a higher winpercent. The code I wrote is:
more_popular = '3 Musketeers' if df.loc[df["competitorname"] == '3 Musketeers', 'winpercent'].values[0] > df.loc[df["competitorname"] == 'Almond Joy', 'winpercent'].values[0] else 'Almond Joy'
My question is
Can I select the values I am interested in without python returning a Series? Is there a way to just do
df[df["competitorname"] == 'Almond Joy', 'winpercent']
and then it would return a simple
50.347546
?
I know this doesn't make my code significantly shorter but I feel like I am missing something about getting values from pandas that would help me avoid constantly adding
.values[0]
The underlying issue is that there could be multiple matches, so we will always need to extract the match(es) at some point in the pipeline:
Use Series.idxmax on the boolean mask
Since False is 0 and True is 1, using Series.idxmax on the boolean mask will give you the index of the first True:
df.loc[df['competitorname'].eq('Almond Joy').idxmax(), 'winpercent']
# 50.347546
This assumes there is at least 1 True match, otherwise it will return the first False.
Or use Series.item on the result
This is basically just an alias for Series.values[0]:
df.loc[df['competitorname'].eq('Almond Joy'), 'winpercent'].item()
# 50.347546
This assumes there is exactly 1 True match, otherwise it will throw a ValueError.
How about simply sorting the dataframe by "winpercent" and then taking the top row?
df.sort_values(by="winpercent", ascending=False, inplace=True)
then to see the winner's row
df.head(1)
or to get the values
df.iloc[0]["winpercent"]
If you're sure that the returned Series has a single element, you can simply use .item() to get it:
import pandas as pd
df = pd.DataFrame({
"competitorname": ["3 Musketeers", "Almond Joy"],
"winpercent": [67.602936, 50.347546]
}, index = [1, 2])
s = df.loc[df["competitorname"] == 'Almond Joy', 'winpercent'] # a pandas Series
print(s)
# output
# 2 50.347546
# Name: winpercent, dtype: float64
v = df.loc[df["competitorname"] == 'Almond Joy', 'winpercent'].item() # a scalar value
print(v)
# output
# 50.347546
Related
I have a dataframe that looks like this
dict = {'trade_date': {1350: 20151201,
6175: 20151201,
3100: 20151201,
5650: 20151201,
3575: 20151201,
1: 20170301,
2: 20170301},
'comId': {1350: '257762',
6175: '1038328',
3100: '315476',
5650: '658776',
3575: '329376',
1: '123456',
2: '987654'},
'return': {1350: -0.0018,
6175: 0.0023,
3100: -0.0413,
5650: 0.1266,
3575: 0.0221,
1: '0.9',
2: '0.01'}}
df = pd.DataFrame(dict)
the expected output should be like this:
dict2 = {'trade_date': {5650: 20151201,
1: 20170301},
'comId': {5650: '658776',
1: '123456'},
'return': {5650: 0.1266,
1: '0.9'}}
I need to filter it based on the following condition: for each trade_date value, I want to keep only the top 20% entries, based on the value in column return. So for this example, it would filter out everything but the company with comId value 658776 and return value 0.1266.
Bear in mind there might be trade_dates with more companies associated to them. In that case it should round that up or down to the nearest integer. For example, if there are 9 companies associated with a date, 20% * 9 = 1.8, so it should only keep the first two based on the values in column return.
Any ideas how to best approach this, I'm a bit lost?
I think this should work:
df\
.groupby("trade_date")\
.apply(lambda x: x[x["return"] >
x["return"].quantile(0.8, interpolation="nearest")])\
.reset_index(drop=True)
You can use groupby().transform to get the threshold for each row. This would be a bit faster than groupby().apply:
thresholds = df.groupby('trade_date')['return'].transform('quantile',q=.8)
df[df['return'] > thresholds]
Output:
trade_date comId return
5650 20151201 658776 0.1266
Create a temporary variable storing only the rows with the same trade_date. Then use this:
df.sort_values(by='return', ascending=False)
and then remove the bottom 80%.
Loop through all possible dates and everytime you get the 20%, append them to a new dataframe.
I have a dataframe as shown below:
>>> import pandas as pd
>>> df = pd.DataFrame(data = [['app;',1,2,3],['app; web;',4,5,6],['web;',7,8,9],['',1,4,5]],columns = ['a','b','c','d'])
>>> df
a b c d
0 app; 1 2 3
1 app; web; 4 5 6
2 web; 7 8 9
3 1 4 5
I have an input array that looks like this: ["app","web"]
For each of these values I want to check against a specific column of a dataframe and return a decision as shown below:
>>> df.a.str.contains("app")
0 True
1 True
2 False
3 False
Since str.contains only allows me to look for an individual value, I was wondering if there's some other direct way to determine the same something like:
df.a.str.contains(["app","web"]) # Returns TypeError: unhashable type: 'list'
My end goal is not to do an absolute match (df.a.isin(["app", "web"]) but rather a 'contains' logic that says return true even if it has those characters present in that cell of data frame.
Note: I can of course use apply method to create my own function for the same logic such as:
elementsToLookFor = ["app","web"]
df[header] = df.apply(lambda element: all([a in element for a in elementsToLookFor]))
But I am more interested in the optimal algorithm for this and so prefer to use a native pandas function within pandas, or else the next most optimized custom solution.
This should work too:
l = ["app","web"]
df['a'].str.findall('|'.join(l)).map(lambda x: len(set(x)) == len(l))
also this should work as well:
pd.concat([df['a'].str.contains(i) for i in l],axis=1).all(axis = 1)
so many solutions, which one is the most efficient
The str.contains-based answers are generally fastest, though str.findall is also very fast on smaller dfs:
values = ['app', 'web']
pattern = ''.join(f'(?=.*{value})' for value in values)
def replace_dummies_all(df):
return df.a.str.replace(' ', '').str.get_dummies(';')[values].all(1)
def findall_map(df):
return df.a.str.findall('|'.join(values)).map(lambda x: len(set(x)) == len(values))
def lower_contains(df):
return df.a.astype(str).str.lower().str.contains(pattern)
def contains_concat_all(df):
return pd.concat([df.a.str.contains(l) for l in values], axis=1).all(1)
def contains(df):
return df.a.str.contains(pattern)
Try with str.get_dummies
df.a.str.replace(' ','').str.get_dummies(';')[['web','app']].all(1)
0 False
1 True
2 False
3 False
dtype: bool
Update
df['a'].str.contains(r'^(?=.*web)(?=.*app)')
Update 2: (To ensure case insenstivity doesn't matter and the column dtype is str without which the logic may fail):
elementList = ['app','web']
for eachValue in elementList:
valueString += f'(?=.*{eachValue})'
df[header] = df[header].astype(str).str.lower() #To ensure case insenstivity and the dtype of the column is string
result = df[header].str.contains(valueString)
Let's say I have the following data of a match in a CSV file:
name,match1,match2,match3
Alice,2,4,3
Bob,2,3,4
Charlie,1,0,4
I'm writing a python program. Somewhere in my program I have scores collected for a match stored in a list, say x = [1,0,4]. I have found where in the data these scores exist using pandas and I can print "found" or "not found". However I want my code to print out to which name these scores correspond to. In this case the program should output "charlie" since charlie has all these values [1,0,4]. how can I do that?
I will have a large set of data so I must be able to tell which name corresponds to the numbers I pass to the program.
Yes, here's how to compare entire rows in a dataframe:
df[(df == x).all(axis=1)].index # where x is the pd.Series we're comparing to
Also, it makes life easiest if you directly set name as the index column when you read in the CSV.
import pandas as pd
from io import StringIO
df = """\
name,match1,match2,match3
Alice,2,4,3
Bob,2,3,4
Charlie,1,0,4"""
df = pd.read_csv(StringIO(df), index_col='name')
x = pd.Series({'match1':1, 'match2':0, 'match3':4})
Now you can see that doing df == x, or equivalently df.eq(x), is not quite what you want because it does element-wise compare and returns a row of True/False. So you need to aggregate those rows with .all(axis=1) which finds rows where all comparison results were True...
df.eq(x).all(axis=1)
df[ (df == x).all(axis=1) ]
# match1 match2 match3
# name
# Charlie 1 0 4
...and then finally since you only want the name of such rows:
df[ (df == x).all(axis=1) ].index
# Index(['Charlie'], dtype='object', name='name')
df[ (df == x).all(axis=1) ].index.tolist()
# ['Charlie']
which is what you wanted. (I only added the spaces inside the expression for clarity).
You need to use DataFrame.loc which would work like this:
print(df.loc[(df.match1 == 1) & (df.match2 == 0) & (df.match3 == 4), 'name'])
Maybe try something like this:
import pandas as pd
import numpy as np
# Makes sample data
match1 = np.array([2,2,1])
match2 = np.array([4,4,0])
match3 = np.array([3,3,4])
name = np.array(['Alice','Bob','Charlie'])
df = pd.DataFrame({'name': id, 'match1': match1, 'match2':match2, 'match3' :match3})
df
# example of the list you want to get the data from
x=[1,0,4]
#x=[2,4,3]
# should return the name Charlie as well as the index (based on the values in the list x)
df['name'].loc[(df['match1'] == x[0]) & (df['match2'] == x[1]) & (df['match3'] ==x[2])]
# Makes a new dataframe out of the above
mydf = pd.DataFrame(df['name'].loc[(df['match1'] == x[0]) & (df['match2'] == x[1]) & (df['match3'] ==x[2])])
# Loop that prints out the name based on the index of mydf
# Assuming there are more than one name, it will print all. if there is only one name, it will print only that)
for i in range(0,len(mydf)):
print(mydf['name'].iloc[i])
you can use this
here data is your Data frame ,you can change accordingly your data frame name,
and
considering [1,0,4] is int type
data = data[(data['match1']== 1)&(data['match2']==0)&(data['match3']== 4 ).index
print(data[0])
if data is object type then use this
data = data[(data['match1']== "1")&(data['match2']=="0")&(data['match3']== "4" ).index
print(data[0])
I have the following example of my dataframe:
df = pd.DataFrame({'first_date': ['01-07-2017', '01-07-2017', '01-08-2017'],
'end_date': ['01-08-2017', '01-08-2017', '15-08-2017'],
'second_date': ['01-09-2017', '01-08-2017', '15-07-2017'],
'cust_num': [1, 2, 1],
'Title': ['philips', 'samsung', 'philips']})
If the cus_num is equal in the column
The Title is equal for both rows in the dataframe
The second_date in a row <= end_date in an other row
If all these requirements are met the value True should be appended to a new column in the original row.
Because I'm working with a big dataset I'm looking for an efficient way to do this.
In this case only the first record should get a true value.
I have checked for the apply with lambda and groupby function in python but couldnt find a way to make these work.
Try this (spontaneously I cannot come up with a faster method):
import pandas as pd
import numpy as np
df["second_date"]=pd.to_datetime(df["second_date"], format='%d-%m-%Y')
df["end_date"]=pd.to_datetime(df["end_date"], format='%d-%m-%Y')
df["new col"] = False
for cust in set(df["cust_num"]):
indices = df.index[df["cust_num"] == cust].tolist()
if len(indices) > 1:
sub_df = df.loc[indices]
for title in set(df.loc[indices]["Title"]):
indices_title = sub_df.index[sub_df["Title"] == title]
if len(indices_title) > 1:
for i in indices_title:
if sub_df.loc[indices_title]["second_date"][i] <= sub_df.loc[indices_title]["end_date"][i]:
df["new col"] = True
break
df["new_col"] = new_col
First you need to make all date columns comparable with eachother by casting them into datetime. Then create the additional column you want.
Now create a set of all unique customer numbers and iterate through them. For each customer number get a list of all row indices with this customer number. If this list is longer than 1, then you have several same customer numbers. Then you create a sub df of your dataframe with all rows with the same customer number. Then iterate through the set of all titles. For each title check if there is the same title somewhere else in the sub df (len > 1). If this is the case, then iterate through all rows and write True in your additional column in the same row where the date condition is met for the first time.
This should work. Also while reading comments, I am assuming that all cust_num is unique.
import pandas as pd
df = pd.DataFrame({'first_date': ['01-07-2017', '01-07-2017', '01-08-2017'],
'end_date': ['01-08-2017', '01-08-2017', '15-08-2017'],
'second_date': ['01-09-2017', '01-08-2017', '15-07-2017'],
'cust_num': [1, 2, 1],
'Title': ['philips', 'samsung', 'philips']})
df["second_date"]=pd.to_datetime(df["second_date"])
df["end_date"]=pd.to_datetime(df["end_date"])
df['Value'] = False
for i in range(len(df)):
for j in range(len(df)):
if (i != j):
if (df.loc[j,'end_date'] >= df.loc[i,'second_date']) == True:
if (df.loc[i,'cust_num'] == df.loc[j,'cust_num']) == True:
if (df.loc[i,'Title'] == df.loc[j,'Title']) == True:
df.loc[i,'Value'] = True
Tell me if this code works! and any errors.
I am using an aggregation function that I have used in my work for a long time now. The idea is that if the Series passed to the function is of length 1 (i.e. the group only has one observation) then that observations is returned. If the length of the Series passed is greater than one, then the observations are returned in a list.
This may seem odd to some, but this is not an X,Y problem, I have good reason for wanting to do this that is not relevant to this question.
This is the function that I have been using:
def MakeList(x):
""" This function is used to aggregate data that needs to be kept distinc within multi day
observations for later use and transformation. It makes a list of the data and if the list is of length 1
then there is only one line/day observation in that group so the single element of the list is returned.
If the list is longer than one then there are multiple line/day observations and the list itself is
returned."""
L = x.tolist()
if len(L) > 1:
return L
else:
return L[0]
Now for some reason, with the current data set I am working on I get a ValueError stating that the function does not reduce. Here is some test data and the remaining steps I am using:
import pandas as pd
DF = pd.DataFrame({'date': ['2013-04-02',
'2013-04-02',
'2013-04-02',
'2013-04-02',
'2013-04-02',
'2013-04-02',
'2013-04-02',
'2013-04-02',
'2013-04-02',
'2013-04-02'],
'line_code': ['401101',
'401101',
'401102',
'401103',
'401104',
'401105',
'401105',
'401106',
'401106',
'401107'],
's.m.v.': [ 7.760,
25.564,
25.564,
9.550,
4.870,
7.760,
25.564,
5.282,
25.564,
5.282]})
DFGrouped = DF.groupby(['date', 'line_code'], as_index = False)
DF_Agg = DFGrouped.agg({'s.m.v.' : MakeList})
In trying to debug this, I put a print statement to the effect of print L and print x.index and
the output was as follows:
[7.7599999999999998, 25.564]
Int64Index([0, 1], dtype='int64')
[7.7599999999999998, 25.564]
Int64Index([0, 1], dtype='int64')
For some reason it appears that agg is passing the Series twice to the function. This as far as I know is not normal at all, and is presumably the reason why my function is not reducing.
For example if I write a function like this:
def test_func(x):
print x.index
return x.iloc[0]
This runs without problem and the print statements are:
DF_Agg = DFGrouped.agg({'s.m.v.' : test_func})
Int64Index([0, 1], dtype='int64')
Int64Index([2], dtype='int64')
Int64Index([3], dtype='int64')
Int64Index([4], dtype='int64')
Int64Index([5, 6], dtype='int64')
Int64Index([7, 8], dtype='int64')
Int64Index([9], dtype='int64')
Which indicates that each group is only being passed once as a Series to the function.
Can anyone help me understand why this is failing? I have used this function with success in many many data sets I work with....
Thanks
I can't really explain you why, but from my experience list in pandas.DataFrame don't work all that well.
I usually use tuple instead.
That will work:
def MakeList(x):
T = tuple(x)
if len(T) > 1:
return T
else:
return T[0]
DF_Agg = DFGrouped.agg({'s.m.v.' : MakeList})
date line_code s.m.v.
0 2013-04-02 401101 (7.76, 25.564)
1 2013-04-02 401102 25.564
2 2013-04-02 401103 9.55
3 2013-04-02 401104 4.87
4 2013-04-02 401105 (7.76, 25.564)
5 2013-04-02 401106 (5.282, 25.564)
6 2013-04-02 401107 5.282
This is a misfeature in DataFrame. If the aggregator returns a list for the first group, it will fail with the error you mention; if it returns a non-list (non-Series) for the first group, it will work fine. The broken code is in groupby.py:
def _aggregate_series_pure_python(self, obj, func):
group_index, _, ngroups = self.group_info
counts = np.zeros(ngroups, dtype=int)
result = None
splitter = get_splitter(obj, group_index, ngroups, axis=self.axis)
for label, group in splitter:
res = func(group)
if result is None:
if (isinstance(res, (Series, Index, np.ndarray)) or
isinstance(res, list)):
raise ValueError('Function does not reduce')
result = np.empty(ngroups, dtype='O')
counts[label] = group.shape[0]
result[label] = res
Notice that if result is None and isinstance(res, list.
Your options are:
Fake out groupby().agg(), so it doesn't see a list for the first group, or
Do the aggregation yourself, using code like that above but without the erroneous test.