Why does the groupby command in Pandas produce non-exist ids? - python

I use the pandas groupby command on my dataframe as:
df.groupby('courier_id').type_of_vehicle.size()
but this code produces some 'courier_id' that they're not in my dataframe
courier_id
00aecd42-472f-11ec-94e0-77812be296a5 4
011da6a6-eb0b-11ec-97e1-179dc13cdf87 1
0140f63c-02e0-11ed-b314-9b2e7e4f7e5c 1
0188d572-7228-11ec-ab3b-07d470cb404d 7
01cef7ba-e32e-11ec-bb21-67c7079055d4 0
..
c98fc418-7b51-11ec-a81c-77139d6dd889 0
d98a4b9a-d056-11ec-9e3c-0b80c11ec04b 1
dae54c80-d1f8-11ec-bbb0-b71d7b2c4e1a 1
f7925664-0ac1-11ed-ab40-df16023f78cb 0
f857cb84-371c-11ec-9af6-ffeaeea4b0f1 4
Name: type_of_vehicle, Length: 268, dtype: int64
I checked it with: '01cef7ba-e32e-11ec-bb21-67c7079055d4' in df.courier_id.values and result was False
I used df.groupby('courier_id').get_group('01cef7ba-e32e-11ec-bb21-67c7079055d4') and it raise KeyError but when make for in it, return empty DataFrame
Note: when I slice my dataframe as new_df = df[['courier_id', 'type_of_vehicle']] the result become right!

If you provide some reproducible code/data it would be appreciated. That way we can provide you the best possible answer.
However, I think the problem is due the following:
When you use the function groupby(), the original courier_id becomes the new index of the transformed DataFrame. Try to use .reset_index() and your problem should be solved.
df.groupby('courier_id').type_of_vehicle.size().reset_index()

Related

Trying to compare to values in a pandas dataframe for max value

I've got a pandas dataframe, and I'm trying to fill a new column in the dataframe, which takes the maximum value of two values situated in another column of the dataframe, iteratively. I'm trying to build a loop to do this, and save time with computation as I realise I could probably do it with more lines of code.
for x in ((jac_input.index)):
jac_output['Max Load'][x] = jac_input[['load'][x],['load'][x+1]].max()
However, I keep getting this error during the comparison
IndexError: list index out of range
Any ideas as to where I'm going wrong here? Any help would be appreciated!
Many things are wrong with your current code.
When you do ['abc'][x], x can only take the value 0 and this will return 'abc' as you are slicing a list. Not at all what you expect it to do (I imagine, slicing the Series).
For your code to be valid, you should do something like:
jac_input = pd.DataFrame({'load': [1,0,3,2,5,4]})
for x in jac_input.index:
print(jac_input['load'].loc[x:x+1].max())
output:
1
3
3
5
5
4
Also, when assigning, if you use jac_output['Max Load'][x] = ... you will likely encounter a SettingWithCopyWarning. You should rather use loc: jac_outputLoc[x, 'Max Load'] = .
But you do not need all that, use vectorial code instead!
You can perform rolling on the reversed dataframe:
jac_output['Max Load'] = jac_input['load'][::-1].rolling(2, min_periods=1).max()[::-1]
Or using concat:
jac_output['Max Load'] = pd.concat([jac_input['load'], jac_input['load'].shift(-1)], axis=1).max(1)
output (without assignment):
0 1.0
1 3.0
2 3.0
3 5.0
4 5.0
5 4.0
dtype: float64

Creating a new column but creates copy of dataframe

I would like to check the value of the row above and see it it is the same as the current row. I found a great answer here: df['match'] = df.col1.eq(df.col1.shift()) such that col1 is what you are comparing.
However, when I tried it, I received a SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. warning. My col1 is a string. I know you can suppress warnings but how would I check the same row above and make sure that I am not creating a copy of the dataframe? Even with the warning I do get my desired output, but was curious if there exists a better way.
import pandas as pd
data = {'col1':['a','a','a','b','b','c','c','c','d','d'],
'week':[1,1,1,1,1,2,2,2,2,2]}
df = pd.DataFrame(data, columns=['col1','week'])
df['check_condition'] = 1
while sum(df.check_condition) != 0:
for week in df.week:
wk = df.loc[df.week == week]
wk['match'] = wk.col1.eq(wk.col1.shift()) # <-- where the warning occurs
# fix the repetitive value...which I have not done yet
# for now just exit out of the while loop
df.loc[df.week == week,'check_condition'] = 0
You can't ignore a pandas SettingWithCopyWarning!
It's 100% telling you that your code is not going to work as intended, if at all. Stop, investigate and fix it. (It's not an ignoreable thing you can filter out, like a pandas FutureWarning nagging about deprecation.)
Multiple issues with your code:
You're trying to iterate over a dataframe (but not with groupby()), take slices of it (in the subdataframe wk, which yes is a copy of a slice)...
then assign to the (nonexistent) new column wk['match']. This is bad, you shouldn't do this. (You could initialize df['match'] = np.nan, but it'd still be wrong to try to assign to the copy in wk)...
SettingWithCopyWarning is being triggered when you try to assign to wk['match']. It's telling you wk is a copy of a slice from dataframe df, not df itself. Hence like it tells you: A value is trying to be set on a copy of a slice from a DataFrame. That assignment would only get thrown away every time wk gets overwritten by your loop, so even if you could force it to work on wk it would be wrong. That's why SettingWithCopyWarning is a code smell you shouldn't be making a copy of a slice of df in the first place.
Later on, you also try to assign to column df['check_condition'] while iterating over the df, that's also bad.
Solution:
df['check_condition'] = df['col1'].eq(df['col1'].shift()).astype(int)
df
col1 week check_condition
0 a 1 0
1 a 1 1
2 a 1 1
3 b 1 0
4 b 1 1
5 c 2 0
6 c 2 1
7 c 2 1
8 d 2 0
9 d 2 1
More generally, for more complicated code where you want to iterate over each group of dataframe according to some grouping criteria, you'd use use groupby() and split-apply-combine instead.
you're grouping by wk.col1.eq(wk.col1.shift()), i.e. rows where col1 value doesn't change from the preceding row
and you want to set check_condition to 0 on those rows
and 1 on rows where col1 value did change from the preceding row
But in this simpler case you can skip groupby() and do a direct assignment.

Pandas - find occurrence within a subset

I'm stripping values from unformatted summary sheets in a for loop, and I need to dynamically find the index location of a string value after the occurrence of another specific string value. I used this question as my starting point. Example dataframe:
import pandas as pd
df = pd.DataFrame([['Small'],['Total',4],['Medium'],['Total',12],['Large'],['Total',7]])
>>>df
0 1
0 Small NaN
1 Total 4.0
2 Medium NaN
3 Total 12.0
4 Large NaN
5 Total 7.0
Say I want to find the 'Total' after 'Medium.' I can find the location of 'Medium' with the following:
MedInd = df[df.iloc[:,0]=='Medium'].first_valid_index()
>>>MedInd
2
After this, I run into issues placing a subset limitation on the query:
>>>MedTotal = df[df.iloc[MedInd:,0]=='Total'].first_valid_index()
IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).
Still very new to programming and could use some direction with this error. Searching the error itself it seems like it's an issue of the ordering in which I should define the subset, but I've been unable to fix it thus far. Any assistance would be greatly appreciated.
EDIT:
So I ended up resolving this by moving the subset limitation to the front, outside the first_valid_index clause as follows (suggestion obtained from this reddit comment):
MedTotal = df.iloc[MedInd:][df.iloc[:,0]=='Total'.first_valid_index()
This does throw the following warning:
UserWarning: Boolean Series key will be reindexed to match DataFrame index.
But the output was as desired, which was just the index number for the value being sought.
I don't know if this will always produce desired results given the warning, so I'll continue to scan the answers for other solutions.
You may want to use shift:
df[df.iloc[:,0].shift().eq('Medium') & df.iloc[:,0].eq('Total')]
Output:
0 1
3 Total 12.0
This would work
def find_idx(df, first_str, second_str):
first_idx = df[0].eq(first_str).idxmax()
rest_of_df = df.iloc[first_idx:]
return rest_of_df[0].eq(second_str).idxmax()
find_idx(df, 'Medium', 'Total')

Filtering data in Pandas returns error 'method' object is not iterable

I have a dataset as follows:
I am going to filter rows where the counts value equals 1.
index count
1 4
2 5
3 1
4 1
This is my code:
booleans =[]
for number in df1.count:
if number ==1:
booleans.append (True)
else:
booleans.append (False)
but it has this error:
'method' object is not iterable
I also tried this:
df[df.count==1]
but I had the following error:
KeyError: False
any suggestion?
In your code the problem is with the this part df1.count. Actually, pandas is having a method count() which is used to count the no. of non-NA/null observations across the given axis.
And in your code it returns something like this,
<bound method DataFrame.count of index count
0 1 4
1 2 5
2 3 1
3 4 1>
Instead of it, you can use df[df['count']=='1'] to get what you were looking for.
import pandas as pd
data = {"index":['1','2','3','4'],
"count":['4','5','1','1']}
df = pd.DataFrame(data)
indexes = df[df['count']=='1']
print(indexes)
Output
index count
2 3 1
3 4 1
Count is also a method of pandas DataFrame.
When you do df.count, pandas understands you are calling the count() method, not fetching your column that happens to have the same name. Doing df["count"] would solve your issue.
The standard way to do this is to do the following:
Solution 1
df1[df1["count"]=='1']
Solution 2
However, if you really do want to get a list of booleans you might want to use lambdas:
booleans = list(df1['count'].apply(lambda x:x=='1').values)
You can then use this list to get the result you want like so:
df1[booleans]
This is basically the same thing as solution 1.

How to do columnwise operations in pandas?

I have a dataframe that looks something like:
sample parameter1 parameter2 parameter3
A 9 6 3
B 4 5 7
C 1 5 8
and I want to do an operation that does something like:
for sample in dataframe:
df['new parameter'] = df[sample, parameter1]/df[sample, parameter2]
so far I have tried:
df2.loc['ratio'] = df2.loc['reads mapped']/df2.loc['raw total sequences']
but I get the error:
KeyError: 'the label [reads mapped] is not in the [index]'
when I know well that it is in the index, so I figure I am missing some concept somewhere. Any help is much appreciated!
I should add that the parameter values are floats, just in case that is a problem as well!
The method .loc first expects row indices, then column indices, so the following should work, since you wanted to do column-wise operations:
df2['ratio'] = df2.loc[:, 'reads mapped'] / df2.loc[:, 'raw total sequences']
You can find more info in the documentation.

Categories

Resources