Finding which row of the DataFrame - python

I have a DataFrame, called weights:
| person | age | weight_at_time_1 | weight_at_time_2 |
| Joe | 23 | 280 | 240 |
| Mary | 19 | 111 | 90 |
| Tom | 34 | 150 | 100 |
I want to find out the highest weight loss (essentially, where the the difference in weight is the most negative) and find out what this weight_at_time_1 and weight_at_time_2 were that yielded the result, to see the significance of the weight loss. As well as the name of the person who lost it.
weights['delta_weight] = weights['weight_at_time_2'] - ['weight_at_time_1]
weights['delta_weight'].min()
This tells me that the most negative change in weight (highest weight loss) was -50.
I want to report back the weight_at_time_1 and weight_at_time_2 which yielded this min().
Is there a way to perhaps retrieve the index for the row at which min() is found? Or do I have to loop through the DataFrame and keep track of that?

Here is one way using idxmin
df.loc[[(df.weight_at_time_1-df.weight_at_time_2).idxmin()],:]
person age weight_at_time_1 weight_at_time_2
1 Mary 19 111 90

If you have multiple max/min you can also use this:
delta = df.weight_at_time_2 - df.weight_at_time_1
df.loc[delta == delta.min()]
To answer your comment:
In [3]: delta = df.weight_at_time_2 - df.weight_at_time_1
In [4]: bool_idx = delta == delta.min()
# In this way, we are actually using the *Boolean indexing*,
# a boolean vectors to filter the data out of a DataFrame
In [5]: bool_idx
Out[5]:
0 False
1 False
2 True
dtype: bool
# These two lines are equivalent, the result is a DataFrame,
# contains all the rows that match the True/False in the
# same position of `bool_idx`
# In [6]: df.loc[bool_idx]
In [6]: df.loc[bool_idx, :]
Out[6]:
person age weight_at_time_1 weight_at_time_2
2 Tom 34 150 100
# To specify the column label, we can get a Series out the
# filtered DataFrame
In [7]: df.loc[bool_idx, 'person']
Out[7]:
2 Tom
Name: person, dtype: object
# To drop the Series data structure
# - use `.values` property to get a `numpy.ndarray`
# - use `.to_list()` method to get a list
In [8]: df.loc[bool_idx, 'person'].values
Out[8]: array(['Tom'], dtype=object)
In [9]: df.loc[bool_idx, 'person'].to_list()
Out[9]: ['Tom']
# Now, at this time I think you must know many ways
# to get only a string 'Tom' out of above results :)
By the way, #WeNYoBen's great answer is the way of Selection By Label, while this answer is the way of Selection By Boolean Indexing.
For better understanding, I would also suggest you to read through this great official doc for Indexing and Selecting Data of Pandas.

Related

Subsetting data with a column condition

I have a dataframe which contains Date, Visitor_ID and Pages columns. In the Page_visited column there are different row wise entries for each dates. Please refer the below table to understand the data.
[| Dates | Visitor_ID| Pages |
|:------ |:---------:| -----: |
| 10/1/2021 | 1 | xy |
| 10/1/2021 | 1 | step2 |
|10/1/2021 | 1 | xx |
|10/1/2021 | 1 | NetBanking|
| 10/1/2021 | 2 | step1 |
| 10/1/2021 | 2 | xy |
|10/1/2021 | 3 | step1 |
|10/1/2021 | 3 | NetBanking|
|11/1/2021 | 4 | step1 |
|12/1/2021 | 4 | NetBanking|][1]
Desired output:
Date Visitor_ID
|10/1/2021 | 1 |
|10/1/2021 | 3 |
the output should be a subset of actual data where the condition is that if for same Visitor_ID the page contains string "step" before string "Netbanking in same date then return the Visitor ID.
To initialise your dataframe you could do:
import pandas as pd
columns = ["Dates", "Visitor_ID", "Pages"]
records = [
["10/1/2021", 1, "xy"],
["10/1/2021", 1, "step2"],
["10/1/2021", 1, "NetBanking"],
["10/1/2021", 2, "step1"],
["10/1/2021", 2, "xy"],
["10/1/2021", 3, "step1"],
["10/1/2021", 3, "NetBanking"],
["11/1/2021", 4, "step1"],
["12/1/2021", 4, "NetBanking"]]
data = pd.DataFrame().from_records(records, columns=columns)
data["Dates"] = pd.DatetimeIndex(data["Dates"])
index_names = columns[:2]
data.set_index(index_names, drop=True, inplace=True)
Note that I have left out your third line in the records, otherwise I cannot reproduce your desired output. I have made this a multi-index data frame in order to easily loop over the groups 'date/visitor'. The structure of the dataframe looks like:
print(data)
Pages
Dates Visitor_ID
2021-10-01 1 xy
1 step2
1 NetBanking
2 step1
2 xy
3 step1
3 NetBanking
2021-11-01 4 step1
2021-12-01 4 NetBanking
Now to select the customers from the same date and from the same group, I am going to loop over these groups and use 2 masks to select the required records:
for date_time, data_per_date in data.groupby(level=0):
for visitor, data_per_visitor in data_per_date.groupby(level=0):
# select the column with the Pages
pages = data_per_visitor["Pages"].str
# make 2 boolean masks, for the records with step and netbanking
has_step = pages.contains("step")
has_netbanking = pages.contains("NetBanking")
# to get the records after each 'step' records, apply a diff on 'has_step'
# Convert to int first for the correct result
# each diff with outcome -1 fulfills this requirement. Make a
# mask based on this requirement
diff_step = has_step.astype(int).diff()
records_after_step = diff_step == -1
# combine the 2 mask to create your final mask to make a selection
mask = records_after_step & has_netbanking
# select the records and print to screen
selection = data_per_visitor[mask]
if not selection.empty:
print(selection.reset_index()[index_names])
This gives the following output:
Dates Visitor_ID
0 2021-10-01 1
1 2021-10-01 3
EDIT:
I was reading your question again. The solution above assumed that only records with 'NetBanking' directly following a record with 'step' is valid. That is why I thought your example input was not corresponding with your desired output. However, in case you are allowing rows in between an occurrence with 'step' and the first 'netbanking', the solution does not work. In that case, it is better to explicitly iterate of the rows of your dataframe per date and client id. An example then would be:
for date_time, data_per_date in data.groupby(level=0):
for visitor, data_per_visitor in data_per_date.groupby(level=0):
after_step = False
index_selection = list()
data_per_visitor.reset_index(inplace=True)
for index, records in data_per_visitor.iterrows():
page = records["Pages"]
if "step" in page and not after_step:
after_step = True
if "NetBanking" in page and after_step:
index_selection.append(index)
after_step = False
selection = data_per_visitor.reindex(index_selection)
if not selection.empty:
print(selection.reset_index()[index_names]
Normally I would not recommend to use 'iterrows' as it is really slow, but in this case I don't see an easy other solution. The output of the second algorithm is the same as the first for my data. In case you do include the third line from your example data, the second algorithm still gives the same output.

create a column that is the sum of previous X rows where x is a parm given by a different column row

Im trying to create a column where i sum the previous x rows of a column by a parm given in a different column row.
I have a solution but its really slow so i was wondering if anyone could help do this alot faster.
| time | price |parm |
|--------------------------|------------|-----|
|2020-11-04 00:00:00+00:00 | 1.17600 | 1 |
|2020-11-04 00:01:00+00:00 | 1.17503 | 2 |
|2020-11-04 00:02:00+00:00 | 1.17341 | 3 |
|2020-11-04 00:03:00+00:00 | 1.17352 | 2 |
|2020-11-04 00:04:00+00:00 | 1.17422 | 3 |
and the slow slow code
#jit
def rolling_sum(x,w):
return np.convolve(x,np.ones(w,dtype=int),'valid')
#jit
def rol(x,y):
for i in range(len(x)):
res[i] = rolling_sum(x, y[i])[0]
return res
dfa = df[:500000]
res = np.empty(len(dfa))
r = rol(dfa.l_x.values, abs(dfa.mb).values+1)
r
Maybe something like this could work. I have made up an example with to_be_summed being the column of the value that should be summed up and looback holding the number of rows to be looked back
df = pd.DataFrame({"to_be_summed": range(10), "lookback":[0,1,2,3,2,1,4,2,1,2]})
summed = df.to_be_summed.cumsum()
result = [summed[i] - summed[max(0,i - lookback - 1)] for i, lookback in enumerate(df.lookback)]
What I did here is to first do a cumsum over the column that should be summed up. Now, for the i-th entry I can take the entry of this cumsum, and subtract the one i + 1 steps back. Note that this include the i-th value in the sum. If you don't want to inlcude it, you just have to change from summed[i] to summed[i - 1]. Also note that this part max(0,i - lookback - 1) will prevent you from accidentally looking back too many rows.

Get Index Minimum Value in Column When String - Pandas Dataframe

I have done some research on this, but couldn't find a concise method when the index is of type 'string'.
Given the following Pandas dataframe:
Platform | Action | RPG | Fighting
----------------------------------------
PC | 4 | 6 | 9
Playstat | 6 | 7 | 5
Xbox | 9 | 4 | 6
Wii | 8 | 8 | 7
I was trying to get the index (Platform) of the smallest value in the 'RPG' column, which would return 'Xbox'. I managed to make it work but it's not efficient, and looking for a better/quicker/condensed approach. Here is what I got:
# Return the minimum value of a series of all columns values for RPG
series1 = min(ign_data.loc['RPG'])
# Find the lowest value in the series
minim = min(ign_data.loc['RPG'])
# Get the index of that value using boolean indexing
result = series1[series1 == minim].index
# Format that index to a list, and return the first (and only) element
str_result = result.format()[0]
Use Series.idxmin:
df.set_index('Platform')['RPG'].idxmin()
#'Xbox'
or what #Quang Hoang suggests on the comments
df.loc[df['RPG'].idxmin(), 'Platform']
if Platform already the index:
df['RPG'].idxmin()
EDIT
df.set_index('Platform').loc['Playstat'].idxmin()
#'Fighting'
df.set_index('Platform').idxmin(axis=1)['Playstat']
#'Fighting'
if already the index:
df.loc['Playstat'].idxmin()

How to add new row in pandas dataframe? [duplicate]

I have an existing dataframe which I need to add an additional column to which will contain the same value for every row.
Existing df:
Date, Open, High, Low, Close
01-01-2015, 565, 600, 400, 450
New df:
Name, Date, Open, High, Low, Close
abc, 01-01-2015, 565, 600, 400, 450
I know how to append an existing series / dataframe column. But this is a different situation, because all I need is to add the 'Name' column and set every row to the same value, in this case 'abc'.
df['Name']='abc' will add the new column and set all rows to that value:
In [79]:
df
Out[79]:
Date, Open, High, Low, Close
0 01-01-2015, 565, 600, 400, 450
In [80]:
df['Name'] = 'abc'
df
Out[80]:
Date, Open, High, Low, Close Name
0 01-01-2015, 565, 600, 400, 450 abc
You can use insert to specify where you want to new column to be. In this case, I use 0 to place the new column at the left.
df.insert(0, 'Name', 'abc')
Name Date Open High Low Close
0 abc 01-01-2015 565 600 400 450
Summing up what the others have suggested, and adding a third way
You can:
assign(**kwargs):
df.assign(Name='abc')
access the new column series (it will be created) and set it:
df['Name'] = 'abc'
insert(loc, column, value, allow_duplicates=False)
df.insert(0, 'Name', 'abc')
where the argument loc ( 0 <= loc <= len(columns) ) allows you to insert the column where you want.
'loc' gives you the index that your column will be at after the insertion. For example, the code above inserts the column Name as the 0-th column, i.e. it will be inserted before the first column, becoming the new first column. (Indexing starts from 0).
All these methods allow you to add a new column from a Series as well (just substitute the 'abc' default argument above with the series).
Single liner works
df['Name'] = 'abc'
Creates a Name column and sets all rows to abc value
I want to draw more attention to a portion of #michele-piccolini's answer.
I strongly believe that .assign is the best solution here. In the real world, these operations are not in isolation, but in a chain of operations. And if you want to support a chain of operations, you should probably use the .assign method.
Here is an example using snowfall data at a ski resort (but the same principles would apply to say ... financial data).
This code reads like a recipe of steps. Both assignment (with =) and .insert make this much harder:
raw = pd.read_csv('https://github.com/mattharrison/datasets/raw/master/data/alta-noaa-1980-2019.csv',
parse_dates=['DATE'])
def clean_alta(df):
return (df
.loc[:, ['STATION', 'NAME', 'LATITUDE', 'LONGITUDE', 'ELEVATION', 'DATE',
'PRCP', 'SNOW', 'SNWD', 'TMAX', 'TMIN', 'TOBS']]
.groupby(pd.Grouper(key='DATE', freq='W'))
.agg({'PRCP': 'sum', 'TMAX': 'max', 'TMIN': 'min', 'SNOW': 'sum', 'SNWD': 'mean'})
.assign(LOCATION='Alta',
T_RANGE=lambda w_df: w_df.TMAX-w_df.TMIN)
)
clean_alta(raw)
Notice the line .assign(LOCATION='Alta', that creates a column with a single value in the middle of the rest of the operations.
One Line did the job for me.
df['New Column'] = 'Constant Value'
df['New Column'] = 123
You can Simply do the following:
df['New Col'] = pd.Series(["abc" for x in range(len(df.index))])
This single line will work.
df['name'] = 'abc'
The append method has been deprecated since Pandas 1.4.0
So instead use the above method only if using actual pandas DataFrame object:
df["column"] = "value"
Or, if setting value on a view of a copy of a DataFrame, use concat() or assign():
This way the new Series created has the same index as original DataFrame, and so will match on exact rows
# adds a new column in view `where_there_is_one` named
# `client` with value `display_name`
# `df` remains unchanged
df = pd.DataFrame({"number": ([1]*5 + [0]*5 )})
where_there_is_one = df[ df["number"] == 1]
where_there_is_one = pd.concat([
where_there_is_one,
pd.Series(["display_name"]*df.shape[0],
index=df.index,
name="client")
],
join="inner", axis=1)
# Or use assign
where_there_is_one = where_there_is_one.assign(client = "display_name")
Output:
where_there_is_one: df:
| 0 | number | client | | 0 | number |
| --- | --- | --- | |---| -------|
| 0 | 1 | display_name | | 0 | 1 |
| 1 | 1 | display_name | | 1 | 1 |
| 2 | 1 | display_name | | 2 | 1 |
| 3 | 1 | display_name | | 3 | 1 |
| 4 | 1 | display_name | | 4 | 1 |
| 5 | 0 |
| 6 | 0 |
| 7 | 0 |
| 8 | 0 |
| 9 | 0 |
Ok, all, I have a similar situation here but if i take this code to use: df['Name']='abc'
instead 'abc' the name for the new column I want to take from somewhere else in the csv file.
As you can see from the picture, df is not cleaned yet but I want to create 2 columns with the name "ADI dms rivoli" which will continue for every row, and the same for the "December 2019". Hope it is clear for you to understand, it was hard to explaine, sorry.

Python - Groupby a DataFrameGroupBy object

I have a panda dataframe in Python at which I am applying a groupby. And then I want to apply a new groupby + sum on the previous result. To be more specific, first I am doing:
check_df = data_df.groupby(['hotel_code', 'dp_id', 'market', 'number_of_rooms'])
[['market', 'number_of_rooms']]
And then I want to do:
check_df = check_df.groupby(['market'])['number_of_rooms'].sum()
So, I am getting the following error:
AttributeError: Cannot access callable attribute 'groupby' of 'DataFrameGroupBy'
objects, try using the 'apply' method
My initial data look like that:
hotel_code | market | number_of_rooms | ....
---------------------------------------------
001 | a | 200 | ...
001 | a | 200 |
002 | a | 300 | ...
Notice that I may have duplicates of pairs like (a - 200), that's why I want need the first groupby.
What I want in the end is something like that:
Market | Rooms
--------------
a | 3000
b | 250
I'm just trying to translate the following sql query into python:
select a.market, sum(a.number_of_rooms)
from (
select market, number_of_rooms
from opinmind_dev..cg_mm_booking_dataset_full
group by hotel_code, market, number_of_rooms
) as a
group by market ;
Any ideas how I can fix that? If you need any more info, let me know.
ps. I am new to Python and data science
IIUC, instead of:
check_df = data_df.groupby(['hotel_code', 'dp_id', 'market', 'number_of_rooms'])
[['market', 'number_of_rooms']]
You should simply do:
check_df = data_df.drop_duplicates(subset=['hotel_code', 'dp_id', 'market', 'number_of_rooms'])\
.loc[:, ['market', 'number_of_rooms']]\
.groupby('market')\
.sum()
df = pd.DataFrame({'Market': [1,1,1,2,2,2,3,3], 'Rooms':range(8), 'C':np.random.rand(8)})
Market Rooms C
0 1 0 0.187793
1 1 1 0.325284
2 1 2 0.095147
3 2 3 0.296781
4 2 4 0.022262
5 2 5 0.201078
6 3 6 0.160082
7 3 7 0.683151
You need to move the column selection away from the grouped DataFrame. Either of the following should work.
df.groupby('Market').sum()[['Rooms']]
df[['Rooms']].groupby(df['Market']).sum()
Rooms
Market
1 3
2 12
3 13
If you select using ['Rooms'] instead of [['Rooms']] you will get a Series instead of a DataFrame.
The dataframes produced use market as their index. If you want to convert it to a normal data column, use:
df.reset_index()
Market Rooms
0 1 3
1 2 12
2 3 13
If I understand your question correctly, You could simply do -
data_df.groupby('Market').agg({'Rooms': np.sum}) OR
data_df.groupby(['market'], as_index=False).agg({'Rooms': np.sum})
data_df = pd.DataFrame({'Market' : ['A','B','C','B'],
'Hotel' : ['H1','H2','H4','H5'],
'Rooms' : [20,40,50,34]
})
data_df.groupby('Market').agg({'Rooms': np.sum})

Categories

Resources