I have a simple data frame which might look like this:
| Label | Average BR_1 | Average BR_2 | Average BR_3 | Average BR_4 |
| ------- | ------------ | ------------ | ------------ | ------------ |
| Label 1 | 50 | 30 | 50 | 50 |
| Label 2 | 60 | 20 | 50 | 50 |
| Label 3 | 65 | 50 | 50 | 50 |
What I would like to be able to do is to add a % symbol in every column.
I know that I can do something like this for every column:
df['Average BR_1'] = df['Average BR_1'].astype(str) + '%'
However, the problem is, that I read in the data from a CSV file which might contain more of these columns, so instead of Average BR_1 to Average BR_4, it might contain Average BR_1 to say Average BR_10.
So I would like this change to happen automatically for every column which contains Average BR_ in its column name.
I have been reading about .loc but I managed only to change column values to an entirely new value like so:
df.loc[:, ['Average BR_1', 'Average BR_2']] = "Hello"
Also, I haven't yet been able to implement regex here.
I tried with a list:
colsArr = [c for c in df.columns if 'Average BR_' in c]
print(colsArr)
But I did not manage to implement this with .loc.
I suppose I could do this using a loop, but I feel like there must be some better pandas solution, but I can not figure it out.
Could you help and point me in the right direction?
Thank you
# extract the column names that need to be updated
cols = df.columns[df.columns.str.startswith('Average BR')]
# update the columns
df[cols] = df[cols].astype(str).add('%')
print(df)
Label Average BR_1 Average BR_2 Average BR_3 Average BR_4
0 Label 1 50% 30% 50% 50%
1 Label 2 60% 20% 50% 50%
2 Label 3 65% 50% 50% 50%
working example
You can use df.update and df.filter
df.update(df.filter(like='Average BR_').astype('str').add('%'))
df
Out:
Label Average BR_1 Average BR_2 Average BR_3 Average BR_4
0 Label 1 50% 30% 50% 50%
1 Label 2 60% 20% 50% 50%
2 Label 3 65% 50% 50% 50%
Related
Im trying to create a column where i sum the previous x rows of a column by a parm given in a different column row.
I have a solution but its really slow so i was wondering if anyone could help do this alot faster.
| time | price |parm |
|--------------------------|------------|-----|
|2020-11-04 00:00:00+00:00 | 1.17600 | 1 |
|2020-11-04 00:01:00+00:00 | 1.17503 | 2 |
|2020-11-04 00:02:00+00:00 | 1.17341 | 3 |
|2020-11-04 00:03:00+00:00 | 1.17352 | 2 |
|2020-11-04 00:04:00+00:00 | 1.17422 | 3 |
and the slow slow code
#jit
def rolling_sum(x,w):
return np.convolve(x,np.ones(w,dtype=int),'valid')
#jit
def rol(x,y):
for i in range(len(x)):
res[i] = rolling_sum(x, y[i])[0]
return res
dfa = df[:500000]
res = np.empty(len(dfa))
r = rol(dfa.l_x.values, abs(dfa.mb).values+1)
r
Maybe something like this could work. I have made up an example with to_be_summed being the column of the value that should be summed up and looback holding the number of rows to be looked back
df = pd.DataFrame({"to_be_summed": range(10), "lookback":[0,1,2,3,2,1,4,2,1,2]})
summed = df.to_be_summed.cumsum()
result = [summed[i] - summed[max(0,i - lookback - 1)] for i, lookback in enumerate(df.lookback)]
What I did here is to first do a cumsum over the column that should be summed up. Now, for the i-th entry I can take the entry of this cumsum, and subtract the one i + 1 steps back. Note that this include the i-th value in the sum. If you don't want to inlcude it, you just have to change from summed[i] to summed[i - 1]. Also note that this part max(0,i - lookback - 1) will prevent you from accidentally looking back too many rows.
I have a DataFrame that keeps track the item_id with its price over time period:
item_id | item_price | day
1 | 10 | 09-02-2000 # DD-MM-YYYY format
2 | 24 | 10-02-2000
1 | 10 | 20-02-2000
...
As you can see, the price of the item 1 does not change over time. How do I select all of the items that the specific column item_price does not change over time? I tried groupby(), but it does not seem to work right.
EDIT: desired output is all the item_id that its price does not change over time. For example: item_id_list = [1, ...]. It can be a list or a DataFrame Series.
here it is little bro
df.groupby('item_id').item_price.nunique()
And you keep the ones with one unique price.
i am in python i have a data frame like this contain sub_id refer to patient_id, hour_measure from 1 to 22 and other patient's measurement
subject_id | hour_measure heart rate | urinecolor | blood pressure
--------------------------------------------------------
3 | 1 40 | red | high
3 | 2 60 | red | high
3 | .. .. | .. | ..
3 | 22 90 | red | high
4 | 3 60 | yellow | low
4 | 3 60 | yellow | low
4 | 22 90 | red | high
i want to group sub_id measurement by max min skew,etc for numeric features and first and last value for categorical
i write the follwing code
df= pd.read_csv(path)
df1 = (df.groupby(['subject_id','hour_measure'])
.agg([ 'sum','min','max', 'median','var','skew']))
f = lambda x: next(iter(x.mode()), None)
cols = df.select_dtypes(object).columns
df2 = df.groupby(['subject_id','hour_measure'])[cols].agg(f)
df2.columns = pd.MultiIndex.from_product([df2.columns, ['mode']])
print (df2)
df3 = pd.concat([df1, df2], axis=1).unstack().reorder_levels([0,2,1],axis= 1)
print (df3)
df3.to_csv("newfile.csv")
it give me the grouping for every hour
i try to make it group only with subject id only
df1 = (df.groupby(['subject_id'])
.agg([ 'sum','min','max', 'median','var','skew']))
it also give me the same output , and calculate the statistics for every hour as follows
subject_id | heart rate_1 | heartrate_2 ....
--------------------------------------------------------
| min max mean | min max mean ....
3
4
i want the out put to be as the following
subject_id | heart rate | repiratotry rate |urine color
--------------------------------------------------------
| min | max | mean | min | max | mean ..|. first | last
3 50 60 55 40 65 20 | yellow | red
any one can tell how can i edit the code to give the wanted output
any help will appreciated
let me know if this gets you close to what you're looking for. I did not run into your issue with grouping by every hour so I'm not sure if I understood your question completely.
# sample dataframe
df = pd.DataFrame(
{
"subject_id": [1, 1, 1, 2, 2, 2, 3, 3, 3],
"hour_measure": [1, 22, 12, 5, 18, 21, 8, 18, 4],
"blood_pressure": [
"high",
"high",
"high",
"high",
"low",
"low",
"low",
"low",
"high",
],
}
)
# sort out numeric columns before aggregating them
numeric_result = (
df.select_dtypes(include="number")
.groupby(["subject_id"])
.agg(["min", "max", "mean"])
)
# sort out categorical columns before aggregating them
categorical_result = (
df.set_index(["subject_id"])
.select_dtypes(include="object")
.groupby(["subject_id"])
.agg(["first", "last"])
)
# combine numeric and categorical results
result = numeric_result.join(categorical_result)
hour_measure blood_pressure
min max mean first last
subject_id
1 1 22 11.666667 high high
2 5 21 14.666667 high low
3 4 18 10.000000 low high
I have a DataFrame, called weights:
| person | age | weight_at_time_1 | weight_at_time_2 |
| Joe | 23 | 280 | 240 |
| Mary | 19 | 111 | 90 |
| Tom | 34 | 150 | 100 |
I want to find out the highest weight loss (essentially, where the the difference in weight is the most negative) and find out what this weight_at_time_1 and weight_at_time_2 were that yielded the result, to see the significance of the weight loss. As well as the name of the person who lost it.
weights['delta_weight] = weights['weight_at_time_2'] - ['weight_at_time_1]
weights['delta_weight'].min()
This tells me that the most negative change in weight (highest weight loss) was -50.
I want to report back the weight_at_time_1 and weight_at_time_2 which yielded this min().
Is there a way to perhaps retrieve the index for the row at which min() is found? Or do I have to loop through the DataFrame and keep track of that?
Here is one way using idxmin
df.loc[[(df.weight_at_time_1-df.weight_at_time_2).idxmin()],:]
person age weight_at_time_1 weight_at_time_2
1 Mary 19 111 90
If you have multiple max/min you can also use this:
delta = df.weight_at_time_2 - df.weight_at_time_1
df.loc[delta == delta.min()]
To answer your comment:
In [3]: delta = df.weight_at_time_2 - df.weight_at_time_1
In [4]: bool_idx = delta == delta.min()
# In this way, we are actually using the *Boolean indexing*,
# a boolean vectors to filter the data out of a DataFrame
In [5]: bool_idx
Out[5]:
0 False
1 False
2 True
dtype: bool
# These two lines are equivalent, the result is a DataFrame,
# contains all the rows that match the True/False in the
# same position of `bool_idx`
# In [6]: df.loc[bool_idx]
In [6]: df.loc[bool_idx, :]
Out[6]:
person age weight_at_time_1 weight_at_time_2
2 Tom 34 150 100
# To specify the column label, we can get a Series out the
# filtered DataFrame
In [7]: df.loc[bool_idx, 'person']
Out[7]:
2 Tom
Name: person, dtype: object
# To drop the Series data structure
# - use `.values` property to get a `numpy.ndarray`
# - use `.to_list()` method to get a list
In [8]: df.loc[bool_idx, 'person'].values
Out[8]: array(['Tom'], dtype=object)
In [9]: df.loc[bool_idx, 'person'].to_list()
Out[9]: ['Tom']
# Now, at this time I think you must know many ways
# to get only a string 'Tom' out of above results :)
By the way, #WeNYoBen's great answer is the way of Selection By Label, while this answer is the way of Selection By Boolean Indexing.
For better understanding, I would also suggest you to read through this great official doc for Indexing and Selecting Data of Pandas.
I am new to pandas and would like to know how to best use time bounded sliding windows and rolling statistics calculation. I process a continuous stream with different rolling data calculations (weighted average, mean, sum, max, oldest...) within different time windows (1hr, 4hrs, 1day, 1week...) also grouped by different item ID.
An output stream is produced for each item with its own rolling statistics but also with statistics from similar items (data are linked on closest variable time spans and previously identified similar items).
I currently created a custom code without using pandas where the huge speed improvement is due to: calculation of rolling statistics using differential calculation only (ie. compute difference on new data and data discarded from sliding window), linking of the variable timespan of similar items as it happens in the stream. I would like to switch to pandas but would like to be sure of expected performance.
Is there a way to achieve similar (or better) performance with pandas ? Then:
does pandas calculate each rolling statistics on all the sliding window's values or does it do differential calculation on new/old values ? Then, when creating "custom functions" for rolling statistics could I also do differential calculation to avoid the huge cost of re-processing all the values ?
what is the most efficient way to declare multiple rolling statistics on several time windows ? If I also want to group this by each item I assume I should just add something like "my_stream.groupby(item_key)", would it be still efficient ?
output: for each item, I output its own rolling statistics and statistics from similar items but timespan are variable (from 10mn to 40mn). How could I link each item row from the other item only with the closest "older" timestamp (I mean: if time is 02:00 for Item 1, and Item 2 has data at 02:01 and 01:50, I should link with data from 01:50) ? will it highly impact performance ?
I tried to create a quick illustration but not very easy :
Input:
Item | Price | Date
------- | ----- | --------------
1 | 10 | 2014 01:01:01
2 | 20 | 2014 01:01:02
1 | 20 | 2014 01:21:00
1 | 20 | 2014 01:31:01
Output:
Item | Date | Price | Mean1hr | Mean4hr | Mean24hr | Sum1hr | Sum4hr | Sum24hr | SimilarMean1hr | SimilarMean4hr | Similar24hr |
-------|------|--------|-------|-------------|-----------|-------|--------|-------|----------|----------|--------|
1 | 2014 01:21:00 | 15 | 8 | 3 | 30 | 30 | 35 | 16 | 14 | 10 |
Thanks a lot,
Xavier