My DataFrame looks like this:
What I would like to do is: if weight is once less than 70, drop all rows that have the same name. So, if Thomas' weight was once less than 70, drop all his data and repeat this for all the other names.
So in my case the result would be:
Code to rebuild data:
data = {'date': {0: Timestamp('2014-01-01 00:00:00'),
1: Timestamp('2014-01-02 00:00:00'),
2: Timestamp('2014-01-03 00:00:00'),
3: Timestamp('2014-01-04 00:00:00'),
4: Timestamp('2014-01-05 00:00:00'),
5: Timestamp('2014-01-06 00:00:00'),
6: Timestamp('2014-01-07 00:00:00'),
7: Timestamp('2014-01-08 00:00:00')},
'name': {0: 'Thomas', 1: 'Thomas', 2: 'Thomas', 3: 'Max',
4: 'Max', 5: 'Paul', 6: 'Paul', 7: 'Paul'},
'size': {0: 130, 1: 132, 2: 132, 3: 143, 4: 150, 5: 140,
6: 140, 7: 141},
'weight': {0: 60, 1: 65, 2: 80, 3: 75, 4: 56, 5: 75, 6: 76, 7: 74}}
df = pd.DataFrame(data)
Try as follows:
Select column name from the df based on Series.lt and turn into a list with Series.tolist. Feed the resulting list to Series.isin and combine with unary operator (~) for selection from the df.
res = df[~df.name.isin(df[df.weight.lt(70)].name.tolist())]
print(res)
date name size weight
5 2014-01-06 Paul 140 75
6 2014-01-07 Paul 140 76
7 2014-01-08 Paul 141 74
Or as a variant on this answer to a similar question, try as follows:
Use df.groupby on column name and apply filter with a lambda function, keeping the group only if Series.ge is True for all its values.
res = df.groupby('name').filter(lambda x: x.weight.ge(70).all())
# same result
names = list(df[df['weight']<70]['name'])
df_new = df[~(df['name'].isin(names))]
Related
I'm trying to plot the maximum value per day of a dataframe column (ext_temp):
import pandas as pd
data = {'vin': {0: 'VF1AG0000KF908155', 1: 'VF1AG0000KF908155', 2: 'VF1AG0000KF908155', 3: 'VF1AG0000KF908155', 4: 'VF1AG0000KF908155', 5: 'VF1AG0000KF908155', 6: 'VF1AG0000KF908155', 7: 'VF1AG0000KF908155', 8: 'VF1AG0000KF908155', 9: 'VF1AG0000KF908155'}, 'date': {0: pd.Timestamp('2019-09-27 07:07:02'), 1: pd.Timestamp('2019-09-27 09:23:08'), 2: pd.Timestamp('2019-09-27 09:39:08'), 3: pd.Timestamp('2020-07-15 11:46:41'), 4: pd.Timestamp('2020-07-16 07:17:52'), 5: pd.Timestamp('2020-07-16 09:23:47'), 6: pd.Timestamp('2020-09-11 07:43:05'), 7: pd.Timestamp('2020-09-17 15:00:33'), 8: pd.Timestamp('2020-10-21 06:49:58'), 9: pd.Timestamp('2020-10-21 14:47:33')}, 'sohe': {0: 101, 1: 101, 2: 101, 3: 96, 4: 96, 5: 96, 6: 96, 7: 96, 8: 96, 9: 96}, 'soc': {0: 60, 1: 63, 2: 99, 3: 66, 4: 68, 5: 69, 6: 86, 7: 58, 8: 9, 9: 9}, 'ext_temp': {0: 27, 1: 30, 2: 31, 3: 30, 4: 26, 5: 29, 6: 26, 7: 29, 8: 28, 9: 27}, 'battery_temp': {0: 27, 1: 33, 2: 32, 3: 26, 4: 26, 5: 26, 6: 26, 7: 30, 8: 27, 9: 29}}
df = pd.DataFrame(data)
Unfortunately, when trying to use
nd = "VF1AG0000KF908155"
df = charge[charge.vin==gop]
df = df.groupby(pd.Grouper(key = 'date', freq = 'D'))
fig,ax = plt.subplots()
ax.plot(df.date, df['ext_temp'].max())
I get the following error message :
VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
Using pd.Grouper has will fill missing days with a NaN
If you don't want missing days filled in, groupby the date component of 'date' by using the .dt extractor.
Use pandas.DataFrame.plot for plotting the dataframe
kind='bar' was used, since there's not much data. For a line plot, use kind='line'.
pd.Grouper
Note the need to use .dropna(), at least to plot the bar plot.
dfg = df.groupby(pd.Grouper(key='date', freq='D'))['ext_temp'].max().dropna()
ax = dfg.plot(kind='bar')
dfg = df.groupby(pd.Grouper(key='date', freq='D'))['ext_temp'].max().dropna()
ax = dfg.plot(kind='line')
.dt.date
Groupby only the date component of the 'date' column
dfg = df.groupby(df.date.dt.date)['ext_temp'].max()
ax = dfg.plot(kind='bar')
My goal is to summarise a person's heart rate speed over time, and make a boxplot to visualise it.
Now I have a dataframe with raw data like this:
{'Minute': {0: Timestamp('2015-02-24 00:00:00'),
1: Timestamp('2015-02-24 00:00:30'),
2: Timestamp('2015-02-24 00:01:00'),
3: Timestamp('2015-02-24 00:01:30'),
4: Timestamp('2015-02-24 00:02:00'),
5: Timestamp('2015-02-24 00:02:30')},
'heartrate': {0: 66, 1: 68, 2: 70, 3: 72, 4: 75, 5: 79}}
And I wanted to create a new dataframe summarising the heart rate statistics according to the minute, here is what I want to have:
{'Hour': {0: '00', 1: '00', 2: '00'},
'Minute': {0: 0, 1: 1, 2: 2},
'Max heart rate': {0: 68, 1: 72, 2: 79},
'Min heart rate': {0: 66, 1: 70, 2: 75},
'Avg heart rate': {0: 67, 1: 71, 2: 77}}
Eventually, I want to use the new dataframe above to plot the heart rate with boxplots and x-axis as time series, like the following one with x-axis as time (minute) and y-axis as heart rate bpm:
And for the date time part, if there are data of different days like 1 Feb, 2 Feb, 3 Feb, all with the Hour 9pm and minute of 01, how do I differentiate it?
A big thank you to all people who helped!
You could try this:
import pandas as pd
df = pd.DataFrame({'Timestamp': {0: pd.Timestamp('2015-02-24 00:00:00'),
1: pd.Timestamp('2015-02-24 00:00:30'), 2: pd.Timestamp('2015-02-24
00:01:00'), 3: pd.Timestamp('2015-02-24 00:01:30'), 4: pd.Timestamp('2015-
02-24 00:02:00'), 5: pd.Timestamp('2015-02-24 00:02:30')}, 'heartrate':
{0: 66, 1: 68, 2: 70, 3: 72, 4: 75, 5: 79}})
df['Minute'] = df['Timestamp'].apply(lambda minute : minute.minute)
df['Hour'] = df['Timestamp'].apply(lambda hour : hour.hour)
df['Day'] = df['Timestamp'].apply(lambda day : day.day)
df['Month'] = df['Timestamp'].apply(lambda month : month.month)
df['Year'] = df['Timestamp'].apply(lambda year : year.year)
df.groupby(['Year','Month','Day','Hour','Minute']).agg({'heartrate':
['mean', 'min', 'max']})
I am trying to add one column at the end of another column. I have included a picture that kind of demonstrates what I want to achieve. How can this be done?
For example, in this case I added the age column under the name column
Dummy data:
{'Unnamed: 0': {0: nan, 1: nan, 2: nan, 3: nan},
'age ': {0: 35, 1: 56, 2: 22, 3: 16},
'name': {0: 'andrea', 1: 'juan', 2: 'jose ', 3: 'manuel'},
'sex': {0: 'female', 1: 'male ', 2: 'male ', 3: 'male '}}
One way is to use .append. If your data is in the DataFrame df:
# Split out the relevant parts of your DataFrame
top_df = df[['name','sex']]
bottom_df = df[['age','sex']]
# Make the column names match
bottom_df.columns = ['name','sex']
# Append the two together
full_df = top_df.append(bottom_df)
You might have to decide on what kind of indexing you want. This method above will have non-unique indexing in full_df, which could be fixed by running the following line:
full_df.reset_index(drop=True, inplace=True)
You can use pd.melt and drop variable column using df.drop here.
df = pd.DataFrame({'Unnamed: 0': {0: np.nan, 1: np.nan, 2: np.nan, 3: np.nan},
'age ': {0: 35, 1: 56, 2: 22, 3: 16},
'name': {0: 'andrea', 1: 'juan', 2: 'jose ', 3: 'manuel'},
'sex': {0: 'female', 1: 'male ', 2: 'male ', 3: 'male '}})
df.melt(id_vars=['sex'], value_vars=['name', 'age']).drop(columns='variable')
sex value
0 female andrea
1 male juan
2 male jose
3 male manuel
4 female 35
5 male 56
6 male 22
7 male 16
I have a dataframe with multiple time/date columns:
{'city': {0: 'HOUSTON', 1: 'HOUSTON', 2: 'HOUSTON', 3: 'HOUSTON', 4: 'HOUSTON'}, 'timeDate_1': {0: Timestamp('2017-07-01 08:00:00'), 1: Timestamp('2017-07-01 08:00:00'), 2: Timestamp('2017-07-01 08:00:00'), 3: Timestamp('2017-07-01 08:00:00'), 4: Timestamp('2017-07-01 08:00:00')}, 'hour': {0: 2, 1: 2, 2: 3, 3: 4, 4: 4}, 'timeDate_2': {0: Timestamp('2017-01-07 00:00:00'), 1: Timestamp('2017-01-07 00:00:00'), 2: Timestamp('2017-01-07 00:00:00'), 3: Timestamp('2017-01-07 00:00:00'), 4: Timestamp('2017-01-07 00:00:00')}}
I need to match across these columns - as in if timeDate_1 equals timeDate_2 (or the hour column), and drop all rows where date and time don't match up. Obviously the easiest way would be to have two different tables and just join on date/time, but I'm in too deep at this point.
The dtypes of each column are:
timeDate_1 datetime64[ns]
hour int64
timeDate_2 datetime64[ns]
Which spits out an error when I do an isin operation:
df[df['timeDate_1'].isin(['timeDate_2', 'hour']) ]
ValueError: ('Unknown string format:', 'timeDate_2')
What's the easiest way to do this? (Besides decoupling all the columns and doing a simple join)
Try this:
df[
df['timeDate_1'].isin(df['timeDate_2'])
| df['timeDate_1'].dt.hour.isin(df['hour'])
]
I have df:
pd.DataFrame({'period': {0: pd.Timestamp('2016-05-01 00:00:00'),
1: pd.Timestamp('2017-05-01 00:00:00'),
2: pd.Timestamp('2018-03-01 00:00:00'),
3: pd.Timestamp('2018-04-01 00:00:00'),
4: pd.Timestamp('2016-05-01 00:00:00'),
5: pd.Timestamp('2017-05-01 00:00:00'),
6: pd.Timestamp('2016-03-01 00:00:00'),
7: pd.Timestamp('2016-04-01 00:00:00')},
'cost2': {0: 15,
1: 144,
2: 44,
3: 34,
4: 13,
5: 11,
6: 12,
7: 13},
'rev2': {0: 154,
1: 13,
2: 33,
3: 37,
4: 15,
5: 11,
6: 12,
7: 13},
'cost1': {0: 19,
1: 39,
2: 53,
3: 16,
4: 19,
5: 11,
6: 12,
7: 13},
'rev1': {0: 34,
1: 34,
2: 74,
3: 22,
4: 34,
5: 11,
6: 12,
7: 13},
'destination': {0: 'YYZ',
1: 'YYZ',
2: 'YYZ',
3: 'YYZ',
4: 'DFW',
5: 'DFW',
6: 'DFW',
7: 'DFW'},
'source': {0: 'SFO',
1: 'SFO',
2: 'SFO',
3: 'SFO',
4: 'MIA',
5: 'MIA',
6: 'MIA',
7: 'MIA'}})
df = df[['source','destination','period','rev1','rev2','cost1','cost2']]
which looks like:
I want the final df to have the following columns:
2017-05-01 2016-05-01
source, destination, rev1, rev2, cost1, cost2, rev1, rev2, cost1, cost2...
So essentially, for every source/destination pair, I want revenue and cost numbers for each date in a single row.
I've been tinkering with stack and unstack but haven't been able to achieve my objective.
You can using set_index + unstack, to change the long to wide , then using swaplevel to change the format of columns index you need
df.set_index(['destination','source','period']).unstack().swaplevel(0,1,axis=1).sort_index(level=0,axis=1)
An alternative to .set_index + .unstack is .pivot_table:
df.pivot_table( \
index=['source', 'destination'], \
columns=['period'], \
values=['rev1', 'rev2', 'cost1', 'cost2'] \
).swaplevel(axis=1).sort_index(axis=1, level=0)
# period 2016-03-01 2016-04-01 ...
# cost1 cost2 rev1 rev2 cost1 cost2 rev1 rev2
# source destination
# MIA DFW 12.0 12.0 12.0 12.0 13.0 13.0 13.0 13.0
# SFO YYZ NaN NaN NaN NaN NaN NaN NaN NaN