I have data which looks like this:
month day
1 1 NaN
2 NaN
3 39.529999
4 40.570000
5 40.099998
...
12 27 NaN
28 NaN
29 NaN
30 NaN
31 39.049999
df55.iloc[df55.index.get_level_values('month') == 3]
month day
3 1 37.099998
2 38.060001
3 37.939999
4 37.230000
5 NaN
6 NaN
7 35.869999
8 35.660000
9 36.970001
10 36.660000
11 36.400002
12 NaN
13 NaN
14 36.860001
15 37.380001
16 38.430000
17 38.910000
18 39.000000
19 NaN
20 NaN
21 38.810001
22 39.439999
23 38.709999
24 39.020000
25 39.520000
26 NaN
27 NaN
28 NaN
29 NaN
30 NaN
31 NaN
I want to interpolate() the missing data but only till today, which is month 3 and day 26 from month 1 day 1 and leave all the other NaN as is. Could you please advise how can data between the range to interpolate()
Your idea to use iloc is good but you can use dayofyear to slice your dataframe because I guess your dataframe is well ordered.
today = pd.to_datetime('today')
df.iloc[:today.dayofyear] = df.iloc[:today.dayofyear].interpolate()
It seems easiest to temporarily reset the index so you can use a query:
today = pd.to_datetime('today')
idx = df.reset_index().query('month in [1,2] or (month == #today.month and day < #today.day)').index.max()
df.iloc[:idx] = df.iloc[:idx].interpolate()
Now all values from 1-1 (inclusive) to 3-25 (inclusive) will be non-NaN.
Related
I have a data frame like this:
df:
number score
12 NaN
13 NaN
14 NaN
15 NaN
16 10
17 NaN
18 NaN
I want to filter this data frame from the start to the row where it finds a number in the score column.
So, after filtering the data frame should look like this:
new_df:
number score
12 NaN
13 NaN
14 NaN
15 NaN
16 10
I want to filter this data frame from the row where it finds a number in the score column to the end of the data frame.
So, after filtering the data frame should look like this:
new_df:
number score
16 10
17 NaN
18 NaN
How do I filter this data frame?
Kindly help
You can use pd.Series.last_valid_index and pd.Series.first_valid_index like this:
df.loc[df['score'].first_valid_index():]
Output:
number score
4 16 10.0
5 17 NaN
6 18 NaN
And,
df.loc[:df['score'].last_valid_index()]
Output:
number score
0 12 NaN
1 13 NaN
2 14 NaN
3 15 NaN
4 16 10.0
And, if you wanted to clip leading NaN and trailing Nan you can combined the two.
df.loc[df['score'].first_valid_index():df['score'].last_valid_index()]
Output:
number score
4 16 10.0
You can use a reverse cummax and boolean slicing:
new_df = df[df['score'].notna()[::-1].cummax()]
Output:
number score
0 12 NaN
1 13 NaN
2 14 NaN
3 15 NaN
4 16 10.0
For the second one, a simple cummax:
new_df = df[df['score'].notna().cummax()]
Output:
number score
4 16 10.0
5 17 NaN
6 18 NaN
I have a dataframe, df with a MultiIndex.
df.columns
Index(['all', 'month', 'day', 'year'], dtype='object')
all month day year
match
7 0 10/24/89 10 24 89
8 0 3/7/86 3 7 86
1 10 NaN NaN 10
9 0 4/10/71 4 10 71
10 0 5/11/85 5 11 85
1 96 NaN NaN 96
2 26 NaN NaN 26
11 0 10 NaN NaN 10
1 4/09/75 4 09 75
12 0 8/01/98 8 01 98
How can I select the rows with more than 1 entry at the MultiIndex level 2?
For example, here I need the rows 8,10 and 11.
you can use groupby.transform by the first level of index and use len. Then get True where the len is greater and equal (ge) to the value you want (here 2) to get the boolean mask you want and select the rows.
print(df[df.groupby(level=0)['month'].transform(len).ge(2)])
0 month day year
match
8 0 3/7/86 3.0 7.0 86
1 10 NaN NaN 10
10 0 5/11/85 5.0 11.0 85
1 96 NaN NaN 96
2 26 NaN NaN 26
11 0 10 NaN NaN 10
1 4/09/75 4.0 9.0 75
Here I use 'month' as column after the groupby operation, but any column in your dataframe would work.
You can also use groupby.filter and get the same result with:
print(df.groupby(level=0).filter(lambda x: len(x)>=2))
My dataframe with Quarter and Week as MultiIndex:
Quarter Week X Y Z
Q1 Q1-W01 1 1 1
Q1-W02 2 2 2
Q1-W03 3 3 3
Q1-W04 4 4 4
Q2 Q2-W15 15 15 15
Q2-W16 16 16 16
Q2-W17 17 17 17
Q2-W18 18 18 18
I am trying to add the last row in Q1 (Q1-W04) to all the rows in Q2 (Q2-W15 through Q2-W18). This is what I would like the dataframe to look like:
Quarter Week X Y Z
Q1 Q1-W01 1 1 1
Q1-W02 2 2 2
Q1-W03 3 3 3
Q1-W04 4 4 4
Q2 Q2-W15 19 19 19
Q2-W16 20 20 20
Q2-W17 21 21 21
Q2-W18 22 22 22
When I try to only specify the level 0 index and sumthe specific row, all Q2 values go to NaN.
df.loc['Q2'] += df.loc['Q1','Q1-W04']
Quarter Week X Y Z
Q1 Q1-W01 1 1 1
Q1-W02 2 2 2
Q1-W03 3 3 3
Q1-W04 4 4 4
Q2 Q2-W15 NaN NaN NaN
Q2-W16 NaN NaN NaN
Q2-W17 NaN NaN NaN
Q2-W18 NaN NaN NaN
I have figured out that if I specify both the level 0 and level 1 index, there is no problem.
df.loc['Q2','Q2-W15'] += df.loc['Q1','Q1-W04']
Quarter Week X Y Z
Q1 Q1-W01 1 1 1
Q1-W02 2 2 2
Q1-W03 3 3 3
Q1-W04 4 4 4
Q2 Q2-W15 19 19 19
Q2-W16 16 16 16
Q2-W17 17 17 17
Q2-W18 18 18 18
Is there a way to sum the specific row to all the rows within the Q2 Level 0 index without having to call out each row individually by its level 1 index?
Any insight/guidance would be greatly appreciated.
Thank you.
try this
df.loc['Q2'] = (df.loc['Q2'] + df.loc['Q1', 'Q1-W04']).values.tolist()
df.loc returns a DataFrame, to set the value it looks for the list or array. Hence the above.
In your case we should remove the impact of index
df.loc['Q2','Q2-W15'] += df.loc['Q1','Q1-W04'].values
I have a Pandas dataframe, df. Here are the first five rows:
Id StartDate EndDate
0 0 2015-08-11 2018-07-13
1 1 2014-02-15 2016-01-25
2 2 2014-12-20 NaT
3 3 2015-01-09 2015-01-14
4 4 2014-07-20 NaT
I want to construct a new dataframe, df2. df2 should have a row for each month between StartDate and EndDate, inclusive, for each Id in df1. For example, since the first row of df1 has StartDate in August 2015 and EndDate in July 2018, df2 should have rows corresponding to August 2015, September 2015, ..., July 2018. If an Id in df1 has no EndDate, we will take it to be June 2019.
I would like df2 to use a multiindex with the first level being the corresponding Id in df1, the second level being the year, and the third level being the month. For example, if the above five rows were all of df1, then df2 should look like:
Id Year Month
0 2015 8
9
10
11
12
2016 1
2
3
4
5
6
7
8
9
10
11
12
2017 1
2
3
4
5
6
7
8
9
10
11
12
2018 1
... ... ...
4 2017 1
2
3
4
5
6
7
8
9
10
11
12
2018 1
2
3
4
5
6
7
8
9
10
11
12
2019 1
2
3
4
5
6
The following code does the trick, but takes about 20 seconds on my decent laptop for 10k Ids. Can I be more efficient somehow?
import numpy as np
def build_multiindex_for_id_(id_, enroll_month, enroll_year, cancel_month, cancel_year):
# Given id_ and start/end dates,
# returns 2d array to be converted to multiindex.
# Each row of returned array represents a month/year
# between enroll date and cancel date inclusive.
year = enroll_year
month = enroll_month
multiindex_array = [[],[],[]]
while (month != cancel_month) or (year != cancel_year):
multiindex_array[0].append(id_)
multiindex_array[1].append(year)
multiindex_array[2].append(month)
month += 1
if month == 13:
month = 1
year += 1
multiindex_array[0].append(id_)
multiindex_array[1].append(year)
multiindex_array[2].append(month)
return np.array(multiindex_array)
# Begin by constructing array for first id.
array_for_multiindex = build_multiindex_for_id_(0,8,2015,7,2018)
# Append the rest of the multiindices for the remaining ids.
for _, row in df.loc[1:].fillna(pd.to_datetime('2019-06-30')).iterrows():
current_id_array = build_multiindex_for_id_(
row['Id'],
row['StartDate'].month,
row['StartDate'].year,
row['EndDate'].month,
row['EndDate'].year)
array_for_multiindex = np.append(array_for_multiindex, current_id_array, axis=1)
df2_index = pd.MultiIndex.from_arrays(array_for_multiindex).rename(['Id','Year','Month'])
pd.DataFrame(index=df2_index)
Here's my approach after several trial and error:
(df.melt(id_vars='Id')
.fillna(pd.to_datetime('June 2019'))
.set_index('value')
.groupby('Id').apply(lambda x: x.asfreq('M').ffill())
.reset_index('value')
.assign(year=lambda x: x['value'].dt.year,
month=lambda x: x['value'].dt.month)
.set_index(['year','month'], append=True)
)
Output:
value Id variable
Id year month
0 2015 8 2015-08-31 NaN NaN
9 2015-09-30 NaN NaN
10 2015-10-31 NaN NaN
11 2015-11-30 NaN NaN
12 2015-12-31 NaN NaN
2016 1 2016-01-31 NaN NaN
2 2016-02-29 NaN NaN
3 2016-03-31 NaN NaN
4 2016-04-30 NaN NaN
5 2016-05-31 NaN NaN
6 2016-06-30 NaN NaN
7 2016-07-31 NaN NaN
8 2016-08-31 NaN NaN
9 2016-09-30 NaN NaN
10 2016-10-31 NaN NaN
I would like to set a value to a panda dataframe based on the values of another column. In a nutshell, for example, if I wanted to set indices of a column my_column of a pandas dataframe pd where another column, my_interesting_column is between 10 and 30, I would like to do something like:
start_index=pd.find_closest_index_where_pd["my_interesting_column"].is_closest_to(10)
end_index=pd.find_closest_index_where_pd["my_interesting_column"].is_closest_to(30)
pd["my_column"].between(star_index, end_index)= some_value
As a simple illustration, suppose I have the following dataframe
df = pd.DataFrame(np.arange(10, 20), columns=list('A'))
df["B"]=np.nan
>>> df
A B
0 10 NaN
1 11 NaN
2 12 NaN
3 13 NaN
4 14 NaN
5 15 NaN
6 16 NaN
7 17 NaN
8 18 NaN
9 19 NaN
How can I do something like
df.where(df["A"].is_between(13,16))= 5
So that the end results looks like
>>> df
A B
0 10 NaN
1 11 NaN
2 12 NaN
3 13 5
4 14 5
5 15 5
6 16 5
7 17 NaN
8 18 NaN
9 19 NaN
pd.loc[start_idx:end_idx, 'my_column'] = some_value
I think this is what you are looking for
df.loc[(df['A'] >= 13) & (df['A'] <= 16), 'B'] = 5