I have a pandas dataframe as shown below where I have Month-Year, need to get the continuous dataframe which should include count as 0 if no rows are found for that month. Excepcted output is shown below.
Input dataframe
Month | Count
--------------
Jan-15 | 10
Feb-15 | 100
Mar-15 | 20
Jul-15 | 10
Sep-15 | 11
Oct-15 | 1
Dec-15 | 15
Expected Output
Month | Count
--------------
Jan-15 | 10
Feb-15 | 100
Mar-15 | 20
Apr-15 | 0
May-15 | 0
Jun-15 | 0
Jul-15 | 10
Aug-15 | 0
Sep-15 | 11
Oct-15 | 1
Nov-15 | 0
Dec-15 | 15
You can set the Month column as the index. It looks like Excel input, if so, it will be parsed at 01.01.2015 so you can resample it as follows:
df.set_index('Month').resample('MS').asfreq().fillna(0)
Out:
Count
Month
2015-01-01 10.0
2015-02-01 100.0
2015-03-01 20.0
2015-04-01 0.0
2015-05-01 0.0
2015-06-01 0.0
2015-07-01 10.0
2015-08-01 0.0
2015-09-01 11.0
2015-10-01 1.0
2015-11-01 0.0
2015-12-01 15.0
If the month column is not recognized as date, you need to convert it first:
df['Month'] = pd.to_datetime(df['Month'], format='%b-%y')
Related
Lets say I have the following model:
class DateRange(models.Model):
start = models.DateTimeField()
end = models.DateTimeField()
Is there a way to get all pairs of DateRange instances with any overlap in their start to end range? For example, if I had:
id | start | end
-----+---------------------+---------------------
1 | 2020-01-02 12:00:00 | 2020-01-02 16:00:00 # overlap with 2, 3 and 5
2 | 2020-01-02 13:00:00 | 2020-01-02 14:00:00 # overlap with 1 and 3
3 | 2020-01-02 13:30:00 | 2020-01-02 17:00:00 # overlap with 1 and 2
4 | 2020-01-02 10:00:00 | 2020-01-02 10:30:00 # no overlap
5 | 2020-01-02 12:00:00 | 2020-01-02 12:30:00 # overlap with 1
I'd want:
id_1 | id_2
------+-----
1 | 2
1 | 3
1 | 5
2 | 3
Any thoughts on the best way to do this? The order of id_1 and id_2 doesn't matter, but I do need it to be distinct (e.g.- id_1=1, id_2=2 is the same as id_1=2, id_2=1 and should not be repeated)
I have a pandas dataframe where there's three levels of row indexing. The last level is a datetime index. There are nan values and I am trying to fill them with the average of each row at the datetime level. How can I go about doing this?
data_df
Level 0 | Level 1 | Level 2 |
A 123 2019-01-28 17:00:00 | 3 | 1 | nan
2019-01-28 18:00:00 | 2 | nan | 1
2019-01-28 19:00:00 | nan | nan | 5
234 2019-01-28 05:00:00 | 1 | 1 | 3
2019-01-28 06:00:00 | nan | nan | nan
Some rows may all be nan values. In this case I want to fill the row with 0's. Some rows may have all values filled in so imputing with average isn't needed.
I want this the following result:
Level 0 | Level 1 | Level 2 |
A 123 2019-01-28 17:00:00 | 3 | 1 | 2
2019-01-28 18:00:00 | 2 | 1.5 | 1
2019-01-28 19:00:00 | 5 | 5 | 5
234 2019-01-28 05:00:00 | 1 | 1 | 3
2019-01-28 06:00:00 | 0 | 0 | 0
Use DataFrame.mask with mean per rows and last convert only NaNs rows by DataFrame.fillna:
df = df.mask(df.isna(), df.mean(axis=1), axis=0).fillna(0)
print (df)
a b c
Level 0 Level 1 Level 2
A 123 2019-01-28 17:00:00 3.0 1.0 2.0
2019-01-28 18:00:00 2.0 1.5 1.0
2019-01-28 19:00:00 5.0 5.0 5.0
234 2019-01-28 05:00:00 1.0 1.0 3.0
2019-01-28 06:00:00 0.0 0.0 0.0
Another solution is use DataFrame.fillna for replace, but because not implemented df.fillna(df.mean(axis=1), axis=1) is necessary double transpose:
df = df.T.fillna(df.mean(axis=1)).fillna(0).T
I have a Pandas dataframe as shown below:
Voice_Usage | Data_Usage | Revenue | Age | Segment
--------------------------------------------------
300 | 20 | 400 | 35 | 1
700 | 10 | 300 | 40 | 1
100 | 15 | 200 | 32 | 3
150 | 30 | 100 | 20 | 2
450 | 12 | 450 | 54 | 1
900 | 18 | 800 | 17 | 3
... ... ... ... ...
I want to derive a dataframe from the above dataframe where each Segment type will have all the variables in data frame and their statistical measures (min, max, mean).
The derived dataframe should be like:
Segment | Variables | Min | Max | Mean |
----------------------------------------
1 Voice_Usage 5 100 50
1 Data_Usage 0 50 30
1 Revenue 50 1500 300
1 Age 10 80 35
2 Voice_Usage 10 200 70
2 Data_Usage 10 90 50
2 Revenue 30 500 200
2 Age 15 60 25
3 Voice_Usage 5 100 500
3 Data_Usage 0 50 30
3 Revenue 50 1500 300
3 Age 10 80 35
...and so on.
How can I derive the second dataframe from the first one? I grouped by segment value with aggregating over other variables but that did not work. I need to make it generic for n no. of variables of the dataframe.
Use melt with DataFrameGroupBy.agg:
df = (df.melt('Segment', var_name='a')
.groupby(['Segment','a'])['value']
.agg(['min','max','mean'])
.reset_index())
print (df)
Segment a min max mean
0 1 Age 35 54 43.000000
1 1 Data_Usage 10 20 14.000000
2 1 Revenue 300 450 383.333333
3 1 Voice_Usage 300 700 483.333333
4 2 Age 20 20 20.000000
5 2 Data_Usage 30 30 30.000000
6 2 Revenue 100 100 100.000000
7 2 Voice_Usage 150 150 150.000000
8 3 Age 17 32 24.500000
9 3 Data_Usage 15 18 16.500000
10 3 Revenue 200 800 500.000000
11 3 Voice_Usage 100 900 500.000000
If want multiple statistics use DataFrameGroupBy.describe:
df = (df.melt('Segment', var_name='a')
.groupby(['Segment','a'])['value']
.describe()
.reset_index())
print (df)
Segment a count mean std min 25% 50% \
0 1 Age 3.0 43.000000 9.848858 35.0 37.50 40.0
1 1 Data_Usage 3.0 14.000000 5.291503 10.0 11.00 12.0
2 1 Revenue 3.0 383.333333 76.376262 300.0 350.00 400.0
3 1 Voice_Usage 3.0 483.333333 202.072594 300.0 375.00 450.0
4 2 Age 1.0 20.000000 NaN 20.0 20.00 20.0
5 2 Data_Usage 1.0 30.000000 NaN 30.0 30.00 30.0
6 2 Revenue 1.0 100.000000 NaN 100.0 100.00 100.0
7 2 Voice_Usage 1.0 150.000000 NaN 150.0 150.00 150.0
8 3 Age 2.0 24.500000 10.606602 17.0 20.75 24.5
9 3 Data_Usage 2.0 16.500000 2.121320 15.0 15.75 16.5
10 3 Revenue 2.0 500.000000 424.264069 200.0 350.00 500.0
11 3 Voice_Usage 2.0 500.000000 565.685425 100.0 300.00 500.0
75% max
0 47.00 54.0
1 16.00 20.0
2 425.00 450.0
3 575.00 700.0
4 20.00 20.0
5 30.00 30.0
6 100.00 100.0
7 150.00 150.0
8 28.25 32.0
9 17.25 18.0
10 650.00 800.0
11 700.00 900.0
Let's say I have purchase records with two fields Buy and Time.
What I want to get is the third column of time elapsed since first not-buy so it looks like:
buy| time | time difference
1 | 8:00 | NULL
0 | 9:01 | NULL
0 | 9:10 | NULL
0 | 9:21 | NULL
1 | 9:31 | 0:30
0 | 9:41 | NULL
0 | 9:42 | NULL
1 | 9:53 | 0:12
How can I achieve this? It seems to me that it's a mix of pd.groupby() and pd.shift(), but I can't seem to work that out in my head.
IIUC
df.time=pd.to_datetime(df.time)
df.loc[df.buy==1,'DIFF']=df.groupby(df.buy.cumsum().shift().fillna(0)).time.transform(lambda x : x.iloc[-1]-x.iloc[0])
df
Out[19]:
buy time timedifference DIFF
0 1 2018-02-26 08:00:00 NaN 00:00:00
1 0 2018-02-26 09:01:00 NaN NaT
2 0 2018-02-26 09:10:00 NaN NaT
3 0 2018-02-26 09:21:00 NaN NaT
4 1 2018-02-26 09:31:00 0:30 00:30:00
5 0 2018-02-26 09:41:00 NaN NaT
6 0 2018-02-26 09:42:00 NaN NaT
7 1 2018-02-26 09:53:00 0:12 00:12:00
#df.buy.cumsum().shift().fillna(0) Create the key for groupby
#time.transform(lambda x : x.iloc[-1]-x.iloc[0]) create the different for each group
#df.loc[df.buy==1,'DIFF'] fill the value from groupby by the right position which buy equal to 1
I am querying my database to show records from the past week. I am then aggregating the data and transposing it in python and pandas into a DataFrame.
In this table I am attempting to show what occurred on each day in the past 7 week, however, on some days no events occur. In these cases, the date is missing altogether. I am looking for an approach to append the dates that are not present (but are part of the date range specified in the query) so that I can then fillna with any value I wish for the other missing columns.
In some trials I have the data set into a pandas Dataframe where the dates are the index and in others the dates are a column. I am preferably looking to have the dates as the top index - so group by name, stack purchase and send_back and dates are the 'columns'.
Here is an example of how the dataframe looks now and what I am looking for:
Dates set in query for - 01.08.2016 - 08.08.2016. The dataframe looks liks so:
| dates | name | purchase | send_back
0 01.08.2016 Michael 120 0
1 02.08.2016 Sarah 100 40
2 04.08.2016 Sarah 55 0
3 05.08.2016 Michael 80 20
4 07.08.2016 Sarah 130 0
After:
| dates | name | purchase | send_back
0 01.08.2016 Michael 120 0
1 02.08.2016 Sarah 100 40
2 03.08.2016 - 0 0
3 04.08.2016 Sarah 55 0
4 05.08.2016 Michael 80 20
5 06.08.2016 - 0 0
6 07.08.2016 Sarah 130 0
7 08.08.2016 Sarah 0 35
8 08.08.2016 Michael 20 0
Printing the following:
df.index
gives:
'Index([ u'dates',u'name',u'purchase',u'send_back'],
dtype='object')
RangeIndex(start=0, stop=1, step=1)'
I appreciate any guidance.
assuming you have the following DF:
In [93]: df
Out[93]:
name purchase send_back
dates
2016-08-01 Michael 120 0
2016-08-02 Sarah 100 40
2016-08-04 Sarah 55 0
2016-08-05 Michael 80 20
2016-08-07 Sarah 130 0
you can resample and replace:
In [94]: df.resample('D').first().replace({'name':{np.nan:'-'}}).fillna(0)
Out[94]:
name purchase send_back
dates
2016-08-01 Michael 120.0 0.0
2016-08-02 Sarah 100.0 40.0
2016-08-03 - 0.0 0.0
2016-08-04 Sarah 55.0 0.0
2016-08-05 Michael 80.0 20.0
2016-08-06 - 0.0 0.0
2016-08-07 Sarah 130.0 0.0
Your index is of object type and you must convert it to datetime format.
# Converting the object date to datetime.date
df['dates'] = df['dates'].apply(lambda x: datetime.strptime(x, "%d.%m.%Y"))
# Setting the index column
df.set_index(['dates'], inplace=True)
# Choosing a date range extending from first date to the last date with a set frequency
new_index = pd.date_range(start=df.index[0], end=df.index[-1], freq='D')
new_index.name = df.index.name
# Setting the new index
df = df.reindex(new_index)
# Making the required modifications
df.ix[:,0], df.ix[:,1:] = df.ix[:,0].fillna('-'), df.ix[:,1:].fillna(0)
print (df)
name purchase send_back
dates
2016-08-01 Michael 120.0 0.0
2016-08-02 Sarah 100.0 40.0
2016-08-03 - 0.0 0.0
2016-08-04 Sarah 55.0 0.0
2016-08-05 Michael 80.0 20.0
2016-08-06 - 0.0 0.0
2016-08-07 Sarah 130.0 0.0
Let's suppose you have data for a single day (as mentioned in the comments section) and you would like to fill the other days of the week with null values:
Data Setup:
df = pd.DataFrame({'dates':['01.08.2016'], 'name':['Michael'],
'purchase':[120], 'send_back':[0]})
print (df)
dates name purchase send_back
0 01.08.2016 Michael 120 0
Operations:
df['dates'] = df['dates'].apply(lambda x: datetime.strptime(x, "%d.%m.%Y"))
df.set_index(['dates'], inplace=True)
# Setting periods as 7 to account for the end of the week
new_index = pd.date_range(start=df.index[0], periods=7, freq='D')
new_index.name = df.index.name
# Setting the new index
df = df.reindex(new_index)
print (df)
name purchase send_back
dates
2016-08-01 Michael 120.0 0.0
2016-08-02 NaN NaN NaN
2016-08-03 NaN NaN NaN
2016-08-04 NaN NaN NaN
2016-08-05 NaN NaN NaN
2016-08-06 NaN NaN NaN
2016-08-07 NaN NaN NaN
Incase you want to fill the null values with 0's, you could do:
df.fillna(0, inplace=True)
print (df)
name purchase send_back
dates
2016-08-01 Michael 120.0 0.0
2016-08-02 0 0.0 0.0
2016-08-03 0 0.0 0.0
2016-08-04 0 0.0 0.0
2016-08-05 0 0.0 0.0
2016-08-06 0 0.0 0.0
2016-08-07 0 0.0 0.0