Summing over months with pandas - python

I know there is a simple implementation to do this but I cannot remember the syntax. Have a simple pandas time series and I want to summarize the data by month. Specifically I want to add data over months and years to get some summary of it. Can write it with slicing, but I remember seeing syntax that does it automatically.
import pandas as pd
df = Series(randn(100), index=pd.date_range('2012-01-01', periods=100))
a Multi-indexed Series with Years and sub endexed to months would be first prize.
Partial Answer:
ds.resample('M', how=sum) # for calendar monthly
ds.resample('A', how=sum) # for calendar yearly
Any idea how to elegantly get to multindexed by year sums?

In [1]: import pandas as pd
from numpy.random import randn
In [2]: df = Series(randn(500), index=pd.date_range('2012-01-01', periods=500))
In [3]: s2 = df.groupby([lambda x: x.year, lambda x: x.month]).sum()
In [4]: s2
Out[4]:
2012 1 3.853775
2 4.259941
3 4.629546
4 -10.812505
5 -16.383818
6 -5.255475
7 5.901344
8 13.375258
9 1.758670
10 6.570200
11 6.299812
12 7.237049
2013 1 -1.331835
2 3.399223
3 2.011031
4 7.905396
5 1.127362
dtype: float64

Related

Python: Count Unique Value over rolling past 3 days

I have a df that is a time series of user access data
UserID Access Date
a 10/01/2019
b 10/01/2019
c 10/01/2019
a 10/02/2019
b 10/02/2019
d 10/02/2019
e 10/03/2019
f 10/03/2019
a 10/03/2019
b 10/03/2019
a 10/04/2019
b 10/04/2019
c 10/05/2019
I have another df that lists out the dates and I want to aggregate the unique occurrence of UserIDs in the rolling past 3 days. The expected output would look like below:
Date Past_3_days_unique_count
10/01/2019 NaN
10/02/2019 NaN
10/03/2019 6
10/04/2019 5
10/04/2019 5
How would I be able to achieve this?
It's quite straightforward - let me walk you through it via the following snippet and its comments.
import pandas as pd
import numpy as np
# Generate some dates
dates = pd.date_range("01-01-2016", "01-10-2016", freq="6H")
# Generate some user ids
ids = np.random.randint(1, 5, len(dates))
df = pd.DataFrame({"id": ids, "date": dates})
# Collect unique IDs for each day
q = df.groupby(df["date"].dt.to_period("D"))["id"].nunique()
# Grab the rolling sum over 3 previous days which is what we wanted
q.rolling(3).sum()
Use pandas groupby the documentation is very good

Add all rows from pandas dataframe

My data look like this:
>df
Jan Feb March April
0 4 6 6 8
1 3 6 8 9
I am working with tslearn. Based on documentation, the data can be made into tslearn object as
from tslearn.utils import to_time_series_dataset
ts = to_time_series_dataset([df.iloc[0],df.iloc[1]])
which would be okay if I only had small number of rows. However I have about thousand. I tried to
for index, row in df.iterrows():
ts = to_time_series_dataset(row)
But the 'ts' from this only contain last row of dataframe.
Try using:
from tslearn.utils import to_time_series_dataset
ts = to_time_series_dataset([i for _,i in df.iterrows()])
Or use:
from tslearn.utils import to_time_series_dataset, load_timeseries_txt
ts = load_timeseries_txt('filename.txt')

How to delete many columns in python with one line of code? [duplicate]

This question already has answers here:
Deleting multiple columns based on column names in Pandas
(11 answers)
Closed 3 years ago.
I am trying to delete the following columns on my dataframe: 1,2,101:117,121:124,126.
So far the two ways I have found to delete columns is:
df.drop(df.columns[2:6],axis=1)
df.drop(df.columns[[0,3,5]],axis=1)
however if I try
df.drop(df.columns[1,2,101:117,121:124],axis=1)
I get a "too many indices" error
I also tried this
a=df.drop(df.columns[[1,2]],axis=1)
b=a.drop(a.columns[99:115],axis=1)
c=b.drop(b.columns[102:105],axis=1)
d=c.drop(c.columns[103],axis=1)
but this isn't deleting the columns I'm wanting to for some reason.
Use np.r_ to slice:
import numpy as np
df.drop(columns=df.columns[np.r_[1, 2, 101:117, 121:124, 126]])
import pandas pd
df = pd.DataFrame(np.random.randint(1, 10, (2, 130)))
df.drop(columns=df.columns[np.r_[1, 2, 101:117, 121:124, 126]])
# 0 3 4 5 6 ... 120 124 125 127
#0 6 1 3 7 2 ... 8 7 2 6
#1 1 9 2 5 3 ... 7 3 9 4
This should work:
df.drop(df.columns[[indexes_of_columns_you_want_to_delete]], axis=1,
inplace=True)
Please try this:
import numpy as np
import pandas as pd
input_df.drop(input_df.columns[[np.r_[0,2:4]]],axis=1, inplace = True)

How to get rid of nested column names in Pandas from group by aggregation?

I have the following code that finds the total and unique sales for each employee using a group by with Employee_id and aggregation with Customer_id.
Sales.groupby('Employee_id').agg({
'Customer_id': [
('total_sales', 'count'),
('unique_sales', 'nunique')
]})
It is important to know that I will perform aggregations with other columns as well, but so far this is all I have written. So if you have a proposed solution, I ask that you please consider that in case it makes a difference.
While this does exactly what I want in terms of computing total and unique sales for each employee and creating two columns, it creates nested column names. So the column names look like, [('Customer_id', 'total_sales'), ('Customer_id', 'unique_sales')], which I don't want. Is there any way to easily get rid of the nested part to only include ['total_sales', 'unique_sales'], or is the easiest thing to just rename the columns once I have finished everything?
Thanks!
You could simply rename the columns:
import numpy as np
import pandas as pd
np.random.seed(2018)
df = pd.DataFrame(np.random.randint(10, size=(100, 3)), columns=['A','B','C'])
result = df.groupby('A').agg({'B': [('D','count'),('E','nunique')],
'C': [('F','first'),('G','max')]})
result.columns = result.columns.get_level_values(1)
print(result)
Alternatively, you could save the groupby object, and use grouped[col].agg(...)
to produce sub-DataFrames which can then be pd.concat'ed together:
import numpy as np
import pandas as pd
np.random.seed(2018)
df = pd.DataFrame(np.random.randint(10, size=(100, 3)), columns=['A','B','C'])
grouped = df.groupby('A')
result = pd.concat([grouped['B'].agg([('D','count'),('E','nunique')]),
grouped['C'].agg([('F','first'),('G','max')])], axis=1)
print(result)
both code snippets yield the following (though with columns perhaps in a different order):
D E F G
A
0 18 8 8 9
1 12 8 6 6
2 14 8 0 8
3 10 9 8 9
4 7 6 3 5
5 8 5 6 7
6 9 7 9 9
7 8 6 4 7
8 8 7 2 9
9 6 5 7 9
Overall, I think renaming the columns after-the-fact is the easiest and more readable option.

Python: combining two columns [duplicate]

This question already has answers here:
Combine two columns of text in pandas dataframe
(21 answers)
Closed 5 years ago.
I have two columns, one has the year, and another has the month data, and I am trying to make one column from them (containing year and month).
Example:
click_year
-----------
2016
click_month
-----------
11
I want to have
YearMonth
-----------
201611
I tried
date['YearMonth'] = pd.concat((date.click_year, date.click_month))
but it gave me "cannot reindex from a duplicate axis" error.
Bill's answer on the post might be what you are looking for.
import pandas as pd
df = pd.DataFrame({'click_year': ['2014', '2015'], 'click_month': ['10', '11']})
>>> df
click_month click_year
0 10 2014
1 11 2015
df['YearMonth'] = df[['click_year','click_month']].apply(lambda x : '{}{}'.format(x[0],x[1]), axis=1)
>>> df
click_month click_year YearMonth
0 10 2014 201410
1 11 2015 201511

Categories

Resources