Here is test data
import numpy as np
import pandas as pd
import datetime
# multi-indexed dataframe via cartesian join
df1 = pd.DataFrame([1, 2, 3])
df2 = pd.DataFrame(pd.date_range(start='2016', end='2018', freq='M'))
df1['key'] = 0
df2['key'] = 0
df = df1.merge(df2, how='outer', on='key')
del df1, df2
del df['key']
df.columns = ['id','date']
df['value'] = pd.DataFrame(np.random.randn(len(df)))
df.set_index(['date', 'id'], inplace=True)
df.sort_index(inplace=True)
df.head()
output:
value
date id
2016-01-31 1 0.245029
2 -2.141292
3 1.521566
2016-02-29 1 0.870639
2 1.407977
There is probably a better way to generate the cartesian join, but I'm new and that is the best I could find to generate panel data that looks like mine. Anyway, my goal is to create a quick table looking at the pattern of observations to see if any are missing as it relates to time.
My goal is to create a year by month table of frequency observations. This is close to what I want:
df.groupby(pd.Grouper(level='date',freq='M')).count()
But it gives a vertical list. My data is much bigger than this small MWE so I'd like to fit it more compactly, as well as see if there are seasonal patterns (i.e. lots of observations in December or June).
It seems to me that this should work but it doesn't:
df.groupby([df.index.levels[0].month, df.index.levels[0].year]).count()
I get a ValueError: Grouper and axis must be same length error.
This gives what I'm looking for but it seems to me that it should be easier with the time index:
df.reset_index(inplace=True)
df['month'] = df['date'].dt.month
df['year'] = df['date'].dt.year
df.groupby(['month', 'year'])['value'].count().unstack().T
output:
month 1 2 3 4 5 6 7 8 9 10 11 12
year
2016 3 3 3 3 3 3 3 3 3 3 3 3
2017 3 3 3 3 3 3 3 3 3 3 3 3
Also, since this is just a quick validation, I'd rather not reset the index, then re-establish the index (and delete month and year) each time just to see this table.
I think need Index.get_level_values for select first level of MultiIndex:
idx = df.index.get_level_values(0)
df1 = df.groupby([idx.year, idx.month])['value'].count().unstack()
Or:
df1 = df.groupby([idx.year, idx.month]).size().unstack()
Difference between count and size is count omit NaNs and size not.
print (df1)
date 1 2 3 4 5 6 7 8 9 10 11 12
date
2016 3 3 3 3 3 3 3 3 3 3 3 3
2017 3 3 3 3 3 3 3 3 3 3 3 3
Related
I have a dataframe that looks like this:
Column A
Column B
Category
1
7
A
2
8
A
3
9
B
4
10
B
5
11
C
6
12
C
I would like to write code to produce the following dataframe:
Category A
Category B
Category C
Column A
Column B
Column A
Column B
Column A
Column B
1
7
3
9
5
11
2
8
4
10
6
12
I've tried pd.pivot_table, but am not able to figure it out. Can someone help me with this please? Thanks!
You can create a dummy index to use pivot table with:
out = df.pivot_table(
columns="Category",
index=df.groupby("Category").cumcount()
)
which has output:
Column A Column B
Category A B C A B C
0 1 3 5 7 9 11
1 2 4 6 8 10 12
I don't know if there's any simple way to rearrange the columns to be in your format within pivot_table itself. Here is a way by doing some post processing:
final = out.swaplevel(axis=1).sort_index(axis=1, level=0)
final:
Category A B C
Column A Column B Column A Column B Column A Column B
0 1 7 3 9 5 11
1 2 8 4 10 6 12
The issue is that you cannot identify each row uniquely to be able to apply pivot. To this end, create a "within-group" index as follows.
from io import StringIO
import pandas as pd
# setup sample data
data = StringIO("""
Column A;Column B;Category
1;7;A
2;8;A
3;9;B
4;10;B
5;11;C
6;12;C
"""
)
df = pd.read_csv(data, sep=";")
# assign a within-group index
df['id'] = df.groupby('Category').cumcount()
# now apply pivot
df = df.pivot(index='id', columns='Category', values=['Column A', 'Column B'])
Now, you can apply swaplevel and sort_index to match the desired result
df.swaplevel(axis=1).sort_index(axis=1)
I have a dataset that looks like his:
ID date
1 O1-01-2012
1 05-02-2012
1 25-06-2013
1 14-12-2013
1 10-04-2014
2 19-05-2012
2 07-08-2014
2 10-09-2014
2 27-11-2015
2 01-12-2015
3 15-04-2013
3 17-05-2015
3 22-05-2015
3 30-10-2016
3 02-11-2016
I am working with Python and I would like to select the 3 last dates for each ID. Here is the dataset I would like to have:
ID date
1 25-06-2013
1 14-12-2013
1 10-04-2014
2 10-09-2014
2 27-11-2015
2 01-12-2015
3 22-05-2015
3 30-10-2016
3 02-11-2016
I used this code to select the very last date for each ID:
df_2=df.sort_values(by=['date']).drop_duplicates(subset='ID',keep='last')
But how can I select more than one date (for example the 3 last dates, or 4 last dates, etc)?
You might use groupby and tail following way to get 2 last items from each group:
import pandas as pd
df = pd.DataFrame({'ID':[1,1,1,2,2,2,3,3,3],'value':['A','B','C','D','E','F','G','H','I']})
df2 = df.groupby('ID').tail(2)
print(df2)
Output:
ID value
1 1 B
2 1 C
4 2 E
5 2 F
7 3 H
8 3 I
Note that for simplicity sake I used other (already sorted) data for building df.
can try this:
df.sort_values(by=['date']).groupby('ID').tail(3).sort_values(['ID', 'date'])
I tried this but with a non-datetime data type
a = [1,1,1,1,1,2,2,2,2,2,3,3,3,3,3]
b = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o']
import pandas as pd
import numpy as np
a = np.array([a,b])
df=pd.DataFrame(a.T,columns=['ID','Date'])
# the tail would give you the last n number of elements you are interested in
df_ = df.groupby('ID').tail(3)
df_
output:
ID Date
2 1 c
3 1 d
4 1 e
7 2 h
8 2 i
9 2 j
12 3 m
13 3 n
14 3 o
I am trying to add a column to index duplicate rows and order by another column.
Here's the example dataset:
df = pd.DataFrame({'Name' = ['A','A','A','B','B','B','B'], 'Score'=[9,10,10,8,7,8,8], 'Year'=[2019,2018,2017,2019,2018,2017,2016']})
I want to use ['Name', 'Score'] for identifying duplicates. Then index the duplicate order by Year to get following result:
Here rows 2 and 3 are duplicate rows because they have same name and score, so I order them by year and give index.
Is anyone have good idea to realize this in Python? Thank you so much!
You are looking for cumcount:
df['Index'] = (df.sort_values('Year', ascending=False)
.groupby(['Name','Score'])
.cumcount() + 1
)
Output:
Name Score Year Index
0 A 9 2019 1
1 A 10 2018 1
2 A 10 2017 2
3 B 8 2019 1
4 B 7 2018 1
5 B 8 2017 2
6 B 8 2016 3
I have the following DataFrame (in reality I'm working with around 20 million rows):
shop month day sale
1 7 1 10
1 6 1 8
1 5 1 9
2 7 1 10
2 6 1 8
2 5 1 9
I want another column: "Prev month sales", where sales are equal to the "Sales of previous month with same day, e.g.
shop month day sale prev month sale
1 7 1 10 8
1 6 1 8 9
1 5 1 9 9
2 7 1 10 8
2 6 1 8 9
2 5 1 9 9
One solution using .concat(), set_index(), and .loc[]:
# Get index of (shop, previous month, day).
# This will serve as a unique index to look up prev. month sale.
prev = pd.concat((df.shop, df.month - 1, df.day), axis=1)
# Unfortunately need to convert to list of tuples for MultiIndexing
prev = pd.MultiIndex.from_arrays(prev.values.T)
# old: [tuple(i) for i in prev.values]
# Now call .loc on df to look up each prev. month sale.
sale_prev_month = df.set_index(['shop', 'month', 'day']).loc[prev]
# And finally just concat rather than merge/join operation
# because we want to ignore index & mimic a left join.
df = pd.concat((df, sale_prev_month.reset_index(drop=True)), axis=1)
shop month day sale sale
0 1 7 1 10 8.0
1 1 6 1 8 9.0
2 1 5 1 9 NaN
3 2 7 1 10 8.0
4 2 6 1 8 9.0
5 2 5 1 9 NaN
Your new column will be float, not int, because of the presence of NaNs.
Update - an attempt with dask
I don't use dask day to day so this is probably woefully sub-par. Trying to work around the fact that dask does not implement pandas' MultiIndex. So, you can concatenate your three existing indices into a string column and lookup on that.
import dask.dataframe as dd
# Play around with npartitions or chunksize here!
df2 = dd.from_pandas(df, npartitions=10)
# Get a *single* index of unique (shop, month, day IDs)
# Dask doesn't support MultiIndex
empty = pd.Series(np.empty(len(df), dtype='object')) # Passed to `meta`
current = df2.loc[:, on].apply(lambda col: '_'.join(col.astype(str)), axis=1,
meta=empty)
prev = df2.loc[:, on].assign(month=df2['month'] - 1)\
.apply(lambda col: '_'.join(col.astype(str)), axis=1, meta=empty)
df2 = df2.set_index(current)
# We know have two dask.Series, `current` and `prev`, in the
# concatenated format "shop_month_day".
# We also have a dask.DataFrame, df2, which is indexed by `current`
# I would think we could just call df2.loc[prev].compute(), but
# that's throwing a KeyError for me, so slightly more expensive:
sale_prev_month = df2.compute().loc[prev.compute()][['sale']]\
.reset_index(drop=True)
# Now just concat as before
# Could re-break into dask objects here if you really needed to
df = pd.concat((df, sale_prev_month.reset_index(drop=True)), axis=1)
I have a Panda data frame (df) with many columns. For the sake of simplicity, I am posting three columns with dummy data here.
Timestamp Source Length
0 1 5
1 1 5
2 1 5
3 2 5
4 2 5
5 3 5
6 1 5
7 3 5
8 2 5
9 1 5
Using Panda functions, First I set timestamp as index of the df.
index = pd.DatetimeIndex(data[data.columns[1]]*10**9) # Convert timestamp
df = df.set_index(index) # Set Timestamp as index
Next I can use groupby and pd.TimeGrouper functions to group the data into 5 seconds bins and compute cumulative length for each bin as following:
df_length = data[data.columns[5]].groupby(pd.TimeGrouper('5S')).sum()
So the df_length dataframe should look like:
Timestamp Length
0 25
5 25
Now the problem is: "I want to get the same bins of 5 seconds, but ant to compute the cumulative length for each source (1,2 and 3) in separate columns in the following format:
Timestamp 1 2 3
0 15 10 0
5 10 5 10
I think I can use df.groupby with some conditions to get it. But confused and tired now :(
Appreciate solution using panda functions only.
You can add new column for groupby Source for MultiIndex DataFrame and then reshape by unstack last level of MultiIndex for columns:
print (df[df.columns[2]].groupby([pd.TimeGrouper('5S'), df['Source']]).sum())
Timestamp Source
1970-01-01 00:00:00 1 15
2 10
1970-01-01 00:00:05 1 10
2 5
3 10
Name: Length, dtype: int64
df1 = df[df.columns[2]].groupby([pd.TimeGrouper('5S'), df['Source']])
.sum()
.unstack(fill_value=0)
print (df1)
Source 1 2 3
Timestamp
1970-01-01 00:00:00 15 10 0
1970-01-01 00:00:05 10 5 10