Groupby certain number of rows pandas - python

I have a dataframe with let's say 2 columns: dates and doubles
2017-05-01 2.5
2017-05-02 3.5
... ...
2017-05-17 0.2
2017-05-18 2.5
Now I would like to do a groupby and sum with x rows. So i.e. with 6 rows it would return:
2017-05-06 15.6
2017-05-12 13.4
2017-05-18 18.0
Is there a clean solution to do this without running it through a for-loop with something like this:
temp = pd.DataFrame()
j = 0
for i in range(0,len(df.index),6):
temp[df.ix[i]['date']] = df.ix[i:i+6]['value'].sum()

I guess you are looking for resample. consider this dataframe
rng = pd.date_range('2017-05-01', periods=18, freq='D')
num = np.random.randint(5,size = 18)
df = pd.DataFrame({'date': rng, 'val': num})
df.resample('6D', on = 'date').sum().reset_index()
will return
date val
0 2017-05-01 14
1 2017-05-07 11
2 2017-05-13 16

This is alternative solution using groupby range of length of the dataframe.
Two columns using agg
df.groupby(np.arange(len(df))//6).agg(lambda x: {'date': x.date.iloc[0],
'value': x.value.sum()})
Multiple columns you can use first (or last) for date and sum for other columns.
group = df.groupby(np.arange(len(df))//6)
pd.concat((group['date'].first(),
group[[c for c in df.columns if c != 'date']].sum()), axis=1)

Related

Unpivot in pandas using a column that have mutiple value columns

So i have a dataframe frame like this
index
type_of_product
dt_of_product
value_of_product
size_of_product
1
A
01/02/22
23.1
1
1
B
01/03/22
23.2
2
1
C
01/04/22
23.3
2
And i need to unpivot the colum type_of_product with the values of dt_of_product, value_of_product and size_of_product
I tryed to use
pd.pivot(df, index = "index", column = "type_of_product", values = ["dt_of_product","value_of_product","size_of_product"]
and want to get this desire output
index
A_dt_of_product
B_dt_of_product
C_dt_of_product
A_value_of_product
B_value_of_product
C_value_of_product
A_size_of_product
B_size_of_product
C_size_of_product
1
01/02/22
01/03/22
01/04/22
23.1
23.2
23.3
1
2
2
Is there a way to do this in pandas with one pivot or do i have to do 3 pivots and merges all on them?
You can do:
df = df.pivot(index='index',
values=['dt_of_product', 'value_of_product', 'size_of_product'],
columns = ['type_of_product'])
df.columns = df.columns.swaplevel(0).map('_'.join)
Try:
df = (
df.set_index(["index", "type_of_product"])
.unstack(level=1)
.swaplevel(axis=1)
)
df.columns = map("_".join, df.columns)
print(df.reset_index().to_markdown(index=False))
Prints:
index
A_dt_of_product
B_dt_of_product
C_dt_of_product
A_value_of_product
B_value_of_product
C_value_of_product
A_size_of_product
B_size_of_product
C_size_of_product
1
01/02/22
01/03/22
01/04/22
23.1
23.2
23.3
1
2
2
You could try set_index with unstack
s = df.set_index(['index','type_of_product']).unstack().sort_index(level=1,axis=1)
s.columns = s.columns.map('{0[1]}_{0[0]}'.format)
s
Out[30]:
A_dt_of_product A_size_of_product ... C_size_of_product C_value_of_product
index ...
1 01/02/22 1 ... 2 23.3
[1 rows x 9 columns]

Pandas groupby diff removes column

I have a dataframe like this:
d = {'id': ['101_i','101_e','102_i','102_e'], 1: [3, 4, 5, 7], 2: [5,9,10,11], 3: [8,4,3,7]}
df = pd.DataFrame(data=d)
I want to subtract all rows which have the same prefix id, i.e. subtract all values of rows 101_i with 101_e or vice versa. The code I use for that is:
df['new_identifier'] = [x.upper().replace('E', '').replace('I','').replace('_','') for x in df['id']]
df = df.groupby('new_identifier')[df.columns[1:-1]].diff().dropna()
I get the output like this:
I see that I lose the new column that I create, new_identifier. Is there a way I can retain that?
You can define specific aggregation function (in this case np.diff() for columns 1, 2, and 3) for columns that you know the types (int or float in this case).
import numpy as np
df.groupby('new_identifier').agg({i: np.diff for i in range(1, 4)}).dropna()
Result:
1 2 3
new_identifier
101 1 4 -4
102 2 1 4
Series.str.split to get groups, you need DataFrame.set_axis() before GroupBy, after that we use GroupBy.diff
cols = df.columns.difference(['id'])
groups = df['id'].str.split('_').str[0]
new_df = (
df.set_axis(groups, axis=0)
.groupby(level=0)
[cols]
.diff()
.dropna()
)
print(new_df)
1 2 3
id
101 1.0 4.0 -4.0
102 2.0 1.0 4.0
Detail Groups
df['id'].str.split('_').str[0]
0 101
1 101
2 102
3 102
Name: id, dtype: object

Pandas - All Column to 1 new Column

My data frame has 6 columns of dates which i want them to in 1 column
DATA FRAME IMAGE HERE
Code to make another column is as below
df['Mega'] = df['Mega'].append(df['RsWeeks','RsMonths','RsDays','PsWeeks','PsMonths','PsDays'])
i am new to python and pandas i would like to learn more so please point me sources too as i am really bad with debugging as i have no programming background.
Pandas documentation is a great source for good examples. Click here to visit a page with a lot of examples and visuals.
For your particular case:
We construct a sample DataFrame:
import pandas as pd
df = pd.DataFrame([
{"RsWeeks": "2015-11-10", "RsMonths": "2016-08-01"},
{"RsWeeks": "2015-11-11", "RsMonths": "2015-12-30"}
])
print("DataFrame preview:")
print(df)
Output:
DataFrame preview:
RsWeeks RsMonths
0 2015-11-10 2016-08-01
1 2015-11-11 2015-12-30
We concatenate the columns RsWeeks and RsMonths to create a Series:
my_series = pd.concat([df["RsWeeks"], df["RsMonths"]], ignore_index=True)
print("\nSeries preview:")
print(my_series)
Output:
Series preview:
0 2015-11-10
1 2015-11-11
2 2016-08-01
3 2015-12-30
Edit
If you really need to add the new Series as a column to your DataFrame, you can do the following:
df2 = pd.DataFrame({"Mega": my_series})
df = pd.concat([df, df2], axis=1)
print("\nDataFrame preview:")
print(df)
Output:
DataFrame preview:
RsWeeks RsMonths Mega
0 2015-11-10 2016-08-01 2015-11-10
1 2015-11-11 2015-12-30 2015-11-11
2 NaN NaN 2016-08-01
3 NaN NaN 2015-12-30
Data:
df = pd.DataFrame({"name" : 'Dav Las Oms'.split(),
'age' : [25, 50, 70]})
df['Name'] = list(['a', 'M', 'm'])
df:
name age Name
0 Dav 25 a
1 Las 50 M
2 Oms 70 m
df = pd.DataFrame(df.astype(str).apply('|'.join, axis=1))
df:
0
0 Dav|25|a
1 Las|50|M
2 Oms|70|m
You can use pd.melt() which makes your dataframe from wide to long:
df_reshaped = pd.melt(df, id_vars = ['id_1','id_2','id_3'], var_name = 'new_name', value_name = 'Mega')

Add a new row to a Pandas DataFrame with specific index name

I'm trying to add a new row to the DataFrame with a specific index name 'e'.
number variable values
a NaN bank true
b 3.0 shop false
c 0.5 market true
d NaN government true
I have tried the following but it's creating a new column instead of a new row.
new_row = [1.0, 'hotel', 'true']
df = df.append(new_row)
Still don't understand how to insert the row with a specific index. Will be grateful for any suggestions.
You can use df.loc[_not_yet_existing_index_label_] = new_row.
Demo:
In [3]: df.loc['e'] = [1.0, 'hotel', 'true']
In [4]: df
Out[4]:
number variable values
a NaN bank True
b 3.0 shop False
c 0.5 market True
d NaN government True
e 1.0 hotel true
PS using this method you can't add a row with already existing (duplicate) index value (label) - a row with this index label will be updated in this case.
UPDATE:
This might not work in recent Pandas/Python3 if the index is a
DateTimeIndex and the new row's index doesn't exist.
it'll work if we specify correct index value(s).
Demo (using pandas: 0.23.4):
In [17]: ix = pd.date_range('2018-11-10 00:00:00', periods=4, freq='30min')
In [18]: df = pd.DataFrame(np.random.randint(100, size=(4,3)), columns=list('abc'), index=ix)
In [19]: df
Out[19]:
a b c
2018-11-10 00:00:00 77 64 90
2018-11-10 00:30:00 9 39 26
2018-11-10 01:00:00 63 93 72
2018-11-10 01:30:00 59 75 37
In [20]: df.loc[pd.to_datetime('2018-11-10 02:00:00')] = [100,100,100]
In [21]: df
Out[21]:
a b c
2018-11-10 00:00:00 77 64 90
2018-11-10 00:30:00 9 39 26
2018-11-10 01:00:00 63 93 72
2018-11-10 01:30:00 59 75 37
2018-11-10 02:00:00 100 100 100
In [22]: df.index
Out[22]: DatetimeIndex(['2018-11-10 00:00:00', '2018-11-10 00:30:00', '2018-11-10 01:00:00', '2018-11-10 01:30:00', '2018-11-10 02:00:00'], dtype='da
tetime64[ns]', freq=None)
Use append by converting list a dataframe in case you want to add multiple rows at once i.e
df = df.append(pd.DataFrame([new_row],index=['e'],columns=df.columns))
Or for single row (Thanks #Zero)
df = df.append(pd.Series(new_row, index=df.columns, name='e'))
Output:
number variable values
a NaN bank True
b 3.0 shop False
c 0.5 market True
d NaN government True
e 1.0 hotel true
If it's the first row you need:
df = Dataframe(columns=[number, variable, values])
df.loc['e', [number, variable, values]] = [1.0, 'hotel', 'true']
df.loc['e', :] = [1.0, 'hotel', 'true']
should be the correct implementation in case of conflicting index and column names.
In future versions of Pandas, DataFrame.append(other, ignore_index=False, verify_integrity=False, sort=False) will be deprecated.
Source: Pandas Documentation
The documentation recommends using .concat().
It would look like this (if you wanted an empty row with only the added index name:
df = pd.concat([df, pd.Series(index=['New index label'], dtype=str)])
If you wanted to add data use this:
df = pd.concat([df, pd.Series(data, index=['New index label'], dtype=str)])
Hope that helps!

Counting the business days between two series

Is there a better way than bdate_range() to measure business days between two columns of dates via pandas?
df = pd.DataFrame({ 'A' : ['1/1/2013', '2/2/2013', '3/3/2013'],
'B': ['1/12/2013', '4/4/2013', '3/3/2013']})
print df
df['A'] = pd.to_datetime(df['A'])
df['B'] = pd.to_datetime(df['B'])
f = lambda x: len(pd.bdate_range(x['A'], x['B']))
df['DIFF'] = df.apply(f, axis=1)
print df
With output of:
A B
0 1/1/2013 1/12/2013
1 2/2/2013 4/4/2013
2 3/3/2013 3/3/2013
A B DIFF
0 2013-01-01 00:00:00 2013-01-12 00:00:00 9
1 2013-02-02 00:00:00 2013-04-04 00:00:00 44
2 2013-03-03 00:00:00 2013-03-03 00:00:00 0
Thanks!
brian_the_bungler was onto the most efficient way of doing this using numpy's busday_count:
import numpy as np
A = [d.date() for d in df['A']]
B = [d.date() for d in df['B']]
df['DIFF'] = np.busday_count(A, B)
print df
On my machine this is 300x faster on your test case, and 1000s of times faster on much larger arrays of dates
You can use pandas' Bday offset to step through business days between two dates like this:
new_column = some_date - pd.tseries.offsets.Bday(15)
Read more in this conversation: https://stackoverflow.com/a/44288696
It also works if some_date is a single date value, not a series.

Categories

Resources