PYTHON DATAFRAME conditional group - python

I have a dataframe table which looks like the following
How do I reconstruct this table and sum the rows where length is under 2. so the output df looks like the following
please any suggestion would be greatly appreciated.
thanks
Shei

Add a new column to indicate the length group:
df['len_group'] = df['length'].astype(str)
df.loc[df['length']<=2, 'len_group'] = '<=2'
Then groupby on the new column:
df.groupby('len_group')
Test Case:
df = pd.DataFrame({'length':[1,2,3,4,5], 'val':[2,3,4,5,6]})
length val
0 1 2
1 2 3
2 3 4
3 4 5
4 5 6
df['len_group'] = df['length'].astype(str)
df.loc[df['length']<=2, 'len_group'] = '<=2'
df_result = df.groupby('len_group')[['val']].sum()
len_group val
3 4
4 5
5 6
<=2 5

Related

How to tranpose pandas dataframe by only using the first x values accoding id?

Initial dataframe looks as follows:
>>>>df
id param
1 4
1 15
1 3
2 2
2 7
4 8
4 6
4 11
How to achieve the following scheme by putting only the first 2 values of each id into new row? Resulting df should look as follows:
>>>>df
col_a col_b
4 15
2 7
8 6
I tried to achieve by using transpose and iloc but did not succeed.
Columns names are just for clarification. It is sufficient if index is displayed only (e.g. 0, 1, 2,..).
You can use a double groupby on 'id' to first get the first two rows of each group and then join your 'param' column, and thereafter expand it into new columns. Lastly, rename accordingly:
new = df.groupby('id').head(2).groupby('id',as_index=False).agg({'param':list}).param.apply(pd.Series)
new.columns = ['col_a', 'col_b']
Prints:
col_a col_b
0 4 15
1 2 7
2 8 6
You can first take groupby with head(2) and then split every 2 elements in a list:
a = df.groupby("id")['param'].head(2).tolist()
out = pd.DataFrame([a[i:i + 2] for i in range(0, len(a), 2)],columns=['col_a','col_b'])
print(out)
col_a col_b
0 4 15
1 2 7
2 8 6

How to select the 3 last dates in Python

I have a dataset that looks like his:
ID date
1 O1-01-2012
1 05-02-2012
1 25-06-2013
1 14-12-2013
1 10-04-2014
2 19-05-2012
2 07-08-2014
2 10-09-2014
2 27-11-2015
2 01-12-2015
3 15-04-2013
3 17-05-2015
3 22-05-2015
3 30-10-2016
3 02-11-2016
I am working with Python and I would like to select the 3 last dates for each ID. Here is the dataset I would like to have:
ID date
1 25-06-2013
1 14-12-2013
1 10-04-2014
2 10-09-2014
2 27-11-2015
2 01-12-2015
3 22-05-2015
3 30-10-2016
3 02-11-2016
I used this code to select the very last date for each ID:
df_2=df.sort_values(by=['date']).drop_duplicates(subset='ID',keep='last')
But how can I select more than one date (for example the 3 last dates, or 4 last dates, etc)?
You might use groupby and tail following way to get 2 last items from each group:
import pandas as pd
df = pd.DataFrame({'ID':[1,1,1,2,2,2,3,3,3],'value':['A','B','C','D','E','F','G','H','I']})
df2 = df.groupby('ID').tail(2)
print(df2)
Output:
ID value
1 1 B
2 1 C
4 2 E
5 2 F
7 3 H
8 3 I
Note that for simplicity sake I used other (already sorted) data for building df.
can try this:
df.sort_values(by=['date']).groupby('ID').tail(3).sort_values(['ID', 'date'])
I tried this but with a non-datetime data type
a = [1,1,1,1,1,2,2,2,2,2,3,3,3,3,3]
b = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o']
import pandas as pd
import numpy as np
a = np.array([a,b])
df=pd.DataFrame(a.T,columns=['ID','Date'])
# the tail would give you the last n number of elements you are interested in
df_ = df.groupby('ID').tail(3)
df_
output:
ID Date
2 1 c
3 1 d
4 1 e
7 2 h
8 2 i
9 2 j
12 3 m
13 3 n
14 3 o

Converting rows of the same group to one single row with Dask Dataframes

I have a dask dataframe that look like this:
group index col1 col2 col3
1 1 5 3 4
1 2 4 3 7
1 3 1 2 9
-----------------------
2 2 4 3 7
2 3 1 2 9
2 4 7 4 3
-----------------------
3 3 1 2 9
3 4 7 4 3
3 5 6 3 2
It´s basically a rolling window where each group has its row and x more rows on the dataset. I need to change it to something like this:
group col1_1 col2_1 col3_1 col1_2 col2_2 col3_2 col1_3 col2_3 col3_3
1 5 3 4 4 3 7 1 2 9
2 4 3 7 1 2 9 7 4 3
3 1 2 9 7 4 3 6 3 2
so for each group I have a row that contains all the values in that group. The number of rows per group is constant but can change, meaning it could be 10, but it would be 10 for the whole dataset. In pandas I found some way to do it using this code that I found in this page: link.
indexCol = dff.index.name
dff.reset_index(inplace=True)
colNames = dff.columns
df = pd.pivot_table(dff, index=[indexCol], columns=dff.groupby(indexCol).cumcount().add(1), values=colNames,
aggfunc='sum')
df.columns = df.columns.map('{0[0]}{0[1]}'.format)
The problem is that dask pivot table does not work like pandas and for what I have read it does not admit multiindex so this code does not work with dask dataframes. I can´t make compute() in the dask dataframe neither because the dataset is too big for my memory so I should keep it in dask.
Thank you very much for your help
Well, I figured it out at the end so I post it here:
def series(x):
di = {};
for y in x.columns:
di.update({y + str(i + 1): t for i, t in enumerate(x[y])})
return pd.Series(di);
dictMeta = {};
for y in colNames:
dictMeta.update({y + str(i + 1): df[y].dtype for i in range(0, int(window))})
lista = [(k, dictMeta[k]) for k in dictMeta.keys()]
#We create the 2d dataset for the model
df = dff.groupby(indexCol).apply(lambda x: series(x[colNames]), meta= dictMeta)
where colNames are the columns of the original dataset (col1, col2 and col3 in the question) and indexCol is the name of groupby column (group in the question). Basically we create a dictionary for each group and append it to the dataframe as a row. The dictMeta creates the meta since sometimes errors happen without it.

How to write for loop to find particular number existing or not in multiple columns for each row in Python?

I have a data frame like this:
abc = {'p1':[1,2,3,4,5,6,7,8,9,1],
'p2':[2,3,4,5,6,7,8,9,1,2],
'p3':[3,4,5,6,7,8,9,1,2,3]}
I want to add another column to find if the number 1 exists or not for every row in those 3 columns like this:
I have tried this one got nothing but error. here 1 = yes, 0 = no
is_1st_exist = []
for p in abc['p1'],abc['p2'],abc['p3']:
if (p[0] | p[1] | p[2] == 1)
is_1st_exist.append(1)
else is_1st_exist.append(0)
What should I do to get below is_1st_exist column?
abc = {'p1':[1,2,3,4,5,6,7,8,9,1],
'p2':[2,3,4,5,6,7,8,9,1,2],
'p3':[3,4,5,6,7,8,9,1,2,3],
'is_1st_exist?':[1,0,0,0,0,0,0,1,1,1]}
First compare all values by DataFrame.eq, then test if at least one value per row is True by DataFrame.any and last convert to integers:
df = pd.DataFrame(abc)
df['is_1st_exist?'] = df.eq(1).any(axis=1).astype(int)
#alternative
#df['is_1st_exist?'] = np.where(df.eq(1).any(axis=1), 1, 0)
print (df)
p1 p2 p3 is_1st_exist?
0 1 2 3 1
1 2 3 4 0
2 3 4 5 0
3 4 5 6 0
4 5 6 7 0
5 6 7 8 0
6 7 8 9 0
7 8 9 1 1
8 9 1 2 1
9 1 2 3 1
If want specify columns for test by list:
cols = ['p1','p2','p3']
df['is_1st_exist?'] = df[cols].eq(1).any(axis=1).astype(int)
You can iterate over 'columns' like this:
is_1st_exist=[0 for i in range(len(abc['p1']))
for i in range(len(abc['p1'])):
for k,v in abc.items():
if v[i]==1:
is_1st_exist[i]=1
abc['is_1st_exist'] = is_1st_exist
but if you have lots of problems like this to solve you may be better off using the 'pandas' or 'numpy' modules; pandas is good for tabular data of any kind, like excel, and numpy is kind of like a matlab replacement.
The len(abc['p1']) is just the length of your 'rows'.

Sum all columns with a wildcard name search using Python Pandas

I have a dataframe in python pandas with several columns taken from a CSV file.
For instance, data =:
Day P1S1 P1S2 P1S3 P2S1 P2S2 P2S3
1 1 2 2 3 1 2
2 2 2 3 5 4 2
And what I need is to get the sum of all columns which name starts with P1... something like P1* with a wildcard.
Something like the following which gives an error:
P1Sum = data["P1*"]
Is there any why to do this with pandas?
I found the answer.
Using the data, dataframe from the question:
from pandas import *
P1Channels = data.filter(regex="P1")
P1Sum = P1Channels.sum(axis=1)
List comprehensions on columns allow more filters in the if condition:
In [1]: df = pd.DataFrame(np.arange(15).reshape(5, 3), columns=['P1S1', 'P1S2', 'P2S1'])
In [2]: df
Out[2]:
P1S1 P1S2 P2S1
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
In [3]: df.loc[:, [x for x in df.columns if x.startswith('P1')]].sum(axis=1)
Out[3]:
0 1
1 7
2 13
3 19
4 25
dtype: int64
Thanks for the tip jbssm, for anyone else looking for a sum total, I ended up adding .sum() at the end, so:
P1Sum= P1Channels.sum(axis=1).sum()

Categories

Resources