Pandas: join dataframe composed by different iteration - python

I have a dataframe in which multiple dataseries with 2 columsn (0,1). The data is composed of different iterations of a measurement. The data is structured like so:
df = pd.DataFrame({
0: ['user', 'x', 1, 4, 7, 10, 'user', 'x', 1, 4, 7, 10, 'user', 'x', 1, 4, 7, 10],
1: ['iteration=0', 'y',5, 7, 9, 12, 'iteration=1', 'y',20, 8, 12, 12, 'iteration=2', 'y',3, 17, 19, 112]
})
0 user iteration=0
1 x y
2 1 5
3 4 7
4 7 9
5 10 12
6 user iteration=1
7 x y
8 1 20
9 4 8
10 7 12
11 10 12
12 user iteration=2
13 x y
14 1 3
15 4 17
16 7 19
17 10 112
I want to plot x vs y grouped by iteration.
I am trying to do this by first creaeting a single dataframe with the iteration as a column to perform the groupby on:
1 x y iteration
2 1 5 0
3 4 7 0
4 7 9 0
5 10 12 0
8 1 20 1
9 4 8 1
10 7 12 1
11 10 12 1
14 1 3 2
15 4 17 2
16 7 19 2
17 10 112 2
To create this joined dataframe, I implemented this code :
meta=df.loc[df[0]=='user']
lst=[]
ind=0
for index, row in meta.iterrows():
if index==0: #continue to start loop from second value
continue
splitvalue = meta.loc[ind][1].split('=')[1]
print (splitvalue)
temp=temp.iloc[ind:index]
temp['iteration']=splitvalue
ind=index
lst.append(temp)
pd.concat(lst)
Is there a way to create this joined dataframe without creating lists of subdataframes ? Or is there a way to directly plot from the original dataframe ?

You can use:
numeric=~pd.Series([isinstance(key,str) for key in df[0]])
iterations=df[1].where(df[1].str.contains('=').fillna(False)).ffill()
iterations=[int(key.replace('iteration=','')) for key in iterations]
df['iterations']=iterations
df=df.loc[numeric]
df.columns=['x','y','iteration']
df.reset_index(drop=True,inplace=True)
print(df)
x y iteration
0 1 5 0
1 4 7 0
2 7 9 0
3 10 12 0
4 1 20 1
5 4 8 1
6 7 12 1
7 10 12 1
8 1 3 2
9 4 17 2
10 7 19 2
11 10 112 2

Related

Advanced Slicing with pandas or numpy for a 2pair and 3 pair, in a 5 group

I have a pandas table below which can be copy/pasted and read in with pd.read_clipboard(). I need to take a slice of 2 consectutive values, and 3 consecutive values, It's a two pair, three pair list,
as you can see by column y1. So 0,1 is a pair, then 2,3,4 are a pair, and then continues for each 5 group. I need to slice the entire list in these pairs. This is a group of 5, where the first 2 are a pair, and the next three are a pair.
So 14,1 is a pair, and 4,10,8 are a pair, and this is the same for every 5 pair.
what W1 W2 W8 W9 W0 y Y x y4 y1
0 14 4 14 12 14 2 15 4 7 1 1
1 1 11 1 3 1 13 0 14 8 10 1
2 4 14 4 6 4 8 5 5 13 13 1
3 10 0 10 8 10 6 11 9 3 8 1
4 8 2 8 10 8 4 9 12 1 8 1
5 15 15 13 11 15 0 4 15 4 11 11
6 11 11 9 15 11 4 0 9 0 2 11
7 9 9 11 13 9 6 2 0 2 10 11
8 2 2 0 6 2 13 9 9 9 0 11
9 0 0 2 4 0 15 11 15 11 10 11
10 4 6 4 13 4 12 13 6 7 9 9
11 9 11 9 0 9 1 0 1 10 2 9
12 3 1 3 10 3 11 10 10 0 7 9
13 2 0 2 11 2 10 11 3 1 10 9
14 10 8 10 3 10 2 3 12 9 14 9
15 13 13 5 14 13 2 6 13 2 11 11
16 11 11 3 8 11 4 0 4 4 8 11
17 4 4 12 7 4 11 15 7 11 4 11
18 8 8 0 11 8 7 3 7 7 4 11
19 4 4 12 7 4 11 15 9 11 7 11
I have tried this which gives the right results, but it doesn't repeat.
In [1540]: df['what'][:].to_numpy()[0:2:]
Out[1540]: array([14, 1], dtype=int8)
In [1538]: df['what'][2:].to_numpy()[0:3:]
Out[1538]: array([ 4, 10, 8], dtype=int8)
which is exactly what i want, but it doesn't continue slicing to the end of the list and what i want is it to continue slice so i get all the pairs like belowl:
array([ 4, 10, 8, 9, 2, 0, 3, 2, 10, 4, 8, 4] and the flip side array([14, 1, 15, 11, 4, 9, 13, 11]
How do i change my code or use pandas .loc/iloc or numpy slicing to continue slicing like my examples for the entire set?
The reason i need this is because i need to XOR the first two pair by a number, and the second three pair by a separate number. I'd like to XOR the first two pair and set the value in another column, and then XOR the second three pair, and set their values in another column in the correct index location.
Thanks in advance.
Convert data into numpy, then use a boolean of True and False to index the arrays
numpy resize helps in matching the boolean to the size of the what array
#create array
what = df.what.to_numpy()
what
array([14, 1, 4, 10, 8, 15, 11, 9, 2, 0, 4, 9, 3, 2, 10, 13, 11,
4, 8, 4], dtype=int64)
#create array of boolean
#ignore first two entries, gimme the next three entries
index = np.array([False,False,True,True,True])
#resize index to match size of what array
index = np.resize(index,what.shape[0])
what[index]
array([ 4, 10, 8, 9, 2, 0, 3, 2, 10, 4, 8, 4], dtype=int64)
#reverse the direction of the boolean
#keep first two entries, ignore next three
what[~index]
array([14, 1, 15, 11, 4, 9, 13, 11], dtype=int64)

Python Group BY Cumsum

I have this DataFrame :
Value Month
0 1
1 2
8 3
11 4
12 5
17 6
0 7
0 8
0 9
0 10
1 11
2 12
7 1
3 2
1 3
0 4
0 5
And i want to create new variable "Cumsum" like this :
Value Month Cumsum
0 1 0
1 2 1
8 3 9
11 4 20
12 5 32
17 6
0 7
0 8 ...
0 9
0 10
1 11
2 12
7 1 7
3 2 10
1 3 11
0 4 11
0 5 11
Sorry if my code it is not clean, I failed to include my dataframe ...
My problem is that I do not have only 12 lines (1 line per month) but I have many more lines.
By cons I know that my table is tidy and I want to have the cumulated amount until the 12th month and repeat that when the month 1 appears.
Thank you for your help.
Try:
df['Cumsum'] = df.groupby((df.Month == 1).cumsum())['Value'].cumsum()
print(df)
Value Month Cumsum
0 0 1 0
1 1 2 1
2 8 3 9
3 11 4 20
4 12 5 32
5 17 6 49
6 0 7 49
7 0 8 49
8 0 9 49
9 0 10 49
10 1 11 50
11 2 12 52
12 7 1 7
13 3 2 10
14 1 3 11
15 0 4 11
16 0 5 11
code:
df = pd.DataFrame({'value': [0, 1, 8, 11, 12, 17, 0, 0, 0, 0, 1, 2, 7, 3, 1, 0, 0],
'month': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5]})
temp = int(len(df)/12)
for i in range(temp + 1):
start = i * 12
if i < temp:
end = (i + 1) * 12 - 1
df.loc[start:end, 'cumsum'] = df.loc[start:end, 'value'].cumsum()
else:
df.loc[start:, 'cumsum'] = df.loc[start:, 'value'].cumsum()
# df.loc[12:, 'cumsum'] = 12
print(df)
output:
value month cumsum
0 0 1 0.0
1 1 2 1.0
2 8 3 9.0
3 11 4 20.0
4 12 5 32.0
5 17 6 49.0
6 0 7 49.0
7 0 8 49.0
8 0 9 49.0
9 0 10 49.0
10 1 11 50.0
11 2 12 52.0
12 7 1 7.0
13 3 2 10.0
14 1 3 11.0
15 0 4 11.0
16 0 5 11.0

I want to split dataframe into train and test set with ranges

import pandas as pd
import numpy as np
data=[]
columns = ['A', 'B', 'C']
data = [[0, 10, 5], [0, 12, 5], [2, 34, 13], [2, 3, 13], [4, 5, 8], [2, 4, 8], [1, 2, 4], [1, 3, 4], [3, 8, 12],[4,10,12],[6,7,12]]
df = pd.DataFrame(data, columns=columns)
print(df)
# A B C
# 0 0 10 5
# 1 0 12 5
# 2 2 34 13
# 3 2 3 13
# 4 4 5 8
# 5 2 4 8
# 6 1 2 4
# 7 1 3 4
# 8 3 8 12
# 9 4 10 12
# 10 6 7 12
Now I want to create two data frames df_train and df_test such that no two numbers of column 'C' are in the same set. eg. in column C the element 5 should be either in the training set or testing set .So, the rows [0, 10, 5], [0, 12, 5], [2, 34, 13] will either go in training set or testing set but not in both.This choosing of elements of column C should be done randomly.
I am stuck on this step and cannot proceed.
First sample your df , then groupby C get the cumcount distinct the duplicated value within the same group.
s=df.sample(len(df)).groupby('C').cumcount()
s
Out[481]:
5 0
7 0
2 0
1 0
0 1
6 1
10 0
4 1
3 1
8 1
9 2
dtype: int64
test=df.loc[s[s==1].index]
train=df.loc[s[s==0].index]
test
Out[483]:
A B C
0 0 10 5
6 1 2 4
4 4 5 8
3 2 3 13
8 3 8 12
train
Out[484]:
A B C
5 2 4 8
7 1 3 4
2 2 34 13
1 0 12 5
10 6 7 12
The question is not so clear of what the expected output of the two train and test set dataframe should looks like.
Anyway, I will try to answer.
I think you can first sort the dataframe values:
df_sorted = df.sort_values(['C'], ascending=[True])
print(df_sorted)
Out[1]:
A B C
6 1 2 4
7 1 3 4
0 0 10 5
1 0 12 5
4 4 5 8
5 2 4 8
8 3 8 12
9 4 10 12
10 6 7 12
2 2 34 13
3 2 3 13
Then split the sorted dataframe:
uniqe_c = df_sorted['C'].unique().tolist()
print(uniqe_c)
Out[2]:
[4, 5, 8, 12, 13]
train_set = df[df['C'] <= uniqe_c[2]]
val_set = df[df['C'] > uniqe_c[2]]
print(train_set)
# Train set dataframe
Out[3]:
A B C
0 0 10 5
1 0 12 5
4 4 5 8
5 2 4 8
6 1 2 4
7 1 3 4
print(val_set)
# Test set dataframe
Out[4]:
A B C
2 2 34 13
3 2 3 13
8 3 8 12
9 4 10 12
10 6 7 12
From 11 samples, after the split, 6 samples go to the train set and 5 samples go to the validation set. So, checked and no missing samples in the total combined two dataframes.

Multiple pandas columns

If a have pandas dataframe with 4 columns like this:
A B C D
0 2 4 1 9
1 3 2 9 7
2 1 6 9 2
3 8 6 5 4
is it possible to apply df.cumsum() in some way to get the results in a new column next to existing column like this:
A AA B BB C CC D DD
0 2 2 4 4 1 1 9 9
1 3 5 2 6 9 10 7 16
2 1 6 6 12 9 19 2 18
3 8 14 6 18 5 24 4 22
You can create new columns using assign:
result = df.assign(**{col*2:df[col].cumsum() for col in df})
and order the columns with sort_index:
result.sort_index(axis=1)
# A AA B BB C CC D DD
# 0 2 2 4 4 1 1 9 9
# 1 3 5 2 6 9 10 7 16
# 2 1 6 6 12 9 19 2 18
# 3 8 14 6 18 5 24 4 22
Note that depending on the column names, sorting may not produce the desired order. In that case, using reindex is a more robust way of ensuring you obtain the desired column order:
result = df.assign(**{col*2:df[col].cumsum() for col in df})
result = result.reindex(columns=[item for col in df for item in (col, col*2)])
Here is an example which demonstrates the difference:
import pandas as pd
df = pd.DataFrame({'A': [2, 3, 1, 8], 'A A': [4, 2, 6, 6], 'C': [1, 9, 9, 5], 'D': [9, 7, 2, 4]})
result = df.assign(**{col*2:df[col].cumsum() for col in df})
print(result.sort_index(axis=1))
# A A A A AA A AA C CC D DD
# 0 2 4 4 2 1 1 9 9
# 1 3 2 6 5 9 10 7 16
# 2 1 6 12 6 9 19 2 18
# 3 8 6 18 14 5 24 4 22
result = result.reindex(columns=[item for col in df for item in (col, col*2)])
print(result)
# A AA A A A AA A C CC D DD
# 0 2 2 4 4 1 1 9 9
# 1 3 5 2 6 9 10 7 16
# 2 1 6 6 12 9 19 2 18
# 3 8 14 6 18 5 24 4 22
#unutbu's way certainly works but using insert reads better to me. Plus you don't need to worry about sorting/reindexing!
for i, col_name in enumerate(df):
df.insert(i * 2 + 1, col_name * 2, df[col_name].cumsum())
df
returns
A AA B BB C CC D DD
0 2 2 4 4 1 1 9 9
1 3 5 2 6 9 10 7 16
2 1 6 6 12 9 19 2 18
3 8 14 6 18 5 24 4 22

Split dataframe by certain values in first column?

I have a dataframe like this one:
A C1 C2 Total
PRODUCT1 8 11 19
rs1 5 9 14
rs2 2 2 4
rs3 1 0 1
PRODUCT2 21 12 33
rs7 11 7 18
rs2 7 3 10
rs1 3 1 4
rs9 0 1 1
PRODUCT3 2 11 13
rs9 1 6 7
rs5 1 5 6
The column A is made of strings, I want to split my dataframe by the values in this column, specifically every upper word in it. Like this:
df1 =
PRODUCT1 8 11 19
rs1 5 9 14
rs2 2 2 4
rs3 1 0 1
df2 =
PRODUCT2 21 12 33
rs7 11 7 18
rs2 7 3 10
rs1 3 1 4
rs9 0 1 1
df3 =
PRODUCT3 2 11 13
rs9 1 6 7
rs5 1 5 6
Is there an easy way to achieve this?
import pandas as pd
df = pd.DataFrame({'A': ['PRODUCT1', 'rs1', 'rs2', 'rs3', 'PRODUCT2', 'rs7', 'rs2', 'rs1', 'rs9', 'PRODUCT3', 'rs9', 'rs5'], 'C1': [8, 5, 2, 1, 21, 11, 7, 3, 0, 2, 1, 1], 'C2': [11, 9, 2, 0, 12, 7, 3, 1, 1, 11, 6, 5], 'Total': [19, 14, 4, 1, 33, 18, 10, 4, 1, 13, 7, 6]})
for key, group in df.groupby(df['A'].str.isupper().cumsum()):
print(group)
prints
A C1 C2 Total
0 PRODUCT1 8 11 19
1 rs1 5 9 14
2 rs2 2 2 4
3 rs3 1 0 1
A C1 C2 Total
4 PRODUCT2 21 12 33
5 rs7 11 7 18
6 rs2 7 3 10
7 rs1 3 1 4
8 rs9 0 1 1
A C1 C2 Total
9 PRODUCT3 2 11 13
10 rs9 1 6 7
11 rs5 1 5 6
The idea here is to identify rows which are uppercase:
In [95]: df['A'].str.isupper()
Out[95]:
0 True
1 False
2 False
3 False
4 True
5 False
6 False
7 False
8 False
9 True
10 False
11 False
Name: A, dtype: bool
then use cumsum to take a cumulative sum, where True is treated as 1 and False is treated as 0:
In [96]: df['A'].str.isupper().cumsum()
Out[96]:
0 1
1 1
2 1
3 1
4 2
5 2
6 2
7 2
8 2
9 3
10 3
11 3
Name: A, dtype: int64
These values can be used as group numbers. Pass them to df.groupby to group the DataFrame according to these group numbers. df.groupby(...) returns an iterable, which lets you loop through the sub-groups.

Categories

Resources