I have two dataframes
dt AAPL AMC AMZN ASO ATH ... SPCE SRNE TH TSLA VIAC WKHS
0 2021-04-12 36 28 6 20 1 ... 5 0 0 50 23 0
1 2021-04-13 46 15 5 16 6 ... 5 0 0 122 12 1
2 2021-04-14 12 4 1 5 2 ... 2 0 0 39 1 0
3 2021-04-15 30 23 3 14 2 ... 15 0 0 101 9 0
dt AAPL AMC AMZN ASO ATH ... SPCE SRNE TH TSLA VIAC WKHS
0 2021-04-12 41 28 4 33 10 ... 5 0 0 56 14 3
1 2021-04-13 76 22 7 12 29 ... 4 0 0 134 8 2
2 2021-04-14 21 15 2 7 16 ... 2 0 0 61 3 0
3 2021-04-15 54 43 9 2 31 ... 16 0 0 83 13 1
I want to remove numbers from two dataframe that are lower than 10 if the instance is deleted from one dataframe the same cell should be remove in another dataframe same thing goes other way around
Appreciate your help
Use a mask:
## pre-requisite
df1 = df1.set_index('dt')
df2 = df2.set_index('dt')
## processing
mask = df1.lt(10) | df2.lt(10)
df1 = df1.mask(mask)
df2 = df2.mask(mask)
output:
>>> df1
AAPL AMC AMZN ASO ATH SPCE SRNE TH TSLA VIAC WKHS
dt
2021-04-12 36 28.0 NaN 20.0 NaN NaN NaN NaN 50 23.0 NaN
2021-04-13 46 15.0 NaN 16.0 NaN NaN NaN NaN 122 NaN NaN
2021-04-14 12 NaN NaN NaN NaN NaN NaN NaN 39 NaN NaN
2021-04-15 30 23.0 NaN NaN NaN 15.0 NaN NaN 101 NaN NaN
>>> df2
AAPL AMC AMZN ASO ATH SPCE SRNE TH TSLA VIAC WKHS
dt
2021-04-12 41 28.0 NaN 33.0 NaN NaN NaN NaN 56 14.0 NaN
2021-04-13 76 22.0 NaN 12.0 NaN NaN NaN NaN 134 NaN NaN
2021-04-14 21 NaN NaN NaN NaN NaN NaN NaN 61 NaN NaN
2021-04-15 54 43.0 NaN NaN NaN 16.0 NaN NaN 83 NaN NaN
Suppose that I have 2 dataframes, with indexes populated so that elements in columns are unique, because in real data they are:
vals = pd.DataFrame(np.random.randint(0,10,(10, 3)), columns=list('ABC'))
indexes = pd.DataFrame(np.argsort(np.random.randint(0,10,(10, 3)), axis=0)[:5], columns=list('ABC'))
>>> vals
A B C
0 64 20 48
1 28 60 81
2 5 73 77
3 74 66 86
4 41 39 21
5 65 37 98
6 10 20 73
7 6 70 3
8 36 29 28
9 43 13 12
>>> indexes
A B C
0 4 2 3
1 3 3 8
2 5 1 7
3 9 8 9
4 2 4 0
I would like to retain only those values in vals which indexes are listed in indexes. I don't care about row integrity or NAs, as I'll use the columns as Series later.
This is what I came up with:
vals_indexes = pd.DataFrame()
for i in range(vals.shape[1]):
vals_indexes = pd.concat([vals_indexes, vals.iloc[[e for e in indexes.iloc[:, i] if e in vals.index], i]], axis=1)
>>> vals_indexes
A B C
0 NaN NaN 48.0
1 NaN 60.0 NaN
2 5.0 73.0 NaN
3 74.0 66.0 86.0
4 41.0 39.0 NaN
5 65.0 NaN NaN
7 NaN NaN 3.0
8 NaN 29.0 28.0
9 43.0 NaN 12.0
Which is a bit ugly, but works for me. Question: is there a more effective way to do this?
use .loc within a loop to replace non existing index with nan
for i in vals.columns:
vals.loc[vals[i].isin(list(indexes[i].unique())),i]=np.nan
print(vals)
A B C
0 NaN 2.0 NaN
1 NaN 5.0 NaN
2 2.0 3.0 NaN
3 NaN NaN NaN
4 NaN NaN 6.0
5 9.0 NaN NaN
6 NaN NaN 4.0
7 NaN 7.0 NaN
8 2.0 NaN NaN
9 NaN NaN NaN
how can I separate each dataframe with an empty row
ive combined them using this snippet
frames1 = [df4, df5, df6]
Summary = pd.concat(frames1)
so how can i split them with an empty row
You can use the below example which works:
Create test dfs
df1 = pd.DataFrame(np.random.randint(0,20,20).reshape(5,4),columns=list('ABCD'))
df2 = pd.DataFrame(np.random.randint(0,20,20).reshape(5,4),columns=list('ABCD'))
df3 = pd.DataFrame(np.random.randint(0,20,20).reshape(5,4),columns=list('ABCD'))
dfs=[df1,df2,df3]
Solution:
pd.concat([df.append(pd.Series(), ignore_index=True) for df in dfs])
A B C D
0 17.0 16.0 15.0 7.0
1 13.0 6.0 12.0 18.0
2 0.0 2.0 10.0 17.0
3 8.0 13.0 10.0 17.0
4 4.0 18.0 8.0 19.0
5 NaN NaN NaN NaN
0 14.0 0.0 13.0 12.0
1 10.0 3.0 6.0 3.0
2 15.0 10.0 15.0 3.0
3 9.0 16.0 11.0 4.0
4 5.0 7.0 6.0 2.0
5 NaN NaN NaN NaN
0 10.0 18.0 13.0 12.0
1 1.0 6.0 10.0 0.0
2 2.0 19.0 4.0 18.0
3 4.0 3.0 9.0 16.0
4 16.0 6.0 5.0 6.0
5 NaN NaN NaN NaN
For horizontal stack:
pd.concat([df.assign(test=np.nan) for df in dfs],axis=1)
A B C D test A B C D test A B C D test
0 17 16 15 7 NaN 14 0 13 12 NaN 10 18 13 12 NaN
1 13 6 12 18 NaN 10 3 6 3 NaN 1 6 10 0 NaN
2 0 2 10 17 NaN 15 10 15 3 NaN 2 19 4 18 NaN
3 8 13 10 17 NaN 9 16 11 4 NaN 4 3 9 16 NaN
4 4 18 8 19 NaN 5 7 6 2 NaN 16 6 5 6 NaN
Is this what you want?:
fname = 'test2.csv'
frames1 = [df4, df5, df6]
with open(fname, mode='a+') as f:
for df in frames1:
df.to_csv(fname, mode='a', header = f.tell() == 0)
f.write('\n')
test2.csv:
,a,b,c
0,0,1,2
1,3,4,5
2,6,7,8
0,0,1,2
1,3,4,5
2,6,7,8
0,0,1,2
1,3,4,5
2,6,7,8
f.tell() == 0 checks whether the file handle is at the beginning of the file, i.e. at 0, if yes, prints header, else doesn't.
NOTE: I have used same values for all the dfs, that's why all the results are similar.
For columns:
fname = 'test3.csv'
frames1 = [df1, df2, df3]
Summary = pd.concat([df.assign(**{' ':' '}) for df in frames1], axis=1)
Summary.to_csv(fname)
test3.csv:
,a,b,c, ,a,b,c, ,a,b,c,
0,0,1,2, ,0,1,2, ,0,1,2,
1,3,4,5, ,3,4,5, ,3,4,5,
2,6,7,8, ,6,7,8, ,6,7,8,
But the columns will not be equally spaced. If you save with header=False:
test3.csv:
0,0,1,2, ,0,1,2, ,0,1,2,
1,3,4,5, ,3,4,5, ,3,4,5,
2,6,7,8, ,6,7,8, ,6,7,8,
Editing my original post to hopefully simplify my question... I'm merging multiple DataFrames into one, SomeData.DataFrame, which gives me the following:
Key 2019-02-17 2019-02-24_x 2019-02-24_y 2019-03-03
0 A 80 NaN NaN 80
1 B NaN NaN 45 36
2 C 44 NaN 39 NaN
3 D 80 NaN NaN 12
4 E 49 2 NaN NaN
What I'm trying to do now is efficiently merge the columns ending in "_x" and "_y" while keeping everything else in place so that I get:
Key 2019-02-17 2019-02-24 2019-03-03
0 A 80 NaN 80
1 B NaN 45 36
2 C 44 39 NaN
3 D 80 NaN 12
4 E 49 2 NaN
The other issue I'm trying to account for is that the data contained in SomeData.DataFrame changes weekly so that my column headers are unpredictable. Meaning, some weeks I may not have the above issue at all and other weeks, there may be multiple instances for example:
Key 2019-02-17 2019-02-24_x 2019-02-24_y 2019-03_10_x 2019-03-10_y
0 A 80 NaN NaN 80 NaN
1 B NaN NaN 45 36 NaN
2 C 44 NaN 39 NaN 12
3 D 80 NaN NaN 12 NaN
4 E 49 2 NaN NaN 17
So that again the desired result would be:
Key 2019-02-17 2019-02-24 2019-03_10
0 A 80 NaN 80
1 B NaN 45 36
2 C 44 39 12
3 D 80 NaN 12
4 E 49 2 17
Is what I'm asking reasonable or am I venturing outside the bounds of Pandas' limits? I can't find anyone trying to do anything similar so I'm not sure anymore. Thank you in advance!
Edited answer to updated question:
df = df.set_index('Key')
df.groupby(df.columns.str.split('_').str[0], axis=1).sum()
Output:
2019-02-17 2019-02-24 2019-03-03
Key
A 80.0 0.0 80.0
B 0.0 45.0 36.0
C 44.0 39.0 0.0
D 80.0 0.0 12.0
E 49.0 2.0 0.0
Second dataframe Output:
df.groupby(df.columns.str.split('_').str[0], axis=1).sum()
Output:
2019-02-17 2019-02-24 2019-03-10
Key
A 80.0 0.0 80.0
B 0.0 45.0 36.0
C 44.0 39.0 12.0
D 80.0 0.0 12.0
E 49.0 2.0 17.0
You could try something like this:
df_t = df.T
df_t.set_index(df_t.groupby(level=0).cumcount(), append=True)\
.unstack().T\
.sort_values(df.columns[0])[df.columns.unique()]\
.reset_index(drop=True)
Output:
val03-20 03-20 val03-24 03-24
0 a 1 d 5
1 b 6 e 7
2 c 4 f 10
3 NaN NaN g 5
4 NaN NaN h 6
5 NaN NaN i 1
We have a dataframe 'A' with 5 columns, and we want to add the rolling mean of each column, we could do:
A = pd.DataFrame(np.random.randint(100, size=(5, 5)))
for i in range(0,5):
A[i+6] = A[i].rolling(3).mean()
If however 'A' has column named 'A', 'B'...'E':
A = pd.DataFrame(np.random.randint(100, size=(5, 5)), columns = ['A', 'B',
'C', 'D', 'E'])
How could we neatly add 5 columns with the rolling mean, and each name being ['A_mean', 'B_mean', ....'E_mean']?
try this:
for col in df:
A[col+'_mean'] = A[col].rolling(3).mean()
Output with your way:
0 1 2 3 4 6 7 8 9 10
0 16 53 9 16 67 NaN NaN NaN NaN NaN
1 55 37 93 92 21 NaN NaN NaN NaN NaN
2 10 5 93 99 27 27.0 31.666667 65.000000 69.000000 38.333333
3 94 32 81 91 34 53.0 24.666667 89.000000 94.000000 27.333333
4 37 46 20 18 10 47.0 27.666667 64.666667 69.333333 23.666667
and Output with mine:
A B C D E A_mean B_mean C_mean D_mean E_mean
0 16 53 9 16 67 NaN NaN NaN NaN NaN
1 55 37 93 92 21 NaN NaN NaN NaN NaN
2 10 5 93 99 27 27.0 31.666667 65.000000 69.000000 38.333333
3 94 32 81 91 34 53.0 24.666667 89.000000 94.000000 27.333333
4 37 46 20 18 10 47.0 27.666667 64.666667 69.333333 23.666667
Without loops:
pd.concat([A, A.apply(lambda x:x.rolling(3).mean()).rename(
columns={col: str(col) + '_mean' for col in A})], axis=1)
A B C D E A_mean B_mean C_mean D_mean E_mean
0 67 54 85 61 62 NaN NaN NaN NaN NaN
1 44 53 30 80 58 NaN NaN NaN NaN NaN
2 10 59 14 39 12 40.333333 55.333333 43.0 60.000000 44.000000
3 47 25 58 93 38 33.666667 45.666667 34.0 70.666667 36.000000
4 73 80 30 51 77 43.333333 54.666667 34.0 61.000000 42.333333