I have a df that looks like
L.1
L.2
G.1
G.2
1
5
9
13
2
6
10
14
3
7
11
15
4
8
12
16
This is just an arbitrary example but the structure of my df is the exactly the same. 4 column titles and then numbers under them. I would like to stack my columns in a way that it will look like
L
G
1
9
2
10
3
11
4
12
5
13
6
14
7
15
8
16
If someone could help me in solving this, it would be great as I am having a really hard time doing this.
Use wide_to_long with remove MultiIndex in DataFrame.reset_index with drop=True:
df = (pd.wide_to_long(df.reset_index(), stubnames=['L','G'], i='index', j='tmp', sep='.')
.reset_index(drop=True))
print (df)
L G
0 1 9
1 2 10
2 3 11
3 4 12
4 5 13
5 6 14
6 7 15
7 8 16
Or split columns by str.split with DataFrame.stack and sorting MultiIndex by DataFrame.sort_index, last also remove MultiIndex:
df.columns = df.columns.str.split('.', expand=True)
df = df.stack().sort_index(level=[1,0]).reset_index(drop=True)
print (df)
G L
0 9 1
1 10 2
2 11 3
3 12 4
4 13 5
5 14 6
6 15 7
7 16 8
You can make each column to list and concatenate them and create a new dataframe based on the new list:
import pandas as pd
df = pd.DataFrame({'L.1': [1, 2, 3, 4], 'L.2': [5, 6, 7, 8], 'G.1':[9, 10, 11, 12], 'G.2': [13, 14, 15, 16]})
new_df = pd.DataFrame({'L':df['L.1'].tolist()+df['L.2'].tolist(),
'G':df['G.1'].tolist()+df['G.2'].tolist()})
Printing new_df will give you:
L G
0 1 9
1 2 10
2 3 11
3 4 12
4 5 13
5 6 14
6 7 15
7 8 16
The columns have a pattern, some start with L, others start with G. We can use pivot_longer from pyjanitor to abstract the process; simply pass a list of new column names, and pass a regular expression to match the patterns:
df.pivot_longer(index = None,
names_to = ['L', 'G'],
names_pattern = ['^L', '^G'])
L G
0 1 9
1 2 10
2 3 11
3 4 12
4 5 13
5 6 14
6 7 15
7 8 16
Using pivot_longer, you can use the .value approach, along with a regular expression that contains groups - the grouped part is retained as a column header:
df.pivot_longer(index = None,
names_to = ".value",
names_pattern = r"(.).")
L G
0 1 9
1 2 10
2 3 11
3 4 12
4 5 13
5 6 14
6 7 15
7 8 16
Related
I would like to replace values in a column, but only to the values seen after an specific value
for example, I have the following dataset:
In [108]: df=pd.DataFrame([[12,13,14,15,16,17],[4,10,5,6,1,3],[1, 3,5,4,9,1],[2, 4, 1,8,3,4], [4, 2, 6,7,1,8]], columns=['ID','time,'A', 'B', 'C'])
In [109]: df
Out[109]:
ID time A B C
0 12 4 1 2 4
1 13 10 3 4 2
2 14 5 5 1 6
3 15 6 4 8 7
4 16 1 9 3 1
5 17 3 1 4 8
and I want to change for column "A" all the values that come after 5 for a 1, for column "B" all the values that come after 1 for 6, for column "C" change all the values after 7 for a 5. so it will look like this:
ID time A B C
0 12 4 1 2 4
1 13 10 3 4 2
2 14 5 5 1 6
3 15 6 1 6 7
4 16 1 1 6 5
5 17 3 1 6 5
I know that I could use where to get sort of a similar effect, but if I put a condition like df["A"] = np.where(x!=5,1,x), but obviously this will change the values before 5 as well. I can't think of anything else at the moment.
Thanks for the help.
Use DataFrame.mask with shifted valeus by DataFrame.shift, compared by dictioanry and for next Trues is used DataFrame.cummax:
df=pd.DataFrame([[12,13,14,15,16,17],[4,10,5,6,1,3],
[1, 3,5,4,9,1],[2, 4, 1,8,3,4], [4, 2, 6,7,1,8]],
index=['ID','time','A', 'B', 'C']).T
after = {'A':5, 'B':1, 'C': 7}
new = {'A':1, 'B':6, 'C': 5}
cols = list(after.keys())
s = pd.Series(new)
df[cols] = df[cols].mask(df[cols].shift().eq(after).cummax(), s, axis=1)
print (df)
ID time A B C
0 12 4 1 2 4
1 13 10 3 4 2
2 14 5 5 1 6
3 15 6 1 6 7
4 16 1 1 6 5
5 17 3 1 6 5
I am currently starting to learn Pandas and struggling to do a task with it. What I am trying to do is to augment the data stored in a dataframe by combining two succesive rows with an increasing overlap between them. Just like a rolling window.
I believe the question can exemplified with this small dataframe:
df = pd.DataFrame([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]], columns=['A', 'B', 'C', 'D'])
which gives:
A B C D
0 1 2 3 4
1 5 6 7 8
2 9 10 11 12
With it, what I want to accomplish but I don't know how to, is a dataframe like the next one:
A B C D
0 1 2 3 4
1 2 3 4 5
2 3 4 5 6
3 4 5 6 7
4 5 6 7 8
5 6 7 8 9
6 7 8 9 10
7 8 9 10 11
8 9 10 11 12
As if we were using multiple rolling windows between each pair of the initial dataframe. Note that I am not using this specific dataframe (the values are not really ordered like 1,2,3,4...). What I am using is a general dataframe imported from a csv.
Is this possible?, thanks in advance!
Edit
Thanks to all the responses. Both answers given by anky and Shubham Sharma work perfect. Here are the results obtained by using both of them with my real dataframe:
Initial dataframe
After adding multiple rolling windows as my question needed
May be not as elegant, but try:
def fun(dataframe,n):
l = dataframe.stack().tolist()
return (pd.DataFrame([l[e:e+n] for e,i in enumerate(l)],
columns=dataframe.columns).dropna().astype(dataframe.dtypes))
fun(df,df.shape[1])
A B C D
0 1 2 3 4
1 2 3 4 5
2 3 4 5 6
3 4 5 6 7
4 5 6 7 8
5 6 7 8 9
6 7 8 9 10
7 8 9 10 11
8 9 10 11 12
Let's try rolling with numpy:
def rolling(a, w=4):
s = a.strides[-1]
return np.lib.stride_tricks.as_strided(a, (len(a)-w+1, w), (s, s))
pd.DataFrame(rolling(df.values.reshape(-1)), columns=df.columns)
A B C D
0 1 2 3 4
1 2 3 4 5
2 3 4 5 6
3 4 5 6 7
4 5 6 7 8
5 6 7 8 9
6 7 8 9 10
7 8 9 10 11
8 9 10 11 12
You can do all the weight lifting with numpy and then drop the resulting matrix into a dataframe.
import numpy as np
import pandas as pd
n_columns = 4
n_rows = 9
aux = np.tile(
np.arange(1, n_columns+1), # base row
(n_rows, 1) # replicate it as many times as needed
)
# use broadcasting to add a per row offset to each row
aux = aux + np.arange(n_rows)[:, np.newaxis]
# put everything into a dataframe
pd.DataFrame(aux)
I have rand_df1:
np.random.seed(1)
rand_df1 = pd.DataFrame(np.random.randint(0, 40, size=(3, 2)), columns=list('AB'))
print(rand_df1, '\n')
A B
0 37 12
1 8 9
2 11 5
Also, rand_df2:
rand_df2 = pd.DataFrame(np.random.randint(0, 40, size=(3, 2)), columns=list('AB'))
rand_df2 = rand_df2.loc[rand_df2.index.repeat(rand_df2['B'])]
print(rand_df2, '\n')
A B
1 16 1
2 12 7
2 12 7
2 12 7
2 12 7
2 12 7
2 12 7
2 12 7
I need to reassign values in the first dataframe col 'A' with values in 'A' of the second dataframe by index. Desired output of rand_df1:
A B
0 37 12
1 16 1
2 12 7
2 12 7
2 12 7
2 12 7
2 12 7
2 12 7
2 12 7
If I've interpreted your question correctly, you are looking to append new rows onto rand_df2. These rows are to be selected from rand_df1 where they have an index which does not appear in rand_df2. Is that correct?
This will do the trick:
rand_df2_new = rand_df2.append(rand_df1[~rand_df1.index.isin(rand_df2.index)]).sort_index()
Thanks to Henry Yik for his solution:
rand_df2.combine_first(rand_df1)
A B
0 37 12
1 16 1
2 12 7
2 12 7
2 12 7
2 12 7
2 12 7
2 12 7
2 12 7
Also, tested this with extra column in one dataframe, that doesn't appears in second dataframe and backward situation. It works good.
If a have pandas dataframe with 4 columns like this:
A B C D
0 2 4 1 9
1 3 2 9 7
2 1 6 9 2
3 8 6 5 4
is it possible to apply df.cumsum() in some way to get the results in a new column next to existing column like this:
A AA B BB C CC D DD
0 2 2 4 4 1 1 9 9
1 3 5 2 6 9 10 7 16
2 1 6 6 12 9 19 2 18
3 8 14 6 18 5 24 4 22
You can create new columns using assign:
result = df.assign(**{col*2:df[col].cumsum() for col in df})
and order the columns with sort_index:
result.sort_index(axis=1)
# A AA B BB C CC D DD
# 0 2 2 4 4 1 1 9 9
# 1 3 5 2 6 9 10 7 16
# 2 1 6 6 12 9 19 2 18
# 3 8 14 6 18 5 24 4 22
Note that depending on the column names, sorting may not produce the desired order. In that case, using reindex is a more robust way of ensuring you obtain the desired column order:
result = df.assign(**{col*2:df[col].cumsum() for col in df})
result = result.reindex(columns=[item for col in df for item in (col, col*2)])
Here is an example which demonstrates the difference:
import pandas as pd
df = pd.DataFrame({'A': [2, 3, 1, 8], 'A A': [4, 2, 6, 6], 'C': [1, 9, 9, 5], 'D': [9, 7, 2, 4]})
result = df.assign(**{col*2:df[col].cumsum() for col in df})
print(result.sort_index(axis=1))
# A A A A AA A AA C CC D DD
# 0 2 4 4 2 1 1 9 9
# 1 3 2 6 5 9 10 7 16
# 2 1 6 12 6 9 19 2 18
# 3 8 6 18 14 5 24 4 22
result = result.reindex(columns=[item for col in df for item in (col, col*2)])
print(result)
# A AA A A A AA A C CC D DD
# 0 2 2 4 4 1 1 9 9
# 1 3 5 2 6 9 10 7 16
# 2 1 6 6 12 9 19 2 18
# 3 8 14 6 18 5 24 4 22
#unutbu's way certainly works but using insert reads better to me. Plus you don't need to worry about sorting/reindexing!
for i, col_name in enumerate(df):
df.insert(i * 2 + 1, col_name * 2, df[col_name].cumsum())
df
returns
A AA B BB C CC D DD
0 2 2 4 4 1 1 9 9
1 3 5 2 6 9 10 7 16
2 1 6 6 12 9 19 2 18
3 8 14 6 18 5 24 4 22
I have a dataframe in which some values are split in different columns:
ch1a ch1b ch1c ch2
0 0 4 10
0 0 5 9
0 6 0 8
0 7 0 7
8 0 0 6
9 0 0 5
I want to sum those columns and keep the normal ones (like ch2).
The desired result should be something like:
ch1a ch2
4 10
5 9
6 8
7 7
8 6
9 5
I took a look at both pandas functions, merge and join, but I could not find the right one for my case.
This was my first try:
df = pd.DataFrame({'ch1a': [0, 0, 0, 0, 8, 9],'ch1b': [0, 0, 6, 7, 0, 0],'ch1c': [4, 5, 0, 0, 0, 0],'ch2': [10, 9, 8, 7, 6, 5]})
df['ch1a'] = df.sum(axis=1)
del df['ch1b']
del df['ch1c']
However the result is not what I want:
ch1a ch2
0 14 10
1 14 9
2 14 8
3 14 7
4 14 6
5 14 5
I have two questions:
How can I get my desired result?
Is there a way to merge some columns by summing their values and not have to delete the remaining columns afterwards?
This would get you the desired result:
cols_to_sum = ['ch1a', 'ch1b', 'ch1c']
df['ch1'] = df.loc[:, cols_to_sum].sum(axis=1)
df.drop(cols_to_sum, axis=1)
Your problem was that you were summing over all columns. Here we restrict it to the relevant ones.
I don't know how to avoid the drop though.
You can do a horizontal (column-wise) groupby using axis=1:
>>> df.groupby(df.columns.str[:3], axis=1).sum()
ch1 ch2
0 4 10
1 5 9
2 6 8
3 7 7
4 8 6
5 9 5
Here I used the first three letters of the columns to determine the destination groups, but you can use functions or dictionaries or lists instead:
>>> df.groupby(lambda x: x[:3], axis=1).sum()
ch1 ch2
0 4 10
1 5 9
2 6 8
3 7 7
4 8 6
5 9 5
>>> df.groupby(['a','b','b','c'], axis=1).sum()
a b c
0 0 4 10
1 0 5 9
2 0 6 8
3 0 7 7
4 8 0 6
5 9 0 5