Question has been posted before but the requirements were not properly conveyed. I have a csv file with more than 1000 columns:
A B C D .... X Y Z
1 0 0.5 5 .... 1 7 6
2 0 0.6 4 .... 0 7 6
3 0 0.7 3 .... 1 7 6
4 0 0.8 2 .... 0 7 6
Here X , Y and Z are the 999, 1000, 10001 column and A, B, C , D are the 1st,2nd,3rd and 4th. I need to reorder the columns in such a way that it gives me the following.
D Y Z A B C ....X
5 7 6 1 0 0.5 ....1
4 7 6 2 0 0.6 ....0
3 7 6 3 0 0.7 ....1
2 7 6 4 0 0.8 ....0
that is 4th column becomes the 1st, 1000 and 1001th column becomes 2nd and 3rd and the other columns are shifted right accordingly.
So the question is how to reorder your columns in a custom way.
For example you have the following DF and you want to reorder your columns in the following way (indices):
5, 3, rest...
DF
In [82]: df
Out[82]:
A B C D E F G
0 1 0 0.5 5 1 7 6
1 2 0 0.6 4 0 7 6
2 3 0 0.7 3 1 7 6
3 4 0 0.8 2 0 7 6
columns
In [83]: cols = df.columns.tolist()
In [84]: cols
Out[84]: ['A', 'B', 'C', 'D', 'E', 'F', 'G']
reordered:
In [88]: cols = [cols.pop(5)] + [cols.pop(3)] + cols
In [89]: cols
Out[89]: ['F', 'D', 'A', 'B', 'C', 'E', 'G']
In [90]: df[cols]
Out[90]:
F D A B C E G
0 7 5 1 0 0.5 1 6
1 7 4 2 0 0.6 0 6
2 7 3 3 0 0.7 1 6
3 7 2 4 0 0.8 0 6
In [4]: df
Out[4]:
A B C D X Y Z
0 1 0 0.5 5 1 7 6
1 2 0 0.6 4 0 7 6
2 3 0 0.7 3 1 7 6
3 4 0 0.8 2 0 7 6
In [5]: df.reindex(columns=['D','Y','Z','A','B','C','X'])
Out[5]:
D Y Z A B C X
0 5 7 6 1 0 0.5 1
1 4 7 6 2 0 0.6 0
2 3 7 6 3 0 0.7 1
Related
I want to split dataframe by uneven number of rows using row index.
The below code:
groups = df.groupby((np.arange(len(df.index))/l[1]).astype(int))
works only for uniform number of rows.
df
a b c
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
l = [2, 5, 7]
df1
1 1 1
2 2 2
df2
3,3,3
4,4,4
5,5,5
df3
6,6,6
7,7,7
df4
8,8,8
You could use list comprehension with a little modications your list, l, first.
print(df)
a b c
0 1 1 1
1 2 2 2
2 3 3 3
3 4 4 4
4 5 5 5
5 6 6 6
6 7 7 7
7 8 8 8
l = [2,5,7]
l_mod = [0] + l + [max(l)+1]
list_of_dfs = [df.iloc[l_mod[n]:l_mod[n+1]] for n in range(len(l_mod)-1)]
Output:
list_of_dfs[0]
a b c
0 1 1 1
1 2 2 2
list_of_dfs[1]
a b c
2 3 3 3
3 4 4 4
4 5 5 5
list_of_dfs[2]
a b c
5 6 6 6
6 7 7 7
list_of_dfs[3]
a b c
7 8 8 8
I think this is what you need:
df = pd.DataFrame({'a': np.arange(1, 8),
'b': np.arange(1, 8),
'c': np.arange(1, 8)})
df.head()
a b c
0 1 1 1
1 2 2 2
2 3 3 3
3 4 4 4
4 5 5 5
5 6 6 6
6 7 7 7
last_check = 0
dfs = []
for ind in [2, 5, 7]:
dfs.append(df.loc[last_check:ind-1])
last_check = ind
Although list comprehension are much more efficient than a for loop, the last_check is necessary if you don't have a pattern in your list of indices.
dfs[0]
a b c
0 1 1 1
1 2 2 2
dfs[2]
a b c
5 6 6 6
6 7 7 7
I think this is you are looking for.,
l = [2, 5, 7]
dfs=[]
i=0
for val in l:
if i==0:
temp=df.iloc[:val]
dfs.append(temp)
elif i==len(l):
temp=df.iloc[val]
dfs.append(temp)
else:
temp=df.iloc[l[i-1]:val]
dfs.append(temp)
i+=1
Output:
a b c
0 1 1 1
1 2 2 2
a b c
2 3 3 3
3 4 4 4
4 5 5 5
a b c
5 6 6 6
6 7 7 7
Another Solution:
l = [2, 5, 7]
t= np.arange(l[-1])
l.reverse()
for val in l:
t[:val]=val
temp=pd.DataFrame(t)
temp=pd.concat([df,temp],axis=1)
for u,v in temp.groupby(0):
print v
Output:
a b c 0
0 1 1 1 2
1 2 2 2 2
a b c 0
2 3 3 3 5
3 4 4 4 5
4 5 5 5 5
a b c 0
5 6 6 6 7
6 7 7 7 7
You can create an array to use for indexing via NumPy:
import pandas as pd, numpy as np
df = pd.DataFrame(np.arange(24).reshape((8, 3)), columns=list('abc'))
L = [2, 5, 7]
idx = np.cumsum(np.in1d(np.arange(len(df.index)), L))
for _, chunk in df.groupby(idx):
print(chunk, '\n')
a b c
0 0 1 2
1 3 4 5
a b c
2 6 7 8
3 9 10 11
4 12 13 14
a b c
5 15 16 17
6 18 19 20
a b c
7 21 22 23
Instead of defining a new variable for each dataframe, you can use a dictionary:
d = dict(tuple(df.groupby(idx)))
print(d[1]) # print second groupby value
a b c
2 6 7 8
3 9 10 11
4 12 13 14
I have a dataframe like this
import pandas as pd
df = pd.DataFrame({'id' : [1, 1, 1, 1, 2, 2, 2, 3, 3, 3],\
'crit_1' : [0, 0, 1, 0, 0, 0, 1, 0, 0, 1], \
'crit_2' : ['a', 'a', 'b', 'b', 'a', 'b', 'a', 'a', 'a', 'a'],\
'value' : [3, 4, 3, 5, 1, 2, 4, 6, 2, 3]}, \
columns=['id' , 'crit_1', 'crit_2', 'value' ])
df
Out[41]:
id crit_1 crit_2 value
0 1 0 a 3
1 1 0 a 4
2 1 1 b 3
3 1 0 b 5
4 2 0 a 1
5 2 0 b 2
6 2 1 a 4
7 3 0 a 6
8 3 0 a 2
9 3 1 a 3
I pull a subset out of this frame based on crit_1
df_subset = df[(df['crit_1']==1)]
Then I perform a complex operation (the nature of which is unimportant for this question) on that subeset producing a new column
df_subset['some_new_val'] = [1, 4,2]
df_subset
Out[42]:
id crit_1 crit_2 value some_new_val
2 1 1 b 3 1
6 2 1 a 4 4
9 3 1 a 3 2
Now, I want to add some_new_val and back into my original dataframe onto the column value. However, I only want to add it in where there is a match on id and crit_2
The result should look like this
id crit_1 crit_2 value new_value
0 1 0 a 3 3
1 1 0 a 4 4
2 1 1 b 3 4
3 1 0 b 5 6
4 2 0 a 1 1
5 2 0 b 2 6
6 2 1 a 4 4
7 3 0 a 6 8
8 3 0 a 2 4
9 3 1 a 3 5
You can use merge with left join and then add:
#filter only columns for join and for append
cols = ['id','crit_2', 'some_new_val']
df = pd.merge(df, df_subset[cols], on=['id','crit_2'], how='left')
print (df)
id crit_1 crit_2 value some_new_val
0 1 0 a 3 NaN
1 1 0 a 4 NaN
2 1 1 b 3 1.0
3 1 0 b 5 1.0
4 2 0 a 1 4.0
5 2 0 b 2 NaN
6 2 1 a 4 4.0
7 3 0 a 6 2.0
8 3 0 a 2 2.0
9 3 1 a 3 2.0
df['some_new_val'] = df['some_new_val'].add(df['value'], fill_value=0)
print (df)
id crit_1 crit_2 value some_new_val
0 1 0 a 3 3.0
1 1 0 a 4 4.0
2 1 1 b 3 4.0
3 1 0 b 5 6.0
4 2 0 a 1 5.0
5 2 0 b 2 2.0
6 2 1 a 4 8.0
7 3 0 a 6 8.0
8 3 0 a 2 4.0
9 3 1 a 3 5.0
Suppose I have two dataframes X and Y:
import pandas as pd
X = pd.DataFrame({'A':[1,4,7],'B':[2,5,8],'C':[3,6,9]})
Y = pd.DataFrame({'D':[1],'E':[11]})
In [4]: X
Out[4]:
A B C
0 1 2 3
1 4 5 6
2 7 8 9
In [6]: Y
Out[6]:
D E
0 1 11
and then, I want to get the following result dataframe:
A B C D E
0 1 2 3 1 11
1 4 5 6 1 11
2 7 8 9 1 11
how?
Assuming that Y contains only one row:
In [9]: X.assign(**Y.to_dict('r')[0])
Out[9]:
A B C D E
0 1 2 3 1 11
1 4 5 6 1 11
2 7 8 9 1 11
or a much nicer alternative from #piRSquared:
In [27]: X.assign(**Y.iloc[0])
Out[27]:
A B C D E
0 1 2 3 1 11
1 4 5 6 1 11
2 7 8 9 1 11
Helper dict:
In [10]: Y.to_dict('r')[0]
Out[10]: {'D': 1, 'E': 11}
Here is another way
Y2 = pd.concat([Y]*3, ignore_index = True) #This duplicates the rows
Which produces:
D E
0 1 11
0 1 11
0 1 11
Then concat once again:
pd.concat([X,Y2], axis =1)
A B C D E
0 1 2 3 1 11
1 4 5 6 1 11
2 7 8 9 1 11
I have an existing pandas DataFrame, and I want to add a new column, where the value of each row will depend on the previous row.
for example:
df1 = pd.DataFrame(np.random.randint(10, size=(4, 4)), columns=['a', 'b', 'c', 'd'])
df1
Out[31]:
a b c d
0 9 3 3 0
1 3 9 5 1
2 1 7 5 6
3 8 0 1 7
and now I want to create column e, where for each row i the value of df1['e'][i] would be: df1['e'][i] = df1['d'][i] - df1['d'][i-1]
desired output:
df1:
a b c d e
0 9 3 3 0 0
1 3 9 5 1 1
2 1 7 5 6 5
3 8 0 1 7 1
how can I achieve this?
You can use sub with shift:
df['e'] = df.d.sub(df.d.shift(), fill_value=0)
print (df)
a b c d e
0 9 3 3 0 0.0
1 3 9 5 1 1.0
2 1 7 5 6 5.0
3 8 0 1 7 1.0
If need convert to int:
df['e'] = df.d.sub(df.d.shift(), fill_value=0).astype(int)
print (df)
a b c d e
0 9 3 3 0 0
1 3 9 5 1 1
2 1 7 5 6 5
3 8 0 1 7 1
My question is related to this question
import pandas as pd
df = pd.DataFrame(
[['A', 'X', 3], ['A', 'X', 5], ['A', 'Y', 7], ['A', 'Y', 1],
['B', 'X', 3], ['B', 'X', 1], ['B', 'X', 3], ['B', 'Y', 1],
['C', 'X', 7], ['C', 'Y', 4], ['C', 'Y', 1], ['C', 'Y', 6]],
columns=['c1', 'c2', 'v1'])
df['CNT'] = df.groupby(['c1', 'c2']).cumcount()+1
I got column 'CNT'. But I'd like to break it apart according to group 'c2' to obtain cumulative count of 'X' and 'Y' respectively.
c1 c2 v1 CNT Xcnt Ycnt
0 A X 3 1 1 0
1 A X 5 2 2 0
2 A Y 7 1 2 1
3 A Y 1 2 2 2
4 B X 3 1 1 0
5 B X 1 2 2 0
6 B X 3 3 3 0
7 B Y 1 1 3 1
8 C X 7 1 1 0
9 C Y 4 1 1 1
10 C Y 1 2 1 2
11 C Y 6 3 1 3
Any suggestions? I am just starting to explore Pandas and appreciate your help.
I don't directly know a way to do this directly, but starting from the calculated CNT column, you can do it as follows:
Make the Xcnt and Ycnt columns:
In [13]: df['Xcnt'] = df['CNT'][df['c2']=='X']
In [14]: df['Ycnt'] = df['CNT'][df['c2']=='Y']
In [15]: df
Out[15]:
c1 c2 v1 CNT Xcnt Ycnt
0 A X 3 1 1 NaN
1 A X 5 2 2 NaN
2 A Y 7 1 NaN 1
3 A Y 1 2 NaN 2
4 B X 3 1 1 NaN
5 B X 1 2 2 NaN
6 B X 3 3 3 NaN
7 B Y 1 1 NaN 1
8 C X 7 1 1 NaN
9 C Y 4 1 NaN 1
10 C Y 1 2 NaN 2
11 C Y 6 3 NaN 3
Next, we want to fill the NaN's per group of c1 by forward filling:
In [23]: df['Xcnt'] = df.groupby('c1')['Xcnt'].fillna(method='ffill')
In [24]: df['Ycnt'] = df.groupby('c1')['Ycnt'].fillna(method='ffill').fillna(0)
In [25]: df
Out[25]:
c1 c2 v1 CNT Xcnt Ycnt
0 A X 3 1 1 0
1 A X 5 2 2 0
2 A Y 7 1 2 1
3 A Y 1 2 2 2
4 B X 3 1 1 0
5 B X 1 2 2 0
6 B X 3 3 3 0
7 B Y 1 1 3 1
8 C X 7 1 1 0
9 C Y 4 1 1 1
10 C Y 1 2 1 2
11 C Y 6 3 1 3
For the Ycnt an extra fillna was needed to fill the convert the NaN's to 0's where the group started with NaNs (couldn't fill forward).