Lets say I have a DataFrame that looks like this:
A B C D
X 5 6 5
Y 3 1 2
X 9 7 5
3 5 2
9 8 7 NaN
5 2 2 NaN
X 4 3 1
Y 3 2 1
6 8 0 NaN
Notice in Column A there are certain values that are letters (i.e. X,Y). In Column A there are also values that should be in Column B (i.e. the numbers). How do I specify in Pandas to say:
For every value that is not X, Y, or Empty, shift the specific row over by 1 column.
My desired output being:
A B C D
X 5 6 5
Y 3 1 2
X 9 7 5
3 5 2
9 8 7
5 2 2
X 4 3 1
Y 3 2 1
6 8 0
Even something like this would work for me:
A B C D E
X 5 6 5
Y 3 1 2
X 9 7 5
3 5 2 NaN
9 8 7 NaN
5 2 2 NaN
X 4 3 1
Y 3 2 1
6 8 0 NaN
There should be some sort of function to say, for every value that is X, Y or Empty, ignore and the ones that are numbers shift over... or vice versa.
Any tips would be greatly appreciated!
Basically: how do I say; anything that is not these specific values, take that row and shift everything over?
My approach would be to convert the DataFrame into a list of lists and then insert an empty element into each row that doesn't have X, Y, or ''.
df = df.values.tolist()
for row in df:
if row[0] not in ['X', 'Y', '']:
row.insert(0, '')
result = pd.DataFrame(df, columns=list('ABCDE')
Output:
A B C D E
0 X 5 6 5.0 NaN
1 Y 3 1 2.0 NaN
2 X 9 7 5.0 NaN
3 3 5 2.0 NaN
4 9 8 7.0 NaN
5 5 2 2.0 NaN
6 X 4 3 1.0 NaN
7 Y 3 2 1.0 NaN
8 6 8 0.0 NaN
Related
I have a DataFrame that looks like the following:
a b c
0 NaN 8 NaN
1 NaN 7 NaN
2 NaN 5 NaN
3 7.0 3 NaN
4 3.0 5 NaN
5 5.0 4 NaN
6 7.0 1 NaN
7 8.0 9 3.0
8 NaN 5 5.0
9 NaN 6 4.0
What I want to create is a new DataFrame where each value contains the sum of all non-NaN values before it in the same column. The resulting new DataFrame would look like this:
a b c
0 0 1 0
1 0 2 0
2 0 3 0
3 1 4 0
4 2 5 0
5 3 6 0
6 4 7 0
7 5 8 1
8 5 9 2
9 5 10 3
I have achieved it with the following code:
for i in range(len(df)):
df.iloc[i] = df.iloc[0:i].isna().sum()
However, I can only do so with an individual column. My real DataFrame contains thousands of columns so iterating between them is impossible due to the low processing speed. What can I do? Maybe it should be something related to using the pandas .apply() function.
There's no need for apply. It can be done much more efficiently using notna + cumsum (notna for the non-NaN values and cumsum for the counts):
out = df.notna().cumsum()
Output:
a b c
0 0 1 0
1 0 2 0
2 0 3 0
3 1 4 0
4 2 5 0
5 3 6 0
6 4 7 0
7 5 8 1
8 5 9 2
9 5 10 3
Check with notna with cumsum
out = df.notna().cumsum()
Out[220]:
a b c
0 0 1 0
1 0 2 0
2 0 3 0
3 1 4 0
4 2 5 0
5 3 6 0
6 4 7 0
7 5 8 1
8 5 9 2
9 5 10 3
I have the following dataframe:
a = pd.DataFrame([[1,2,3], [4,5,6], [7,8,9], [10, 11, 12]], columns=['a','b','c'])
a
Out[234]:
a b c
0 1 2 3
1 4 5 6
2 7 8 9
3 10 11 12
I want to add a column with only the last row as the mean of the last 2 values of column c. Something like:
a b c d
0 1 2 3 NaN
1 4 5 6 NaN
2 7 8 9 NaN
3 10 11 12 mean(9,12)
I tried this but the first part gives an error:
a['d'].iloc[-1] = a.c.iloc[-2:].values.mean()
You can use .at to assign at a single row/column label pair:
ix = a.shape[0]
a.at[ix-1,'d'] = a.loc[ix-2:ix, 'c'].values.mean()
a b c d
0 1 2 3 NaN
1 4 5 6 NaN
2 7 8 9 NaN
3 10 11 12 10.5
Also note that chained indexing (what you're doing with a.c.iloc[-2:]) is explicitly discouraged in the docs, given that pandas sees these operations as separate events, namely two separate calls to __getitem__, rather than a single call using a nested tuple of slices.
You may set d column beforehand (to ensure assignment):
In [100]: a['d'] = np.nan
In [101]: a['d'].iloc[-1] = a.c.iloc[-2:].mean()
In [102]: a
Out[102]:
a b c d
0 1 2 3 NaN
1 4 5 6 NaN
2 7 8 9 NaN
3 10 11 12 10.5
We can use .loc, .iloc & np.mean
a.loc[a.index.max(), 'd'] = np.mean(a.iloc[-2:, 2])
a b c d
0 1 2 3 NaN
1 4 5 6 NaN
2 7 8 9 NaN
3 10 11 12 10.5
Or just using .loc and np.mean:
a.loc[a.index.max(), 'd'] = np.mean(a.loc[a.index.max()-1:, 'c'])
a b c d
0 1 2 3 NaN
1 4 5 6 NaN
2 7 8 9 NaN
3 10 11 12 10.5
Say I have a dataframe:
x y
0 1 5
1 2 4
2 3 3
3 4 2
4 5 1
I want to run something like df.y.shift(3) and it will not only shift y down 3 rows but also create three new rows for x that are null so that the dataframe becomes
x y
0 1 N
1 2 N
2 3 N
3 4 5
4 5 4
5 N 3
6 N 2
7 N 1
Where N=NaN
The shift operator only seems to shift a column down n rows but will also "knock" n items from your list. I want to keep those n item on my new dataframe.
You can reindex first, and then shift only one of the columns:
df = df.reindex(range(8))
df["y"] = df.y.shift(3)
x y
0 1.0 NaN
1 2.0 NaN
2 3.0 NaN
3 4.0 5.0
4 5.0 4.0
5 NaN 3.0
6 NaN 2.0
7 NaN 1.0
I have a row of pandas dataframe, i.e.
x p y q z
---------
1 4 2 5 3
I want to append only some columns ('x','y','z') of it to another dataframe as new columns with names 'a','b','c'.
Before:
A B
---
7 8
9 6
8 5
After
A B a b c
---------
7 8 1 2 3
9 6 1 2 3
8 5 1 2 3
try this,
df1=pd.DataFrame({'x':[1],'y':[2],'z':[3]})
df2=pd.DataFrame({'A':[7,9,8],'B':[8,6,5]})
print pd.concat([df2,df1],axis=1).fillna(method='ffill').rename(columns={'x':'a','y':'b','z':'c'})
A B a b c
0 7 8 1.0 2.0 3.0
1 9 6 1.0 2.0 3.0
2 8 5 1.0 2.0 3.0
Use assign by Series created by selecting 1. row of df1:
cols = ['x','y','z']
new_cols = ['a','b','c']
df = df2.assign(**pd.Series(df1[cols].iloc[0].values, index=new_cols))
print (df)
A B a b c
0 7 8 1 2 3
1 9 6 1 2 3
2 8 5 1 2 3
I have a dataframe
C V S D LOC
1 2 3 4 X
5 6 7 8
1 2 3 4
5 6 7 8 Y
9 10 11 12
how can i select rows from loc X to Y and inport them in another csv
Use idxmax for first values of index where True in condition:
df = df.loc[(df['LOC'] == 'X').idxmax():(df['LOC'] == 'Y').idxmax()]
print (df)
C V S D LOC
0 1 2 3 4 X
1 5 6 7 8 NaN
2 1 2 3 4 NaN
3 5 6 7 8 Y
In [133]: df.loc[df.index[df.LOC=='X'][0]:df.index[df.LOC=='Y'][0]]
Out[133]:
C V S D LOC
0 1 2 3 4 X
1 5 6 7 8 NaN
2 1 2 3 4 NaN
3 5 6 7 8 Y
PS this will select all rows between first occurence of X and first occurence of Y