I am trying to get new columns B and C with a condition B value will be positive if the ‘A’ of one day is bigger than the ‘A’ of the day before. Otherwise, the value will be negative (C column).
Here is an example of what I am trying to get:
A B C
0. 167765
1. 235353 235353
2. 89260 89260
3. 188382 188382
4. 104677 104677
5. 207723 207723
I notice that this will cause an index error because the number of data in column B and C will be different from the original column A.
Currently, I am doing via this to test move specific data to column B and this cause length of values does not match the length of index error:
df['B'] = np.where(df['A'] <= 250000)
how do I accomplish the desired output where the first row is NA or empty?
desired output:
B C
0.
1. 235353
2. 89260
3. 188382
4. 104677
5. 207723
I'm not able to understand how you got to your final result by the method you're describing
In my understanding a value should be placed in column B if it is greater than the value the day before. Otherwise in column C.
You may need to correct me or adapt this answer if you meant differently.
The trick is in to use .where on a pandas Series object, which inserts the NaNs automatically.
df = pd.DataFrame({'A': [167765, 235353, 89260, 188382, 104677, 207723]})
diffs = df['A'].diff()
df['B'] = df['A'].where(diffs >= 0)
df['C'] = df['A'].where(diffs < 0)
diffs is going to be the following Series which also comes with a handy NaN in the first row.
0 NaN
1 67588.0
2 -146093.0
3 99122.0
4 -83705.0
5 103046.0
Name: A, dtype: float64
Comparing with NaN always returns False. Therefore we can omit the first row by comparing for the positive and the negative seperately.
The resulting table looks like this
A B C
0 167765 NaN NaN
1 235353 235353.0 NaN
2 89260 NaN 89260.0
3 188382 188382.0 NaN
4 104677 NaN 104677.0
5 207723 207723.0 NaN
You can try giving explicit list of index:
df['B'] = np.where(df.index.isin([1, 2, 3]), df['A'], np.nan)
df['C'] = np.where(df.index.isin([4, 5]), df['A'], np.nan)
Related
I wish to set the values of a dataframe that lie between an index range and a value range to be NaN values. For example, say I have n columns, I want for every numeric data point in these columns to be set to NaN if they meet the following conditions:
The value is between -1 and 1
The index of this value is between 1 and 3
Below I have some code that is trying to do what I'm describing above, and it almost does it, it's just that it is setting these values on a copy of the original dataframe, and trying to use .loc throws the following error:
KeyError: "None of [Index([('a',), ('b',), ('c',)], dtype='object')]
are in the [columns]"
import numpy as np
import pandas as pd
np.random.seed(398)
df = pd.DataFrame(np.random.randn(5, 3), columns=['a', 'b', 'c'])
row_indexer = (df.index > 0) & (df.index < 4)
col_indexer = (df > -1) & (df < 1)
df[row_indexer][col_indexer] = np.nan
I'm sure there's a really simple solution, I just can't figure out the correct syntax.
(Additionally, I want to "extract" these filtered values (the ones I'm setting to NaN) into a second dataframe, but I'm fairly sure any solution that solves the primary question will solve this additional issue)
Any help would be appreciated
Try broadcasting with numpy:
df[row_indexer[:,None] & col_indexer] = np.nan
Output:
a b c
0 -1.810802 -0.776590 -0.495147
1 1.381038 NaN 2.334671
2 NaN -1.571401 1.011139
3 -1.200217 -1.013983 NaN
4 1.261759 0.863896 0.228914
I will do mul since True * True = True
out = df.mask(col_indexer.mul(row_indexer ,axis=0))
Out[81]:
a b c
0 -1.810802 -0.776590 -0.495147
1 1.381038 NaN 2.334671
2 NaN -1.571401 1.011139
3 -1.200217 -1.013983 NaN
4 1.261759 0.863896 0.228914
import pandas as pd
data = [['a',1],['b',2],['c',3]]
df = pd.DataFrame(data, columns = ['letter', 'number']
exclude_list = [2, 4, 6]
I want to change row 2 in df, where "number" == 2, to empty/nan. I want to do this by comparing the "number" column to the exclude list, and if there is a match, exclude that row.
Check these useful functions:
mask : Lets you replace values where a condition is met, we use the inplace=True flag to perform the operation without using no auxiliary data structure (or significant extra storage in simple words).
isin : To check if values are in another list.
df.number.mask(df.number.isin(exclude_list),inplace=True)
df
Out[200]:
letter number
0 a 1.0
1 b NaN
2 c 3.0
df.loc[df['number'].isin(exclude_list), 'number'] = None
letter number
0 a 1.0
1 b NaN
2 c 3.0
For the whole row
df.loc[df['number'].isin(exclude_list), :] = None
letter number
0 a 1.0
1 None NaN
2 c 3.0
I have a csv that is read by my python code and a dataframe is created using pandas.
CSV file is in following format
1 1.0
2 99.0
3 20.0
7 63
My code calculates the percentile and wants to find all rows that have the value in 2nd column greater than 60.
df = pd.read_csv(io.BytesIO(body), error_bad_lines=False, header=None, encoding='latin1', sep=',')
percentile = df.iloc[:, 1:2].quantile(0.99) # Selecting 2nd column and calculating percentile
criteria = df[df.iloc[:, 1:2] >= 60.0]
While my percentile code works fine, criteria to find all rows that have column 2's value greater than 60 returns
NaN NaN
NaN NaN
NaN NaN
NaN NaN
Can you please help me find the error.
Just correct the condition inside criteria. Being the second column "1" you should write df.iloc[:,1].
Example:
import pandas as pd
import numpy as np
b =np.array([[1,2,3,7], [1,99,20,63] ])
df = pd.DataFrame(b.T) #just creating the dataframe
criteria = df[ df.iloc[:,1]>= 60 ]
print(criteria)
Why?
It seems like the cause resides inside the definition type of the condition. Let's inspect
Case 1:
type( df.iloc[:,1]>= 60 )
Returns pandas.core.series.Series,so it gives
df[ df.iloc[:,1]>= 60 ]
#out:
0 1
1 2 99
3 7 63
Case2:
type( df.iloc[:,1:2]>= 60 )
Returns a pandas.core.frame.DataFrame, and gives
df[ df.iloc[:,1:2]>= 60 ]
#out:
0 1
0 NaN NaN
1 NaN 99.0
2 NaN NaN
3 NaN 63.0
Therefore I think it changes the way the index is processed.
Always keep in mind that 3 is a scalar, and 3:4 is a array.
For more info is always good to take a look at the official doc Pandas indexing
Your indexing a bit off, since you only have two columns [0, 1] and you are interested in selecting just the one with index 1. As #applesoup mentioned the following is just enough:
criteria = df[df.iloc[:, 1] >= 60.0]
However, I would consider naming columns and just referencing based on name. This will allow you to avoid any mistakes in case your df structure changes, e.g.:
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3, 7], 'b': [1.0, 99.0, 20.0, 63.]})
criteria = df[df['b'] >= 60.0]
People here seem to be more interested in coming up with alternative solutions instead of digging into his code in order to find out what's really wrong. I will adopt a diametrically opposed strategy!
The problem with your code is that you are indexing your DataFrame df by another DataFrame. Why? Because you use slices instead of integer indexing.
df.iloc[:, 1:2] >= 60.0 # Return a DataFrame with one boolean column
df.iloc[:, 1] >= 60.0 # Return a Series
df.iloc[:, [1]] >= 60.0 # Return a DataFrame with one boolean column
So correct your code by using :
criteria = df[df.iloc[:, 1] >= 60.0] # Dont slice !
I have a DataFrame which contains a lot of NA values. I want to write a query which returns rows where a particular column is not NA but all other columns are NA.
I can get a Dataframe where all the column values are not NA easily enough:
df[df.interesting_column.notna()]
However, I cant figure out how to then say "from that DataFrame return only rows were every column that is not 'interesting_column' is NA". I can't use .dropna as all rows and columns will contain at least one NA value.
I realise this is probably embarrassingly simple. I have tried lots of .loc variations, join/merges in various configurations and I am not getting anywhere.
Any pointers before I just do a for loop over this thing would be appreciated.
You can simply use a conjunction of the conditions:
df[df.interesting_column.notna() & (df.isnull().sum(axis=1) == len(df.columns) - 1)]
df.interesting_column.notna() checks the column is non-null.
df.isnull().sum(axis=1) == len(df.columns) - 1 checks that the number of nulls in the row is the number of columns minus 1
Both conditions together mean that the entry in the column is the only one that is non-null.
The & operator lets you row-by-row "and" together two boolean columns. Right now, you are using df.interesting_column.notna() to give you a column of TRUE or FALSE values. You could repeat this for all columns, using notna() or isna() as desired, and use the & operator to combine the results.
For example, if you have columns a, b, and c, and you want to find rows where the value in columns a is not NaN and the values in the other columns are NaN, then do the following:
df[df.a.notna() & df.b.isna() & df.c.isna()]
This is clear and simple when you have a small number of columns that you know about ahead of time. But, if you have many columns, or if you don't know the column names, you would want a solution that loops over all columns and checks notna() for the interesting_column and isna() for the other columns. The solution by #AmiTavory is a clever way to achieve this. But, if you didn't know about that solution, here is a simpler approach.
for colName in df.columns:
if colName == "interesting_column":
df = df[ df[colName].notna() ]
else:
df = df[ df[colName].isna() ]
You can use:
rows = df.drop('interesting_column', axis=1).isna().all(1) & df['interesting_column'].notna()
Example (suppose c is the interesting column):
In [99]: df = pd.DataFrame({'a': [1, np.nan, 2], 'b': [1, np.nan, 3], 'c':[4, 5, np.nan]})
In [100]: df
Out[100]:
a b c
0 1.0 1.0 4.0
1 NaN NaN 5.0
2 2.0 3.0 NaN
In [101]: rows = df.drop('c', axis=1).isna().all(1) & df.c.notna()
In [102]: rows
Out[102]:
0 False
1 True
2 False
dtype: bool
In [103]: df[rows]
Out[103]:
a b c
1 NaN NaN 5.0
So I created two dataframes from existing CSV files, both consisting of entirely numbers. The second dataframe consists of an index from 0 to 8783 and one column of numbers and I want to add it on as a new column to the first dataframe which has an index consisting of a month, day and hour. I tried using append, merge and concat and none worked and then tried simply using:
x1GBaverage['Power'] = x2_cut
where x1GBaverage is the first dataframe and x2_cut is the second. When I did this it added x2_cut on properly but all the values were entered as NaN instead of the numerical values that they should be. How should I be approaching this?
x1GBaverage['Power'] = x2_cut.values
problem solved :)
The thing about pandas is that values are implicitly linked to their indices unless you deliberately specify that you only need the values to be transferred over.
If they're the same row counts and you just want to tack it on the end, the indexes either need to match, or you need to just pass the underlying values. In the example below, columns 3 and 5 are the index matching & value versions, and 4 is what you're running into now:
In [58]: df = pd.DataFrame(np.random.random((3,3)))
In [59]: df
Out[59]:
0 1 2
0 0.670812 0.500688 0.136661
1 0.185841 0.239175 0.542369
2 0.351280 0.451193 0.436108
In [61]: df2 = pd.DataFrame(np.random.random((3,1)))
In [62]: df2
Out[62]:
0
0 0.638216
1 0.477159
2 0.205981
In [64]: df[3] = df2
In [66]: df.index = ['a', 'b', 'c']
In [68]: df[4] = df2
In [70]: df[5] = df2.values
In [71]: df
Out[71]:
0 1 2 3 4 5
a 0.670812 0.500688 0.136661 0.638216 NaN 0.638216
b 0.185841 0.239175 0.542369 0.477159 NaN 0.477159
c 0.351280 0.451193 0.436108 0.205981 NaN 0.205981
If the row counts differ, you'll need to use df.merge and let it know which columns it should be using to join the two frames.