I have a dataframe 'DF', part of which looks like this:
I want to select only the values between 0 and 0.01, to form a new dataframe(with blanks where the value was over 0.01)
To do this, i tried:
similarity = []
for x in DF:
similarity.append([DF[DF.between(0, 0.01).any(axis=1)]])
simdf = pd.DataFrame(similarity)
simdf.to_csv("similarity.csv")
However, i get the error AttributeError: 'DataFrame' object has no attribute 'between'
How do i select a range of values and create a new data frame with these?
Just do the two comparisons:
df_new = df[(df>0) & (df<0.01)]
Example:
import pandas as pd
df = pd.DataFrame({"a":[0,2,4,54,56,4],"b":[4,5,7,12,3,4]})
print(df[(df>5) & (df<33)])
a b
0 NaN NaN
1 NaN NaN
2 NaN 7.0
3 NaN 12.0
4 NaN NaN
5 NaN NaN
If want blank string instead of NaN:
df[(df>5) & (df<33)].fillna("")
Related
I have a simple problem that probably has a simple solution but I couldn't found it anywhere. I have the following multindex column Dataframe:
mux = pd.MultiIndex.from_product(['A','B','C'], ['Datetime', 'Str', 'Ret']])
dfr = pd.DataFrame(columns=mux)
| A | B | C |
|Datetime|Str|Ret|Datetime|Str|Ret|Datetime|Str|Ret|
I need to add values one by one at the end of a specific subcolumn. For example add one value at the end of column A sub-column Datetime and leave the rest of the row as it is, then add another value to column B sub-column Str and again leave the rest of the values in the same row untouched and so on. So my questions are: Is it possible to target individual locations in this type of Dataframes? How? and also Is it possible to append not a full row but an individual value always at the end after the previous value without knowing where the end is?. Thank you so much for your answers.
IIUC, you can use .loc:
idx = len(dfr) # get the index of the next row after the last one
dfr.loc[idx, ('A', 'Datetime')] = pd.to_datetime('2021-09-24')
dfr.loc[idx, ('B', 'Str')] = 'Hello'
dfr.loc[idx, ('C', 'Ret')] = 4.3
Output:
>>> dfr
A B C
Datetime Str Ret Datetime Str Ret Datetime Str Ret
0 2021-09-24 00:00:00 NaN NaN NaN Hello NaN NaN NaN 4.3
Update
I mean for example when I have different number of values in different columns (for example 6 values in column A-Str but only 4 in column B-Datetime) but I donĀ“t really know. In that case what I need is to add the next value in that column after the last one so I need to know the index of the last non Nan value of that particular column so I can use it in your answer because if I use len(dfr) while trying to add value to the column that only has 4 values it will end up in the 7th row instead of the 5th row, this is because one of the columns may have more values than the others.
You can do it easily using last_valid_index. Create a convenient function append_to_col to append values inplace in your dataframe:
def append_to_col(col, val):
idx = dfr[col].last_valid_index()
dfr.loc[idx+1 if idx is not None else 0, col] = val
# Fill your dataframe
append_to_col(('A', 'Datetime'), '2021-09-24')
append_to_col(('A', 'Datetime'), '2021-09-25')
append_to_col(('B', 'Str'), 'Hello')
append_to_col(('C', 'Ret'), 4.3)
append_to_col(('C', 'Ret'), 8.2)
append_to_col(('A', 'Datetime'), '2021-09-26')
Output:
>>> dfr
A B C
Datetime Str Ret Datetime Str Ret Datetime Str Ret
0 2021-09-24 NaN NaN NaN Hello NaN NaN NaN 4.3
1 2021-09-25 NaN NaN NaN NaN NaN NaN NaN 8.2
2 2021-09-26 NaN NaN NaN NaN NaN NaN NaN NaN
I have a dataframe with values like
0 1 2
a 5 NaN 6
a NaN 2 NaN
Need the output by combining the two rows based on index 'a' which is same in both rows
Also need to add multiple columns and output as single column
Need the output as below. Value 13 since adding 5 2 6
0
a 13
Trying this using concat function but getting errors
How about using Pandas dataframe.sum() ?
import pandas as pd
import numpy as np
data = pd.DataFrame({"0":[5, np.NaN], "1":[np.NaN, 2], "2":[6,np.NaN]})
row_total = data.sum(axis = 1, skipna = True)
row_total.sum(axis = 0)
result:
13.0
EDIT: #Chris comment (did not see it while writing my answer) shows how to do it in one line, if all rows have same index.
data:
data = pd.DataFrame({"0":[5, np.NaN],
"1":[np.NaN, 2],
"2":[6,np.NaN]},
index=['a', 'a'])
gives:
0 1 2
a 5.0 NaN 6.0
a NaN 2.0 NaN
Then
data.groupby(data.index).sum().sum(1)
Returns
13.0
I tried to insert a new row to a dataframe named 'my_df1' using the my_df1.loc function.But in the result ,the new row added has NaN values
my_data = {'A':pd.Series([1,2,3]),'B':pd.Series([4,5,6]),'C':('a','b','c')}
my_df1 = pd.DataFrame(my_data)
print(my_df1)
my_df1.loc[3] = pd.Series([5,5,5])
Result displayed is as below
A B C
0 1.0 4.0 a
1 2.0 5.0 b
2 3.0 6.0 c
3 NaN NaN NaN
The reason that is all NaN is that my_df1.loc[3] as index (A,B,C) while pd.Series([5,5,5]) as index (0,1,2). When you do series1=series2, pandas only copies values of common indices, hence the result.
To fix this, do as #anky_91 says, or if you already has a series, use its values:
my_df1.loc[3] = my_series.values
Finally I found out how to add a Series as a row or column to a dataframe
my_data = {'A':pd.Series([1,2,3]),'B':pd.Series([4,5,6]),'C':('a','b','c')}
my_df1 = pd.DataFrame(my_data)
print(my_df1)
Code1 adds a new column 'D' and values 5,5,5 to the dataframe
my_df1.loc[:,'D'] = pd.Series([5,5,5],index = my_df1.index)
print(my_df1)
Code2 adds a new row with index 3 and values 3,4,3,4 to the dataframe in code 1
my_df1.loc[3] = pd.Series([3,4,3,4],index = ('A','B','C','D'))
print(my_df1)
If I need to choose from a dataframe where columns col1 and col2 must follow the condition that atleast one of these columns must be not null.
Right now, I am trying to perform below but it doesn't work
df=df.loc[(df['Cat1_L2'].isnull()) & (df['Cat2_L3'].isnull())==False]
Setup
(Modifying U8-Forward's data)
df = pd.DataFrame({'Cat1_L2':[1,np.nan,3, np.nan], 'Cat3_L3': [np.nan,3,4, np.nan]})
df
Cat1_L2 Cat3_L3
0 1.0 NaN
1 NaN 3.0
2 3.0 4.0
3 NaN NaN
Indexing with isna + sum
Fixing your code, ensure the number of True cases (corresponding to NaN in columns) is lesser than 2.
df[df[['Cat1_L2', 'Cat3_L3']].isna().sum(axis=1) < 2]
Cat1_L2 Cat3_L3
0 1.0 NaN
1 NaN 3.0
2 3.0 4.0
dropna with thresh
df.dropna(subset=['Cat1_L2', 'Cat3_L3'], thresh=1)
Cat1_L2 Cat3_L3
0 1.0 NaN
1 NaN 3.0
2 3.0 4.0
One way is to loop over every row using itertuples(). Beaware that this is computationally expensive.
1 - Create list that chceks your condition for each row using itertuples()
condition_list = []
for row in df.itertuples():
if (row.Cat1_L2 != None) or (row.Cat2_L3 != None):
condition_list.append(1)
else:
condition_list.append(0)
2. Convert list to pandas series
condition_series = pd.Series(condition_list)
3. Append series to original df
df['condition_column'] = condition_series.values
4. Filter df
df_new = df[df.condition_column == 1]
del df_new['condition_column']
I have a dataframe which can be simplified like this:
df = pd.DataFrame(index = ['01/11/2017', '01/11/2017', '01/11/2017', '02/11/2017', '02/11/2017', '02/11/2017'],
columns = ['Period','_A', '_B', '_C'] )
df.Period = [1, 2, 3, 1, 2, 3]
df
which looks like:
Date Period _A _B _C
01/11/2017 1 NaN NaN NaN
01/11/2017 2 NaN NaN NaN
01/11/2017 3 NaN NaN NaN
02/11/2017 1 NaN NaN NaN
02/11/2017 2 NaN NaN NaN
02/11/2017 3 NaN NaN NaN
And I want to apply my function to each cell
Get_Y(Date, Period, Location)
(where _A, _B, _C, ... are the locations).
Get_Y is a complex function, that looks up data from other dataframes using the Date, Period and Location, and based on criteria gives a value for Y (a float between 0 and 1).
I have managed to make this work with iterrows:
for index, row in PeriodDF.iterrows():
date = index
Period = row.loc[row.index[0]]
LocationList = row.index[1:]
print(date, Period)
for Location in LocationList :
PeriodDF.loc[(PeriodDF.index == date)&(PeriodDF.Period == Period), Location] = Get_Y(date, Period, Location)
But this takes over 1 hour.
There must be a way to do this faster in pandas.
I have tried creating 3 dataframes, one an array of the Period, one an array of the Location, and one of the Date, but not sure how to apply Get_Y elementwise, using the value from each dataframe.