Pandas dataframe trying to retrieve integer in dataframe - python

I have a pandas dataframe which is as follows:
s = index_df[(index_df['id2'].values == result[z][3])]
print s.iloc[:, [0]]
which will give me the result
id1
36 14559
I'm trying to store the value 14559 into a variable with the following:
value = s.iloc[:, [0]]
But it keeps giving me an error:
ValueError: Incompatible indexer with DataFrame
Any idea how i could solve this?
EDIT:
My dataframe are declared as follows:
result:
result=[(fuzz.WRatio(n, n2),n2,sdf.index[x],bdf.index[y])
for y, n2 in enumerate(Col2['CSGNE_NAME']) if fuzz.WRatio(n, n2)>80 and len(n2) >= 2
]
And this is how i declare and append to the dataframe:
index_df = pd.DataFrame(columns=['id1','id2', 'score'])
index_df = index_df.append({'id1':result[z][2], 'id2':result[z][3], 'score':result[z][0]}, ignore_index=True)

I believe need:
s.iloc[:, 0]
Or:
s.iloc[0, 0]
Or convert values to list and use next for extract first value:
L = index_df[(index_df['id2'].values == result[z][3])].values.tolist()
#use parameter if not matched condition and returned empty val
out = next(iter(L), 'no matched value')
Sample:
index_df = pd.DataFrame({'id2':[1,2,3,2],
'id1':[10,20,30,40]})
print (index_df)
id2 id1
0 1 10
1 2 20
2 3 30
3 2 40
#if possible specify column name with .loc (`id1`)
L = index_df.loc[index_df['id2'].values == 2, 'id1']
#use parameter if not matched condition and returned empty val
#out = next(iter(L), 'no matched value')
print (out)
20

Related

How to get last value of column from a data frame

I have a data frame like this
ntil ureach_x ureach_y awgt
0 1 1 34 2204.25
1 2 35 42 1700.25
2 3 43 48 898.75
3 4 49 53 160.25
and an array of values like this
ulist = [41,57]
For each value in the list [41,57] I am trying to find if the values fall in between ureach_x and ureach_y and return the awgt value.
awt=[]
for u in ulist:
for index,rows in df.iterrows():
if (u >= rows['ureach_x'] and u <= rows['ureach_y']):
awt.append(rows['awgt'])
The above code works for within the value ranges of ureach_x and ureach_y. How do I check if the value in the list is greater than the last row of ureach_y. My data frame has dynamic shape with varying number of rows.
For example, The desired output for value 57 in the list is 160.25
I tried the following:
for u in ulist:
for index,rows in df.iterrows():
if (u >= rows['ureach_x'] and u <= rows['ureach_y']):
awt.append(rows['awgt'])
elif (u >= rows['ureach_x'] and u > rows['ureach_y']):
awt.append(rows['awgt'])
However, this returns multiple values for 41 in the list. How do I refer only the last value in the column of reach_y in a iterrows loop.
The expected output is as follows:
for values in list:
[41,57]
the corresponding values from df has to be returned.
[1700.25 ,160.25]
If I've understood correctly, you can perform a merge_asof:
s = pd.Series([41,57], name='index')
(pd.merge_asof(s, df, left_on='index', right_on='ureach_x')
.set_index('index')['awgt']
)
Output:
index
41 1700.25
57 160.25
Name: awgt, dtype: float64
If you have 0 in the data and you want to have 2204.25 returned, you can add two lines to #mozway's code and perform merge_asof twice, once going backwards and once going forwards; then combine the two.
ulist = [0, 41, 57]
srs = pd.Series(ulist, name='num')
backward = pd.merge_asof(srs, df, left_on='num', right_on='ureach_x')
forward = pd.merge_asof(srs, df, left_on='num', right_on='ureach_x', direction='forward')
out = backward.combine_first(forward)['awgt']
Output:
0 2204.25
1 1700.25
2 160.25
Name: awgt, dtype: float64
Another option (an explicit loop over ulist):
out = []
for num in ulist:
if ((df['ureach_x'] <= num) & (num <= df['ureach_y'])).any():
x = df.loc[(df['ureach_x'] <= num) & (num <= df['ureach_y']), 'awgt'].iloc[-1]
elif (df['ureach_x'] > num).any():
x = df.loc[df['ureach_x'] > num, 'awgt'].iloc[0]
else:
x = df.loc[df['ureach_y'] < num, 'awgt'].iloc[-1]
out.append(x)
Output:
[2204.25, 1700.25, 160.25]

Ignoring an invalid filter among multiple filters on a DataFrame

Problem Statement:
I have a DataFrame that has to be filtered with multiple conditions.
Each condition is optional, which means if an invalid value is entered by the user for a certain condition, the condition can be skipped completely, defaulting to the original DataFrame (without that specific condition)in return.
While I can implement this quite easily with multiple if-conditions, modifying the DataFrame in a sequential way, I am looking for something that is more elegant and scalable (with increasing input parameters) and preferably using inbuilt pandas functionality
Reproducible Example
Dummy dataframe -
df = pd.DataFrame({'One':['a','a','a','b'],
'Two':['x','y','y','y'],
'Three':['l','m','m','l']})
print(df)
One Two Three
0 a x l
1 a y m
2 a y m
3 b y l
Let's say that invalid values are the values that don't belong to the respective column. So, for column 'One' all other values are invalid except 'a' and 'b'. If the user input's 'a' then I should be able to filter the DataFrame df[df['One']=='a'], however, if the user inputs any invalid value, no such filter should be applied, and the original dataframe df is returned.
My attempt (with multiple parameters):
def valid_filtering(df, inp):
if inp[0] in df['One'].values:
df = df[df['One']==inp[0]]
if inp[1] in df['Two'].values:
df = df[df['Two']==inp[1]]
if inp[2] in df['Three'].values:
df = df[df['Three']==inp[2]]
return df
With all valid inputs -
inp = ['a','y','m'] #<- all filters valid so df is filtered before returning
print(valid_filtering(df, inp))
One Two Three
1 a y m
2 a y m
With few invalid inputs -
inp = ['a','NA','NA'] #<- only first filter is valid, so other 2 filters are ignored
print(valid_filtering(df, inp))
One Two Three
0 a x l
1 a y m
2 a y m
P.S. Additional question - is there a way to get DataFrame indexing to behave as -
df[df['One']=='valid'] -> returns filtered df
df[df['One']=='invalid'] -> returns original df
Because this would help me rewrite my filtering -
df[(df['One']=='valid') & (df['Two']=='invalid') & (df['Three']=='valid')] -> Filtered by col One and Three
EDIT: Solution -
An updated solution inspired by the code and logic provided by #corralien and #Ben.T
df.loc[(df.eq(inp)|~df.eq(inp).any(0)).all(1)]
Here is one way creating a Boolean dataframe depending on each value of inp in each column. Then use any along the rows to get columns with at least one True, and all along the columns once selected the columns that have at least one True.
def valid_filtering(df, inp):
# check where inp values are same than in df
m = (df==pd.DataFrame(data=[inp] , index=df.index, columns=df.columns))
# select the columns with at least one True
cols = m.columns[m.any()]
# select the rows that all True amongst wanted columns
rows = m[cols].all(axis=1)
# return df with selected rows
return df.loc[rows]
Note that if you don't have the same number of filter than columns in your original df, then you could do with a dictionary, it works too as in the example below the column Three will be ignored as all False.
d = {'One': 'a', 'Two': 'y'}
m = (df==pd.DataFrame(d, index=df.index).reindex(columns=df.columns))
The key is if a column return all False (~b.any, invalid filter) then return True to accept all values of this columns:
mask = df.eq(inp).apply(lambda b: np.where(~b.any(), True, b))
out = df.loc[mask.all(axis="columns")]
Case 1: inp = ['a','y','m'] (with all valid inputs)
>>> out
One Two Three
1 a y m
2 a y m
Case 2: inp = ['a','NA','NA'] (with few invalid inputs)
>>> out
One Two Three
0 a x l
1 a y m
2 a y m
Case 3: inp = ['NA','NA','NA'] (with no invalid inputs)
>>> out
One Two Three
0 a x l
1 a y m
2 a y m
3 b y l
Case 4: inp = ['b','x','m'] (with all valid inputs but not results)
>>> out
Empty DataFrame
Columns: [One, Two, Three]
Index: []
Of course, you can increase input parameters:
df["Four"] = ['i','j','k','k']
inp = ['a','NA','m','k']
>>> out
One Two Three Four
2 a y m k
Another way with list comprehension:
def valid_filtering(df, inp):
series = [df[column] == inp[i]
for i, column in enumerate(df.columns) if len(df[df[column] == inp[i]].values) > 0]
for s in series: df = df[s]
return df
Output of print(valid_filtering(df, ['a','NA','NA'])):
One Two Three
0 a x l
1 a y m
2 a y m
Related: applying lambda row on multiple columns pandas

Retrieve certain value located in dataframe in any row or column and keep it in separate column without forloop

I have a dataframe like below
df
A B C
0 1 TRANSIT_1
TRANSIT_3
0 TRANSIT_5
And I want to change it to below:
Resulting DF
A B C D
0 1 TRANSIT_1
TRANSIT_3
0 TRANSIT_5
So i tried to use str.contains and once I receive the series with True or False, i put it in eval function to somehow get me the table I want.
Code I tried:
series_index = pd.DataFrame()
series_index = df.columns.str.contains("^TRANSIT_", case=True, regex=True)
print(type(series_index))
series_index.index[series_index].tolist()
I thought to use eval function to write it to separate column,like
df = eval(df[result]=the index) # I dont know, But eval function does evaluation and puts it in a separate column
I couldn't find a simple one-liner, but this works:
idx = list(df1[df1.where(df1.applymap(lambda x: 'TRA' in x if isinstance(x, str) else False)).notnull()].stack().index)
a, b = [], []
for sublist in idx:
a.append(sublist[0])
b.append(sublist[1])
df1['ans'] = df1.lookup(a,b)
Output
A B C ans
0 0 1 TRANSIT_1 TRANSIT_1
1 TRANSIT_3 None None TRANSIT_3
2 0 TRANSIT_5 None TRANSIT_5

using previous row value by looping through index conditioning

If i have dataframe with column x.
I want to make a new column x_new but I want the first row of this new column to be set to a specific number (let say -2).
Then from 2nd row, use the previous row to iterate through the cx function
data = {'x':[1,2,3,4,5]}
df=pd.DataFrame(data)
def cx(x):
if df.loc[1,'x_new']==0:
df.loc[1,'x_new']= -2
else:
x_new = -10*x + 2
return x_new
df['x_new']=(cx(df['x']))
The final dataframe
I am not sure on how to do this.
Thank you for your help
This is what i have so far:
data = {'depth':[1,2,3,4,5]}
df=pd.DataFrame(data)
df
# calculate equation
def depth_cal(d):
z = -3*d+1 #d must be previous row
return z
depth_cal=(depth_cal(df['depth'])) # how to set d as previous row
print (depth_cal)
depth_new =[]
for row in df['depth']:
if row == 1:
depth_new.append('-5.63')
else:
depth_new.append(depth_cal) #Does not put list in a column
df['Depth_correct']= depth_new
correct output:
There is still two problem with this:
1. it does not put the depth_cal list properly in column
2. in the depth_cal function, i want d to be the previous row
Thank you
I would do this by just using a loop to generate your new data - might not be ideal if particularly huge but it's a quick operation. Let me know how you get on with this:
data = {'depth':[1,2,3,4,5]}
df=pd.DataFrame(data)
res = data['depth']
res[0] = -5.63
for i in range(1, len(res)):
res[i] = -3 * res[i-1] + 1
df['new_depth'] = res
print(df)
To get
depth new_depth
0 1 -5.63
1 2 17.89
2 3 -52.67
3 4 159.01
4 5 -476.03

Fill pandas data frame using .append()

I have a dataframe with a column containing comma separated strings. What I want to do is separate them by comma, count them and append the counted number to a new data frame. If the column contains a list with only one element, I want to differentiate wheather it is a string or an integer. If it is an integer, I want to append the value 0 in that row to the new df.
My code looks as follows:
def decide(dataframe):
df=pd.DataFrame()
for liste in DataFrameX['Column']:
x=liste.split(',')
if len(x) > 1:
df.append(pd.Series([len(x)]), ignore_index=True)
else:
#check if element in list is int
for i in x:
try:
int(i)
print i
x = []
df.append(pd.Series([int(len(x))]), ignore_index=True)
except:
print i
x = [1]
df.append(pd.Series([len(x)]), ignore_index=True)
return df
The Input data look like this:
C1
0 a,b,c
1 0
2 a
3 ab,x,j
If I now run the function with my original dataframe as input, it returns an empty dataframe. Through the print statement in the try/except statements I could see that everything works. The problem is appending the resulting values to the new dataframe. What do I have to change in my code? If possible, please do not give an entire different solution, but tell me what I am doing wrong in my code so I can learn.
******************UPDATE************************************
I edited the code so that it can be called as lambda function. It looks like this now:
def decide(x):
For liste in DataFrameX['Column']:
x=liste.split(',')
if len(x) > 1:
x = len(x)
print x
else:
#check if element in list is int
for i in x:
try:
int(i)
x = []
x = len(x)
print x
except:
x = [1]
x = len(x)
print x
And I call it like this:
df['Count']=df['C1'].apply(lambda x: decide(x))
It prints the right values, but the new column only contains None.
Any ideas why?
This is a good start, it could be simplified, but I think it works as expected.
#I have a dataframe with a column containing comma separated strings.
df = pd.DataFrame({'data': ['apple, peach', 'banana, peach, peach, cherry','peach','0']})
# What I want to do is separate them by comma, count them and append the counted number to a new data frame.
df['data'] = df['data'].str.split(',')
df['count'] = df['data'].apply(lambda row: len(row))
# If the column contains a list with only one element
df['first'] = df['data'].apply(lambda row: row[0])
# I want to differentiate wheather it is a string or an integer
df['first'] = pd.to_numeric(df['first'], errors='coerce')
# if the element in x is an integer, len(x) should be set to zero
df.loc[pd.notnull(df['first']), 'count'] = 0
# Dropping temp column
df.drop('first', 1, inplace=True)
df
data count
0 [apple, peach] 2
1 [banana, peach, peach, cherry] 4
2 [peach] 1
3 [0] 0

Categories

Resources