How to get last value of column from a data frame - python

I have a data frame like this
ntil ureach_x ureach_y awgt
0 1 1 34 2204.25
1 2 35 42 1700.25
2 3 43 48 898.75
3 4 49 53 160.25
and an array of values like this
ulist = [41,57]
For each value in the list [41,57] I am trying to find if the values fall in between ureach_x and ureach_y and return the awgt value.
awt=[]
for u in ulist:
for index,rows in df.iterrows():
if (u >= rows['ureach_x'] and u <= rows['ureach_y']):
awt.append(rows['awgt'])
The above code works for within the value ranges of ureach_x and ureach_y. How do I check if the value in the list is greater than the last row of ureach_y. My data frame has dynamic shape with varying number of rows.
For example, The desired output for value 57 in the list is 160.25
I tried the following:
for u in ulist:
for index,rows in df.iterrows():
if (u >= rows['ureach_x'] and u <= rows['ureach_y']):
awt.append(rows['awgt'])
elif (u >= rows['ureach_x'] and u > rows['ureach_y']):
awt.append(rows['awgt'])
However, this returns multiple values for 41 in the list. How do I refer only the last value in the column of reach_y in a iterrows loop.
The expected output is as follows:
for values in list:
[41,57]
the corresponding values from df has to be returned.
[1700.25 ,160.25]

If I've understood correctly, you can perform a merge_asof:
s = pd.Series([41,57], name='index')
(pd.merge_asof(s, df, left_on='index', right_on='ureach_x')
.set_index('index')['awgt']
)
Output:
index
41 1700.25
57 160.25
Name: awgt, dtype: float64

If you have 0 in the data and you want to have 2204.25 returned, you can add two lines to #mozway's code and perform merge_asof twice, once going backwards and once going forwards; then combine the two.
ulist = [0, 41, 57]
srs = pd.Series(ulist, name='num')
backward = pd.merge_asof(srs, df, left_on='num', right_on='ureach_x')
forward = pd.merge_asof(srs, df, left_on='num', right_on='ureach_x', direction='forward')
out = backward.combine_first(forward)['awgt']
Output:
0 2204.25
1 1700.25
2 160.25
Name: awgt, dtype: float64
Another option (an explicit loop over ulist):
out = []
for num in ulist:
if ((df['ureach_x'] <= num) & (num <= df['ureach_y'])).any():
x = df.loc[(df['ureach_x'] <= num) & (num <= df['ureach_y']), 'awgt'].iloc[-1]
elif (df['ureach_x'] > num).any():
x = df.loc[df['ureach_x'] > num, 'awgt'].iloc[0]
else:
x = df.loc[df['ureach_y'] < num, 'awgt'].iloc[-1]
out.append(x)
Output:
[2204.25, 1700.25, 160.25]

Related

Pandas dataframe trying to retrieve integer in dataframe

I have a pandas dataframe which is as follows:
s = index_df[(index_df['id2'].values == result[z][3])]
print s.iloc[:, [0]]
which will give me the result
id1
36 14559
I'm trying to store the value 14559 into a variable with the following:
value = s.iloc[:, [0]]
But it keeps giving me an error:
ValueError: Incompatible indexer with DataFrame
Any idea how i could solve this?
EDIT:
My dataframe are declared as follows:
result:
result=[(fuzz.WRatio(n, n2),n2,sdf.index[x],bdf.index[y])
for y, n2 in enumerate(Col2['CSGNE_NAME']) if fuzz.WRatio(n, n2)>80 and len(n2) >= 2
]
And this is how i declare and append to the dataframe:
index_df = pd.DataFrame(columns=['id1','id2', 'score'])
index_df = index_df.append({'id1':result[z][2], 'id2':result[z][3], 'score':result[z][0]}, ignore_index=True)
I believe need:
s.iloc[:, 0]
Or:
s.iloc[0, 0]
Or convert values to list and use next for extract first value:
L = index_df[(index_df['id2'].values == result[z][3])].values.tolist()
#use parameter if not matched condition and returned empty val
out = next(iter(L), 'no matched value')
Sample:
index_df = pd.DataFrame({'id2':[1,2,3,2],
'id1':[10,20,30,40]})
print (index_df)
id2 id1
0 1 10
1 2 20
2 3 30
3 2 40
#if possible specify column name with .loc (`id1`)
L = index_df.loc[index_df['id2'].values == 2, 'id1']
#use parameter if not matched condition and returned empty val
#out = next(iter(L), 'no matched value')
print (out)
20

using previous row value by looping through index conditioning

If i have dataframe with column x.
I want to make a new column x_new but I want the first row of this new column to be set to a specific number (let say -2).
Then from 2nd row, use the previous row to iterate through the cx function
data = {'x':[1,2,3,4,5]}
df=pd.DataFrame(data)
def cx(x):
if df.loc[1,'x_new']==0:
df.loc[1,'x_new']= -2
else:
x_new = -10*x + 2
return x_new
df['x_new']=(cx(df['x']))
The final dataframe
I am not sure on how to do this.
Thank you for your help
This is what i have so far:
data = {'depth':[1,2,3,4,5]}
df=pd.DataFrame(data)
df
# calculate equation
def depth_cal(d):
z = -3*d+1 #d must be previous row
return z
depth_cal=(depth_cal(df['depth'])) # how to set d as previous row
print (depth_cal)
depth_new =[]
for row in df['depth']:
if row == 1:
depth_new.append('-5.63')
else:
depth_new.append(depth_cal) #Does not put list in a column
df['Depth_correct']= depth_new
correct output:
There is still two problem with this:
1. it does not put the depth_cal list properly in column
2. in the depth_cal function, i want d to be the previous row
Thank you
I would do this by just using a loop to generate your new data - might not be ideal if particularly huge but it's a quick operation. Let me know how you get on with this:
data = {'depth':[1,2,3,4,5]}
df=pd.DataFrame(data)
res = data['depth']
res[0] = -5.63
for i in range(1, len(res)):
res[i] = -3 * res[i-1] + 1
df['new_depth'] = res
print(df)
To get
depth new_depth
0 1 -5.63
1 2 17.89
2 3 -52.67
3 4 159.01
4 5 -476.03

Find Sub range of values in defined ranges

I have 3 ranges of data values in series:
min_range:
27 893.151613
26 882.384516
20 817.781935
dtype: float64
max_range:
28 903.918710
27 893.151613
21 828.549032
dtype: float64
I have created a list of ranges:
range = zip(min_range, max_range)
output:
[(893.1516129032259, 903.91870967741943), (882.38451612903225, 893.1516129032259), (817.78193548387094, 828.54903225806447)]
I have got a sub range:
sub-range1: 824
sub-range2: 825
I want to find the region in which the sub range lies.
for p,q in zip(min_range, max_range):
if (sub-range1 > p) & (sub-range2 < q):
print p,q
output:
817.781935484 828.549032258
I want to find the respective position from that defined "range".
Expected Output:
817.781935484 828.549032258
range = 2 (Position in the range list)
How can i achieve this? Any help would be appreciated.
Use enumerate to get the index i.e
for i,(p,q) in enumerate(zip(min_range, max_range)):
if (sub_range1 > p) & (sub_range2 < q):
print(i)
Output : 2
A simple approach using counter.
cnt = 0
for p,q in zip(min_range, max_range):
if (sub-range1 > p) & (sub-range2 < q):
print p,q
print cnt
cnt = cnt + 1

Storing all values when creating a Pandas Pivot Table

Basically, I'm aggregating prices over three indices to determine: mean, std, as well as an upper/lower limit. So far so good. However, now I want to also find the lowest identified price which is still >= the computed lower limit.
My first idea was to use np.min to find the lowest price -> this obviously disregards the lower-limit and is not useful. Now I'm trying to store all the values the pivot table identified to find the price which still is >= lower-limit. Any ideas?
pivot = pd.pivot_table(temp, index=['A','B','C'],values=['price'], aggfunc=[np.mean,np.std],fill_value=0)
pivot['lower_limit'] = pivot['mean'] - 2 * pivot['std']
pivot['upper_limit'] = pivot['mean'] + 2 * pivot['std']
First, merge pivoted[lower_limit] back into temp. Thus, for each price in temp there is also a lower_limit value.
temp = pd.merge(temp, pivoted['lower_limit'].reset_index(), on=ABC)
Then you can restrict your attention to those rows in temp for which the price is >= lower_limit:
temp.loc[temp['price'] >= temp['lower_limit']]
The desired result can be found by computing a groupby/min:
result = temp.loc[temp['price'] >= temp['lower_limit']].groupby(ABC)['price'].min()
For example,
import numpy as np
import pandas as pd
np.random.seed(2017)
N = 1000
ABC = list('ABC')
temp = pd.DataFrame(np.random.randint(2, size=(N,3)), columns=ABC)
temp['price'] = np.random.random(N)
pivoted = pd.pivot_table(temp, index=['A','B','C'],values=['price'],
aggfunc=[np.mean,np.std],fill_value=0)
pivoted['lower_limit'] = pivoted['mean'] - 2 * pivoted['std']
pivoted['upper_limit'] = pivoted['mean'] + 2 * pivoted['std']
temp = pd.merge(temp, pivoted['lower_limit'].reset_index(), on=ABC)
result = temp.loc[temp['price'] >= temp['lower_limit']].groupby(ABC)['price'].min()
print(result)
yields
A B C
0 0 0 0.003628
1 0.000132
1 0 0.005833
1 0.000159
1 0 0 0.006203
1 0.000536
1 0 0.001745
1 0.025713

Pandas every nth row

Dataframe.resample() works only with timeseries data. I cannot find a way of getting every nth row from non-timeseries data. What is the best method?
I'd use iloc, which takes a row/column slice, both based on integer position and following normal python syntax. If you want every 5th row:
df.iloc[::5, :]
Though #chrisb's accepted answer does answer the question, I would like to add to it the following.
A simple method I use to get the nth data or drop the nth row is the following:
df1 = df[df.index % 3 != 0] # Excludes every 3rd row starting from 0
df2 = df[df.index % 3 == 0] # Selects every 3rd raw starting from 0
This arithmetic based sampling has the ability to enable even more complex row-selections.
This assumes, of course, that you have an index column of ordered, consecutive, integers starting at 0.
There is an even simpler solution to the accepted answer that involves directly invoking df.__getitem__.
df = pd.DataFrame('x', index=range(5), columns=list('abc'))
df
a b c
0 x x x
1 x x x
2 x x x
3 x x x
4 x x x
For example, to get every 2 rows, you can do
df[::2]
a b c
0 x x x
2 x x x
4 x x x
There's also GroupBy.first/GroupBy.head, you group on the index:
df.index // 2
# Int64Index([0, 0, 1, 1, 2], dtype='int64')
df.groupby(df.index // 2).first()
# Alternatively,
# df.groupby(df.index // 2).head(1)
a b c
0 x x x
1 x x x
2 x x x
The index is floor-divved by the stride (2, in this case). If the index is non-numeric, instead do
# df.groupby(np.arange(len(df)) // 2).first()
df.groupby(pd.RangeIndex(len(df)) // 2).first()
a b c
0 x x x
1 x x x
2 x x x
Adding reset_index() to metastableB's answer allows you to only need to assume that the rows are ordered and consecutive.
df1 = df[df.reset_index().index % 3 != 0] # Excludes every 3rd row starting from 0
df2 = df[df.reset_index().index % 3 == 0] # Selects every 3rd row starting from 0
df.reset_index().index will create an index that starts at 0 and increments by 1, allowing you to use the modulo easily.
I had a similar requirement, but I wanted the n'th item in a particular group. This is how I solved it.
groups = data.groupby(['group_key'])
selection = groups['index_col'].apply(lambda x: x % 3 == 0)
subset = data[selection]
A solution I came up with when using the index was not viable ( possibly the multi-Gig .csv was too large, or I missed some technique that would allow me to reindex without crashing ).
Walk through one row at a time and add the nth row to a new dataframe.
import pandas as pd
from csv import DictReader
def make_downsampled_df(filename, interval):
with open(filename, 'r') as read_obj:
csv_dict_reader = DictReader(read_obj)
column_names = csv_dict_reader.fieldnames
df = pd.DataFrame(columns=column_names)
for index, row in enumerate(csv_dict_reader):
if index % interval == 0:
print(str(row))
df = df.append(row, ignore_index=True)
return df
df.drop(labels=df[df.index % 3 != 0].index, axis=0) # every 3rd row (mod 3)

Categories

Resources