I have the following output from a txt file. My goal is to find the difference between values of Column 2 and Column 3 as long as the value in Column 1 remains below or equal to 5, that means my expected output is the difference of Column 2 and 3 values up to Row 5 as the Column 1 value in Row 6 is greater than 5.
1 4 5
2 6 7
3 8 8
4 4 7
5 3 2
6 8 4
I tried the following approach.
import pandas as pd
data= pd.read_table('/Users/Hrihaan/Desktop/A.txt', dtype=float, header=None, sep='\s+').values
x=data[:,0]
y=(data[:,1] for x<=5)
z=(data[:,2] for x<=5)
Diff=y-z
print(Diff)
I received this error: (SyntaxError: invalid syntax), any help on how to get it going would be really helpful.
import numpy as np
>>> np.where(data[:, 0] <= 5, data[:, 1] - data[:, 2], np.nan)
array([ -1., -1., 0., -3., 1., nan])
For your code, you can use a conditional list comprehension:
y = [i for x, i in zip(data[:, 0], data[:, 1]) if x <= 5]
z = [i for x, i in zip(data[:, 0], data[:, 2]) if x <= 5]
diff = [a - b for a, b in zip(y, z)]
or...
diff = [y - z for x, y, z in data if x <= 5]
Or you can try this
(df2['v2'].subtract(df2['v3']))[(df2['v1']<=5)]
Out[856]:
0 -1
1 -1
2 0
3 -3
4 1
dtype: int64
Data input
df2
Out[857]:
v1 v2 v3
0 1 4 5
1 2 6 7
2 3 8 8
3 4 4 7
4 5 3 2
5 6 8 4
Assuming your column names are 'a', 'b', and 'c'. Just swap column names with your own
Option 1
df.query('a <= 5').eval('b - c')
Option 2
df.b.sub(df.c)[df.a.le(5)]
I think the SyntaxError is coming from your generator comprehension, because it doesn't really do anything. What is being iterated over?
Anyway, you can directly select the rows with column 0 <= 5 like so:
EDIT: You don't need to convert the DataFrame into a numpy array with .values.
import pandas as pd
data = pd.read_table('/Users/Hrihaan/Desktop/A.txt', dtype=float, header=None, sep='\s+') # note: no .values
idx = data[0] <= 5
Diff = data.loc[idx, 1] - data.loc[idx, 2]
print(Diff)
Related
I have following dataframe called condition:
[0] [1] [2] [3]
1 0 0 1 0
2 0 1 0 0
3 0 0 0 1
4 0 0 0 1
For easier reproduction:
import numpy as np
import pandas as pd
n=4
t=3
condition = pd.DataFrame([[0,0,1,0], [0,1,0,0], [0,0,0, 1], [0,0,0, 1]], columns=['0','1', '2', '3'])
condition.index=np.arange(1,n+1)
Further I have several dataframes that should be filled in a foor loop
df = pd.DataFrame([],index = range(1,n+1),columns= range(t+1) ) #NaN DataFrame
df_2 = pd.DataFrame([],index = range(1,n+1),columns= range(t+1) )
df_3 = pd.DataFrame(3,index = range(1,n+1),columns= range(t+1) )
for i,t in range(t,-1,-1):
if condition[t]==1:
df.loc[:,t] = df_3.loc[:,t]**2
df_2.loc[:,t]=0
elif (condition == 0 and no 1 in any column after t)
df.loc[:,t] = 2.5
....
else:
df.loc[:,t] = 5
df_2.loc[:,t]= df.loc[:,t+1]
I am aware that this for loop is not correct, but what I wanted to do, is to check elementwise condition (recursevly) and if it is 1 (in condition) to fill dataframe df with squared valued of df_3. If it is 0 in condition, I should differentiate two cases.
In the first case, there are no 1 after 0 (row 1 and 2 in condition) then df = 2.5
Second case, there was 1 after and fill df with 5 (row 3 and 4)
So the dataframe df should look something like this
[0] [1] [2] [3]
1 5 5 9 2.5
2 5 9 2.5 2.5
3 5 5 5 9
4 5 5 5 9
The code should include for loop.
Thanks!
I am not sure if this is what you want, but based on your desired output you can do this with only masking operations (which is more efficient than looping over the rows anyway). Your code could look like this:
is_one = condition.astype(bool)
is_after_one = (condition.cumsum(axis=1) - condition).astype(bool)
df = pd.DataFrame(5, index=condition.index, columns=condition.columns)
df_2 = pd.DataFrame(2.5, index=condition.index, columns=condition.columns)
df_3 = pd.DataFrame(3, index=condition.index, columns=condition.columns)
df.where(~is_one, other=df_3 * df_3, inplace=True)
df.where(~is_after_one, other=df_2, inplace=True)
which yields:
0 1 2 3
1 5 5 9.0 2.5
2 5 9 2.5 2.5
3 5 5 5.0 9.0
4 5 5 5.0 9.0
EDIT after comment:
If you really want to loop explicitly over the rows and columns, you could do it like this with the same result:
n_rows = condition.index.size
n_cols = condition.columns.size
for row_index in range(n_rows):
for col_index in range(n_cols):
cond = condition.iloc[row_index, col_index]
if col_index < n_cols - 1:
rest_row = condition.iloc[row_index, col_index + 1:].to_list()
else:
rest_row = []
if cond == 1:
df.iloc[row_index, col_index] = df_3.iloc[row_index, col_index] ** 2
elif cond == 0 and 1 not in rest_row:
# fill whole row at once
df.iloc[row_index, col_index:] = 2.5
# stop iterating over the rest
break
else:
df.iloc[row_index, col_index] = 5
df_2.loc[:, col_index] = df.iloc[:, col_index + 1]
The result is the same, but this is much more inefficient and ugly, so I would not recommend it like this
I have a dataframe like the following.
i
j
element
0
0
1
0
1
2
0
2
3
1
0
4
1
1
5
1
2
6
2
0
7
2
1
8
2
2
9
How can I convert it to the 3*3 array below?
1
2
3
4
5
6
7
8
9
Assuming that the dataframe is called df, one can use pandas.DataFrame.pivot as follows, with .to_numpy() (recommended) or .values as follows
array = df.pivot(index='i', columns='j', values='element').to_numpy()
# or
array = df.pivot(index='i', columns='j', values='element').values
[Out]:
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]], dtype=int64)
If you transform your dataframe into three lists where the first is containing "i" values, the second - j and the third is data, you can create NumPy array "manually":
i, j, v = zip(*[x for x in df.itertuples(index=False, name=None)])
arr = np.zeros(df.shape)
arr[i, j] = v
I tried to make a kind of running average - out of 90 rows, every 3 in column A should make an average that would be the same as those rows in column B.
For example:
From this:
df = pd.DataFrame( A B
2 0
3 0
4 0
7 0
9 0
8 0)
to this:
df = pd.DataFrame( A B
2 3
3 3
4 3
7 8
9 8
8 8)
I tried running this code:
x=0
for i in df['A']:
if x<90:
y = (df['A'][x]+ df['A'][(x +1)]+df['A'][(x +2)])/3
df['B'][x] = y
df['B'][(x+1)] = y
df['B'][(x+2)] = y
x=x+3
print(y)
It does print the correct Y
But does not change B
I know there is a better way to do it, and if anyone knows - it would be great if they shared it. But the more important thing for me is to understand why what I wrote down doesn't have an effect on the df.
You could group by the index divided by 3, then use transform to compute the mean of those values and assign to B:
df = pd.DataFrame({'A': [2, 3, 4, 7, 9, 8], 'B': [0, 0, 0, 0, 0, 0]})
df['B'] = df.groupby(df.index // 3)['A'].transform('mean')
Output:
A B
0 2 3
1 3 3
2 4 3
3 7 8
4 9 8
5 8 8
Note that this relies on the index being of the form 0,1,2,3,4,.... If that is not the case, you could either reset the index (df.reset_index(drop=True)) or use np.arange(df.shape[0]) instead i.e.
df['B'] = df.groupby(np.arange(df.shape[0]) // 3)['A'].transform('mean')
i = 0
batch_size = 3
df = pd.DataFrame({'A':[2,3,4,7,9,8,9,10],'B':[-1] * 8})
while i < len(df):
j = min(i+batch_size-1,len(df)-1)
avg =sum(df.loc[i:j,'A'])/ (j-i+1)
df.loc[i:j,'B'] = [avg] * (j-i+1)
i+=batch_size
df
corner case when len(df) % batch_size != 0 assumes we take the average of the leftover rows.
I have a DataFrame as below:
len scores
5 [0.45814112124905954, 0.34974337172257086, 0.042586941883761324, 0.042586941883761324, 0.33509446692807404, 0.01202741856859997, 0.01202741856859997, 0.031149023579740857, 0.031149023579740857, 0.9382029832667171]
4 [0.1289882974831455, 0.17069367229950574, 0.03518847270370917, 0.3283517918439753, 0.41119171582425107, 0.5057528742869354]
3 [0.22345885572316307, 0.1366147609256035, 0.09309687010700848]
2 [0.4049920770888036]
I want to index the scores column based on len column value and get multiple rows
len scores
5 [0.45814112124905954, 0.34974337172257086, 0.042586941883761324, 0.042586941883761324]
5 [0.33509446692807404, 0.01202741856859997, 0.01202741856859997]
5 [0.031149023579740857, 0.031149023579740857]
5 [0.9382029832667171]
5
4 [0.1289882974831455, 0.17069367229950574, 0.03518847270370917]
4 [0.3283517918439753, 0.41119171582425107]
4 [0.9382029832667171]
4
3 [0.22345885572316307, 0.1366147609256035]
3 [0.09309687010700848]
3
2 [0.4049920770888036]
2
I tried this
d = []
for x in df['len']:
col = df['scores'][:(x-1)]
d.append(col)
but this would just give me first row of indexed row only
len scores
5 [0.45814112124905954, 0.34974337172257086, 0.042586941883761324, 0.042586941883761324]
4 [0.1289882974831455, 0.17069367229950574, 0.03518847270370917]
3 [0.22345885572316307, 0.1366147609256035]
2 [0.4049920770888036]
How to get the rest of the rows to index as per my requirement ?
Assuming that the column len is related to the length of the list in the column scores row wise as in your example, you can do it with apply to reshape the list to nested list with decreasing length and then explode like:
#define function to create nested list
def create_nested_list (x):
l_idx = [0]+np.cumsum(np.arange(x['len'])[::-1]).tolist()
return [x['scores'][i:j] for i, j in zip(l_idx[:-1], l_idx[1:])]
#apply row-wise
s = df.apply(create_nested_list, axis=1)
#change index to keep the value in len
s.index=df['len']
#explode and reset_index
df_f = s.explode().reset_index(name='scores')
print (df_f)
len scores
0 5 [0.45814112124905954, 0.34974337172257086, 0.0...
1 5 [0.33509446692807404, 0.01202741856859997, 0.0...
2 5 [0.031149023579740857, 0.031149023579740857]
3 5 [0.9382029832667171]
4 5 []
5 4 [0.1289882974831455, 0.17069367229950574, 0.03...
6 4 [0.3283517918439753, 0.41119171582425107]
7 4 [0.5057528742869354]
8 4 []
9 3 [0.22345885572316307, 0.1366147609256035]
10 3 [0.09309687010700848]
11 3 []
12 2 [0.4049920770888036]
13 2 []
EDIT: if you can't use explode, try like this:
#define function to create a series from nested lists
def create_nested_list_s (x):
l_idx = [0]+np.cumsum(np.arange(x['len'])[::-1]).tolist()
return pd.Series([x['scores'][i:j] for i, j in zip(l_idx[:-1], l_idx[1:])])
df_f = (df.apply(create_nested_list_s, axis=1)
.set_index(df['len'])
.stack()
.reset_index(name='scores')
.drop('level_1', axis=1))
print(df_f)
df.explode() does exactly what you want.
Example:
import pandas as pd
df = pd.DataFrame({'A': [[1, 2, 3], 'foo', [], [3, 4]], 'B': 1})
df.explode('A')
#Output
# A B
# 0 1 1
# 0 2 1
# 0 3 1
# 1 foo 1
# 2 NaN 1
# 3 3 1
# 3 4 1
I'd like to create a new dataframe using the same values from another dataframe, unless there is a 0 value. If there is a 0 value, I'd like to find the average of the entry before and after.
For Example:
df = A B C
5 2 1
3 4 5
2 1 0
6 8 7
I'd like the result to look like the df below:
df_new = A B C
5 2 1
3 4 5
2 1 6
6 8 7
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':[5, 3, 2, 6], 'B':[2, 4, 1, 8], 'C':[1, 5, 0, 7]})
Nrows = len(df)
def run(col):
originalValues = list(df[col])
values = list(np.where(np.array(list(df[col])) == 0)[0])
indices2replace = filter(lambda x: x > 0 and x < Nrows, values)
for index in indices2replace:
originalValues[index] = 0.5 * (originalValues[index+1] + originalValues[index-1])
return originalValues
newDF = pd.DataFrame(map(lambda x: run(x) , df.columns)).transpose()