I have a df with a column containing floats (transaction values).
I would liek to iterate trough the column and only print the value if it is not nan.
Right now i have the following condition.
if j > 0:
print(j)
i += 1
else: i += 1
where i is my iteration number.
I do this because I know that in my dataset there are no negative values and that is my workaound, but I would like to know how it would be done correctly if I would have nagative values.
so what would be the if condition ?
I have tried j != None
and j != np.nan but it still prints all nan.
Why not use built-in pandas functionality?
Given some dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'a': [-1, -2, 0, np.nan, 5, 6, np.nan]
})
You can filter out all nans:
df[df['a'].notna()]
>>> a
0 -1.0
1 -2.0
2 0.0
4 5.0
5 6.0
or only positive numbers:
df[df['a']> 0]
>>> a
4 5.0
5 6.0
Suppose let's assume here the datatype is int64 for the filled values in column
for i in range(0,10):
if(type(array[i])!= int):
print(array[i])
Related
So I am trying to forward fill a column with the limit being the value in another column. This is the code I run and I get this error message.
import pandas as pd
import numpy as np
df = pd.DataFrame()
df['NM'] = [0, 0, 1, np.nan, np.nan, np.nan, 0]
df['length'] = [0, 0, 2, 0, 0, 0, 0]
print(df)
NM length
0 0.0 0
1 0.0 0
2 1.0 2
3 NaN 0
4 NaN 0
5 NaN 0
6 0.0 0
df['NM'] = df['NM'].fillna(method='ffill', limit=df['length'])
print(df)
ValueError: Limit must be an integer
The dataframe I want looks like this:
NM length
0 0.0 0
1 0.0 0
2 1.0 2
3 1.0 0
4 1.0 0
5 NaN 0
6 0.0 0
Thanks in advance for any help you can provide!
I do not think you want to use ffill for this instance.
Rather I would recommend filtering to where length is greater than 0, then iterating through those rows to enter the NM value from that row in the next n+length rows.
for row in df.loc[df.length.gt(0)].reset_index().to_dict(orient='records'):
df.loc[row['index']+1:row['index']+row['length'], 'NM'] = row['NM']
To better break this down:
Get rows containing change information be sure to include the index.
df.loc[df.length.gt(0)].reset_index().to_dict(orient='records')
iterate through them... I prefer to_dict for performance reasons on large datasets. It is a habit.
sets NM rows to the NM value of your row with the defined length.
You can first group the dataframe by the length column before filling. Only issue is that for the first group in your example limit would be 0 which causes an error, so we can make sure it's at least 1 with max. This might cause unexpected results if there are nan values before the first non-zero value in length but from the given data it's not clear if that can happen.
# make groups
m = df.length.gt(0).cumsum()
# fill the column
df["NM"] = df.groupby(m).apply(
lambda f: f.NM.fillna(
method="ffill",
limit=max(f.length.iloc[0], 1))
).values
I'm trying to figure out the average of increasing values in my table per column.
my table
A | B | C
----------------
0 | 5 | 10
100 | 2 | 20
50 | 2 | 30
100 | 0 | 40
function I'm trying to write for my problem
def avergeIncreace(data,value): #not complete but what I have so far
x = data[value].pct_change().fillna(0).gt(0)
print( x )
pct_change() returns a table of the percentage of the number at that index compared to the number in row before it.fillna(0) replaces the NaN in position 0 of the chart that pct_change() creates with 0.gt(0) returns true or false table depending if the value at that index is greater than 0
current output of this function
In[1]:avergeIncreace(df,'A')
Out[1]: 0 False
1 True
2 False
3 True
Name: BAL, dtyle: bool
desired output
In[1]:avergeIncreace(df,'A')
Out[1]:75
In[2]:avergeIncreace(df,'B')
Out[2]:0
In[3]:avergeIncreace(df,'C')
Out[3]:10
From my limited understanding of pandas there should be a way to return an array of all the indexes that are true and then use a for loop and go through the original data table, but I believe pandas should have a way to do this without a for loop.
what I think the for loop way would look plus missing code so indexes returned are ones that are true instead of every index
avergeIncreace(df,'A')
indexes = data[value].pct_change().fillna(0).gt(0).index.values #this returns an array containing all of the index (true and false)
answer = 0
times = 0
for x in indexes:
answer += (data[value][x] - data[value][x-1])
times += 1
print( answer/times )
How to I achieve my desired output without using a for loop in the function?
You can use mask() and diff():
df.diff().mask(df.diff()<=0, np.nan).mean().fillna(0)
Yields:
A 75.0
B 0.0
C 10.0
dtype: float64
How about
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [0, 100, 50, 100],
'B': [5, 2, 2, 0],
'C': [10, 20, 30, 40]})
def averageIncrease(df, col_name):
# Create array of deltas. Replace nan and negative values with zero
a = np.maximum(df[col_name] - df[col_name].shift(), 0).replace(np.nan, 0)
# Count non-zero values
count = np.count_nonzero(a)
if count == 0:
# If only zero values… there is no increase
return 0
else:
return np.sum(a) / count
print(averageIncrease(df, 'A'))
print(averageIncrease(df, 'B'))
print(averageIncrease(df, 'C'))
75.0
0
10.0
I have a list of pandas Series objects obj and a list of indices idx. What I want is a new Series out that for each row in idx contains the value of obj[idx] if idx is not 255 and -1 otherwise.
The following code does what I want to achieve, but I'd like to know if there's a better way of doing this, especially without the overhead of first creating a Python list and then converting that into a pandas series.
>>> import pandas as pd
>>> obj = [pd.Series([1, 2, 3]), pd.Series([4, 5, 6]), pd.Series([7, 8, 9])]
>>> idx = pd.Series([0, 255, 2])
>>> out = pd.Series([obj[idx[row]][row] if idx[row] != 255 else -1 for row in range(len(idx))])
>>> out
0 1
1 -1
2 9
dtype: int64
>>>
Thanks in advance.
Usingreindex + lookup
pd.Series(pd.concat(obj,1).reindex(idx).lookup(idx,idx.index)).fillna(-1)
Out[822]:
0 1.0
1 -1.0
2 9.0
dtype: float64
I have a pandas Series:
0 1
1 5
2 20
3 -1
Lets say I want to apply mean() on every two elements, so I get something like this:
0 3.0
1 9.5
Is there an elegant way to do this?
You can use groupby by index divide by k=2:
k = 2
print (s.index // k)
Int64Index([0, 0, 1, 1], dtype='int64')
print (s.groupby([s.index // k]).mean())
name
0 3.0
1 9.5
You can do this:
(s.iloc[::2].values + s.iloc[1::2])/2
if you want you can also reset the index afterwards, so you have 0, 1 as the index, using:
((s.iloc[::2].values + s.iloc[1::2])/2).reset_index(drop=True)
If you are using this over large series and many times, you'll want to consider a fast approach. This solution uses all numpy functions and will be fast.
Use reshape and construct new pd.Series
consider the pd.Series s
s = pd.Series([1, 5, 20, -1])
generalized function
def mean_k(s, k):
pad = (k - s.shape[0] % k) % k
nan = np.repeat(np.nan, pad)
val = np.concatenate([s.values, nan])
return pd.Series(np.nanmean(val.reshape(-1, k), axis=1))
demonstration
mean_k(s, 2)
0 3.0
1 9.5
dtype: float64
mean_k(s, 3)
0 8.666667
1 -1.000000
dtype: float64
I have a pandas DataFrame like following.
df = pandas.DataFrame(np.random.randn(5,5),columns=['1','2','3','4','5'])
1 2 3 4 5
0 0.877455 -1.215212 -0.453038 -1.825135 0.440646
1 1.640132 -0.031353 1.159319 -0.615796 0.763137
2 0.132355 -0.762932 -0.909496 -1.012265 -0.695623
3 -0.257547 -0.844019 0.143689 -2.079521 0.796985
4 2.536062 -0.730392 1.830385 0.694539 -0.654924
I need to get row and column indexes for following three groups. (In my original dataset there are no negative values)
value is greater than 2.0
value is between 1.0 - 2.0
value is less than 1.0
For e.g for "value is greater than 2.0" it should return [1,4]. I have tried using this which gives a boolean result.
df.values > 2
You can use np.where on the boolean result to extract the indices:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(5,5),columns=['1','2','3','4','5'])
condition = df.values > 2
print np.column_stack(np.where(condition))
For a df like this,
1 2 3 4 5
0 0.057347 0.722251 0.263292 -0.168865 -0.111831
1 -0.765375 1.040659 0.272883 -0.834273 -0.126997
2 -0.023589 0.046002 1.206445 0.381532 -1.219399
3 2.290187 2.362249 -0.748805 -1.217048 -0.973749
4 0.100084 0.671120 -0.211070 0.903264 -0.312815
Output:
[[3 0]
[3 1]]
Or get a list of row-column index pairs if necessary:
print map(list, np.column_stack(np.where(condition)))
Output:
[[3,0], [3,1]]