Given a dataframe df, I would like to generate a new variable/column for each row based on the values in the previous row. df is sorted so that the order of the rows is meaningful.
Normally, we can use either map or apply, but it seems that neither of them allows the access to values in the previous row.
For example, given existing rows a b c, I want to generate a new column d, which is based on some calculation using the value of c in the previous row.
How should I do it in pandas?
If you just want to do a calculation based on the previous row, you can calculate and then shift:
In [2]: df = pd.DataFrame({'a':[0,1,2], 'b':[0,10,20]})
In [3]: df
Out[3]:
a b
0 0 0
1 1 10
2 2 20
# a calculation based on other column
In [4]: df['c'] = df['b'] + 1
# shift the column
In [5]: df['c'] = df['c'].shift()
In [6]: df
Out[6]:
a b c
0 0 0 NaN
1 1 10 1
2 2 20 11
If you want to do a calculation based on multiple rows, you could look at the rolling_apply function (http://pandas.pydata.org/pandas-docs/stable/computation.html#moving-rolling-statistics-moments and http://pandas.pydata.org/pandas-docs/stable/generated/pandas.rolling_apply.html#pandas.rolling_apply)
You can use the dataframe 'apply' function and leverage the unused the 'kwargs' parameter to store the previous row.
import pandas as pd
df = pd.DataFrame({'a':[0,1,2], 'b':[0,10,20]})
new_col = 'c'
def apply_func_decorator(func):
prev_row = {}
def wrapper(curr_row, **kwargs):
val = func(curr_row, prev_row)
prev_row.update(curr_row)
prev_row[new_col] = val
return val
return wrapper
#apply_func_decorator
def running_total(curr_row, prev_row):
return curr_row['a'] + curr_row['b'] + prev_row.get('c', 0)
df[new_col] = df.apply(running_total, axis=1)
print(df)
# Output will be:
# a b c
# 0 0 0 0
# 1 1 10 11
# 2 2 20 33
This example uses a decorator to store the previous row in a dictionary and then pass it to the function when Pandas calls it on the next row.
Disclaimer 1: The 'prev_row' variable starts off empty for the first row so when using it in the apply function I had to supply a default value to avoid a 'KeyError'.
Disclaimer 2: I am fairly certain this will be slower the apply operation but I did not do any tests to figure out how much.
Related
I am applying a function on a dataframe df and that function returns a dataframe int_df, but the result is getting stored as a series.
df
limit
0 4
new_df
A B
0 0 Number
1 1 Number
2 2 Number
3 3 Number
This is a pseudocode of what I have done:
def foo(x):
limit = x['limit']
int_df = pd.DataFrame(columns=['A', 'B']) # Create empty dataframe
# Append a new row to the dataframe
for i in range(0, limit):
int_df.loc[len(int_df.index)] = [i, 'Number']
return int_df # This is dataframe
new_df = df.apply(foo, axis=1)
new_df # This is a series but I need a dataframe
Is this the right way to do this?
IIUC, here's one way:
df = df.limit.apply(range).explode().to_frame('A').assign(B='number')
OUTPUT:
A B
0 0 Number
1 1 Number
2 2 Number
3 3 Number
I want to filter a pandas dataframe by a function along the index. I can't seem to find a built-in way of performing this action.
So essentially, I have a function that through some arbitrarily complicated means determines whether a particular index should be included, I'll call it filter_func for this example. I wish to apply exactly what the below code does, but to the index:
new_index = filter(filter_func, df.index)
And only include the values that the filter_func allows. The index could also be any type.
This is a pretty important factor of data manipulation, so I imagine there's a built-in way of doing this action.
ETA:
I found that indexing the dataframe by a list of booleans will do what I want, but still requires double the space of the index in order to apply the filter. So my question still remains if there's a built-in way of doing this that does not require twice the space.
Here's an example:
import pandas as pd
df = pd.DataFrame({"value":[12,34,2,23,6,23,7,2,35,657,1,324]})
def filter_func(ind, n=0):
if n > 200: return False
if ind % 79 == 0: return True
return filter_func(ind+ind-1, n+1)
new_index = filter(filter_func, df)
And I want to do this:
mask = []
for i in df.index:
mask.append(filter_func(i))
df = df[mask]
But in a way that doesn't take twice the space of the index to do so
You can use map instead of filter and then do a boolean indexing:
df.loc[map(filter_func,df.index)]
value
0 12
4 6
7 2
8 35
Have you tried using df.apply?
>>> df = pd.DataFrame(np.arange(9).reshape(3, 3), columns=['a', 'b', 'c'])
a b c
0 0 1 2
1 3 4 5
2 6 7 8
df[df.apply(lambda x: x['c']%2 == 0, axis = 1)]
a b c
0 0 1 2
2 6 7 8
You can customize the lambda function in any way you want, let me know if this isn't what you're looking for.
If you want to avoid referencing df explicitly inside the filtering condition, you can use the following:
import pandas as pd
df = pd.DataFrame({"value":[12,34,2,23,6,23,7,2,35,657,1,324]}, dtype=object)
df.apply(lambda x: x if filter_func(x.name) else None, axis=1, result_type='broadcast').dropna()
Imagine that I have a Dataframe and the columns are [A,B,C]. There are some different values for each of these columns. And I want to produce one more column D which can be received with the following function:
def produce_column(i):
# Extract current row by index
raw = df.loc[i]
# Extract previous 3 values for the same sub-df which are before i
df_same = df[
(df['A'] == raw.A)
& (df['B'] == raw.B)
].loc[:i].tail(3)
# Check that we have enough values
if df_same.shape[0] != 3:
return False
# Doesn't matter which function is in use, I just need to apply it on the column / columns
diffs = df_same['C'].map(lambda x: x <= 10 and x > 0)
return all(diffs)
df['D'] = df.index.map(lambda x: produce_column(x))
So on each step, I need to get the Dataframe, which have the same set of properties as a row and perform some operations on columns of this Dataframe. I have a few hundred thousands of rows, so this code takes a lot of time to be executed. I think that a good idea is to vectorize the operation, but I don't know how to do that. Maybe there's another way to perform this?
Thanks in advance!
UPD Here's an example
df = pd.DataFrame([(1,2,3), (4,5,6), (7,8,9)], columns=['A','B','C'])
A B C
0 1 2 3
1 4 5 6
2 7 8 9
df['D'] = df.index.map(lambda x: produce_column(x))
A B C D
0 1 2 3 True
1 4 5 6 True
2 7 8 9 False
I have a large dataset with columns labelled from 1 - 65 (among other titled columns), and want to find how many of the columns, per row, have a string (of any value) in them. For example, if all rows 1 - 65 are filled, the count should be 65 in this particular row, if only 10 are filled then the count should be 10.
Is there any easy way to do this? I'm currently using the following code, which is taking very long as there are a large number of rows.
array = pd.read_csv(csvlocation, encoding = "ISO-8859-1")
for i in range (0, lengthofarray)
for k in range(1,66):
if array[k][i]!="":
array["count"][i]=array["count"][i]+1
From my understanding of the post and the subsequent comments, you are interested in knowing the number of strings in each row for columns labels 1 through 65. There are two steps, the first is to subset your data down to columns 1 through 65, and then the following is the count the number of strings in each row. To do this:
import pandas as pd
import numpy as np
# create sample data
df = pd.DataFrame({'col1': list('abdecde'),
'col2': np.random.rand(7)})
# change one val of column two to string for illustration purposes
df.loc[3, 'col2'] = 'b'
# to create the subset of columns, you could use
# subset = [str(num) for num in list(range(1, 66))]
# and then just use df[subset]
# for each row, count the number of columns that have a string value
# applymap operates elementwise, so we are essentially creating
# a new representation of your data in place, where a 1 represents a
# string value was there, and a 0 represent not a string.
# we then sum along the rows to get the final counts
col_str_counts = np.sum(df.applymap(lambda x: 1 if isinstance(x, str) else 0), axis=1)
# we changed the column two value above, so to check that the count is 2 for that row idx:
col_str_counts[3]
>>> 2
# and for the subset, it would simply become:
# col_str_counts = np.sum(df[subset].applymap(lambda x: 1 if isinstance(x, str) else 0), axis=1)
You should be able to adapt your problem to this example
Say we have this dataframe
df = pd.DataFrame([["","foo","bar"],["","","bar"],["","",""],["foo","bar","bar"]])
0 1 2
0 foo bar
1 bar
2
3 foo bar bar
Then we create a boolean mask where a cell != "" and sum those values
df['count'] = (df != "").sum(1)
print(df)
0 1 2 count
0 foo bar 2
1 bar 1
2 0
3 foo bar bar 3
df = pandas.DataFrame([["","foo","bar"],["","","bar"],["","",""],["foo","bar","bar"]])
total_cells = df.size
df['filled_cell_count'] = (df != "").sum(1)
print(f"{df}")
0 1 2 filled_cell_count
0 foo bar 2
1 bar 1
2 0
3 foo bar bar 3
total_filled_cells = df['filled_cell_count'].sum()/total_cells
print()
print(f"Total Filled Cells in dataframe: {total_filled_cells}")
Total Filled Cells in dataframe: 0.5
For a Dataframe, how do I conduct a conditional statement to assign a new value based on preexisting value in a column?
If the value of Column contains a string of len(>0); then assign a value = 0
If value of Column is None (NoneType) then assign 1
I am trying to get a counter to check how many of the rows are missing value based on a string length.
I can covert the series to a list and do the test, but I would like to learn how this is possible within the dataframe itself.
Dataframe.Series
df['old'] df['old'] (after)
String A 0
String B 0
String C 0
None 1
String D 0
String E 0
None 1
#So that I can sum the df['old'](after) to get counter value
Sum 2
are you just trying to see how many None you have?
you could just do this
import pandas as pd
df = pd.DataFrame(['a', 'b', None, 'q'], columns=['old'])
df['old'].isnull().sum()
Out[37]:
1
If you want to convert the strings to 1 values and the None to 0, you can apply a lambda function:
import pandas as pd
x = pd.DataFrame(['S', 'X', 'Z', None, 'B'])
x[0] = x[0].apply(lambda x: 1 if x else 0)
Then, to count the values which are one, you could use sum:
x[0].sum()
For a fast vectorized solution, just use the isnull method and multiply by 1 to convert to an integer.
df = pd.DataFrame({'col' :['a','b',None, None, 'sdaf']})
df['count'] = df.col.isnull() * 1
output:
col count
0 a 0
1 b 0
2 None 1
3 None 1
4 sdaf 0