How to assign a value to a column in Dask data frame

How to assign a value to a column in Dask data frame - python

How to do the same as the bellow code for a dask data frame.
df['new_column'] = 0
for i in range(len(df)):
if (condition):
df[i,'new_column'] = '1'
else:
df[i,'new_column'] = '0'
I want to add a new column to a dask dataframe and insert 0/1 to the new column.

In case you do not wish to compute as suggested by Rajnish kumar, you can also use something along the following lines:
import dask.dataframe as dd
import pandas as pd
import numpy as np
my_df = [{"a": 1, "b": 2}, {"a": 2, "b": 3}]
df = pd.DataFrame(my_df)
dask_df = dd.from_pandas(df, npartitions=2)
dask_df["c"] = dask_df.apply(lambda x: x["a"] < 2,
axis=1,
meta=pd.Series(name="c", dtype=np.bool))
dask_df.compute()
Output:
a b c
0 1 2 True
1 2 3 False
The condition (here a check whether the entry in column "a" < 2) is applied on a row-by-row-basis. Note that depending on your condition and dependencies therein it might not necessarily be as straightforward, but in that case you could share additional information on what your condition entails.

You can't do that directly to Dask Dataframe. You first need to compute it. Use this, It will work.
df = df.compute()
for i in range(len(df)):
if (condition):
df[i,'new_column'] = '1'
else:
df[i,'new_column'] = '0'
The reason behind this is Dask Dataframe is the representation of dataframe schema, it is divided into dask-delayed task. Hope it helps you.

I was going through these answers for a similar problem I was facing.
This worked for me.
def extractAndFill(df, datetimeColumnName):
# Add 4 new columns for weekday, hour, month and year
df['pickup_date_weekday'] = 0
df['pickup_date_hour'] = 0
df['pickup_date_month'] = 0
df['pickup_date_year'] = 0
# Iterate through each row and update the values for weekday, hour, month and year
for index, row in df.iterrows():
# Get weekday, hour, month and year
w, h, m, y = extractDateParts(row[datetimeColumnName])
# Update the values
row['pickup_date_weekday'] = w
row['pickup_date_hour'] = h
row['pickup_date_month'] = m
row['pickup_date_year'] = y
return df
df1.compute()
df1 = extractAndFill(df1, 'pickup_datetime')

Related

Pandas include single row in df after filtering with .loc

So, in this function:
def filter_by_freq(df, frequency):
filtered_df = df.copy()
if frequency.upper() == 'DAY':
pass
else:
date_obj = filtered_df['Date'].values[0]
target_day = pd.to_datetime(date_obj).day
target_month = pd.to_datetime(date_obj).month
final_date_obj = filtered_df['Date'].values[-1]
if frequency.upper() == 'MONTH':
filtered_df = filtered_df.loc[filtered_df['Date'].dt.day.eq(target_day)]
elif frequency.upper() == 'YEAR':
filtered_df = filtered_df.loc[filtered_df['Date'].dt.day.eq(target_day)]
filtered_df = filtered_df.loc[filtered_df['Date'].dt.month.eq(target_month)]
return filtered_df
How can I also include in the .loc the very last row from the original df? Tried doing (for month frequency): filtered_df = filtered_df.loc[(filtered_df['Date'].dt.day.eq(target_day)) | (filtered_df['Date'].dt.date.eq(final_date_obj))] but didn't work.
Thanks for your time!

Here's one way you could do it. In this example I have a df and I want to filter out all rows that have c1 > 0.5, but I want to keep the last row no matter what. I create a boolean series called lte_half to keep track of the first condition, and then I create another boolean series/list/array (all interchangeable) called end_ind which is True only for the last row. The filtered table is created by taking all rows that pass either condition with the |
import pandas as pd
import numpy as np
np.random.seed(0)
df = pd.DataFrame({'c1':np.random.rand(20)})
lte_half = df['c1'].le(0.5)
end_ind = df.index == df.index[-1]
filt_df = df[lte_half | end_ind]
print(filt_df)

How to insert a new column into a dataframe and access rows with different indices?

I have a dataframe with one column "Numbers" and I want to add a second column "Result". The values should be the sum of the previous two values in the "Numbers" column, otherwise NaN.
import pandas as pd
import numpy as np
data = {
"Numbers": [100,200,400,0]
}
df = pd.DataFrame(data,index = ["whatever1", "whatever2", "whatever3", "whatever4"])
def add_prev_two_elems_to_DF(df):
numbers = "Numbers" # alias
result = "Result" # alias
df[result] = np.nan # empty column
result_index = list(df.columns).index(result)
for i in range(len(df)):
#row = df.iloc[i]
if i < 2: df.iloc[i,result_index] = np.nan
else: df.iloc[i,result_index] = df.iloc[i-1][numbers] + df.iloc[i-2][numbers]
add_prev_two_elems_to_DF(df)
display(df)
The output is:
Numbers Result
whatever1 100 NaN
whatever2 200 NaN
whatever3 400 300.0
whatever4 0 600.0
But this looks quite complicated. Can this be done easier and maybe faster? I am not looking for a solution with sum(). I want a general solution for any kind of function that can fill a column using values from other rows.
Edit 1: I forgot to import numpy.
Edit 2: I changed one line to this:
if i < 2: df.iloc[i,result_index] = np.nan

Looks like you could use rolling.sum together with shift. Since rollling.sum sums until the current row, we have to shift it down one row, so that each row value matches to the sum of the previous 2 rows:
df['Result'] = df['Numbers'].rolling(2).sum().shift()
Output:
Numbers Result
whatever1 100 NaN
whatever2 200 NaN
whatever3 400 300.0
whatever4 0 600.0

This is the shortest code I could develop. It outputs exactly the same table.
import numpy as np
import pandas as pd
#import swifter # apply() gets swifter
data = {
"Numbers": [100,200,400,0]
}
df = pd.DataFrame(data,index = ["whatever1", "whatever2", "whatever3", "whatever4"])
def func(a: np.ndarray) -> float: # we expect 3 elements, but we don't check that
a.reset_index(inplace=True,drop=True) # the index now starts with 0, 1,...
return a[0] + a[1] # we use the first two elements, the 3rd is unnecessary
df["Result"] = df["Numbers"].rolling(3).apply(func)
#df["Result"] = df["Numbers"].swifter.rolling(3).apply(func)
display(df)

Pandas - Incrementally add to DataFrame

I'm trying to add rows and columns to pandas incrementally. I have a lot of data stored across multiple datastores and a heuristic to determine a value. As I navigate across this datastore, I'd like to be able to incrementally update a dataframe, where in some cases, either names or days will be missing.
def foo():
df = pd.DataFrame()
year = 2016
names = ['Bill', 'Bob', 'Ryan']
for day in range(1, 4, 1):
for name in names:
if random.choice([True, False]): # sometimes a name will be missing
continue
value = random.randrange(0, 20, 1) # random value from heuristic
col = '{}_{}'.format(year, day) # column name
df = df.append({col: value, 'name': name}, ignore_index=True)
df.set_index('name', inplace=True, drop=True)
print(df.loc['Bill'])
This produces the following results:
2016_1 2016_2 2016_3
name
Bill 15.0 NaN NaN
Bill NaN 12.0 NaN
I've created a heatmap of the data and it's blocky due to duplicate names, so the output I'm looking for is:
2016_1 2016_2 2016_3
name
Bill 15.0 12.0 NaN
How can I combine these rows?
Is there a more efficient means of creating this dataframe?

Try this :-
df.groupby('name')[df.columns.values].sum()

try this:
df.pivot_table(index='name', aggfunc='sum', dropna=False)

After you run your foo() function, you can use any aggregation function (if you have only one value per column and all the othes are null) and groupby on df.
First, use reset_index to get back your name column.
Then use groupby and apply. Here I propose a custom function which checks that there is only one value per column, and raise a ValueError if not.
df.reset_index(inplace=True)
def aggdata(x):
if all([i <= 1 for i in x.count()]):
return x.mean()
else:
raise ValueError
ddf = df.groupby('name').apply(aggdata)
If all the values of the column are null but one, x.mean() will return that value (actually, you can use almost any aggregator, since there is only one value, that is the one returned).

It would be easier to have the name as column and date as index instead. Plus, you can work within the loop with lists and afterwards create the pd.DataFrame.
e.g.
year = 2016
names = ['Bill', 'Bob', 'Ryan']
index = []
valueBill = []
valueBob = []
valueRyan = []
for day in range(1, 4):
if random.choice([True, False]): # sometimes a name will be missing
valueBill.append(random.randrange(0, 20))
valueBob.append(random.randrange(0, 90))
valueRyan.append(random.randrange(0, 200))
index.append('{}-0{}'.format(year, day)) # column name
else:
valueBill.append(np.nan)
valueBob.append(np.nan)
valueRyan.append(np.nan)
index.append(np.nan)
df = pd.DataFrame({})
for name, value in zip(names,[valueBill,valueBob,valueRyan]):
df[name] = value
df.set_index(pd.to_datetime(index))

You can append the entries with new names if it does not already exist and then do an update to update existing entries.
import pandas as pd
import random
def foo():
df = pd.DataFrame()
year = 2016
names = ['Bill', 'Bob', 'Ryan']
for day in range(1, 4, 1):
for name in names:
if random.choice([True, False]): # sometimes a name will be missing
continue
value = random.randrange(0, 20, 1) # random value from heuristic
col = '{}_{}'.format(year, day) # column name
new_df = pd.DataFrame({col: value, 'name':name}, index=[1]).set_index('name')
df = pd.concat([df,new_df[~new_df.index.isin(df.index)].dropna()])
df.update(new_df)
#df.set_index('name', inplace=True, drop=True)
print(df)

Check for each row in several columns and append for each row if the requirement is met or not. python

I have the following example of my dataframe:
df = pd.DataFrame({'first_date': ['01-07-2017', '01-07-2017', '01-08-2017'],
'end_date': ['01-08-2017', '01-08-2017', '15-08-2017'],
'second_date': ['01-09-2017', '01-08-2017', '15-07-2017'],
'cust_num': [1, 2, 1],
'Title': ['philips', 'samsung', 'philips']})
If the cus_num is equal in the column
The Title is equal for both rows in the dataframe
The second_date in a row <= end_date in an other row
If all these requirements are met the value True should be appended to a new column in the original row.
Because I'm working with a big dataset I'm looking for an efficient way to do this.
In this case only the first record should get a true value.
I have checked for the apply with lambda and groupby function in python but couldnt find a way to make these work.

Try this (spontaneously I cannot come up with a faster method):
import pandas as pd
import numpy as np
df["second_date"]=pd.to_datetime(df["second_date"], format='%d-%m-%Y')
df["end_date"]=pd.to_datetime(df["end_date"], format='%d-%m-%Y')
df["new col"] = False
for cust in set(df["cust_num"]):
indices = df.index[df["cust_num"] == cust].tolist()
if len(indices) > 1:
sub_df = df.loc[indices]
for title in set(df.loc[indices]["Title"]):
indices_title = sub_df.index[sub_df["Title"] == title]
if len(indices_title) > 1:
for i in indices_title:
if sub_df.loc[indices_title]["second_date"][i] <= sub_df.loc[indices_title]["end_date"][i]:
df["new col"] = True
break
df["new_col"] = new_col
First you need to make all date columns comparable with eachother by casting them into datetime. Then create the additional column you want.
Now create a set of all unique customer numbers and iterate through them. For each customer number get a list of all row indices with this customer number. If this list is longer than 1, then you have several same customer numbers. Then you create a sub df of your dataframe with all rows with the same customer number. Then iterate through the set of all titles. For each title check if there is the same title somewhere else in the sub df (len > 1). If this is the case, then iterate through all rows and write True in your additional column in the same row where the date condition is met for the first time.

This should work. Also while reading comments, I am assuming that all cust_num is unique.
import pandas as pd
df = pd.DataFrame({'first_date': ['01-07-2017', '01-07-2017', '01-08-2017'],
'end_date': ['01-08-2017', '01-08-2017', '15-08-2017'],
'second_date': ['01-09-2017', '01-08-2017', '15-07-2017'],
'cust_num': [1, 2, 1],
'Title': ['philips', 'samsung', 'philips']})
df["second_date"]=pd.to_datetime(df["second_date"])
df["end_date"]=pd.to_datetime(df["end_date"])
df['Value'] = False
for i in range(len(df)):
for j in range(len(df)):
if (i != j):
if (df.loc[j,'end_date'] >= df.loc[i,'second_date']) == True:
if (df.loc[i,'cust_num'] == df.loc[j,'cust_num']) == True:
if (df.loc[i,'Title'] == df.loc[j,'Title']) == True:
df.loc[i,'Value'] = True
Tell me if this code works! and any errors.

pandas: setting last N rows of multi-index to Nan for speeding up groupby with shift

I am trying to speed up my groupby.apply + shift and
thanks to this previous question and answer: How to speed up Pandas multilevel dataframe shift by group? I can prove that it does indeed speed things up when you have many groups.
From that question I now have the following code to set the first entry in each multi-index to Nan. And now I can do my shift globally rather than per group.
df.iloc[df.groupby(level=0).size().cumsum()[:-1]] = np.nan
but I want to look forward, not backwards, and need to do calculations across N rows. So I am trying to use some similar code to set the last N entries to NaN, but obviously I am missing some important indexing knowledge as I just can't figure it out.
I figure I want to convert this so that every entry is a range rather than a single integer. How would I do that?
# the start of each group, ignoring the first entry
df.groupby(level=0).size().cumsum()[1:]
Test setup (for backwards shift) if you want to try it:
length = 5
groups = 3
rng1 = pd.date_range('1/1/1990', periods=length, freq='D')
frames = []
for x in xrange(0,groups):
tmpdf = pd.DataFrame({'date':rng1,'category':int(10000000*abs(np.random.randn())),'colA':np.random.randn(length),'colB':np.random.randn(length)})
frames.append(tmpdf)
df = pd.concat(frames)
df.sort(columns=['category','date'],inplace=True)
df.set_index(['category','date'],inplace=True,drop=True)
df['tmpShift'] = df['colB'].shift(1)
df.iloc[df.groupby(level=0).size().cumsum()[:-1]] = np.nan
# Yay this is so much faster.
df['newColumn'] = df['tmpShift'] / df['colA']
df.drop('tmp',1,inplace=True)
Thanks!

I ended up doing it using a groupby apply as follows (and coded to work forwards or backwards):
def replace_tail(grp,col,N,value):
if (N > 0):
grp[col][:N] = value
else:
grp[col][N:] = value
return grp
df = df.groupby(level=0).apply(replace_tail,'tmpShift',2,np.nan)
So the final code is:
def replace_tail(grp,col,N,value):
if (N > 0):
grp[col][:N] = value
else:
grp[col][N:] = value
return grp
length = 5
groups = 3
rng1 = pd.date_range('1/1/1990', periods=length, freq='D')
frames = []
for x in xrange(0,groups):
tmpdf = pd.DataFrame({'date':rng1,'category':int(10000000*abs(np.random.randn())),'colA':np.random.randn(length),'colB':np.random.randn(length)})
frames.append(tmpdf)
df = pd.concat(frames)
df.sort(columns=['category','date'],inplace=True)
df.set_index(['category','date'],inplace=True,drop=True)
shiftBy=-1
df['tmpShift'] = df['colB'].shift(shiftBy)
df = df.groupby(level=0).apply(replace_tail,'tmpShift',shiftBy,np.nan)
# Yay this is so much faster.
df['newColumn'] = df['tmpShift'] / df['colA']
df.drop('tmpShift',1,inplace=True)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to assign a value to a column in Dask data frame - python

How to do the same as the bellow code for a dask data frame. df['new_column'] = 0 for i in range(len(df)): if (condition): df[i,'new_column'] = '1' else: df[i,'new_column'] = '0' I want to add a new column to a dask dataframe and insert 0/1 to the new column.

Related

Pandas include single row in df after filtering with .loc

How to insert a new column into a dataframe and access rows with different indices?

Pandas - Incrementally add to DataFrame

Check for each row in several columns and append for each row if the requirement is met or not. python

pandas: setting last N rows of multi-index to Nan for speeding up groupby with shift

Categories

Resources