Parallelizing a for loop in python

Parallelizing a for loop in python - python

I have a dictionary where each key (date) contains a table (multiple lists of the format[day1, val11, val21], [day2, va12, val22], [day3, val13, val23], .... I want to transform it into a DataFrame; this is done with the following code:
df4 = pd.DataFrame(columns=sorted(set_days))
for date in dic.keys():
days = [day for day, val1, val2 in dic[date]]
val1 = [val1 for day, val1, val2 in dic[date]]
df4.loc[date, days] = val1
This code works fine, but it takes more than two hours to run.
After some research, I've realized I could parallelize it via the multiprocessing library; the following code is the intended parallel version
import multiprocessing
def func(date):
global df4, dic
days = [day for day, val1, val2 in dic[date]]
val1 = [val1 for day, val1, val2 in dic[date]]
df4.loc[date, days] = val1
multiprocessing.Pool(processes=8).map(func, dic.keys())
The problem with this code is that, after executing multiprocessing.Pool(processes..., the df4 DataFrame is empty.
Any help would be much appreciated.
Example
Suppose the dictionary contains two days:
dic['20030812'][:4]
Out: [[1, 24.25, 0.0], [20, 23.54, 23.54], [30, 23.13, 24.36], [50, 22.85, 23.57]]
dic['20030813'][:4]
Out: [[1, 24.23, 0.0], [19, 23.4, 22.82], [30, 22.97, 24.19], [49, 22.74, 23.25]]
then the DataFrame should be of the form:
df4.loc[:, 1:50]
1 2 3 4 5 ... 46 47 48 49 50
20030812 24.25 NaN NaN NaN NaN ... NaN NaN NaN NaN 22.85
20030813 24.23 NaN NaN NaN NaN ... NaN NaN NaN 22.74 NaN
Also,
dic.keys()
Out[36]: dict_keys(['20030812', '20030813'])
df1.head().to_dict()
Out:
{1: {'20030812': 24.25, '20030813': 24.23},
2: {'20030812': nan, '20030813': nan},
3: {'20030812': nan, '20030813': nan},
4: {'20030812': nan, '20030813': nan},
5: {'20030812': nan, '20030813': nan},
6: {'20030812': nan, '20030813': nan},
7: {'20030812': nan, '20030813': nan},
8: {'20030812': nan, '20030813': nan},
9: {'20030812': nan, '20030813': nan},
10: {'20030812': nan, '20030813': nan},
11: {'20030812': nan, '20030813': nan},
12: {'20030812': nan, '20030813': nan},
13: {'20030812': nan, '20030813': nan},
14: {'20030812': nan, '20030813': nan},
15: {'20030812': nan, '20030813': nan},
16: {'20030812': nan, '20030813': nan},
17: {'20030812': nan, '20030813': nan},
18: {'20030812': nan, '20030813': nan},
19: {'20030812': nan, '20030813': 23.4},
20: {'20030812': 23.54, '20030813': nan},
21: {'20030812': nan, '20030813': nan},
22: {'20030812': nan, '20030813': nan},
23: {'20030812': nan, '20030813': nan},
24: {'20030812': nan, '20030813': nan},
25: {'20030812': nan, '20030813': nan},
26: {'20030812': nan, '20030813': nan},
27: {'20030812': nan, '20030813': nan},
28: {'20030812': nan, '20030813': nan},
29: {'20030812': nan, '20030813': nan},
30: {'20030812': 23.13, '20030813': 22.97},
31: {'20030812': nan, '20030813': nan},
32: {'20030812': nan, '20030813': nan},
...

To answer your original question (roughly: "Why is the df4 DataFrame empty?"), the reason this doesn't work is that when the Pool workers are launched, each worker inherits a personal copy-on-write view of the parent's data (either directly if multiprocessing is running on a UNIX-like system with fork, or via a kludgy approach to simulate it when running on Windows).
Thus, when each worker does:
df4.loc[date, days] = val1
it's mutating the worker's personal copy of df4; the parent process's copy remains untouched.
In general, there are three ways to handle this:
Change your worker function to return something that can be used in the parent process. For example, instead of trying to perform in-place mutation with df4.loc[date, days] = val1, return what's necessary to do it in the parent, e.g. return date, days, val1, then change the parent to:
for date, days, val in multiprocessing.Pool(processes=8).map(func, dic.keys()):
df4.loc[date, days] = val
Downside to this approach is that it requires each of the return values to be pickled (Python's version of serialization), piped from child to parent, and unpickled; if the worker task doesn't do very much work, especially if the return values are large (and in this case, that seems to be the case), it can easily spend more time on serialization and IPC than it gains in parallelism.
Using shared object/memory (demonstrated in this answer to "Multiprocessing writing to pandas dataframe"). In practice, this usually doesn't gain you much, since stuff that isn't based on the more "raw" ctypes sharing using multiprocessing.sharedctypes is still ultimately going to end up needing to pipe data from one process to another; sharedctypes based stuff can get a meaningful speed boost though, since once mapped, shared raw C arrays are nearly as fast to access as local memory.
If the work being parallelized is I/O bound, or uses third party C extensions for CPU bound work (e.g. numpy), you may be able to get the required speed boosts from threads, despite GIL interference, and threads do share the same memory. Your case doesn't appear to be either I/O bound or meaningfully dependent on third party C extensions which might release the GIL, so it probably won't help here, but in general, the simple way to switch from process-based parallelism to thread-based parallelism (when you're already using multiprocessing) is to change the import from:
import multiprocessing
to
import multiprocessing.dummy as multiprocessing
which imports the thread-backed version of multiprocessing under the expected name, so code seamlessly switches from using processes to threads.

As RafaelC hinted, It was an XY problem.
I've been able to reduce the execution time to 20 seconds without multiprocessing.
I created a lista list that replaces the dictionary, and, rather than adding to the df4 DataFrame a row for each date, once the lista is full, I transform the lista into a DataFrame.
# Returns the largest day from all the dates (each date has a different number of days)
def longest_series(dic):
largest_series = 0
for date in dic.keys():
# get the last day's table of a specific date
current_series = dic[date][-1][0]
if largest_series < current_series:
largest_series = current_series
return largest_series
ls = longest_series(dic)
l_total_days = list(range(1, ls+1))
s_total_days = set(l_total_days)
# creating lista list, lista is similar to dic
#The difference is that, in lista, every date has the same number of days
#i.e. from 1 to ls, and it does not contain the dates.
# It takes 15 seconds
lista = list()
for date in dic.keys():
present_days = list()
presen_values = list()
for day, val_252, _ in dic[date]:
present_days.append(day)
presen_values.append(val_252)
missing_days = list(s_total_days.difference(set(present_days))) # extra days added to date
missing_values = [None] * len(missing_days) # extra values added to date
all_days_index = list(np.argsort(present_days + missing_days)) # we need to preserve the order between days and values
all_day_values = presen_values + missing_values
lista.append(list(np.array(all_day_values)[all_days_index]))
# It takes 4 seconds
df = pd.DataFrame(lista, index= dic.keys(), columns=l_total_days)

Related

Explode data in a list not separated by comma

Let's say have the following database:
{'docdb_family_id': {0: 569328,
1: 574660,
2: 1187498,
3: 1226468,
4: 1236571,
5: 1239098,
6: 1239277,
7: 1239483,
8: 1239622,
9: 1239624,
10: 1239749,
11: 1334477,
12: 1340405,
13: 1340418,
14: 1340462,
15: 1340471,
16: 1340485,
17: 1340488,
18: 1340508,
19: 1340519,
20: 1340541},
'newa_cited_docdb': {0: '[ 596005 4321416 5802640 6031690 6043910 8600475 8642629 9203255 9345445 10177065 10455451 13428248 22139349 22591458 24627241 24750476 26261826 26405611 27079105 27096884]',
1: '[ 5956195 11260528 22181831 22437920 22642946 23278096 23407037 23458128 24244657 24355363 25014714 25115774 25156886 27047688 27089078 27398716]',
2: '[ 5855196 7755392 11183886 22894980 24648618 27185399]',
3: '[ 3573464 6279285 6294985 6542463 6981930 7427770 10325811 14970234 16878329 17935009 21811002 22329817 23543436 23907898 24456108 25283772]',
4: '[ 2777078 2826073 5944733 10484188 11052747 14682645 15688752 22333410 22614097 22646501 22783765 22978728 23231683 24259740 24605606 24839432 25492752 27009992 27044704]',
5: '[ 5777407 10417156 23463145 23845079 24397163 24426379 24916732 25216234 25296619 27054560 27509152]',
6: '[ 4136523 12578497 21994155 22418792 22626616 22655464 22694825 22779403 23081767 23309829 23379411 23621952 24130698 24236071 24267003 24790872 24841797 25343500 27006578]',
7: '[21722194 23841261 23870348 24749080 26713455 26884023 26892256 27123571]',
8: '[ 3770167 9249538 20340153 21805004 21826650 23074051 23211424 23586695 23664858 24139881 24669345 24951262 25109266 25172355 25351735 26158421 27074633]',
9: '[ 3773931 10400885 23825854 24863945 24904226 25372210 26673422 27108903]',
10: '[ 6245732 6270984 6282047 6313094 6323632 6357314 12700997 14934415]',
11: '[1331950 5937719 5950928 6032897 6737094 8103287]',
12: '[22536768 23111794 23827356 24148953 24483064 24636228 26369896 26722884]',
13: '[ 4096597 6452385 9164095 19820980 22468583 23758517 24922228]',
14: '[ 6273193 6365448 9349940 10531948 13589721 20897840 21818345 22422049 23234586 23722349 24282964 24466601 25476838 26223504 26685774 26756449 26812104 26900843 27088150]',
15: '[ 3770297 6285357 21272262 21883292 22392025 23100861 23160290 23827496 24060758 25448672 26918320]',
16: '[21808322 25167492 25401922 26858065]',
17: '[ 6293130 12621423 12977043 14043576 14524083 22013480 23070753 23360636 23672818 24210016 24396413 24505095 25447453 26335550 27560125]',
18: '[21923978 23414619 23700077 23916998 23917011 23917023 24227869]',
19: '[ 3029629 3461742 8589904 10338953 10633369 16254362 22248316 22635394 24392987 25416705 26671842 27391491 27406148]',
20: None},
'paperid': {0: nan,
1: nan,
2: nan,
3: nan,
4: nan,
5: nan,
6: nan,
7: nan,
8: nan,
9: nan,
10: nan,
11: nan,
12: nan,
13: nan,
14: nan,
15: nan,
16: nan,
17: nan,
18: nan,
19: nan,
20: 1998988989.0},
'fronteer': {0: 0,
1: 0,
2: 0,
3: 0,
4: 0,
5: 0,
6: 0,
7: 0,
8: 0,
9: 0,
10: 0,
11: 0,
12: 0,
13: 0,
14: 0,
15: 0,
16: 0,
17: 0,
18: 0,
19: 0,
20: 1},
'distance': {0: nan,
1: nan,
2: nan,
3: nan,
4: nan,
5: nan,
6: nan,
7: nan,
8: nan,
9: nan,
10: nan,
11: nan,
12: nan,
13: nan,
14: nan,
15: nan,
16: nan,
17: nan,
18: nan,
19: nan,
20: 0.0},
'cited_docdb_ls': {0: '[ 596005 4321416 5802640 6031690 6043910 8600475 8642629 9203255 9345445 10177065 10455451 13428248 22139349 22591458 24627241 24750476 26261826 26405611 27079105 27096884]',
1: '[ 5956195 11260528 22181831 22437920 22642946 23278096 23407037 23458128 24244657 24355363 25014714 25115774 25156886 27047688 27089078 27398716]',
2: '[ 5855196 7755392 11183886 22894980 24648618 27185399]',
3: '[ 3573464 6279285 6294985 6542463 6981930 7427770 10325811 14970234 16878329 17935009 21811002 22329817 23543436 23907898 24456108 25283772]',
4: '[ 2777078 2826073 5944733 10484188 11052747 14682645 15688752 22333410 22614097 22646501 22783765 22978728 23231683 24259740 24605606 24839432 25492752 27009992 27044704]',
5: '[ 5777407 10417156 23463145 23845079 24397163 24426379 24916732 25216234 25296619 27054560 27509152]',
6: '[ 4136523 12578497 21994155 22418792 22626616 22655464 22694825 22779403 23081767 23309829 23379411 23621952 24130698 24236071 24267003 24790872 24841797 25343500 27006578]',
7: '[21722194 23841261 23870348 24749080 26713455 26884023 26892256 27123571]',
8: '[ 3770167 9249538 20340153 21805004 21826650 23074051 23211424 23586695 23664858 24139881 24669345 24951262 25109266 25172355 25351735 26158421 27074633]',
9: '[ 3773931 10400885 23825854 24863945 24904226 25372210 26673422 27108903]',
10: '[ 6245732 6270984 6282047 6313094 6323632 6357314 12700997 14934415]',
11: '[1331950 5937719 5950928 6032897 6737094 8103287]',
12: '[22536768 23111794 23827356 24148953 24483064 24636228 26369896 26722884]',
13: '[ 4096597 6452385 9164095 19820980 22468583 23758517 24922228]',
14: '[ 6273193 6365448 9349940 10531948 13589721 20897840 21818345 22422049 23234586 23722349 24282964 24466601 25476838 26223504 26685774 26756449 26812104 26900843 27088150]',
15: '[ 3770297 6285357 21272262 21883292 22392025 23100861 23160290 23827496 24060758 25448672 26918320]',
16: '[21808322 25167492 25401922 26858065]',
17: '[ 6293130 12621423 12977043 14043576 14524083 22013480 23070753 23360636 23672818 24210016 24396413 24505095 25447453 26335550 27560125]',
18: '[21923978 23414619 23700077 23916998 23917011 23917023 24227869]',
19: '[ 3029629 3461742 8589904 10338953 10633369 16254362 22248316 22635394 24392987 25416705 26671842 27391491 27406148]',
20: []}}
what I would like to do is to explode the variable cited_docdb_ls which contains lists separated by space rather than a comma.
How can I do that? If it is not possible, is there a way to separate them by comma rather than space and then explode them?
The resulting database should either contain cited_docdb_ls with traditional lists separated by comma and not by spaces or the exploded database. I have checked the df.explode() documentation but couldd not find any hint on how to manage situations where the list is separated by space.
Thank you

I would use str.findall with a (\d+) regex for numbers to convert the strings to lists, then explode:
out = (df.assign(newa_cited_docdb=df['newa_cited_docdb'].str.findall('\d+'),
cited_docdb_ls=df['cited_docdb_ls'].str.findall('(\d+)'))
.explode(['newa_cited_docdb', 'cited_docdb_ls'])
)
NB. if you don't have only digits a (\w+) regex will be more generic, however if the strings also contain [/] other than the in first and last character (e.g. [abc 12]3 45d]), then #jezrael's anwser will be an alternative.
output:
docdb_family_id newa_cited_docdb paperid fronteer distance \
0 569328 596005 NaN 0 NaN
0 569328 4321416 NaN 0 NaN
0 569328 5802640 NaN 0 NaN
0 569328 6031690 NaN 0 NaN
0 569328 6043910 NaN 0 NaN
.. ... ... ... ... ...
19 1340519 25416705 NaN 0 NaN
19 1340519 26671842 NaN 0 NaN
19 1340519 27391491 NaN 0 NaN
19 1340519 27406148 NaN 0 NaN
20 1340541 None 1.998989e+09 1 0.0
cited_docdb_ls
0 596005
0 4321416
0 5802640
0 6031690
0 6043910
.. ...
19 25416705
19 26671842
19 27391491
19 27406148
20 NaN
[239 rows x 6 columns]

Use Series.str.strip with Series.str.split for both columns and then DataFrame.explode:
df = (df.assign(newa_cited_docdb=df['newa_cited_docdb'].str.strip('[]').str.split(),
cited_docdb_ls=df['cited_docdb_ls'].str.strip('[]').str.split())
.explode(['newa_cited_docdb','cited_docdb_ls']))

Extract strings from a Dataframe looping over a single row

I'm reading multiple PDFs (using tabula) into data frames like this:
nan = float('nan')
DataFrame_as_dict = {'Unnamed: 0': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan},
'Unnamed: 1': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan},
'NOTA DE NEGOCIAÇÃO': {0: nan,
1: nan,
2: 'Rico Investimentos - Grupo XP',
3: 'Av. Presidente Juscelino Kubitschek - Torre Sul, 1909 - 25o ANDAR VILA OLIMPIA 4543-907',
4: 'Tel. 3003-5465Fax: (55 11) 4007-2465',
5: 'Internet: www.rico.com.vc SAC: 0800-774-0402 e-mail: atendimento#rico.com.vc'},
'Unnamed: 3': {0: 'Nr. nota Folha',
1: '330736 1',
2: nan,
3: 'SÃO PAULO - SP',
4: nan,
5: nan},
'Unnamed: 4': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan},
'Unnamed: 5': {0: 'Data pregão',
1: '30/09/2015',
2: nan,
3: nan,
4: nan,
5: nan}}
df = pd.DataFrame(DataFrame_as_dict)
dataframe figure
My intention is to use that value '330736 1' into the variable "number" and '30/09/2015' into a variable "date".
The issue is that, although these values will always be located in row 1, the columns vary in an unpredictable way across the multiple PDFs.
Therefore, I tried to loop over the different columns of row 1, in order to extract these data regardless the columns they are:
list_columns = df.columns
for i in range(len(list_columns)):
if isinstance(df.iloc[1:2,i], str):
if df.iloc[1:2,i].str.contains("/",na=False,regex=False).any():
date = str(df.iloc[1:2,i]).strip()
else:
n_nota = str(df.iloc[1:2,i]).strip()
However, without success... Any thoughts?

In your original code, if isinstance(df.iloc[1:2,i], str) will never evaluate to True for two reasons:
Strings inside DataFrames are of type object
df.iloc[1:2,i] will always be a pandas Series.
Since object is such a flexible type, it's not as useful as str for identifying the data you want. In the code below, I simply used a space character to differentiate the data you want for n_nota. If this doesn't work with your data, a regex pattern may be a good approach.
list_columns = df.columns
for i in range(len(list_columns)):
if isinstance(df.iloc[1:2,i].values, object):
(df.iloc[1:2,i].values)
if "/" in str(df.iloc[1:2,i].values):
date = str(df.iloc[1:2,i].values[0]).strip()
elif " " in str(df.iloc[1:2,i].values):
n_nota = str(df.iloc[1:2,i].values[0]).strip()
Edit: As noted below, the original code in the question text would have worked if each df.iloc[1:2,i] were replaced with df.iloc[1,i] as in:
list_columns = df.columns
for i in range(len(list_columns)):
if isinstance(df.iloc[1,i], str):
if df.iloc[1:2,i].str.contains("/",na=False,regex=False).any():
date = str(df.iloc[1,i]).strip()
else:
n_nota = str(df.iloc[1,i]).strip()

How to map functions in dask

I'm using Dask to manipulate a dataframe (coming from CSV file) and I'm looking for a way to improve this code using something like map, or apply functions since in large files is taking so long (I know having nested for and using iterrows() is the worst think I can make)
NAN_VALUES = [-999, "INVALID", -9999]
_all_rows=list()
for index, row in df.iterrows():
_row = list()
for key, value in row.iteritems():
if value in NAN_VALUES or pd.isnull(value):
_row.append(None)
else:
_row.append(apply_transform(key, value))
_all_rows.append(_row)
rows_count += 1
How can I map this code using map_partitions or pandas.map ?!
EXTRA: a bit more of context:
In order to be able to apply some functions I'm replacing NaN values with a default value. Finally I need to make a list for each row replacing the default values to None.
1.- Original DF
"name" "age" "money"
---------------------------
"David" NaN 12.345
"Jhon" 22 NaN
"Charles" 30 123.45
NaN NaN NaN
2.- Passing NaN to Default value
"name" "age" "money"
------------------------------
"David" -999 12.345
"Jhon" 22 -9999
"Charles" 30 123.45
"INVALID" -999 -9999
3.- Parse to a list each row
"name" , "age", "money"
------------------------
["David", None, 12.345]
["Jhon", 22, None]
["Charles", 30, 123.45]
[None, None, None]

My suggestion here is try to work with pandas and then try to translate into dask
pandas
import pandas as pd
import numpy as np
nan = np.nan
df = {'name': {0: 'David', 1: 'John', 2: 'Charles', 3: nan},
'age': {0: nan, 1: 22.0, 2: 30.0, 3: nan},
'money': {0: 12.345, 1: nan, 2: 123.45, 3: nan}}
df = pd.DataFrame(df)
# These are your default values
diz = {"age": -999, "name": "INVALID", "money": -9999}
Passing NaN to Default value
for k,v in diz.items():
df[k] = df[k].fillna(v)
Get a list for every row
df.apply(list, axis=1)
0 [David, nan, 12.345]
1 [John, 22.0, nan]
2 [Charles, 30.0, 123.45]
3 [nan, nan, nan]
dtype: object
dask
import pandas as pd
import dask.dataframe as dd
import numpy as np
nan = np.nan
df = {'name': {0: 'David', 1: 'John', 2: 'Charles', 3: nan},
'age': {0: nan, 1: 22.0, 2: 30.0, 3: nan},
'money': {0: 12.345, 1: nan, 2: 123.45, 3: nan}}
df = pd.DataFrame(df)
# These are your default values
diz = {"age": -999, "name": "INVALID", "money": -9999}
# transform to dask dataframe
df = dd.from_pandas(df, npartitions=2)
Passing NaN to Default value
This is exactly the same as before. Note that as dask is lazy you should run if you want to see the effects df.compute()
for k,v in diz.items():
df[k] = df[k].fillna(v)
Get a list for every row
Here things change a bit as you are asked to state explicitly the dtype of your output
df.apply(list, axis=1, meta=(None, 'object'))
In dask you can eventually use map_partitions as following
df.map_partitions(lambda x: x.apply(list, axis=1))
Remark please consider that if your data fits in memory you don't need dask and pandas could be faster.

Getting first non null value after group by function

I would like to return the first non null value of the utm_source column from each group after running a group by function.
This is the code I have written:
file[file['steps'] == 'Sign-ups'].sort_values(by=['ts']).groupby('anonymous_id')['utm_source'].apply(lambda x: x.first_valid_index())
This seems to return this:
anonymous_id
00003df1-be12-47b8-b3b8-d01c84a22fdf NaN
00009cc0-279f-4ccf-aea4-f6af1f2bb75a NaN
0000a6a0-00bc-475f-a9e5-9dcbb4309e78 NaN
0000c906-7060-4521-8090-9cd600b08974 638.0
0000c924-5959-4e2d-8757-0d10f96ca462 NaN
0000dc27-292c-4676-8a1b-4977f2ad1577 275.0
0000df7e-2579-4071-8aa5-814ab294bf9a 419.0
I am not quite sure what the values associated with the anon_id's are.
Here is a sample of my data:
{'anonymous_id': {0: '0000f8ea-3aa6-4423-9247-1d9580d378e1',
1: '00015d49-2cd8-41b1-bbe7-6aedbefdb098',
2: '0002226e-26a4-4f55-9578-2eff2999de7e',
3: '00022b83-240e-4ef9-aaad-ac84064bb902',
4: '00022b83-240e-4ef9-aaad-ac84064bb902'},
'ts': {0: '2018-04-11 06:59:20.206000',
1: '2019-05-18 05:59:11.874000',
2: '2018-09-10 18:19:25.260000',
3: '2017-10-11 08:20:18.092000',
4: '2017-10-11 08:20:31.466000'},
'utm_source': {0: nan, 1: 'facebook', 2: 'facebook', 3: nan, 4: nan},
'rank': {0: 1, 1: 1, 2: 1, 3: 1, 4: 2},
'steps': {0: 'Sign-ups', 1: nan, 2: nan, 3: nan, 4: nan}}
So for each anonymous_id I would return the first (chronological, sorted by the ts column) utm_source associated with the anon_id

So for each anonymous_id I would return the first (chronological,
sorted by the ts column) utm_source associated with the anon_id
IIUC you can first drop the null values and then groupby first:
df.sort_values('ts').dropna(subset=['utm_source']).groupby('anonymous_id')['utm_source'].first()
Output for your example data:
anonymous_id
00015d49-2cd8-41b1-bbe7-6aedbefdb098 facebook
0002226e-26a4-4f55-9578-2eff2999de7e facebook

'forward replace' with pandas

I have a df where every row looks something like this:
series = pd.Series({0: 1.0,
1: 1.0,
10: nan,
11: nan,
2: 1.0,
3: 1.0,
4: nan,
5: nan,
6: nan,
7: nan,
8: nan,
9: nan,
'B': 3.0,
'D': 1.0})
For every value in D I want series[series.D] to be 2 and, all numerical values that are higher than series.D, also to be 2. It's kinda like a 'forward replace'.
So what I want is:
target = pd.Series({0: 1.0,
1: 2.0,
10: nan,
11: nan,
2: 2.0,
3: 2.0,
4: nan,
5: nan,
6: nan,
7: nan,
8: nan,
9: nan,
'B': 3.0,
'D': 1.0})
So far I've got:
def forward_replace(series):
if pd.notnull(series['D']):
cols = [0,1,2,3,4,5,6,7,8,9,10,11]
target_cols = [x for x in cols if x > series.D]
series.loc[target_cols].replace({1:2}, inplace=True)
return series
It seems like it's not possible to use label based indexing with numerical column labels?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parallelizing a for loop in python - python

Related

Explode data in a list not separated by comma

Extract strings from a Dataframe looping over a single row

How to map functions in dask

Getting first non null value after group by function

'forward replace' with pandas

Categories

Resources