I'm using Dask to manipulate a dataframe (coming from CSV file) and I'm looking for a way to improve this code using something like map, or apply functions since in large files is taking so long (I know having nested for and using iterrows() is the worst think I can make)
NAN_VALUES = [-999, "INVALID", -9999]
_all_rows=list()
for index, row in df.iterrows():
_row = list()
for key, value in row.iteritems():
if value in NAN_VALUES or pd.isnull(value):
_row.append(None)
else:
_row.append(apply_transform(key, value))
_all_rows.append(_row)
rows_count += 1
How can I map this code using map_partitions or pandas.map ?!
EXTRA: a bit more of context:
In order to be able to apply some functions I'm replacing NaN values with a default value. Finally I need to make a list for each row replacing the default values to None.
1.- Original DF
"name" "age" "money"
---------------------------
"David" NaN 12.345
"Jhon" 22 NaN
"Charles" 30 123.45
NaN NaN NaN
2.- Passing NaN to Default value
"name" "age" "money"
------------------------------
"David" -999 12.345
"Jhon" 22 -9999
"Charles" 30 123.45
"INVALID" -999 -9999
3.- Parse to a list each row
"name" , "age", "money"
------------------------
["David", None, 12.345]
["Jhon", 22, None]
["Charles", 30, 123.45]
[None, None, None]
My suggestion here is try to work with pandas and then try to translate into dask
pandas
import pandas as pd
import numpy as np
nan = np.nan
df = {'name': {0: 'David', 1: 'John', 2: 'Charles', 3: nan},
'age': {0: nan, 1: 22.0, 2: 30.0, 3: nan},
'money': {0: 12.345, 1: nan, 2: 123.45, 3: nan}}
df = pd.DataFrame(df)
# These are your default values
diz = {"age": -999, "name": "INVALID", "money": -9999}
Passing NaN to Default value
for k,v in diz.items():
df[k] = df[k].fillna(v)
Get a list for every row
df.apply(list, axis=1)
0 [David, nan, 12.345]
1 [John, 22.0, nan]
2 [Charles, 30.0, 123.45]
3 [nan, nan, nan]
dtype: object
dask
import pandas as pd
import dask.dataframe as dd
import numpy as np
nan = np.nan
df = {'name': {0: 'David', 1: 'John', 2: 'Charles', 3: nan},
'age': {0: nan, 1: 22.0, 2: 30.0, 3: nan},
'money': {0: 12.345, 1: nan, 2: 123.45, 3: nan}}
df = pd.DataFrame(df)
# These are your default values
diz = {"age": -999, "name": "INVALID", "money": -9999}
# transform to dask dataframe
df = dd.from_pandas(df, npartitions=2)
Passing NaN to Default value
This is exactly the same as before. Note that as dask is lazy you should run if you want to see the effects df.compute()
for k,v in diz.items():
df[k] = df[k].fillna(v)
Get a list for every row
Here things change a bit as you are asked to state explicitly the dtype of your output
df.apply(list, axis=1, meta=(None, 'object'))
In dask you can eventually use map_partitions as following
df.map_partitions(lambda x: x.apply(list, axis=1))
Remark please consider that if your data fits in memory you don't need dask and pandas could be faster.
Related
I'm struggling on the following case. I have a dataframe with columns containing NaN or a string value. How could I merge all 3 columns (i.e. Q3_4_4, Q3_4_5, and Q3_4_6 into one new column (i.e Q3_4) by only keeping the string value? This column would have 'hi' in row 1, 'bye' in row 2, and 'hello' in row3.
Thank you for your help
{'Q3_4_4': {'R_00RfS8OP6QrTNtL': nan,
'R_3JtmbtdPjxXZAwA': nan,
'R_3G2sp6TEXZmf2KI': 'hello'
},
'Q3_4_5': {'R_00RfS8OP6QrTNtL': 'hi',
'R_3JtmbtdPjxXZAwA': nan,
'R_3G2sp6TEXZmf2KI': nan},
'Q3_4_6': {'R_00RfS8OP6QrTNtL': nan,
'R_3JtmbtdPjxXZAwA': 'bye',
'R_3G2sp6TEXZmf2KI': nan},
}
If need join by columns names with removed last value after last _ in columns names use GroupBy.first per axis=1 (per columns) with lambda for select by columns names spiller from right by first _ and selecting:
nan = np.nan
df = pd.DataFrame({'Q3_4_4': {'R_00RfS8OP6QrTNtL': nan,
'R_3JtmbtdPjxXZAwA': nan,
'R_3G2sp6TEXZmf2KI': 'hello'
},
'Q3_4_5': {'R_00RfS8OP6QrTNtL': 'hi',
'R_3JtmbtdPjxXZAwA': nan,
'R_3G2sp6TEXZmf2KI': nan},
'Q3_4_6': {'R_00RfS8OP6QrTNtL': nan,
'R_3JtmbtdPjxXZAwA': 'bye',
'R_3G2sp6TEXZmf2KI': nan},
})
df = df.groupby(lambda x: x.rsplit('_', 1)[0], axis=1).first()
print (df)
Q3_4
R_00RfS8OP6QrTNtL hi
R_3JtmbtdPjxXZAwA bye
R_3G2sp6TEXZmf2KI hello
df.apply(lambda x : x[x.last_valid_index()], 1)
More methods: https://stackoverflow.com/a/46520070/8170215
In my dataframe, a value should be classed as missing if it is "nan", 0 or "missing".
I wrote a list
null_values = ["nan", 0, "missing"]
Then I checked which columns contained missing values
df.columns[df.isin(null_values).any()]
My df looks as follows:
{'Sick': {0: False, 1: False, 2: False, 3: False, 4: False, 5: False, 6: True},
'Name': {0: 'Liam',
1: 'Chloe',
2: 'Andy',
3: 'Max',
4: 'Anne',
5: nan,
6: 'Charlie'},
'Age': {0: 1.0, 1: 2.0, 2: 8.0, 3: 4.0, 4: 5.0, 5: 2.0, 6: 9.0}}
It flags the column 'Sick' as containing missing values even though it only contains FALSE/TRUE values. However, it correctly recognises that Age has no missing values. Why does it count FALSE as a missing value when I have not defined it as such in my list?
One idea is exclude boolean columns by DataFrame.select_dtypes:
null_values = ["nan", 0, "missing"]
df1 = df.select_dtypes(exclude=bool)
print (df1.columns[df1.isin(null_values).any()])
Index([], dtype='object')
More info is here.
If need also compare NaN missing values add it to list:
null_values = ["nan", 0, "missing", np.nan]
df1 = df.select_dtypes(exclude=bool)
print (df1.columns[df1.isin(null_values).any()])
Index(['Name'], dtype='object')
EDIT: One trick is convert all values to strings, then need compare '0' for strings from integers and '0.0' for strings from floats:
null_values = ["nan", '0', '0.0', "missing"]
print (df.columns[df.astype(str).isin(null_values).any()])
Index(['Name'], dtype='object')
Convert the NaN values to zero
Add a row called diff with the difference between minimum and maximum value in each column. Try solving it using lambda function
Add a column called diff with the difference between minimum and maximum value in each row.
The final df should look like df_final shown below
df = pd.DataFrame({'val1':[9,15,71,9,5], 'val2': [8,31,10, 14,np.nan]})
df
df_final = pd.DataFrame({'diff': {0: 1.0, 1: -16.0, 2: 61.0, 3: -5.0, 4: 5.0, 'diff': 35.0}, 'val1': {0: 9.0, 1: 15.0, 2: 71.0, 3: 9.0, 4: 5.0, 'diff': 66.0}, 'val2': {0: 8.0, 1: 31.0, 2: 10.0, 3: 14.0, 4: 0.0, 'diff': 31.0}})
df_final
Now I want to subtract all the rows value of column 'val1' and then 'val2' after which I have to create a new row below and show the result(the differences). (If possible,suggest me if I can do it using the lambda function)
IICU
df.fillna(0, inplace=True)#Replace NaN with zero
df['diff']=df.val1.sub(df.val2)#Subtract the two vals
#df.loc[:,'diff']= df.apply(lambda x: x.max() - x.min(), axis=1)#if had more columns and needed differences between max and min across columns
df.loc['diff',:]= df.T.apply(lambda x: x.max() - x.min(), axis=1)#fifference between max and min in each column
I would like to return the first non null value of the utm_source column from each group after running a group by function.
This is the code I have written:
file[file['steps'] == 'Sign-ups'].sort_values(by=['ts']).groupby('anonymous_id')['utm_source'].apply(lambda x: x.first_valid_index())
This seems to return this:
anonymous_id
00003df1-be12-47b8-b3b8-d01c84a22fdf NaN
00009cc0-279f-4ccf-aea4-f6af1f2bb75a NaN
0000a6a0-00bc-475f-a9e5-9dcbb4309e78 NaN
0000c906-7060-4521-8090-9cd600b08974 638.0
0000c924-5959-4e2d-8757-0d10f96ca462 NaN
0000dc27-292c-4676-8a1b-4977f2ad1577 275.0
0000df7e-2579-4071-8aa5-814ab294bf9a 419.0
I am not quite sure what the values associated with the anon_id's are.
Here is a sample of my data:
{'anonymous_id': {0: '0000f8ea-3aa6-4423-9247-1d9580d378e1',
1: '00015d49-2cd8-41b1-bbe7-6aedbefdb098',
2: '0002226e-26a4-4f55-9578-2eff2999de7e',
3: '00022b83-240e-4ef9-aaad-ac84064bb902',
4: '00022b83-240e-4ef9-aaad-ac84064bb902'},
'ts': {0: '2018-04-11 06:59:20.206000',
1: '2019-05-18 05:59:11.874000',
2: '2018-09-10 18:19:25.260000',
3: '2017-10-11 08:20:18.092000',
4: '2017-10-11 08:20:31.466000'},
'utm_source': {0: nan, 1: 'facebook', 2: 'facebook', 3: nan, 4: nan},
'rank': {0: 1, 1: 1, 2: 1, 3: 1, 4: 2},
'steps': {0: 'Sign-ups', 1: nan, 2: nan, 3: nan, 4: nan}}
So for each anonymous_id I would return the first (chronological, sorted by the ts column) utm_source associated with the anon_id
So for each anonymous_id I would return the first (chronological,
sorted by the ts column) utm_source associated with the anon_id
IIUC you can first drop the null values and then groupby first:
df.sort_values('ts').dropna(subset=['utm_source']).groupby('anonymous_id')['utm_source'].first()
Output for your example data:
anonymous_id
00015d49-2cd8-41b1-bbe7-6aedbefdb098 facebook
0002226e-26a4-4f55-9578-2eff2999de7e facebook
In df1, each cell value is the index of the row I want from df2.
I would like to grab the information for the row in df2 trial_ms column and then rename the column in df1 based on the df2 column that was grabbed.
Reproducible DF's:
# df1
nan = np.NaN
df1 = {'n1': {0: 1, 1: 2, 2: 8, 3: 2, 4: 8, 5: 8},
'n2': {0: nan, 1: 3.0, 2: 9.0, 3: nan, 4: 9.0, 5: nan},
'n3': {0: nan, 1: nan, 2: 10.0, 3: nan, 4: nan, 5: nan}}
df1 = pd.DataFrame().from_dict(df1)
# df2
df2 = {
'trial_ms': {1: -18963961, 2: 31992270, 3: -13028311},
'user_entries_error_no': {1: 2, 2: 6, 3: 2},
'user_entries_plybs': {1: 3, 2: 3, 3: 2},
'user_id': {1: 'seb', 2: 'seb', 3: 'seb'}}
df2 = pd.DataFrame().from_dict(df2)
Expected Output:
**n1_trial_ms n2_trial_ms n3_trial_ms**
31992270 NaN NaN
-13028311 -18934961 NaN
etc.
Attempt:
for index, row in ch.iterrows():
print(row)
b = df1.iloc[row]['trial_ms']
Gives me the error:
IndexError: positional indexers are out-of-bounds
I believe you need dictionary from trial_ms column - keys are index of df1 and replace values with get, if not matched values is get mising value NaN:
d = df2['trial_ms'].to_dict()
df3 = df1.applymap(lambda x: d.get(x, np.nan)).add_suffix('_trial_ms')
print (df3)
n1_trial_ms n2_trial_ms n3_trial_ms
0 -18963961.0 NaN NaN
1 31992270.0 -13028311.0 NaN
2 NaN NaN NaN
3 31992270.0 NaN NaN
4 NaN NaN NaN
5 NaN NaN NaN