I have a folder of parquet files that I can't fit in memory so I am using dask to perform the data cleansing operations. I have a function where I want to perform item assignment but I can't seem to find any solutions online that qualify as solutions to this particular function. Below is the function that works in pandas. How do I get the same results in a dask dataframe? I thought delayed might help but all of the solutions I've tried to write haven't been working.
def item_assignment(df):
new_col = np.bitwise_and(df['OtherCol'], 0b110)
df['NewCol'] = 0
df.loc[new_col == 0b010, 'NewCol'] = 1
df.loc[new_col == 0b100, 'NewCol'] = -1
return df
TypeError: '_LocIndexer' object does not support item assignment
You can replace your loc assignments with dask.dataframe.Series.mask:
df['NewCol'] = 0
df['NewCol'] = df['NewCol'].mask(new_col == 0b010, 1)
df['NewCol'] = df['NewCol'].mask(new_col == 0b100, -1)
You can use map_partitions in this case where you can use raw pandas functionality. I.e.
ddf.map_partitions(item_assignment)
this operates on the individual pandas constituent dataframes of the dask dataframe
df = pd.DataFrame({"OtherCol":[0b010, 0b110, 0b100, 0b110, 0b100, 0b010]})
ddf = dd.from_pandas(df, npartitions=2)
ddf.map_partitions(item_assignment).compute()
And we see the result as expected:
OtherCol NewCol
0 2 1
1 6 0
2 4 -1
3 6 0
4 4 -1
5 2 1
Related
I want to write a function that takes a dataframe and removes some of its rows:
import pandas as pd
a = pd.DataFrame([1,2,3,3,5])
def f(df):
df = df[(df > 2)]
print(df)
f(a)
print(a)
This outputs
0
2 3
3 3
4 5
0
0 1
1 2
2 3
3 3
4 5
So a was not updated here. Is this because the df inside the function body is actually a copy of a? If so, how can I rewrite the code to access the genuine dataframe inside the function? More generally, there are various other things I would like to do to dataframes within functions that requires updating the objects passed as inputs to the function, so is there a general solution to do this?
This is what you're really asking I think: Pandas best way to subset a dataframe inplace, using a mask
How to filter in place.
So in your case it would be (I think):
import pandas as pd
a = pd.DataFrame([1,2,3,3,5])
def f(df):
# df = df[(df > 2)]
df.drop(df[(df > 2)], inplace = True) # might need df[(df>2)].index
print(df)
f(a)
print(a)
You can also modify global variable a:
import pandas as pd
a = pd.DataFrame([1,2,3,3,5])
def f(df_name):
globals()[df_name] = eval(f"{df_name}[({df_name} > 2)]")
print(globals()[df_name])
f("a")
Now dataframe named "a" will be modyfied.
Please note that argument of function is string "a" not variable a itself.
This works but is not recommended because in code are globals and eval (read more why: Why is Global State so Evil?)
Easy way is:
def f(df):
return df[(df > 2)]
a = f(a)
Function returns just changed dataframe.
I am trying to apply a function on multiple columns and in turn create multiple columns to count the length of each entry.
Basically I have 5 columns with indexes 5,7,9,13 and 15 and each entry in those columns is a string of the form 'WrappedArray(|2008-11-12, |2008-11-12)' and in my function I try to strip the wrappedArray part and split the two values and count the (length - 1) using the following;
def updates(row,num_col):
strp = row[num_col.strip('WrappedAway')
lis = list(strp.split(','))
return len(lis) - 1
where num_col is the index of the column and cal take the value 5,7,9,13,15.
I have done this but only for 1 column:
fn = lambda row: updates(row,5)
col = df.apply(fn, axis=1)
df = df.assign(**{'count1':col.values})
I basically want to apply this function to ALL the columns (not just 5 as above) with the indexes mentioned and then create a separate column associated with columns 5,7,9,13 and 15 all in short code instead of doing that separately for each value.
I hope I made sense.
In regards to finding the amount of elements in the list, looks like you could simply use str.count() to find the amount of ',' in the strings. And in order to apply a defined function to a set of columns you could do something like:
cols = [5,7,9,13,15]
for col in cols:
col_counts = {'{}_count'.format(col): df.iloc[:,col].apply(lambda x: x.count(','))}
df = df.assign(**col_counts)
Alternatively you can also usestrip('WrappedAway').split(',') as you where using:
def count_elements(x):
return len(x.strip('WrappedAway').split(',')) - 1
for col in cols:
col_counts = {'{}_count'.format(col):
df.iloc[:,col].apply(count_elements)}
df = df.assign(**col_counts)
So for example with the following dataframe:
df = pd.DataFrame({'A': ['WrappedArray(|2008-11-12, |2008-11-12, |2008-10-11)', 'WrappedArray(|2008-11-12, |2008-11-12)'],
'B': ['WrappedArray(|2008-11-12,|2008-11-12)', 'WrappedArray(|2008-11-12, |2008-11-12)'],
'C': ['WrappedArray(|2008-11-12|2008-11-12)', 'WrappedArray(|2008-11-12|2008-11-12)']})
Redefining the set of columns on which we want to count the amount of elements:
for col in [0,1,2]:
col_counts = {'{}_count'.format(col):
df.iloc[:,col].apply(count_elements)}
df = df.assign(**col_counts)
Would yield:
A \
0 WrappedArray(|2008-11-12, |2008-11-12, |2008-1...
1 WrappedArray(|2008-11-12, |2008-11-12)
B \
0 WrappedArray(|2008-11-12,|2008-11-12)
1 WrappedArray(|2008-11-12, |2008-11-12)
C 0_count 1_count 2_count
0 WrappedArray(|2008-11-12|2008-11-12) 2 1 0
1 WrappedArray(|2008-11-12|2008-11-12) 1 1 0
You are confusing row-wise and column-wise operations by trying to do both in one function. Choose one or the other. Column-wise operations are usually more efficient and you can utilize Pandas str methods.
Setup
df = pd.DataFrame({'A': ['WrappedArray(|2008-11-12, |2008-11-12, |2008-10-11)', 'WrappedArray(|2008-11-12, |2008-11-12)'],
'B': ['WrappedArray(|2008-11-12,|2008-11-12)', 'WrappedArray(|2008-11-12|2008-11-12)']})
Logic
# perform operations on strings in a series
def calc_length(series):
return series.str.strip('WrappedAway').str.split(',').str.len() - 1
# apply to each column and join to original dataframe
df = df.join(df.apply(calc_length).add_suffix('_Length'))
Result
print(df)
A \
0 WrappedArray(|2008-11-12, |2008-11-12, |2008-1...
1 WrappedArray(|2008-11-12, |2008-11-12)
B A_Length B_Length
0 WrappedArray(|2008-11-12,|2008-11-12) 2 1
1 WrappedArray(|2008-11-12|2008-11-12) 1 0
I think we can use pandas str.count()
df= pd.DataFrame({
"col1":['WrappedArray(|2008-11-12, |2008-11-12)',
'WrappedArray(|2018-11-12, |2017-11-12, |2018-11-12)'],
"col2":['WrappedArray(|2008-11-12, |2008-11-12,|2008-11-12,|2008-11-12)',
'WrappedArray(|2018-11-12, |2017-11-12, |2018-11-12)']})
df["col1"].str.count(',')
pandas DataFrame I'm starting with:
pandas DataFrame I'm trying to build:
I'm very new to computer science so I wasn't quite sure how to word my question without providing images. Basically, I want to build a pandas DataFrame with one row that has columns with column names -3 to 3 and the values below are the maximum absolute values of the second column from the first pandas DataFrame in relation to the first column from the first pandas DataFrame.
I also have the same data in a list a shown here:
Here is what I've tried but I keep getting an error:
Here's the solution. Looping over the dataframe to get what you want seems overkill
import pandas as pd
df = pd.DataFrame([[-1,1],[-2,2],[-2,1],[-2,2],[-1,6],[-1,2],[-1,1],[1,-2],[2,-2],[1,-2],[2,-1],[6,-1],[2,-1],[1,-1]])
max = df.groupby(0)[1].max()
x = dict()
for i in range(-3,4):
try:
if y[i] < 0:
x[i] = z[i]
else:
x[i] = y[i]
except KeyError:
x[i] = 0
x = pd.DataFrame(x, index = [0])
which gives the result
-3 -2 -1 0 1 2 3
0 2 6 0 -2 -2 0
This results in a dataframe with a column for '0' - that should be easy to get rid of at any point
I need to fill a pandas dataframe column with empty numpy arrays. I mean that any row has to be an empty array. Something like
df['ColumnName'] = np.empty(0,dtype=float)
but this don't work because it tries to use every value of the array and assign one value per row.
I tried then
for k in range(len(df)):
df['ColumnName'].iloc[k] = np.empty(0,dtype=float)
but still no luck. Any advice ?
You can repeat the np.empty into number of rows and then assign them to the column. Since it aint a scalar it cant be directly assigned like df['x'] = some_scalar.
df = pd.DataFrame({'a':[0,1,2]})
df['c'] = [np.empty(0,dtype=float)]*len(df)
Output :
a c
0 0 []
1 1 []
2 2 []
You can also use a simple comprehension
df = pd.DataFrame({'a':[0,1,2]})
df['c'] = [[] for i in range(len(df))]
Output
a c
0 0 []
1 1 []
2 2 []
So I'm working with Pandas and I have multiple words (i.e. strings) in one cell, and I need to put every word into the new row and keep coordinated data. I've found a method which could help me,but it works with numbers, not strings.
So what method do I need to use?
Simple example of my table:
id name method
1 adenosis mammography, mri
And I need it to be:
id name method
1 adenosis mammography
mri
Thanks!
UPDATE:
That's what I'm trying to do, according to #jezrael's proposal:
import pandas as pd
import numpy as np
xl = pd.ExcelFile("./dev/eyetoai/google_form_pure.xlsx")
xl.sheet_names
df = xl.parse("Form Responses 1")
df.groupby(['Name of condition','Condition description','Relevant Modality','Type of finding Mammography', 'Type of finding MRI', 'Type of finding US']).mean()
splitted = df['Relevant Modality'].str.split(',')
l = splitted.str.len()
df = pd.DataFrame({col: np.repeat(df[col], l) for col in ['Name of condition','Condition description']})
df['Relevant Modality'] = np.concatenate(splitted)
But I have this type of error:
TypeError: repeat() takes exactly 2 arguments (3 given)
You can use read_excel + split + stack + drop + join + reset_index:
#define columns which need split by , and then flatten them
cols = ['Condition description','Relevant Modality']
#read csv to dataframe
df = pd.read_excel('Untitled 1.xlsx')
#print (df)
df1 = pd.DataFrame({col: df[col].str.split(',', expand=True).stack() for col in cols})
print (df1)
Condition description Relevant Modality
0 0 Fibroadenomas are the most common cause of a b... Mammography
1 NaN US
2 NaN MRI
1 0 Papillomas are benign neoplasms Mammography
1 arising in a duct US
2 either centrally or peripherally within the b... MRI
3 leading to a nipple discharge. As they are of... NaN
4 the discharge may be bloodstained. NaN
2 0 OK Mammography
3 0 breast cancer Mammography
1 NaN US
4 0 breast inflammation Mammography
1 NaN US
#remove original columns
df = df.drop(cols, axis=1)
#create Multiindex in original df for align rows
df.index = [df.index, [0]* len(df.index)]
#join original to flattened columns, remove Multiindex
df = df1.join(df).reset_index(drop=True)
#print (df)
The previous answer is correct, I think you should use the id of reference.
an easier way could possibly be to just parse the method string to a list:
method_list = method.split(',')
method_list = np.asarray(method_list)
If you have any trouble with indexing when initializing your Dataframe, just set index to:
pd.Dataframe(data, index=[0,0])
df.set_index('id')
passing the list as a value for your method key will automatically create a copy of both the index - 'id' and 'name'
id method name
1 mammography adenosis
1 mri adenosis
I hope this helps, all the best