Change column variable values depending on condition in dask dataframes - python

This question is the intended solution to apply lambda function to a dask dataframe. This solution that does not require a pandas dataframe to implement. The reason behind this is I have a larger than memory dataframe and loading it to memory will not work as is done in pandas. (pandas is really good if data fits in memory).
The solution to the linked question is below.
df = pd.DataFrame({'A':['ant','ant','cherry', 'bee', 'ant'], 'B':['cat','peach', 'cat', 'cat', 'peach'], 'C':['dog','dog','roo', 'emu', 'emu']}) #How to read this sort of format directly to dask dataframe?
ddf = dd.from_pandas(df, npartitions=2) # dask conversion
list1 = ['A','B','C'] #list1 of hearder names
for c in list1:
vc = ddf[c].value_counts().compute()
vc /= vc.sum()
print(vc) # A table with the proportion of unique values
for i in range(vc.count()):
if vc[i]<0.5: # Checks whether the varaible value has a proportion of less than .5
ddf[c] = ddf[c].where(ddf[c] != vc.index[i], 'others') #changes such variable value to 'others' (iterates though all clumns mentioned in list1)
print(ddf.compute()) #shows how changes have been implemented column by column
However, the second for loop takes a very long time compute in the actual (larger than memory) dataframe. Is there a more efficient way of getting the same output using dask.
The objective of the code is to change the column variable value to others for labels that have appeared less than 50% of the time in the column. For example if the value ant has appeared less than 50% of the time in a column then change the name to others
Would anyone be able to help me with this regard.
Thanks
Michael

Here is a way to skip your nested loop:
import pandas as pd
import dask.dataframe as dd
df = pd.DataFrame({'A':['ant','ant','cherry', 'bee', 'ant'],
'B':['cat','peach', 'cat', 'cat', 'peach'],
'C':['dog','dog','roo', 'emu', 'emu']})
ddf = dd.from_pandas(df, npartitions=2)
l = len(ddf)
for col in ddf.columns:
vc = (ddf[col].value_counts()/l)
vc = vc[vc>.5].index.compute()
ddf[col] = ddf[col].where(ddf[col].isin(vc), "other")
ddf = ddf.compute()
If you have a really big dataframe and it is on a parquet format you can try to read it column by column and save the result to different files. At the end you can just concatenate them horizontally.

Related

Concatenate two pandas dataframe and follow a sequence of uid

I have a pandas dataframe with the following data: (in csv)
#list1
poke_id,symbol
0,BTC
1,ETB
2,USDC
#list2
5,SOL
6,XRP
I am able to concatenate them into one dataframe using the following code:
df = pd.concat([df1, df2], ignore_index = True)
df = df.reset_index(drop = True)
df['poke_id'] = df.index
df = df[['poke_id','symbol']]
which gives me the output: (in csv)
poke_id,symbol
0,BTC
1,ETB
2,USDC
3,SOL
4,XRP
Is there any other way to do the same. I think calling the whole data frame of ~4000 entries just to add ~100 more will be a little pointless and cumbersome. How can I make it in such a way that it picks list 1 (or dataframe 1) and picks the highest poke_id; and just does i + 1 to the later entries in list 2.
Your solution is good, is possible simplify:
df = pd.concat([df1, df2], ignore_index = True).rename_axis('poke_id').reset_index()
use indexes to get what data you want from the dataframe, although this is not effective if you want large amounts of data from the dataframe, this method allows you to take specific amounts of data from the dataframe

Dask calculate groupby rolling mean over the last n days and assign to original dataframe

I'm trying to replicate the below pandas group by rolling mean logic in dask. But stuck at 1) how to specify time period in days and 2) how to assign it back into the original frame?
df['avg3d']=df.groupby('g')['v'].transform(lambda x: x.rolling('3D').mean())
Get errors like:
ValueError: index must be monotonic, ValueError: Not all divisions are known, can't align partitions
or ValueError: cannot reindex from a duplicate axis
Full example
import pandas as pd
import dask.dataframe
df1 = pd.DataFrame({'g':['a']*10,'v':range(10)},index=pd.date_range('2020-01-01',periods=10))
df2=df1.copy()
df2['g']='b'
df = pd.concat([df1,df2]).sort_index()
df['avg3d']=df.groupby('g')['v'].transform(lambda x: x.rolling('3D').mean())
ddf = dask.dataframe.from_pandas(df, npartitions=4)
# works
ddf.groupby('g')['v'].apply(lambda x: x.rolling(3).mean(), meta=('avg3d', 'f8')).compute()
# rolling time period fails
ddf.groupby('g')['v'].apply(lambda x: x.rolling('3D').mean(), meta=('avg3d', 'f8')).compute()
# how do I add it to the rest of the data??
# neither of these work
ddf['avg3d']=ddf.groupby('g')['v'].apply(lambda x: x.rolling('3D').mean(), meta=('x', 'f8'))
ddf['avg3d']=ddf.groupby('g')['v'].transform(lambda x: x.rolling(3).mean(), meta=('x', 'f8'))
ddft = ddf.merge(ddf3d)
ddf.assign(avg3d=ddf.groupby('g')['v'].transform(lambda x: x.rolling(3).mean(), meta=('x', 'f8')))
Looked at
dask groupby apply then merge back to dataframe
Dask rolling function by group syntax
Compute the rolling mean over the last n days in Dask
ValueError: Not all divisions are known, can't align partitions error on dask dataframe
This problem arises due to the current implementation of .groupby in dask. The answer below is not a complete solution, but will hopefully explain why the error is happening.
First, let's make sure we get a true_result against which we can compare the dask results:
import dask.dataframe
import pandas as pd
df1 = pd.DataFrame(
{"g": ["a"] * 10, "v": range(10)}, index=pd.date_range("2020-01-01", periods=10)
)
df = pd.concat([df1, df1.assign(g="b")]).sort_index()
df["avg3d"] = df.groupby("g")["v"].transform(lambda x: x.rolling("3D").mean())
true_result = df["avg3d"].array
Now, running the code that is commented with #works is going to generate different values every time, even though the data or computations do not have a source of randomness:
ddf = dask.dataframe.from_pandas(df, npartitions=4)
# this doesn't work
dask_result_1 = ddf.groupby("g")["v"].apply(
lambda x: x.rolling(3).mean(), meta=("avg3d", "f8")
).compute().array
# this will fail, every time for a different reason
assert all(dask_result_1 == true_result)
Why is this happening? Well, under the hood, dask will want to shuffle data around to make sure that all the values of the groupby variable are in a single partition. This shuffling seems to be random, so when the values are stitched back together they can be out of original order.
So a quick way to fix this is to add sorting before rolling computation:
# rolling time period works
avg3d_dask = (
ddf.groupby("g")["v"]
.apply(lambda x: x.sort_index().rolling("3D").mean(), meta=("avg3d", "f8"))
.compute()
.droplevel(0)
.sort_index()
)
# this will always pass
assert all(avg3d_dask == true_result)
Now, how do we add this to the original datafame? I don't know a simple way of doing this, but one of the hard ways would be to calculate partitions of the original dask dataframe and then split the data into appropriate chunks and assign. This approach however is not very robust (or at least requires a lot of use-case specific fine-tuning), so hopefully someone can provide a better solution for this part.

Pandas "A value is trying to be set on a copy of a slice from a DataFrame"

Having a bit of trouble understanding the documentation
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
dfbreed['x'] = dfbreed.apply(testbreed, axis=1)
C:/Users/erasmuss/PycharmProjects/Sarah/farmdata.py:38: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Code is basically to re-arrange and clean some data to make analysis easier.
Code in given row-by per each animal, but has repetitions, blanks, and some other sparse values
Idea is to basically stack rows into columns and grab the useful data (Weight by date and final BCS) per animal
Initial DF
few snippets of the dataframe
Output Format
Output DF/csv
import pandas as pd
import numpy as np
#Function for cleaning up multiple entries of breeds
def testbreed(x):
if x.first_valid_index() is None:
return None
else:
return x[x.first_valid_index()]
#Read Data
df1 = pd.read_csv("farmdata.csv")
#Drop empty rows
df1.dropna(how='all', axis=1, inplace=True)
#Copy to extract Weights in DF2
df2 = df1.copy()
df2 = df2.drop(['BCS', 'Breed','Age'], axis=1)
#Pivot for ID names in DF1
df1 = df1.pivot(index='ID', columns='Date', values=['Breed','Weight', 'BCS'])
#Pivot for weights in DF2
df2 = df2.pivot(index='ID', columns='Date', values = 'Weight')
#Split out Breeds and BCS into individual dataframes w/Duplicate/missing data for each ID
df3 = df1.copy()
dfbreed = df3[['Breed']]
dfBCS = df3[['BCS']]
#Drop empty BCS columns
df1.dropna(how='all', axis=1, inplace=True)
#Shorten Breed and BCS to single Column by grabbing first value that is real. see function above
dfbreed['x'] = dfbreed.apply(testbreed, axis=1)
dfBCS['x'] = dfBCS.apply(testbreed, axis=1)
#Populate BCS and Breed into new DF
df5= pd.DataFrame(data=None)
df5['Breed'] = dfbreed['x']
df5['BCS'] = dfBCS['x']
#Join Weights
df5 = df5.join(df2)
#Write output
df5.to_csv(r'.\out1.csv')
I want to take the BCS and Breed dataframes which are multi-indexed on the column by Breed or BCS and then by date to take the first non-NaN value in the rows of dates and set that into a column named breed.
I had a lot of trouble getting the columns to pick the first unique values in-situ on the DF
I found a work-around with a 2015 answer:
2015 Answer
which defined the function at the top.
reading through the setting a value on the copy-of a slice makes sense intuitively,
but I can't seem to think of a way to make it work as a direct-replacement or index-based.
Should I be looping through?
Trying from The second answer here
I get
dfbreed.loc[:,'Breed'] = dfbreed['Breed'].apply(testbreed, axis=1)
dfBCS.loc[:, 'BCS'] = dfBCS.apply['BCS'](testbreed, axis=1)
which returns
ValueError: Must have equal len keys and value when setting with an iterable
I'm thinking this has something to do with the multi-index
keys come up as:
MultiIndex([('Breed', '1/28/2021'),
('Breed', '2/12/2021'),
('Breed', '2/4/2021'),
('Breed', '3/18/2021'),
('Breed', '7/30/2021')],
names=[None, 'Date'])
MultiIndex([('BCS', '1/28/2021'),
('BCS', '2/12/2021'),
('BCS', '2/4/2021'),
('BCS', '3/18/2021'),
('BCS', '7/30/2021')],
names=[None, 'Date'])
Sorry for the long question(s?)
Can anyone help me out?
Thanks.
You created dfbreed as:
dfbreed = df3[['Breed']]
So it is a view of the original DataFrame (limited to just this one column).
Remember that a view has not any own data buffer, it is only a tool to "view"
a fragment of the original DataFrame, with read only access.
When you attempt to perform dfbreed['x'] = dfbreed.apply(...), you
actually attempt to violate the read-only access mode.
To avoid this error, create dfbreed as an "independent" DataFrame:
dfbreed = df3[['Breed']].copy()
Now dfbreed has its own data buffer and you are free to change the data.

Create a dictionary from pandas empty dataframe with only column names

I have a pandas data frame with only two column names( single row, which can be also considered as headers).I want to make a dictionary out of this with the first column being the value and the second column being the key.I already tried the
to.dict() method, but it's not working as it's an empty dataframe.
Example
df=|Land |Norway| to {'Land': Norway}
I can change the pandas data frame to some other type and find my way around it, but this question is mostly to learn the best/different/efficient approach for this problem.
For now I have this as the solution :
dict(zip(a.iloc[0:0,0:1],a.iloc[0:0,1:2]))
Is there any other way to do this?
Here's a simple way convert the columns to a list and a list to a dictionary
def list_to_dict(a):
it = iter(a)
ret_dict = dict(zip(it, it))
return ret_dict
df = pd.DataFrame([], columns=['Land', 'Normway'])
dict_val = list_to_dict(df.columns.to_list())
dict_val # {'Land': 'Normway'}
Very manual solution
df = pd.DataFrame(columns=['Land', 'Norway'])
df = pd.DataFrame({df.columns[0]: df.columns[1]}, index=[0])
If you have any number of columns and you want each sequential pair to have this transformation, try:
df = pd.DataFrame(dict(zip(df.columns[::2], df.columns[1::2])), index=[0])
Note: You will get an error if your DataFrame does not have at least two columns.

Pandas: Use iterrows on Dataframe subset

What is the best way to do iterrows with a subset of a DataFrame?
Let's take the following simple example:
import pandas as pd
df = pd.DataFrame({
'Product': list('AAAABBAA'),
'Quantity': [5,2,5,10,1,5,2,3],
'Start' : [
DT.datetime(2013,1,1,9,0),
DT.datetime(2013,1,1,8,5),
DT.datetime(2013,2,5,14,0),
DT.datetime(2013,2,5,16,0),
DT.datetime(2013,2,8,20,0),
DT.datetime(2013,2,8,16,50),
DT.datetime(2013,2,8,7,0),
DT.datetime(2013,7,4,8,0)]})
df = df.set_index(['Start'])
Now I would like to modify a subset of this DataFrame using the itterrows function, e.g.:
for i, row_i in df[df.Product == 'A'].iterrows():
row_i['Product'] = 'A1' # actually a more complex calculation
However, the changes do not persist.
Is there any possibility (except a manual lookup using the index 'i') to make persistent changes on the original Dataframe ?
Why do you need iterrows() for this? I think it's always preferrable to use vectorized operations in pandas (or numpy):
df.ix[df['Product'] == 'A', "Product"] = 'A1'
I guess the best way that comes to my mind is to generate a new vector with the desired result, where you can loop all you want and then reassign it back to the column
#make a copy of the column
P = df.Product.copy()
#do the operation or loop if you really must
P[ P=="A" ] = "A1"
#reassign to original df
df["Product"] = P

Categories

Resources