DASK: Replace infinite (inf) values in single column - python

I have a dask dataframe in which I have a few inf values appearing. I wish to areplace these on a per column basis, because where inf exists I can replace with a value that is appropriate to the upper bounds that can be expected from that column.
I'm having some trouble understanding the documentation, or rather translating it into something I can use to replace infinite values.
What I have been trying is roughly around the below, replacing inf with 1000 - however the inf value seems to remain in place, unchanged.
Any advice on how to do this would be excellent. Because this is a huge dataframe (10m rows, 40 cols) I'd prefer to do it in a fashion that doesn't use lamba or loops- which the below should basically achieve, but doesn't.
ddf['mycolumn'].replace(np.inf,1000)

Following #Enzo's comment, make sure you are assigning the replaced values back to the original column:
import numpy as np
import pandas as pd
import dask.dataframe as dd
df = pd.DataFrame([1, 2, np.inf], columns=['a'])
ddf = dd.from_pandas(df, npartitions=2)
ddf['a'] = ddf['a'].replace(np.inf, 1000)
# check results with: ddf.compute()
# a
# 0 1.0
# 1 2.0
# 2 1000.0

Related

Correctly using np.where() to output dataframe rather than a tuple

I am trying return the Date (index column using set_index()) where the measurement at one location is twice the measurement at another. I need to show the dates and the speeds at the first location.
Here is what I have so far...
c = np.where(data['Loc1']== 2*data['Loc9'])
c
which returns a truple... How can I get it to show the dates and wind speeds?
Slowly learning Python here.
np.where will return the indices to the rows meeting the criteria, so you then need to use the indices to select the rows of the DataFrame, e.g. using .loc:
import pandas as pd
import numpy as np
# Setup:
data = pd.DataFrame({"Loc1":[4,2,3], "Loc9": [2, 3, 4]})
c = np.where(data['Loc1']== 2*data['Loc9'])
data.loc[c]

Mean of every 15 rows of a dataframe in python

I have a dataframe of (1500x11). I have to select each of the 15 rows and take mean of every 11 columns separately. So my final dataframe should be of dimension 100x11. How to do this in Python.
The following should work:
dfnew=df[:0]
for i in range(100):
df2=df.iloc[i*15:i*15+15, :]
x=pd.Series(dict(df2.mean()))
dfnew=dfnew.append(x, ignore_index=True)
print(dfnew)
Don't know much about pandas, hence I've coded my next solution in pure numpy. Without any python loops hence very efficient. And converted result back to pandas DataFrame:
Try next code online!
import pandas as pd, numpy as np
df = pd.DataFrame([[i + j for j in range(11)] for i in range(1500)])
a = df.values
a = a.reshape((a.shape[0] // 15, 15, a.shape[1]))
a = np.mean(a, axis = 1)
df = pd.DataFrame(a)
print(df)
You can use pandas.DataFrame.
Use a for loop to compute the means and create a counter which should be reseted at every 15 entries.
columns = [col1, col2, ..., col12]
for columns, values in df.items():
# compute mean
# at every 15 entries save it
Also, using pd.DataFrame() you can create the new dataframe.
I'd recommend you to read the documentation.
https://pandas.pydata.org/pandas-docs/stable/reference/frame.html

pandas rolling mean didn't work for my time data series

I'm writing a moving average function on time series:
def datedat_moving_mean(datedat,window):
#window is the average length
datedatF = pandas.DataFrame(datedat)
return (datedatF.rolling(window).mean()).values
The code above is copied from Moving Average- Pandas
The I apply this function to this time series:
datedat1 = numpy.array(
[ pandas.date_range(start=datetime.datetime(2015, 1, 30),periods=17),
numpy.random.rand(17)]).T
However, datedat_moving_mean(datedat1,4) just return the original datedat1. It moving averaged nothing! What's wrong?
Your construction of the DataFrame has no index (defaults to ints) and has a column of Timestamp and a column of floats.
I imagine that you want to use the Timestamps as an index, but even if not, you will need to for the purpose of using .rolling() on the frame.
I would suggest that your initialisation of the original DataFrame should be more like this
import pandas as pd
import numpy as np
df = pd.DataFrame(data=np.random.rand(17), index=pd.date_range(start=datetime.datetime(2015, 1, 30),periods=17))
If however you don't, and you are happy to have the dataframe un-indexed, you can work around the rolling issue by temporarily setting the index to the Timestamp column
import pandas as pd
import numpy as np
import datetime
datedat1 = np.array([ pd.date_range(start=datetime.datetime(2015, 1, 30),periods=17),np.random.rand(17)]).T
datedatF = pd.DataFrame(datedat1)
# We can temporarily set the index, compute the rolling mean, and then
# return the values of the entire DataFrame
vals = datedatF.set_index(0).rolling(5).mean().reset_index().values
return vals
I would suggest however that the DataFrame being created with an index will be better (consider what happens in the event that the datetimes are not sorted and you call rolling on the dataframe?)

Appending new column to dask dataframe

This is a follow up question to Shuffling data in dask.
I have an existing dask dataframe df where I wish to do the following:
df['rand_index'] = np.random.permutation(len(df))
However, this gives the error, Column assignment doesn't support type ndarray. I tried to use df.assign(rand_index = np.random.permutation(len(df)) which gives the same error.
Here is a minimal (not) working sample:
import pandas as pd
import dask.dataframe as dd
import numpy as np
df = dd.from_pandas(pd.DataFrame({'A':[1,2,3]*10, 'B':[3,2,1]*10}), npartitions=10)
df['rand_index'] = np.random.permutation(len(df))
Note:
The previous question mentioned using df = df.map_partitions(add_random_column_to_pandas_dataframe, ...) but I'm not sure if that is relevant to this particular case.
Edit 1
I attempted
df['rand_index'] = dd.from_array(np.random.permutation(len_df)) which, executed without an issue. When I inspected df.head() it seems that the new column was created just fine. However, when I look at df.tail() the rand_index is a bunch of NaNs.
In fact just to confirm I checked df.rand_index.max().compute() which turned out to be smaller than len(df)-1. So this is probably where df.map_partitions comes into play as I suspect this is an issue with dask being partitioned. In my particular case I have 80 partitions (not referring to the sample case).
You would need to turn np.random.permutation(len(df)) into type that dask understands:
permutations = dd.from_array(np.random.permutation(len(df)))
df['rand_index'] = permutations
df
This would yield:
Dask DataFrame Structure:
A B rand_index
npartitions=10
0 int64 int64 int32
3 ... ... ...
... ... ... ...
27 ... ... ...
29 ... ... ...
Dask Name: assign, 61 tasks
So it is up to you now if you want to .compute() to calculate actual results.
To assign a column you should use df.assign
Got the same problem as in Edit 1.
My work around is to get a unique column from the existing dataframe and feed into the dataframe that is to be appended.
import dask.dataframe as dd
import dask.array as da
import numpy as np
import panda as pd
df = dd.from_pandas(pd.DataFrame({'A':[1,2,3]*2, 'B':[3,2,1]*2, 'idx':[0,1,2,3,4,5]}), npartitions=10)
chunks = tuple(df.map_partitions(len).compute())
size = sum(chunks)
permutations = da.from_array(np.random.permutation(len(df)), chunks=chunks)
idx = da.from_array(df['idx'].compute(), chunks=chunks)
ddf = dd.concat([dd.from_dask_array(c) for c in [idx,permutations]], axis = 1)
ddf.columns = ['idx','rand_idx']
df = df.merge(ddf, on='idx')
df = df.set_index('rand_idx')
df.compute().head()

Reading values from Pandas dataframe rows into equations and entering result back into dataframe

I have a dataframe. For each row of the dataframe: I need to read values from two column indexes, pass these values to a set of equations, enter the result of each equation into its own column index in the same row, go to the next row and repeat.
After reading the responses to similar questions I tried:
import pandas as pd
DF = pd.read_csv("...")
Equation_1 = f(x, y)
Equation_2 = g(x, y)
for index, row in DF.iterrows():
a = DF[m]
b = DF[n]
DF[p] = Equation_1(a, b)
DF[q] = Equation_2(a, b)
Rather than iterating over DF, reading and entering new values for each row, this codes iterates over DF and enters the same values for each row. I am not sure what I am doing wrong here.
Also, from what I have read it is actually faster to treat the DF as a NumPy array and perform the calculation over the entire array at once rather than iterating. Not sure how I would go about this.
Thanks.
Turns out that this is extremely easy. All that must be done is to define two variables and assign the desired columns to them. Then set "the row to be replaced" equivalent to the equation containing the variables.
Pandas already knows that it must apply the equation to every row and return each value to its proper index. I didn't realize it would be this easy and was looking for more explicit code.
e.g.,
import pandas as pd
df = pd.read_csv("...") # df is a large 2D array
A = df[0]
B = df[1]
f(A,B) = ....
df[3] = f(A,B)
# If your equations are simple enough, do operations column-wise in Pandas:
import pandas as pd
test = pd.DataFrame([[1,2],[3,4],[5,6]])
test # Default column names are 0, 1
test[0] # This is column 0
test.icol(0) # This is COLUMN 0-indexed, returned as a Series
test.columns=(['S','Q']) # Column names are easier to use
test #Column names! Use them column-wise:
test['result'] = test.S**2 + test.Q
test # results stored in DataFrame
# For more complicated stuff, try apply, as in Python pandas apply on more columns :
def toyfun(df):
return df[0]-df[1]**2
test['out2']=test[['S','Q']].apply(toyfun, axis=1)
# You can also define the column names when you generate the DataFrame:
test2 = pd.DataFrame([[1,2],[3,4],[5,6]],columns = (list('AB')))

Categories

Resources