Python Dataframe for loop syntax - python

I'm having trouble in correctly executing a for loop through my dataframe in python.
Basically, for every row in the dataframe (df_weather), the code should select one value each from the column no. 13 and 14 and execute a function which is defined earlier in the code. Eventually, I require the calculated value in each row to be summed to give me one final answer.
The error being returned is as follows: "string indices must be integers"
Request anyone to help me through this step. The code for the same is provided below.
Thanks!
stress_rate = 0
for i in df_weather:
b = GetStressDampHeatParameterized(i[:,13], i[:,14])
stress_rate = b + stress_rate
print(stress_rate)

This can be solved in a single line:
print sum(df.apply(lambda row: func(row[14], row[15]), axis=1))
Where func is your desired function and axis=1 ensures that the function is applied on each row as opposed to each column (which is the default).
My solution first creates a temporary series (picture: an unattached column) that is constructed by applying a function to each row in turn. The function that is actually being applied is an anonymous function indicated by the keyword lambda, which takes a single input row and which is fed a single row at a time from the apply method. That anonymous function simply calls your function func and passes the two column values in the row.
A Series can be summed using the sum function.
Note the indexing of the columns starts at 0.
Also note, saying for x in df: will iterate over the columns.

your number one problem is the following line:
for i in df_weather: This line is actually yielding you the column titles and not the rows themselves. What you're looking for is actually the following:
for i in df_weather.values():. The values will return a numpy array that you could itterate. The problem though is that the variable i will be a single row in the matrix now.

Related

Can I iterate through my df within a defined function?

I have a function that takes two numbers and makes an output. I want to apply that function to a specific column in my df for the first two rows, and then the next two, etc.
function(df['foo'][0:2]) works, and the next group to be passed into the function would be [2:4]. How would I create a loop that would iterate through my entire df? Or am I not thinking about this correctly?
Use a for loop that iterates in steps of 2.
for i in range(0, len(df.index), 2):
function(df['foo'][i:i+2])

For Python Pandas, how to implement a "running" check of 2 rows against the previous 2 rows?

[updated with expected outcome]
I'm trying to implement a "running" check where I need the sum and mean of two rows to be more than the previous 2 rows.
Referring to the dataframe (copied into spreadsheet) below, I'm trying code out a function where if the mean of those two orange cells is more than the blue cells, the function will return true for row 8, under a new column called 'Cond11'. The dataframe here is historical, so all rows are available.
Note that that Rows column is added in the spreadsheet, easier for me to reference the rows here.
I have been using .rolling to refer to the current row + whatever number of rows to refer to, or using shift(1) to refer to the previous row.
df.loc[:, ('Cond9')] = df.n.rolling(4).mean() >= 30
df.loc[:, ('Cond10')] = df.a > df.a.shift(1)
I'm stuck here... how to I do this 2 rows vs the previous 2 rows? Please advise!
The 2nd part of this question: I have another function that checks the latest rows in the dataframe for the same condition above. This function is meant to be used in real-time, when new data is streaming into the dataframe and the function is supposed to check the latest rows only.
Can I check if the following code works to detect the same conditions above?
cond11 = candles.n[-2:-1].sum() > candles.n[-4:-3].sum()
I believe this solves your problem:
df.rolling(4).apply(lambda rows: rows[0] + rows[1] < rows[2] + rows[3])
The first 3 rows will be NaNs but you did not define what you would like to happen there.
As for the second part, to be able to produce this condition live for new data you just have to prepend the last 3 rows of your current data and then apply the same process to it:
pd.concat([df[-3:], df])

How to select a column from a pandas dataframe to be plotted without addressing its name

I need to plot data from a column, and I want to do it without using its name.
The problem is, I want to have user input to make the the analysis customised, and it means I'll always get a different name for the column, thus having to change the name manually for the plot. Any possible solutions to make it automatic?
I tried
stocks_ret.iloc[0,1].plot(figsize=(16,8), grid=True)
I also tried using .iloc but got
AttributeError: 'numpy.float64' object has no attribute 'plot'
try changing
stocks_ret.iloc[0,1].plot(figsize=(16,8), grid=True)
to
stocks_ret.iloc[:,1].plot(figsize=(16,8), grid=True)
explanation
iloc works by selecting row and column indices to extract from your dataframe, using the syntax my_dataframe.iloc[row_range, column_range]. For example, by writing my_dataframe.iloc[0:2, 0:4], you're asking to extract the values from the first to third row and from the first to fifth column (remember, indices start at 0 in Python).
Similarly, by writing my_dataframe.iloc[2, 3], you're asking what's the specific value inside my_dataframe at the third row, fourth column. This is what you've done in your code. Since it returns a single value, and not a pandas series/dataframe, it doesn't have a plot attribute, resulting in the error you see.
In order to select the whole column, you need to pass a range equivalent to the whole column's length, instead of a single index. The : notation can be used as a shorthand to do exactly that, so that my_dataframe.iloc[:, 3] returns the series of all values in the fourth column.

How to change column values according to size

I have a dataframe df in a PySpark setting. I want to change a column, say it is called A, whose datatype is "string". I want to change its values according to their lengths. In particular, if in a row we have only a character, we want to concatenate 0 to the end. Otherwise, we take the default value. The name of the "modified" column must still be A. This is for a Jupyter Notebook using PySpark3.
This is what I have tried so far:
df = df.withColumn("A", when(size(df.col("A")) == 1, concat(df.col("A"), lit("0"))).otherwise(df.col("A")))
I also tried the same code deleting the "df.col"'s.
When I run this code, the software complains saying that the syntax is invalid, but I don't see the error.
df.withColumn("temp", when(length(df.A) == 1, concat(df.A, lit("0"))).\
otherwise(df.A)).drop("A").withColumnRenamed('temp', 'A')
What I understood after reading your question was, you are getting one extra column A.
So you want that old column A replaced by new Column A. So I created a temp column with your required logic then dropped column A then renamed temp column to A.
Listen here child...
To choose a column from a DF in pyspark, you must not use the "col" function, since it is a Scala/Java API. In Pyspark, the correct way is just to choose the name from the DF: df.colName.
To get the length of your string, use the "length" function. Size function is for iterables.
And for the grand solution... (drums drums drums)
df.withColumn("A", when(length(df.A) == 1, concat(df.A, lit("0"))).otherwise(df.A))
Por favor!

Returning a dataframe in Dask

Aim: To speed up applying a function row wise across a large data frame (1.9 million ~ rows)
Attempt: Using dask map_partitions where partitions == number of cores. I've written a function which is applied to each row, creates a dict containing a variable number of new values (between 1 and 55). This function works fine standalone.
Problem: I need a way to combine the output of each function into a final dataframe. I tried using df.append, where I'd append each dict to a new dataframe and return this dataframe. If I understand the Dask Docs, Dask should then combine them to one big DF. Unfortunately this line is tripping an error (ValueError: could not broadcast input array from shape (56) into shape (1)). Which leads me to believe it's something to do with the combine feature in Dask?
#Function to applied row wise down the dataframe. Takes a column (post) and new empty df.
def func(post,New_DF):
post = str(post)
scores = OtherFUNC.countWords(post)
scores['post'] = post
New_DF = New_DF.append(scores, ignore_index=True)
return(New_DF)
#Dask
dd.from_pandas(dataset,npartitions=nCores).\
map_partitions(
lambda df : df.apply(
lambda x : func(x.post,New_DF),axis=1)).\
compute(get=get)
I am not quite sure I completely understand your code in lieu of an MCVE but I think there is a bit of a misunderstanding here.
In this piece of code you take a row and a DataFrame and append one row to that DataFrame.
#Function to applied row wise down the dataframe. Takes a column (post) and new empty df.
def func(post,New_DF):
post = str(post)
scores = OtherFUNC.countWords(post)
scores['post'] = post
New_DF = New_DF.append(scores, ignore_index=True)
return(New_DF)
Instead of appending to New_DF, I would recommend just returning a pd.Series which df.apply concatenates into a DataFrame. That is because if you are appending to the same New_DF object in all nCores partitions, you are bound to run into trouble.
#Function to applied row wise down the dataframe. Takes a row and returns a row.
def tobsecret_func(row):
post = str(row.post)
scores = OtherFUNC.countWords(post)
scores['post'] = post
length_adjusted_series = pd.Series(scores).reindex(range(55))
return(length_adjusted_series)
Your error also suggests that as you wrote in your question, your function creates a variable number of values. If the pd.Series you return doesn't have the same shape and column names, then df.apply will fail to concatenate them into a pd.DataFrame. Therefore make sure you return a pd.Series of equal shape each time. This question shows you how to create pd.Series of equal length and index: Pandas: pad series on top or bottom
I don't know what kind of dict your OtherFUNC.countWords returns exactly, so you may want to adjust the line:
length_adjusted_series = pd.Series(scores).reindex(range(55))
As is, the line would return a Series with an index 0, 1, 2, ..., 54 and up to 55 values (if the dict originally had less than 55 keys, the remaining cells will contain NaN values).
This means after applied to a DataFrame, the columns of that DataFrame would be named 0, 1, 2, ..., 54.
Now you take your dataset and map your function to each partition and in each partition you apply it to the DataFrame using apply.
#Dask
dd.from_pandas(dataset,npartitions=nCores).\
map_partitions(
lambda df : df.apply(
lambda x : func(x.post,New_DF),axis=1)).\
compute(get=get)
map_partitions expects a function which takes as input a DataFrame and outputs a DataFrame. Your function is doing this by using a lambda function that basically calls your other function and applies it to a DataFrame, which in turn returns a DataFrame. This works but I highly recommend writing a named function which takes as input a DataFrame and outputs a DataFrame, it makes it easier for you to debug your code.
For example with a simple wrapper function like this:
df_wise(df):
return df.apply(tobsecret_func)
Especially as your code gets more complex, abstaining from using lambda functions that call non-trivial code like your custom func and instead making a simple named function can help you debug because the traceback will not just lead you to a line with a bunch of lambda functions like in your code but will also directly point to the named function df_wise, so you will see exactly where the error is coming from.
#Dask
dd.from_pandas(dataset,npartitions=nCores).\
map_partitions(df_wise,
meta=df_wise(dd.head())
).\
compute(get=get)
Notice that we just fed dd.head() to df_wise to create our meta-keyword which is similar to what Dask would do under the hood.
You are using dask.get, the synchronous scheduler which is why the whole New_DF.append(...) code could work, since you append to the DataFrame for each consecutive partition.
This does not give you any parallelism and thus will not work if you use one of the other schedulers, all of which parallelise your code.
The documentation also mentions the meta keyword argument, which you should supply to your map_partitions call, so dask knows what columns your DataFrame will have. If you don't do this, dask will first have to do a trial run of your function on one of the partitions and check what the shape of the output is before it can go ahead and do the other partitions. This can slow down your code by a ton if your partitions are large; giving the meta keyword bypasses this unnecessary computation for dask.

Categories

Resources