Passing varying length variables to a PySpark groupby().agg function - python

I am passing lists of column names of varying lengths to the PySpark's groupby().agg function? The code I have written checks the length of the list and for example, if it is length 1, it will do a .agg(count) on the one element. If the list is of length 2, it will do two separate .agg(counts) producing two new .agg columns.
Is there a more succinct way to write this than through an if statement because as the lists of column names become longer I'll have to add more elif statements.
For example:
agg_fields: list of column names
if len(agg_fields) == 1:
df = df.groupBy(col1, col2).agg(count(agg_fields[0]))
elif len(agg_fields) == 2:
df = df.groupBy(col1, col2).agg(count(agg_fields[0]), \
count(agg_fields[1]))

Yes, you can simply loop to create your aggregate statement:
agg_df = df.groupBy("col1","col2").agg(*[count(i).alias(i) for i in agg_fields])

Related

Within a for loop how to append index value to the end of the dataframe name

I want writing a 'for' loop using Python script (pandas dataframe), and would like to append the index value to the end of dataframe name to differentiate each of them, how can I do it?
For example, I have a dataframe df with column value to be 1~5; and would like to split the dataset into 5 pieces, each has value to be '1' / '2'/ '3'/ '4'/ '5'.
I've tried the following which seems to have syntax error. How can I change it? thanks
for i in range(1, 5):
df_f'{i}' = df.loc[df['value'] == i]
Note:
I'd like the desired dataframe name to be df_1. df_2, df_3, df_4, df_5
As #hbgoddard pointed out, it's bad practice to generate variable names dynamically (at least in production code). However, if you really want to do it, edit globals() like so:
for i in range(1, 5):
globals()[f'df_{i}'] = df.loc[df['value'] == i]
It is not recommended to generate variable names at runtime; use a list or dictionary instead.
df_parts = {i: df.loc[df['value'] == i] for i in range(1, 6)}
A more concise version that handles all unique values of value instead of just 1 through 5:
df_parts = dict(list(df.groupby('value')))
You can then access each part as df_parts[1], df_parts[2], etc.

dataframe overwritten when using list comprehension

I am attempting to create four new pandas dataframes via a list comprehension. Each new dataframe should be the original 'constituents_list' dataframe with two new columns. These two columns add a defined number of years to an existing column and return the value. The example code is below
def add_maturity(df, tenor):
df['tenor'] = str(tenor) + 'Y'
df['maturity'] = df['effectivedate'] + pd.DateOffset(years=tenor)
return df
year_list = [3, 5, 7, 10]
new_dfs = [add_maturity(constituents_file, tenor) for tenor in year_list]
My expected output in in the new_dfs list should have four dataframes, each with a different value for 'tenor' and 'maturity'. In my results, all four dataframes have the same data with 'tenor' of '10Y' and a 'maturity' that is 10 years greater than the 'effectivedate' column.
I suspect that each time I iterate through the list comprehension each existing dataframe is overwritten with the latest call to the function. I just can't work out how to stop this happening.
Many thanks
When you're assigning to the DataFrame object, you're modifying in place. And when you pass it as an argument to a function, what you're passing is a reference to the DataFrame object, in this case a reference to the same DataFrame object every time, so that's overwriting the previous results.
To solve this issue, you can either create a copy of the DataFrame at the start of the function:
def add_maturity(df, tenor):
df = df.copy()
df['tenor'] = str(tenor) + 'Y'
df['maturity'] = df['effectivedate'] + pd.DateOffset(years=tenor)
return df
(Or you could keep the function as is, and have the caller copy the DataFrame first when passing it as an argument...)
Or you can use the assign() method, which returns a new DataFrame with the modified columns:
def add_maturity(df, tenor):
return df.assign(
tenor= str(tenor) + 'Y',
maturity=df['effectivedate'] + pd.DateOffset(years=tenor),
)
(Personally, I'd go with the latter. It's similar to how most DataFrame methods work, in that they typically return a new DataFrame rather than modifying it in place.)

Adding Dataframes from a List of Dataframes using another List

I am having trouble adding several dataframes in a list of dataframes. My goal is to add dataframes from a list of dataframes based on the criteria from another list.
Example: Suppose we have a list of 10 Dataframes, DfList and another list called OrderList.
Suppose OrderList = [3, 2, 1, 4].
Then I would like to obtain a new list of 4 Dataframes in the form [DfList(0) + DfList(1) + DfList(2), DfList(3) + DfList(4), DfList(5), DfList(6) + DfList(7) + DfList(8) + DfList(9)]
I have tried a few ways to do this creating functions using DataFrame.add. Initially, my hope was that I could use the form sum(DfList(0), DfList(1), DfList(2)) to do this but quickly learned that sum() doesn't seem to be supported with DataFrames.
I was hoping to use something like sum(DfList[0:2]) and making OrderList cumulative so I could just use sum(DfList[OrderList[i]:OrderList[i+1]]) but keep getting unsupported operand type errors.
Is there an easy way to do this that I am not considering or is there a different approach entirely that you would suggest?
EDIT: The output I am looking for is another list of DataFrames containing four summed DataFrames based on OrderList (across all columns.) Three DataFrames added together for the first, two for the second, one for the third, and four for the fourth.
If you have a list of DataFrames as you said, you can use the operation sum(DfList[0:2]), but you need to be careful with the order of the columns in each DataFrame in your list because the order provided is used when adding the DataFrames. The addition does not occur accordingly to the names of the columns. If you need, the order of the columns can be changed as showed in this other question.
This example illustrates the issue:
import pandas as pd
df1 = pd.DataFrame({1:[1,23,4], 2:['x','y','z']})
df2 = pd.DataFrame({2:['x','y','z'], 1:[1,23,4]})
try:
df1 + df2
except TypeError:
print("Error")
df1 = pd.DataFrame({1:[1,23,4], 2:['x','y','z']})
df2 = pd.DataFrame({1:[1,23,4], 2:['x','y','z']})
#works fine
df1 + df2
Also, the logic that you used for the cumulative sum in sum(DfList[OrderList[i]:OrderList[i+1]])is not correct. For this to be the case, the OrderList would also need to be cumulative and have one extra element to start from zero, so instead of OrderList = [3, 2, 1, 4], you would have OrderList = [0, 3, 5, 6, 10].

I am wishing to produce a series of smaller dataframes from a single large dataframe in python while naming the dataframes with the filter

I have a large dataframe called dfe filled with scientific information. In my first column ('reaction') there are three different string variables, say a,b,c. I wish to split this data frame into three dataframes dfa, dfb,dfc. I have a list variable called react2 with the variables a,b,c.
Here is my code for the problem:
for i in react2:
df{}.format(i) = dfe[dfe['reaction'] = i ]
I then get an error of:
df{}.format(i) = dfe[dfe['reaction'] = i ]
^
SyntaxError: invalid syntax
The most sensible thing would be to store them in a dictionary:
df_dict = {}
for i in react2:
df_dict[i] = dfe[dfe['reaction'] == i]
You can put this onto a single line using a dictionary comprehension:
df_dict = {i : dfe[dfe['reaction'] == i] for i in react2}

compare list of dictionaries to dataframe, show missing values

I have a list of dictionaries
example_list = [{'email':'myemail#email.com'},{'email':'another#email.com'}]
and a dataframe with an 'Email' column
I need to compare the list against the dataframe and return the values that are not in the dataframe.
I can certainly iterate over the list, check in the dataframe, but I was looking for a more pythonic way, perhaps using list comprehension or perhaps a map function in dataframes?
To return those values that are not in DataFrame.email, here's a couple of options involving set difference operations—
np.setdiff1d
emails = [d['email'] for d in example_list)]
diff = np.setdiff1d(emails, df['Email']) # returns a list
set.difference
# returns a set
diff = set(d['email'] for d in example_list)).difference(df['Email'])
One way is to take one set from another. For a functional solution you can use operator.itemgetter:
from operator import itemgetter
res = set(map(itemgetter('email'), example_list)) - set(df['email'])
Note - is syntactic sugar for set.difference.
I ended up converting the list into a dataframe, comparing the two dataframes by merging them on a column, and then creating a dataframe out of the missing values
so, for example
example_list = [{'email':'myemail#email.com'},{'email':'another#email.com'}]
df_two = pd.DataFrame(item for item in example_list)
common = df_one.merge(df_two, on=['Email'])
df_diff = df_one[(~df_one.Email.isin(common.Email))]

Categories

Resources