Append new column to a Snowpark DataFrame with simple string - python

I've started using python Snowpark and no doubt missing obvious answers based on being unfamiliar to the syntax and documentation.
I would like to do a very simple operation: append a new column to an existing Snowpark DataFrame and assign with a simple string.
Any pointers to the documentation to what I presume is readily achievable would be appreciated.

You can do this by using the function with_column in combination with the lit function. The with_column function needs a Column expression and for a literal value this can be made with the lit function. see documentation here: https://docs.snowflake.com/en/developer-guide/snowpark/reference/python/api/snowflake.snowpark.functions.lit.html
from snowflake.snowpark.functions import lit
snowpark_df = snowpark_df.with_column('NEW_COL', lit('your_string'))

Related

Replacing variable name with literal value in Python [duplicate]

This question already has answers here:
How to access (get or set) object attribute given string corresponding to name of that attribute
(3 answers)
Closed 8 months ago.
I'm not quite sure how to phrase this question, so let me illustrate with an example.
Let's say you have a Pandas dataframe called store_df with a column called STORE_NUMBER. There are two ways to access a given column in a Pandas dataframe:
store_df['STORE_NUMBER']
and
store_df.STORE_NUMBER
Now let's say that you have a variable called column_name which contains the name of a column in store_df as a string. If you run
store_df[column_name]
All is well. But if you try to run
store_df.column_name
Python throws an AttributeError because it is looking for a literal column named "column_name" which doesn't exist in our hypothetical dataframe.
My question is: Is there a way to look up columns dynamically using second syntax (dot notation)? Not so much because there is anything wrong with the first syntax (list notation), but because I am curious if there is some advanced feature of Python that allows users to replace variable names with their value as another variable (in this case a state variable of the dataframe). I know there is the exec function but I was wondering if there was a more elegant solution. I tried
store_df.{column_name}
but received a SyntaxError.
Would getattr(df, 'column_name_as_str') be the kind of thing you're looking for, perhaps?

Trouble translating from Pandas to PySpark

I'm having alot of trouble translating a function that worked on a pandas DataFrame to a PySpark UDF. Mainly, PySpark is throwing error that I don't really understand because it this is my first time using it. First, My dataset does contain some NaNs, which I didn't know would add some complexity to my task. With that said, the dataset contains the standard data types, i.e. categories and integers. Finally, I am running my algorithm using Pandas groupby() method, apply() to every row and using a lambda function I'm told that PySpark supports all these methods.
Now let me tell you about the algorithm. It's pretty much a counting game that I'm running on one column. and itself is written in vanilla python. The reason I'm saying this is because it's a bit too long to post. It returns three lists, i.e. arrays. Which from what I understand PySpark also supports. This is what a super short version of the algo looks like:
def algo(x, col):
# you will be looking at a specific pandas column --- pd.Series
x = x[col]
# LOGIC GOES HERE...
return list1, list2, list3
I'm running the algorithm using:
data = df.groupby("GROUPBY_THIS").apply(lambda x: algo(x, "COLUMN1"))
And everything is working fine. I'm returning the three lists of the correct length. Now when I try to run this algorithm using PySpark I'm confused on whether to use UDFs or PandasUDF. In addition, I'm throwing error that I can quite understand. Can someone point me in the correct direction here. Thanks!
Error:
ValueError: Invalid udf: the udf argument must be a pandas_udf of type GROUPED_MAP.

Attempting to cycle through a Pandas df looking for a PU string, from this I wish to record the row number

for x in range(lenofdf):
if df.iloc[0,x][5:7] == 'PU':
print(df.iloc[0,x])
I get the following error
'invalid index to scalar variable.'
I dont understand how it wont work this way but I get a positive result for:
if df.iloc[0,2][5:7] == 'PU': print('Bruh')
First of all this is highly inefficient to loop through values in a dataframe using a regular Python loop. Use the built in DataFrame method of Iterrows().
Secondly, it's advised in Pandas to avoid loops altogether, and instead resort to using the dataframe build-in search methods which are much faster.
For example, you can search a specific column for a string using the Series methods:
df.iloc[:,0].str.contains('PU')
This will output a boolean pandas.Series where each index corresponds to a row in the original column.
Similarly, you can search the entire dataframe using eq method as such:
df.eq('PU')

What is the the best way to modify (e.g., perform math functions) a column in a Dask DataFrame?

I'm a veteran of Pandas DataFrame objects, but I'm struggling to find a clean, convenient method for altering the values in a Dask DataFrame column. For a specific example, I'm trying to multiply positive values in a numpy.float column by -1, thereby making them negative. Here is my current method (I'm trying to change the last column in the DataFrame):
cols = df.columns
df[[cols[-1]]] = df[[cols[-1]]]*-1
This seems to work only if the column has a string header, otherwise it adds another column using the index number as a string-type column name for a new column. Is there something akin to the Pandas method of, say, df.iloc[-1,:] = df.iloc[-1,:]*-1 that I can use with a Dask dataframe?
Edit: I'm also trying to implement: df = df.applymap(lambda x: x*-1). This, of course, applies the function to the entire dataframe, but is there a way to apply a function over just one column? Thank you.
first question
If something works for string columns and not for numeric-named columns then that is probably a bug. I recommend raising an issue at https://github.com/dask/dask/issues/new
second question
but is there a way to apply a function over just one column?
You can't apply a single Python function over a dask dataframe that is stored in many pieces directly, however methods like .map_partitions or .reduction may help you to achieve the same result with some cleverness.
in the future we recommend asking separate questions separately on stack overflow

Python - Use column name with int and string

I have imported an NBA statistical dataset. But some of my column names have 2 data types, as in "3PP" or "2FG". Therefore, the following code won't work.
for team in nba.3PP
Because when it runs, it gives an "invalid syntax" error. Is there a special way I can use 3PP like .\3PP or something to get it to work? Thanks!
EDIT: Using Pandas dataFrame
You don't say what you've imported into. If Pandas:
for team in nba['3PP']:
...
This uses the item-oriented indexing, rather than attribute-oriented indexing. In Python in general, they are not equivalent, but in Pandas they can often be used interchangeably.
Use the .get method:
nba.get("3PP")
Or:
nba['3PP']
Depending on if the dataset is in Pandas or whatnot.

Categories

Resources