Python - Use column name with int and string - python

I have imported an NBA statistical dataset. But some of my column names have 2 data types, as in "3PP" or "2FG". Therefore, the following code won't work.
for team in nba.3PP
Because when it runs, it gives an "invalid syntax" error. Is there a special way I can use 3PP like .\3PP or something to get it to work? Thanks!
EDIT: Using Pandas dataFrame

You don't say what you've imported into. If Pandas:
for team in nba['3PP']:
...
This uses the item-oriented indexing, rather than attribute-oriented indexing. In Python in general, they are not equivalent, but in Pandas they can often be used interchangeably.

Use the .get method:
nba.get("3PP")
Or:
nba['3PP']
Depending on if the dataset is in Pandas or whatnot.

Related

Append new column to a Snowpark DataFrame with simple string

I've started using python Snowpark and no doubt missing obvious answers based on being unfamiliar to the syntax and documentation.
I would like to do a very simple operation: append a new column to an existing Snowpark DataFrame and assign with a simple string.
Any pointers to the documentation to what I presume is readily achievable would be appreciated.
You can do this by using the function with_column in combination with the lit function. The with_column function needs a Column expression and for a literal value this can be made with the lit function. see documentation here: https://docs.snowflake.com/en/developer-guide/snowpark/reference/python/api/snowflake.snowpark.functions.lit.html
from snowflake.snowpark.functions import lit
snowpark_df = snowpark_df.with_column('NEW_COL', lit('your_string'))

Trouble translating from Pandas to PySpark

I'm having alot of trouble translating a function that worked on a pandas DataFrame to a PySpark UDF. Mainly, PySpark is throwing error that I don't really understand because it this is my first time using it. First, My dataset does contain some NaNs, which I didn't know would add some complexity to my task. With that said, the dataset contains the standard data types, i.e. categories and integers. Finally, I am running my algorithm using Pandas groupby() method, apply() to every row and using a lambda function I'm told that PySpark supports all these methods.
Now let me tell you about the algorithm. It's pretty much a counting game that I'm running on one column. and itself is written in vanilla python. The reason I'm saying this is because it's a bit too long to post. It returns three lists, i.e. arrays. Which from what I understand PySpark also supports. This is what a super short version of the algo looks like:
def algo(x, col):
# you will be looking at a specific pandas column --- pd.Series
x = x[col]
# LOGIC GOES HERE...
return list1, list2, list3
I'm running the algorithm using:
data = df.groupby("GROUPBY_THIS").apply(lambda x: algo(x, "COLUMN1"))
And everything is working fine. I'm returning the three lists of the correct length. Now when I try to run this algorithm using PySpark I'm confused on whether to use UDFs or PandasUDF. In addition, I'm throwing error that I can quite understand. Can someone point me in the correct direction here. Thanks!
Error:
ValueError: Invalid udf: the udf argument must be a pandas_udf of type GROUPED_MAP.

pandas cleaning 1+1 values in a column

I have a column that has the following data
column
------
1+1
2+3
4+5
How do I get pandas to sum these values so that the out put is 2,5,9 instead of the above?
Many thanks
You column obviously contains strings, so, you must somehow evaluate them. Use pd.eval function. Eg
frame['column'].apply(pd.eval)
If interested in performance, probably use an alternative method, like ast.literal_eval. Thanks to user #Serge Ballesta for mentioning

Pandas DataFrame replace does not work with inplace=True

In my column of the data frame i have version numbers like 6.3.5, 1.8, 5.10.0 saved as objects and thus likely as Strings. I want to remove the dots with nothing so i get 635, 18, 5100. My code idea was this:
for row in dataset.ver:
row.replace(".","",inplace=True)
The thing is it works if I don't set inplace to True, but we want to overwrite it and safe it.
You're iterating through the elements within the DataFrame, in which case I'm assuming it's type str (or being coerced to str when you replace). str.replace doesn't have an argument for inplace=....
You should be doing this instead:
dataset['ver'] = dataset['ver'].str.replace('.', '')
Sander van den Oord in the comments is quite correct to point out:
dataset['ver'].replace("[.]","", inplace=True, regex=True)
This is the way we do operations on a column in Pandas because in general, Pandas tries to optimize over for loops. The Pandas developers consider for loops the among least desirable pattern for row-wise operations in Python (see here.)

Using Pandas DataFrame with Multi-name Columns

I'm using Pandas to store a large dataset that has systematically generated column names. Something like this:
import numpy as np
import pandas as pd
df = pd.DataFrame([[0,1,2],[10,11,12],[20,21,22]],columns=["r0","r1","r2"])
These systematic names also have more meaningful names that users would actually understand. So far, I've been mapping them using a dictionary like so:
altName = {"Objective 1":"r0", "Result 5":"r1", "Parameter 2":"r2"}
so that they could then be accessed like this:
print(df[altName["Objective 1"]])
This works, but it leads to very hard to read code (think a plot command with multiple variables, etc.). I can't simply rename the columns to the friendly names because there are times when I need access to both, but I'm not sure how to support both simultaneously without a dictionary.
Is it possible to assign more than one name to a column, or do some sort of implicit mapping that would let me use both of these access methods:
print(df["r0"])
print(df["Objective 1])
I've thought of making my own subclass that would detect a keyerror and then fail to a secondary dictionary of alternate names and try that, but I wasn't sure I'd be able to do that while preserving all other DataFrame functionality (I'd self-assess my Python beginner bordering on intermediate).
Thanks very much for your suggestions.
Yes you can. Dataframes are just wrappers on numpy arrays, so you can multiply the wrappers :
An example:
df=pd.DataFrame([ [0,1], [2,3] ],list('AB'), columns=list('CD'))
df2=pd.DataFrame(df.values,df.index, columns=list('EF'))
df.loc['A','C']=999
Then df2 is also affected :
In [407]: df2['E']
Out[407]:
A 999
B 2
Name: E, dtype: int32

Categories

Resources