How can I manipulate a DataFrame name within a function? - python

How can I manipulate a DataFrame name within a function so that I can have a new DataFrame with a new name that is derived from the input DataFrame name in return?
let say I have this:
def some_func(df):
# some operations
return(df_copy)
and whatever df I put inside this function it should return the new df as ..._copy, e.g. some_func(my_frame) should return my_frame_copy.
Things that I considered are as follows:
As in string operations;
new_df_name = "{}_copy".format(df) -- I know this will not work since the df refers to an object but it just helps to explain what I am trying to do.
def date_timer(df):
df_copy = df.copy()
dates = df_copy.columns[df_copy.columns.str.contains('date')]
for i in range(len(dates)):
df_copy[dates[i]] = pd.to_datetime(df_copy[dates[i]].str.replace('T', ' '), errors='coerce')
return(df_copy)
Actually this was the first thing that I tried, If only DataFrame had a "name" attribute which allowed us to manipulate the name but this also not there:
df.name
Maybe f-string or any kind of string operations could be able to make it happen. if not, it might not be possible to do in python.
I think this might be related to variable name assignment rules in python. And in a sense what I want is reverse engineer that but probably not possible.
Please advice...

It looks like you're trying to access / dynamically set the global/local namespace of a variable from your program.
Unless your data object belongs to a more structured namespace object, I'd discourage you from dynamically setting names with such a method since a lot can go wrong, as per the docs:
Changes may not affect the values of local and free variables used by the interpreter.
The name attribute of your df is not an ideal solution since the state of that attribute will not be set on default. Nor is it particularly common. However, here is a solid SO answer which addresses this.
You might be better off storing your data objects in a dictionary, using dates or something meaningful as keys. Example:
my_data = {}
for my_date in dates:
df_temp = df.copy(deep=True) # deep copy ensures no changes are translated to the parent object
# Modify your df here (not sure what you are trying to do exactly
df_temp[my_date] = "foo"
# Now save that df
my_data[my_date] = df_temp
Hope this answers your Q. Feel free to clarify in the comments.

Related

Problem transforming a variable in logs, python

I am using Python. I would like to create a new column which is the log transformation of column 'lights1992'.
I am using the following code:
log_lights1992 = np.log(lights1992)
I obtain the following error:
I have tried two things: 1) adding a 1 to each value and transform the column 'lights1992' to numeric.
city_join['lights1992'] = pd.to_numeric(city_join['lights1992'])
city_join["lights1992"] = city_join["lights1992"] + 1
However, that two solution has not worked. Variable 'lights1992' is a float64 type. Do you know what can be the problem?
Edit:
The variable 'lights1992' comes from doing a zonal_statistics from a raster 'junk1992', maybe this affect.
zs1 = zonal_stats(city_join, junk1992, stats=['mean'], nodata=np.nan)
city_join['lights1992'] = [x['mean'] for x in zs1]
the traceback states:
'DatasetReader' object has no attribute'log'.
Did you re-assign numpy to something else at some point? I can't find much about 'DatasetReader' is that a custom class?
EDIT:
I think you would need to pass the whole column because your edit doesn't show a variable named 'lights1992'
so instead of:
np.log(lights1992)
can you try passing in the Dataframe's column to log?:
np.log(city_join['lights1992'])
2ND EDIT:
Since you've reported back that it works I'll dive into the why a little bit.
In your original statement you called the log function and gave it an argument, then you assigned the result to a variable name:
log_lights1992 = np.log(lights1992)
The problem here is that when you give python text without any quotes it thinks you are giving it a variable name (see how you have log_lights1992 on the left of the equal sign? You wanted to assign the results of the operation on the right hand side of the equal sign to the variable name log_lights1992) but in this case I don't think lights1992 had any value!
So there were two ways to make it work, either what I said earlier:
Instead of giving it a variable name you give .log the column of the city_join dataframe (that's what city_join["lights1992"]) directly.
Or
You assign the value of that column to the variable name first then you pass it in to .log, like this:
lights1992 = city_join["lights1992"]
log_lights1992 = np.log(lights1992)
Hope that clears it up for you!

avoid writing df['column'] twice when doing df['column'] = df['column']

I don't even know how to phrase this but is there a way in Python to reference the text before the equals without having to actually write it again?
** EDIT - I'm using python3 in Jupyter
I seem to spend half my life writing:
df['column'] = df['column'].some_changes
Is there a way to tell Python that I'm referencing the part before the equals sign?
For example, I would write the following, where <% is just to represent the reference to the text before the = (df['column'])
df['column'] = <%.replace(np.nan)
you are looking for in place methods.
I believe you can pass inplace=True as an argument to most methods in pandas
so it would be something just like
df['column'].replace(np.nan, inplace=True)
edit
You could also do
df["computed_column"] = df["original_column"].many_operations
so you still have access to the original data down the line.
And do all the needed operations at once instead of saving each step.
One of the advantages of inplace not being the default is if you are doing a batch of operations and it fails midway your data is not mangled.

How to dynamically (i.e through program) change name of a Data dictionary to a dynamically obtained vaiable

I am using python 3.9.6 in Windows 10.
Similar earlier questions at
(1) Creating a dynamic dictionary name
and
(2) How to obtain assembly name dynamically
were different and do not solve my problem.
My data dictionary(dynamically created):
pcm30={'ABB': '-0.92', 'ZYDUSWELL': 2.05}
Dynamically obtained new dictionary name "pCh0109" is in variable z
I have to create different dictionaries to create a data frame.
Now I want to dynamically (i.e through programming) change the name of the dictionary from
'pcm30'
to
'pCh0109'.
The digits in the new name of the dictionary ('pCh0109') indicate the time of creation of the particular dictionary.
How to do it?
Will be grateful for assistance and help.
I would strongly recommend you don't try this unless you absolutely have to, but here's the simplest approach to do that:
pcm30 = {'ABB': '-0.92', 'ZYDUSWELL': 2.05}
globals()['pCh0109'] = globals().pop('pcm30')
# Your IDE might glare at you here, but it'll work out without errors at runtime
print(pCh0109)
Instead I suggest to try this approach - use a dictionary if possible. This will turn out much safer for all. Example below:
def my_func():
d = {}
pcm30 = {'ABB': '-0.92', 'ZYDUSWELL': 2.05}
d['pCh0109'] = locals().pop('pcm30')
print(d['pCh0109']['ABB'])
# -0.92

Pandas dataframe from dict, why?

I can create a pandas dataframe from dict as follows:
d = {'Key':['abc','def','xyz'], 'Value':[1,2,3]}
df = pd.DataFrame(d)
df.set_index('Key', inplace=True)
And also by first creating a series like this:
d = {'abc': 1, 'def': 2, 'xyz': 3}
a = pd.Series(d, name='Value')
df = pd.DataFrame(a)
But not directly like this:
d = {'abc': 1, 'def': 2, 'xyz': 3}
df = pd.DataFrame(d)
I'm aware of the from_dict method, and this also gives the desired result:
d = {'abc': 1, 'def': 2, 'xyz': 3}
pd.DataFrame.from_dict(d, orient='index')
but I don't see why:
(1) a separate method is needed to create a dataframe from dict when creating from series or list works without issue;
(2) how/why creating a dataframe from dict/list of lists works, but not creating from dict directly.
Have found several SE answers that offer solutions, but looking for the 'why' as this behavior seems inconsistent. Can anyone shed some light on what I may be missing here.
There's actually a lot happening here, so let's break it down.
The Problem
There are soooo many different ways to create a DataFrame (from a list of records, dict, csv, ndarray, etc ...) that even for python veterans it can take a long time to understand them all. Hell, within each of those ways, there are EVEN MORE ways to build a DataFrame by tweaking some parameters and whatnot.
For example, for dictionaries (where the values are equal length lists), here are two ways pandas can handle them:
Case 1:
You treat each key-value pair as a column title and it's values at each row respectively. In this case, the rows don't have names, and so by default you might just name them by their row index.
Case 2:
You treat each key-value pair as the row's name and it's values at each column respectively. In this case, the columns don't have names, and so by default you might just name them by their index.
The Solution
Python's is a weakly typed language (aka variables don't declare a type and functions don't declare a return). As a result, it doesn't have function overloading. So, you basically have two philosophies when you want to create a object class that can have multiple ways of being constructed:
Create only one constructor that checks the input and handles it accordingly, covering all possible options. This can get very bloated and complicated when certain inputs have their own options/parameters and when there's simply just too much variety.
Separate each option into #classmethod's to handle each specific individual way of constructing the object.
The second is generally better, as it really enforces seperation of concerns as a SE design principle, however the user will need to know all the different #classmethod constructor calls as a result. Although, in my opinion, if you're object class is complicated enough to have many different construction options, the user should be aware of that anyways.
The Panda's Way
Pandas adopts a sorta mix between the two solutions. It'll use the default behaviour for each input type, and it you wanna get any extra functionality you'll need to use the respective #classmethod constructor.
For example, for dicts, by default, if you pass a dict into the DataFrame constructor, it will handle it as Case 1. If you want to do the second case, you'll need to use DataFrame.from_dict and pass in orient='index' (without orient='index', it would would use default behaviour described base Case 1).
In my opinion, I'm not a fan of this kind of implementation. Personally, it's more confusing than helpful. Honestly, a lot of pandas is designed like that. There's a reason why pandas is the topic of every other python tagged question on stackoverflow.

What is the purpose of the alias method in PySpark?

While learning Spark in Python, I'm having trouble understanding both the purpose of the alias method and its usage. The documentation shows it being used to create copies of an existing DataFrame with new names, then join them together:
>>> from pyspark.sql.functions import *
>>> df_as1 = df.alias("df_as1")
>>> df_as2 = df.alias("df_as2")
>>> joined_df = df_as1.join(df_as2, col("df_as1.name") == col("df_as2.name"), 'inner')
>>> joined_df.select("df_as1.name", "df_as2.name", "df_as2.age").collect()
[Row(name=u'Bob', name=u'Bob', age=5), Row(name=u'Alice', name=u'Alice', age=2)]
My question has two parts:
What is the purpose of the alias input? It seems redundant to give the alias string "df_as1" when we are already assigning the new DataFrame to the variable df_as1. If we were to instead use df_as1 = df.alias("new_df"), where would "new_df" ever appear?
In general, when is the alias function useful? The example above feels a bit artificial, but from exploring tutorials and examples it seems to be used regularly -- I'm just not clear on what value it provides.
Edit: some of my original confusion came from the fact that both DataFrame and Column have alias methods. Nevertheless, I'm still curious about both of the above questions, with question 2 now applying to Column.alias as well.
The variable name is irrelevant and can be whatever you like it to be. It's the alias what will be used in string column identifiers and printouts.
I think that the main purpose of aliases is to achieve better brevity and avoid possible confusion when having conflicting column names. For example what was simply 'age' could be aliased to 'max_age' for brevity after you searched for the biggest value in that column. Or you could have a data frame for employees in a company joined with itself and filter so that you have manager-subordinate pairs. It could be useful to use column names like "manager.name" in such context.

Categories

Resources