Python module export Pandas DataFrame - python

I relatively new to Python but my understanding of Python modules is that any object defined in a module can be exported, for example is you had:
# my_module.py
obj1 = 4
obj2 = 8
you can import both these objects simply with from my_module import obj1, obj2.
While working with Pandas, it is common to have code with looks like this (not actual working code):
# pandas_module.py
import pandas as pd
df = pd.DataFrame(...)
df = df.drop()
df = df[df.col > 0]
where the same object (df) is redefined multiple times. If I want to export df how should I handle this? My guess is that if I simply from pandas_module import df from elsewhere, all the pandas code will run first and I will the the final df as expected, but I'm not sure if this is good practice. Maybe it is better to do something like final_df = df.copy() and export final_df instead. This seems like it would be more understandable for someone who is not that familiar with Python.
So my question is, what is the proper way to handle this situation of exporting a df which is defined multiple times?

Personally, I usually create a function that returns a Dataframe object. Such as:
# pandas_module.py
import pandas as pd
def clean_data():
df = pd.DataFrame(...)
df = df.drop()
df = df[df.col > 0]
return df
Then you can call the function from your main work flow and get the expected Dataframe:
from pandas_module.py import clean_data
df = clean_data()

Related

CSV file handling with Pandas for text comparison

I have this csv file called input.csv
KEY;Rate;BYld;DataAsOfDate
CH04;0.719;0.674;2020-01-29
CH03;1.5;0.148;2020-01-29
then I execute the following code:
import pandas as pd
input_df = pd.read_csv('input.csv', sep=";")
input_df.to_csv('output.csv', sep=";")
and get the following output.csv file
KEY;Rate;BYld;DataAsOfDate
CH04;0.7190000000000001;0.674;2020-01-29
CH03;1.5;0.14800000000000002;2020-01-29
I was hoping for and expecting an output like this:
(to be able to use a tool like winmerge.org to detect real differences on each row)
(my real code truly modifies the dataframe - this stack overflow example is for demonstration only)
KEY;Rate;BYld;DataAsOfDate
CH04;0.719;0.674;2020-01-29
CH03;1.5;0.148;2020-01-29
What is the idiomatic way with to achieve such an unmodified output with Pandas?
Python does not use traditional rounding to so as to prevent problems with bankers rounding. However, if being close is not a problem you could use the round function and replace the "2" with whichever number you would like to round to
d = [['CH04',0.719,0.674,'2020-01-29']]
df = pd.DataFrame(d, columns = (['KEY', 'Rate', 'BYld', 'DataAsOfDate']))
df['Rate'] = df['Rate'].apply(lambda x : round(x, 2))
df
Using #Prokos idea I changed the code like this:
import pandas as pd
input_df = pd.read_csv('input.csv', dtype='str',sep=";")
input_df.to_csv('str_output.csv', sep=";", index=False)
and that meets the requirement - all columns come out unchanged.

Action on one pandas dataframe does the same to the one it was copied from

I was using this bit of code (re-worked for my application) when I found that the df_temp.drop(index=sample.index, inplace=True) performed the same action on df_input i.e. it emptied it!!! I was not expecting that at all.
I solved it by changing df_temp = df_input to df_temp = df_input.copy() but can someone illuminate me on what is going on here?
import seaborn as sns
import pandas as pd
df_input = sns.load_dataset('diamonds')
df = df_input.loc[[]]
df_temp = df_input # this is where we're sampling from
n_samples = 1000
for _ in range(n_samples):
sample = df_temp.sample(1)
df_temp.drop(index=sample.index, inplace=True)
df = df.append(sample)
assert((df.index.value_counts() > 1).sum() == 0)
df
Pandas does not copy the whole df if you simply assign it to a new variable. After executing df_temp = df_input you end up with two variables referring to the exact same df. It's not the case that both are referring to an identical df, they are actually pointing to the same df. (think: you just gave this one df two names (variable names)) So no matter which variable (think: name) you are using to alter the df you're also changing for the other variable. If you use .copy() you get what you intended, namely two variables with two distinct versions of the df.

Return DataFrame from a function in another file

I'm trying to create two files: one that will create a series of dataframe, and another that will import theses dataframes to my principal file.
Is something like this:
load_data.py
def data_mean():
import pandas as pd
global mean_re5200, mean_re2000
mean_re5200=pd.read_csv('mean_re5200.csv')
mean_re2000=pd.read_csv('mean_re5200.csv')
main_project.py
from load_data import data_mean
When I run the main_project file and type data_mean() in the terminal, all seems fine, but the dataframes aren't save as local variables that I can use them. I saw another similar quotes here in StackOverFlow, but no one was about saving dataframe, only simple variables.
How can I proceed?
Why don't you simply try something like
load_data.py
import pandas as pd
df = pd.DataFrame({"a":list(range(10))})
main.py
from load_data import *
print(df)
or alternatively
load_data.py
import pandas as pd
def data_mean():
df0 = pd.DataFrame({"a":list(range(10))})
df1 = pd.DataFrame({"b":list(range(10))})
return df0, df1
main.py
from load_data import data_mean
df1, df2 = data_mean()
print(df1)

python - Convert pandas dataframe to json or dict and then back to df with non-unique columns

I need to send a dataframe from a backend to a frontend and so first need to convert it either to an object that is JSON serialisable or directly to JSON. The problem being that I have some dataframes that don't have unique cols. I've looked into the orient parameter, to_json(), to_dict() and from_dict() methods but still can't get it to work...
The goal is to be able to convert the df to something json serializable and then back to its initial self.
I'm also having a hard time copy-pasting it using pd.read_clipboard so I've included a sample df causing problems as an image (sorry!).
I found a way to make it work.
Here is a simple reproducible example:
import pandas as pd
import json
# create simple df with two identical named columns
df = pd.DataFrame([[1, 2, 3, 4]], columns=['col1', 'col2', 'col1', 'col2'])
# orient='split' conservers order
jsonized_df = df.to_json(orient='split')
# suppose the df is part of a bigger data structure being sent to another app
random_dict = {'foo': 'bar'}
all_data = [random_dict, jsonized_df]
data_to_frontend = json.dumps(jsonized_df)
# then from the other app
all_data = json.loads(data_to_frontend)
final_df = pd.read_json(all_data[1], orient='split') #important to remember to include the orient parameter when reading the json df as well!
The final_df will be identical to the initial_df with order preserved!

How to rename the index of a Dask Dataframe

How would I go about renaming the index on a dask dataframe? I tried it like so
df.index.name = 'foo'
but rechecking df.index.name shows it still being whatever it was previously.
This does not seem like an efficient way to do it, so I wouldn't be surprised if there is something more direct.
d.index.name starts off as 'foo';
def f(df, name):
df.index.name = name
return df
d.map_partitions(f, 'pow')
The output now has index name of 'pow'. If this is done with the threaded scheduler, I think you also change the index name of d in-place (in which case you don't really need the output of map_partitions).
A bit late, but the following functions:
import dask.dataframe as dd
import pandas as pd
df = pd.DataFrame().assign(s=[1, 2], o=[3, 4], p=[5, 6]).set_index("si")
ddf = dd.from_pandas(df, npartitions=2)
ddf.index = ddf.index.rename("si2")
I hope this can help someone else out!

Categories

Resources