How to pass a pandas dataframe without using global variables

How to pass a pandas dataframe without using global variables - python

I was taking a online test and I couldn't figure out this question
import pandas as pd
import numpy as np
def login_table(id_name_verified, id_password):
"""
:param id_name_verified: (DataFrame) DataFrame with columns: Id, Login, Verified.
:param id_password: (numpy.array) Two-dimensional NumPy array where each element
is an array that contains: Id and Password
:returns: (None) The function should modify id_name_verified DataFrame in-place.
It should not return anything.
"""
pass
id_name_verified = pd.DataFrame([[1, "JohnDoe", True], [2, "AnnFranklin", False]], columns=["Id", "Login", "Verified"])
id_password = np.array([[1, 987340123], [2, 187031122]], np.int32)
login_table(id_name_verified, id_password)
print(id_name_verified)
The expected answer is of the last print statement is:
Id Login Password
0 1 JohnDoe 987340123
1 2 AnnFranklin 187031122
The only way I thought of solving this is to pass id_name_verified as a global variable and modify the function login_table(id_name_verified, id_password) to login_table( id_password). However, this is the wrong solution. How to solve this?

Related

Adding tooltip to folium.features.GeoJson from a geopandas dataframe

I am having issues adding tooltips to my folium.features.GeoJson. I can't get columns to display from the dataframe when I select them.
feature = folium.features.GeoJson(df.geometry,
name='Location',
style_function=style_function,
tooltip=folium.GeoJsonTooltip(fields= [df.acquired],aliases=["Time"],labels=True))
ax.add_child(feature)
For some reason when I run the code above it responds with
Name: acquired, Length: 100, dtype: object is not available in the data. Choose from: ().
I can't seem to link the data to my tooltip.

have made your code a MWE by including some data
two key issues with your code
need to pass properties not just geometry to folium.features.GeoJson() Hence passed df instead of df.geometry
folium.GeoJsonTooltip() takes a list of properties (columns) not an array of values. Hence passed ["acquired"] instead of array of values from a dataframe column
implied issue with your code. All dataframe columns need to contain values that can be serialised to JSON. Hence conversion of acquired to string and drop()
import geopandas as gpd
import pandas as pd
import shapely.wkt
import io
import folium
df = pd.read_csv(io.StringIO("""ref;lanes;highway;maxspeed;length;name;geometry
A3015;2;primary;40 mph;40.68;Rydon Lane;MULTILINESTRING ((-3.4851169 50.70864409999999, -3.4849879 50.7090007), (-3.4857269 50.70693379999999, -3.4853034 50.7081574), (-3.488620899999999 50.70365289999999, -3.4857269 50.70693379999999), (-3.4853034 50.7081574, -3.4851434 50.70856839999999), (-3.4851434 50.70856839999999, -3.4851169 50.70864409999999))
A379;3;primary;50 mph;177.963;Rydon Lane;MULTILINESTRING ((-3.4763853 50.70886769999999, -3.4786112 50.70811229999999), (-3.4746017 50.70944449999999, -3.4763853 50.70886769999999), (-3.470350900000001 50.71041779999999, -3.471219399999999 50.71028909999998), (-3.465049699999999 50.712158, -3.470350900000001 50.71041779999999), (-3.481215600000001 50.70762499999999, -3.4813909 50.70760109999999), (-3.4934747 50.70059599999998, -3.4930204 50.7007898), (-3.4930204 50.7007898, -3.4930048 50.7008015), (-3.4930048 50.7008015, -3.4919513 50.70168349999999), (-3.4919513 50.70168349999999, -3.49137 50.70213669999998), (-3.49137 50.70213669999998, -3.4911565 50.7023015), (-3.4911565 50.7023015, -3.4909108 50.70246919999999), (-3.4909108 50.70246919999999, -3.4902349 50.70291189999999), (-3.4902349 50.70291189999999, -3.4897693 50.70314579999999), (-3.4805021 50.7077218, -3.4806265 50.70770150000001), (-3.488620899999999 50.70365289999999, -3.4888806 50.70353719999999), (-3.4897693 50.70314579999999, -3.489176800000001 50.70340539999999), (-3.489176800000001 50.70340539999999, -3.4888806 50.70353719999999), (-3.4865751 50.70487679999999, -3.4882604 50.70375799999999), (-3.479841700000001 50.70784459999999, -3.4805021 50.7077218), (-3.4882604 50.70375799999999, -3.488620899999999 50.70365289999999), (-3.4806265 50.70770150000001, -3.481215600000001 50.70762499999999), (-3.4717096 50.71021009999998, -3.4746017 50.70944449999999), (-3.4786112 50.70811229999999, -3.479841700000001 50.70784459999999), (-3.471219399999999 50.71028909999998, -3.4717096 50.71021009999998))"""),
sep=";")
df = gpd.GeoDataFrame(df, geometry=df["geometry"].apply(shapely.wkt.loads), crs="epsg:4326")
df["acquired"] = pd.date_range("8-feb-2022", freq="1H", periods=len(df))
def style_function(x):
return {"color":"blue", "weight":3}
ax = folium.Map(
location=[sum(df.total_bounds[[1, 3]]) / 2, sum(df.total_bounds[[0, 2]]) / 2],
zoom_start=12,
)
# data time is not JSON serializable...
df["tt"] = df["acquired"].dt.strftime("%Y-%b-%d %H:%M")
feature = folium.features.GeoJson(df.drop(columns="acquired"),
name='Location',
style_function=style_function,
tooltip=folium.GeoJsonTooltip(fields= ["tt"],aliases=["Time"],labels=True))
ax.add_child(feature)

Implement df.groupby('user')['item'].apply(np.array) in cuDF

Is there any way to replicate this simple pandas functionality to cuDF?
Note that array lengths are varying.
An example of the expected output using pandas and NumPy(CuPy in the cuDF case) be found below:
import pandas as pd
import numpy as np
df = pd.DataFrame({'user':[0,1,0,2,1], 'item':[1,2,3,4,5]})
res = df.groupby('user')['item'].apply(np.array)
res
# Output:
# user
# 0 [1, 3]
# 1 [2, 5]
# 2 [4]
# Name: item, dtype: object

How to make my modified pandas/numpy .where function adaptable to different sizes of a list parameter?

I want to create my own function that scans a number of user-specified columns in a dataframe, and that function will create a new variable and assign it as '1' if all the specified columns == 1, otherwise 0.
In the following codes, I am accommodating if users are inputting exactly two columns to be scanned over.
import numpy as np
class Tagger:
def __init__(self):
pass
def summing_all_tagger(self, df, tag_var_list, tag_value=1):
# This tagger creates a tag='1' if all variables in tag_var_list equals to tag_value; otherwise='0'
self.df = df
self.tag_var_list = tag_var_list
self.tag_value = tag_value
self.df['temp'] = np.where((self.df[self.tag_var_list[0]]==self.tag_value) &
(self.df[self.tag_var_list[1]]==self.tag_value), 1, 0)
return self.df_pin['temp']
Then I can call it in the main.py file:
import pandas as pd
import datetime
import feature_tagger.feature_tagger as ft
tagger_obj = ft.Tagger()
df_pin['PIN_RX&TIME_TAG'] = tagger_obj.summing_all_tagger(df_pin, tag_var_list=['PIN_RX_TAG', 'PIN_TIME_TAG'], tag_value=1)
How can I modify it so users can enter as many column names for tag_var_list as they want?
Such as
df_pin['PIN_RX&TIME_TAG'] = tagger_obj.summing_all_tagger(df_pin, tag_var_list=['PIN_RX_TAG', 'PIN_TIME_TAG', 'PIN_NAME_TAG'], tag_value=1)
# or
df_pin['PIN_RX&TIME_TAG'] = tagger_obj.summing_all_tagger(df_pin, tag_var_list=['PIN_RX_TAG'], tag_value=1)

The np.all() is your friend.
self.df['temp'] = np.where(np.all(self.df[self.tag_var_list] == self.tag_value, axis=1), 1, 0)

I think you can create list comprehension for list of boolean masks and then reduce of masks to one with casting to integer for 0/1 column:
L = [self.df[x]==self.tag_value for x in tag_var_list]
self.df['temp'] = np.logical_and.reduce(L).astype(int)
Or DataFrame.all with casting boolean mask to integers:
self.df['temp'] = (self.df[self.tag_var_list] == self.tag_value).all(axis=1).astype(int)

Get the same hash value for a Pandas DataFrame each time

My goal is to get unique hash value for a DataFrame. I obtain it out of .csv file.
Whole point is to get the same hash each time I call hash() on it.
My idea was that I create the function
def _get_array_hash(arr):
arr_hashable = arr.values
arr_hashable.flags.writeable = False
hash_ = hash(arr_hashable.data)
return hash_
that is calling underlying numpy array, set it to immutable state and get hash of the buffer.
INLINE UPD.
As of 08.11.2016, this version of the function doesn't work anymore. Instead, you should use
hash(df.values.tobytes())
See comments for the Most efficient property to hash for numpy array.
END OF INLINE UPD.
It works for regular pandas array:
In [12]: data = pd.DataFrame({'A': [0], 'B': [1]})
In [13]: _get_array_hash(data)
Out[13]: -5522125492475424165
In [14]: _get_array_hash(data)
Out[14]: -5522125492475424165
But then I try to apply it to DataFrame obtained from a .csv file:
In [15]: fpath = 'foo/bar.csv'
In [16]: data_from_file = pd.read_csv(fpath)
In [17]: _get_array_hash(data_from_file)
Out[17]: 6997017925422497085
In [18]: _get_array_hash(data_from_file)
Out[18]: -7524466731745902730
Can somebody explain me, how's that possible?
I can create new DataFrame out of it, like
new_data = pd.DataFrame(data=data_from_file.values,
columns=data_from_file.columns,
index=data_from_file.index)
and it works again
In [25]: _get_array_hash(new_data)
Out[25]: -3546154109803008241
In [26]: _get_array_hash(new_data)
Out[26]: -3546154109803008241
But my goal is to preserve the same hash value for a dataframe across application launches in order to retrieve some value from cache.

As of Pandas 0.20.1, you can use the little known (and poorly documented) hash_pandas_object (source code) which was recently made public in pandas.util. It returns one hash value for reach row of the dataframe (and works on series etc. too)
import pandas as pd
import numpy as np
np.random.seed(42)
arr = np.random.choice(['foo', 'bar', 42], size=(3,4))
df = pd.DataFrame(arr)
print(df)
# 0 1 2 3
# 0 42 foo 42 42
# 1 foo foo 42 bar
# 2 42 42 42 42
from pandas.util import hash_pandas_object
h = hash_pandas_object(df)
print(h)
# 0 5559921529589760079
# 1 16825627446701693880
# 2 7171023939017372657
# dtype: uint64
You can always do hash_pandas_object(df).sum() if you want an overall hash of all rows.

Joblib provides a hashing function optimized for objects containing numpy arrays (e.g. pandas dataframes).
import joblib
joblib.hash(df)

I had a similar problem: check if a dataframe is changed and I solved it by hashing the msgpack serialization string. This seems stable among different reloading the same data.
import pandas as pd
import hashlib
DATA_FILE = 'data.json'
data1 = pd.read_json(DATA_FILE)
data2 = pd.read_json(DATA_FILE)
assert hashlib.md5(data1.to_msgpack()).hexdigest() == hashlib.md5(data2.to_msgpack()).hexdigest()
assert hashlib.md5(data1.values.tobytes()).hexdigest() != hashlib.md5(data2.values.tobytes()).hexdigest()

This function seems to work fine:
from hashlib import sha256
def hash_df(df):
s = str(df.columns) + str(df.index) + str(df.values)
return sha256(s.encode()).hexdigest()

Pandas Dataframe or Panel to 3d numpy array

Setup:
pdf = pd.DataFrame(np.random.rand(4,5), columns = list('abcde'))
pdf['a'][2:]=pdf['a'][0]
pdf['a'][:2]=pdf['a'][1]
pdf.set_index(['a','b'])
output:
c d e
a b
0.439502 0.115087 0.832546 0.760513 0.776555
0.609107 0.247642 0.031650 0.727773
0.995370 0.299640 0.053523 0.565753 0.857235
0.392132 0.832560 0.774653 0.213692
Each data series is grouped by the index ID a and b represents a time index for the other features of a. Is there a way to get the pandas to produce a numpy 3d array that reflects the a groupings? Currently it reads the data as two dimensional so pdf.shape outputs (4, 5). What I would like is for the array to be of the variable form:
array([[[-1.38655912, -0.90145951, -0.95106951, 0.76570984],
[-0.21004144, -2.66498267, -0.29255182, 1.43411576],
[-0.21004144, -2.66498267, -0.29255182, 1.43411576]],
[[ 0.0768149 , -0.7566995 , -2.57770951, 0.70834656],
[-0.99097395, -0.81592084, -1.21075386, 0.12361382]]])
Is there a native Pandas way to do this? Note that number of rows per a grouping in the actual data is variable, so I cannot just transpose or reshape pdf.values. If there isn't a native way, what's the best method for iteratively constructing the arrays from hundreds of thousands of rows and hundreds of columns?

I just had an extremely similar problem and solved it like this:
a3d = np.array(list(pdf.groupby('a').apply(pd.DataFrame.as_matrix)))
output:
array([[[ 0.47780308, 0.93422319, 0.00526572, 0.41645868, 0.82089215],
[ 0.47780308, 0.15372096, 0.20948369, 0.76354447, 0.27743855]],
[[ 0.75146799, 0.39133973, 0.25182206, 0.78088926, 0.30276705],
[ 0.75146799, 0.42182369, 0.01166461, 0.00936464, 0.53208731]]])
verifying it is 3d, a3d.shape gives (2, 2, 5).
Lastly, to make the newly created dimension the last dimension (instead of the first) then use:
a3d = np.dstack(list(pdf.groupby('a').apply(pd.DataFrame.as_matrix)))
which has a shape of (2, 5, 2)
For cases where the data is ragged (as brought up by CharlesG in the comments) you can use something like the following if you want to stick to a numpy solution. But be aware that the best strategy to deal with missing data varies from case to case. In this example we simply add zeros for the missing rows.
Example setup with ragged shape:
pdf = pd.DataFrame(np.random.rand(5,5), columns = list('abcde'))
pdf['a'][2:]=pdf['a'][0]
pdf['a'][:2]=pdf['a'][1]
pdf.set_index(['a','b'])
dataframe:
c d e
a b
0.460013 0.577535 0.299304 0.617103 0.378887
0.167907 0.244972 0.615077 0.311497
0.318823 0.640575 0.768187 0.652760 0.822311
0.424744 0.958405 0.659617 0.998765
0.077048 0.407182 0.758903 0.273737
One possible solution:
n_max = pdf.groupby('a').size().max()
a3d = np.array(list(pdf.groupby('a').apply(pd.DataFrame.as_matrix)
.apply(lambda x: np.pad(x, ((0, n_max-len(x)), (0, 0)), 'constant'))))
a3d.shape gives (2, 3, 5)

as_matrix is deprecated, and here we assume first key is a , then groups in a may have different length, this method solve all the problem .
import pandas as pd
import numpy as np
from typing import List
def make_cube(df: pd.DataFrame, idx_cols: List[str]) -> np.ndarray:
"""Make an array cube from a Dataframe
Args:
df: Dataframe
idx_cols: columns defining the dimensions of the cube
Returns:
multi-dimensional array
"""
assert len(set(idx_cols) & set(df.columns)) == len(idx_cols), 'idx_cols must be subset of columns'
df = df.set_index(keys=idx_cols) # don't overwrite a parameter, thus copy!
idx_dims = [len(level) + 1 for level in df.index.levels]
idx_dims.append(len(df.columns))
cube = np.empty(idx_dims)
cube.fill(np.nan)
cube[tuple(np.array(df.index.to_list()).T)] = df.values
return cube
Test:
pdf = pd.DataFrame(np.random.rand(4,5), columns = list('abcde'))
pdf['a'][2:]=pdf['a'][0]
pdf['a'][:2]=pdf['a'][1]
# a, b must be integer
pdf1 = (pdf.assign(a=lambda df: df.groupby(['a']).ngroup())
.assign(b=lambda df: df.groupby(['a'])['b'].cumcount())
)
make_cube(pdf1, ['a', 'b']).shape
give : (2, 2, 3)
pdf = pd.DataFrame(np.random.rand(5,5), columns = list('abcde'))
pdf['a'][2:]=pdf['a'][0]
pdf['a'][:2]=pdf['a'][1]
pdf1 = (pdf.assign(a=lambda df: df.groupby(['a']).ngroup())
.assign(b=lambda df: df.groupby(['a'])['b'].cumcount())
)
make_cube(pdf1, ['a', 'b']).shape
give s (2, 3, 3) .

panel.values
will return a numpy array directly. this will by necessity be the highest acceptable dtype as everything is smushed into a single 3-d numpy array. It will be new array and not a view of the pandas data (no matter the dtype).

Instead of deprecated .as_matrix or alternativly .values() pandas documentation recommends to use .to_numpy()
'Warning: We recommend using DataFrame.to_numpy() instead.'

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to pass a pandas dataframe without using global variables - python

Related

Adding tooltip to folium.features.GeoJson from a geopandas dataframe

Implement df.groupby('user')['item'].apply(np.array) in cuDF

How to make my modified pandas/numpy .where function adaptable to different sizes of a list parameter?

Get the same hash value for a Pandas DataFrame each time

Pandas Dataframe or Panel to 3d numpy array

Categories

Resources