How to implement a select-like function - python

I got a dataset in python and the structure of it is like
Tree Species number of trunks
------------------------------
Acer rubrum 1
Quercus bicolor 1
Quercus bicolor 1
aabbccdd 0
and I have a question of can I implement a function similar to
Select sum(number of trunks)
from trees.data['Number of Trunks']
where x = trees.data["Tree Species"]
group by trees.data["Tree Species"]
in python? x is an array contains five elements:
x = array(['Acer rubrum', 'Acer saccharum', 'Acer saccharinum',
'Quercus rubra', 'Quercus bicolor'], dtype='<U16')
what I want to do is mapping each elements in x to trees.data["Tree Species"] and calculate the sum of number of trunks, it should return an array of
array = (sum_num(Acer rubrum), sum_num(Acer saccharum), sum_num(Acer saccharinum),
sum_num(Acer Quercus rubra), sum_num(Quercus bicolor))

Did you want to look at Python Pandas. That will allow you to do something like
df.groupby('Tree Species')['Number of Trunks'].sum()
Please note here df is whatever the variable name you read in your data frame. I would recommend you to look at pandas and lambda function too.

You can do something like this:
import pandas as pd
df = pd.DataFrame()
tree_species = ["Acer rubrum", "Quercus bicolor", "Quercus bicolor", "aabbccdd"]
no_of_trunks = [1,1,1,0]
df["Tree Species"] = tree_species
df["Number of Trunks"] = no_of_trunks
df.groupby('Tree Species').sum() #This will create a pandas dataframe
df.groupby('Tree Species')['Number of Trunks'].sum() #This will create a pandas series.
You can do the same thing by just using dictionaries too:
tree_species = ["Acer rubrum", "Quercus bicolor", "Quercus bicolor", "aabbccdd"]
no_of_trunks = [1,1,1,0]
d = {}
for key, trunk in zip(tree_species, no_of_trunks):
if not key in d.keys():
d[key] = 0
d[key] += trunk
print(d)

Related

How to get keys from values in dictionary and use it in dataframe?

I have a dataframe df_x_encode-
x_test_encode
0 [0.1260023, -0.014597204, -0.079445906, -0.055...
1 [0.0083509395, 0.09799187, -0.05743032, -0.000...
2 [-0.05807189, 0.11802298, -0.031580053, -0.064...
3 [0.1260023, -0.014597204, -0.079445906, -0.055...
4 [0.121216424, -0.017603464, -0.090226464, -0.0...
I have a dict where the values from the column x_test_encode is the value and keys are text as follows -
{'Strengthening the field is a must ': array([ 1.75993890e-02, 7.26785734e-02, -7.36519024e-02, -2.17226259e-02,
3.65523808e-02, -4.50823084e-03, 6.18522726e-02, 1.35725755e-02,
-1.65322982e-02, -1.93105303e-02, -6.45413473e-02, -1.43367276e-02,
3.43437083e-02, -5.04908897e-02, -7.43871846e-04, -2.44313944e-02,
2.88490783e-02, -2.72445306e-02, 5.23326918e-02, 4.61216345e-02,
2.41497066e-04, -8.29233676e-02, -9.53390170e-03, -7.67266843e-03,..],
.
.
.
I want to add a column x_test where the values will be taken from the dict keys.
Eg -
x_test_encode text
0 [0.1260023, -0.014597204, -0.079445906, -0.055... This is to be noted that..
1 [0.0083509395, 0.09799187, -0.05743032, -0.000... Strengthening the perfect..
2 [-0.05807189, 0.11802298, -0.031580053, -0.064...
3 [0.1260023, -0.014597204, -0.079445906, -0.055...
4 [0.121216424, -0.017603464, -0.090226464, -0.0...
I am unable to get the keys from using the values and then map them to the rows of the dataframe.
Any help on this?
Make a new dictionary for use with Series.map
f = {tuple(value):key for key,value in d}
Make a Series of tuples from the DataFrame column (of arrays)
s = df['a'].apply(tuple)
Map the dictionary to that Series and assign it to a new column
df = df.assign(new=s.map(f))
Test data
import numpy as np
import pandas as pd
rng = np.random.default_rng()
w = rng.integers(-10,10,(4,))
x = rng.integers(-10,10,(4,))
y = rng.integers(-10,10,(4,))
z = rng.integers(-10,10,(4,))
q = pd.Series({0:w,1:x,2:y,3:z,4:x+1})
df = pd.DataFrame({'a':q})
d = {'q':w,'r':x,'s':y,'t':z}
The new column will contain NaN for any row that doesn't match the dictionary.

How to implement python custom function on dictionary of dataframes

I have a dictionary that contains 3 dataframes.
How do I implement a custom function to each dataframes in the dictionary.
In simpler terms, I want to apply the function find_outliers as seen below
# User defined function : find_outliers
#(I)
from scipy import stats
outlier_threshold = 1.5
ddof = 0
def find_outliers(s: pd.Series):
outlier_mask = np.abs(stats.zscore(s, ddof=ddof)) > outlier_threshold
# replace boolean values with corresponding strings
return ['background-color:blue' if val else '' for val in outlier_mask]
To the dictionary of dataframes dict_of_dfs below
# the dataset
import numpy as np
import pandas as pd
df = {
'col_A':['A_1001', 'A_1001', 'A_1001', 'A_1001', 'B_1002','B_1002','B_1002','B_1002','D_1003','D_1003','D_1003','D_1003'],
'col_X':[110.21, 191.12, 190.21, 12.00, 245.09,4321.8,122.99,122.88,134.28,148.14,161.17,132.17],
'col_Y':[100.22,199.10, 191.13,199.99, 255.19,131.22,144.27,192.21,7005.15,12.02,185.42,198.00],
'col_Z':[140.29, 291.07, 390.22, 245.09, 4122.62,4004.52,395.17,149.19,288.91,123.93,913.17,1434.85]
}
df = pd.DataFrame(df)
df
#dictionary_of_dataframes
#(II)
dict_of_dfs=dict(tuple(df.groupby('col_A')))
and lastly, flag outliers in each df of the dict_of_dfs
# end goal is to have find/flag outliers in each `df` of the `dict_of_dfs`
#(III)
desired_cols = ['col_X','col_Y','col_Z']
dict_of_dfs.style.apply(find_outliers, subset=desired_cols)
summarily, I want to apply I to II and finally flag outliers in III
Thanks for your attempt. :)
Desired output should look like this, but for the three dataframes
This may not be what you want, but this is how I'd approach it, but you'll have to work out the details of the function because you have it written to receive a series rather a dataframe. Groupby apply() will send the subsets of rows and then you can perform the actions on that subset and return the result.
For consideration:
inside the function you may be able to handle all columns like so:
def find_outliers(x):
for col in ['col_X','col_Y','col_Z']:
outlier_mask = np.abs(stats.zscore(x[col], ddof=ddof)) > outlier_threshold
x[col] = ['outlier' if val else '' for val in outlier_mask]
return x
newdf = df.groupby('col_A').apply(find_outliers)
col_A col_X col_Y col_Z
0 A_1001 outlier
1 A_1001
2 A_1001
3 A_1001 outlier
4 B_1002 outlier
5 B_1002 outlier
6 B_1002
7 B_1002
8 D_1003 outlier
9 D_1003
10 D_1003

Split List of values to dataframe in Python

So I was trying to split a list of values into dataframe in Python.
Here is a sample example of my list
ini_string1 = "Time=2014-11-07 00:00:00,strangeness=0.0001,p-value=0.19,deviation=0.78,D_Range=low'"
templist = []
for i in range(5):
templist.append({ini_string1})
Now I was trying to create a dataframe with Time, Strangeness, P-Values, Deviation, D_Range as variables.
I was able to get a data frame when I have only one sigle value of ini_string but counld not make it when I have list of values.
Below is a sample code I tried with single value ini_string
lst_dict = []
cols = ['Time','Strangeness', 'P-Values', 'Deviation', 'Is_Deviation']
# Initialising string
for i in range(5):
ini_string1 = "Time=2014-11-07 00:00:00,strangeness=0.0001,p-value=0.19,deviation=0.78,D_Range=low'"
tempstr = ini_string1
res = dict(item.split("=") for item in tempstr.split(","))
lst_dict.append({'Time': res['Time'],
'Strangeness': res['strangeness'],
'P-Values': res['p-value'],
'Deviation': res['deviation'],
'Is_Deviation': res['D_Range']})
print(lst_dict)
strdf = pd.DataFrame(lst_dict, columns=cols)
I could not figureout the implementation for list of values
The below code will do the job.
from collections import defaultdict
import pandas as pd
ini_string1 = "Time=2014-11-07 00:00:00,strangeness=0.0001,p-value=0.19,deviation=0.78,D_Range='low'"
ini_string2 = "Time=2015-12-07 00:00:00,strangeness=0.0005,p-value=0.31,deviation=0.01,D_Range='high'"
ini_strings = [ini_string1, ini_string2]
dd = defaultdict(list)
for ini_str in ini_strings:
for key_val in ini_str.split(','):
k, v = key_val.split('=')
dd[k].append(v)
df = pd.DataFrame(dd)
Read more about defaultdict - How does collections.defaultdict work?
Python has other interesting data structures - https://docs.python.org/2/library/collections.html

Create dataframe in a loop

I would like to create a dataframe in a loop and after use these dataframe in a loop. I tried eval() function but it didn't work.
For example :
for i in range(5):
df_i = df[(df.age == i)]
There I would like to create df_0,df_1 etc. And then concatenate these new dataframe after some calculations :
final_df = pd.concat(df_0,df_1)
for i in range(2:5):
final_df = pd.concat(final_df, df_i)
You can create a dict of DataFrames x and have is as dict keys:
np.random.seed(42)
df = pd.DataFrame({'age': np.random.randint(0, 5, 20)})
x = {}
for i in range(5):
x[i] = df[df['age']==i]
final = pd.concat(x.values())
Then you can refer to individual DataFrames as:
x[1]
Output:
age
5 1
13 1
15 1
And concatenate all of them with:
pd.concat(x.values())
Output:
age
18 0
5 1
13 1
15 1
2 2
6 2
...
The way is weird and not recommended, but it can be done.
Answer
for i in range(5):
exec("df_{i} = df[df['age']=={i}]")
def UDF(dfi):
# do something in user-defined function
for i in range(5):
exec("df_{i} = UDF(df_{i})")
final_df = pd.concat(df_0,df_1)
for i in range(2:5):
final_df = pd.concat(final_df, df_i)
Better Way 1
Using a list or a dict to store the dataframe should be a better way since you can access each dataframe by an index or a key.
Since another answer shows the way using dict (#perl), I will show you the way using list.
def UDF(dfi):
# do something in user-defined function
dfs = [df[df['age']==i] for i in range(i)]
final_df = pd.concat(map(UDF, dfs))
Better Way 2
Since you are using pandas.DataFrame, groupby function is a 'pandas' way to do what you want. (maybe, I guess, cause I don't know what you want to do. LOL)
def UDF(dfi):
# do something in user-defined function
final_df = df.groupby('age').apply(UDF)
Reference: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html

create names of dataframes in a loop

I need to give names to previously defined dataframes.
I have a list of dataframes :
liste_verif = ( dffreesurfer,total,qcschizo)
And I would like to give them a name by doing something like:
for h in liste_verif:
h.name = str(h)
Would that be possible ?
When I'm testing this code, it's doesn't work : instead of considering h as a dataframe, python consider each column of my dataframe.
I would like the name of my dataframe to be 'dffreesurfer', 'total' etc...
You can use dict comprehension and map DataFrames by values in list L:
dffreesurfer = pd.DataFrame({'col1': [7,8]})
total = pd.DataFrame({'col2': [1,5]})
qcschizo = pd.DataFrame({'col2': [8,9]})
liste_verif = (dffreesurfer,total,qcschizo)
L = ['dffreesurfer','total','qcschizo']
dfs = {L[i]:x for i,x in enumerate(liste_verif)}
print (dfs['dffreesurfer'])
col1
0 7
1 8
print (dfs['total'])
col2
0 1
1 5

Categories

Resources