Create dataframe in a loop - python

I would like to create a dataframe in a loop and after use these dataframe in a loop. I tried eval() function but it didn't work.
For example :
for i in range(5):
df_i = df[(df.age == i)]
There I would like to create df_0,df_1 etc. And then concatenate these new dataframe after some calculations :
final_df = pd.concat(df_0,df_1)
for i in range(2:5):
final_df = pd.concat(final_df, df_i)

You can create a dict of DataFrames x and have is as dict keys:
np.random.seed(42)
df = pd.DataFrame({'age': np.random.randint(0, 5, 20)})
x = {}
for i in range(5):
x[i] = df[df['age']==i]
final = pd.concat(x.values())
Then you can refer to individual DataFrames as:
x[1]
Output:
age
5 1
13 1
15 1
And concatenate all of them with:
pd.concat(x.values())
Output:
age
18 0
5 1
13 1
15 1
2 2
6 2
...

The way is weird and not recommended, but it can be done.
Answer
for i in range(5):
exec("df_{i} = df[df['age']=={i}]")
def UDF(dfi):
# do something in user-defined function
for i in range(5):
exec("df_{i} = UDF(df_{i})")
final_df = pd.concat(df_0,df_1)
for i in range(2:5):
final_df = pd.concat(final_df, df_i)
Better Way 1
Using a list or a dict to store the dataframe should be a better way since you can access each dataframe by an index or a key.
Since another answer shows the way using dict (#perl), I will show you the way using list.
def UDF(dfi):
# do something in user-defined function
dfs = [df[df['age']==i] for i in range(i)]
final_df = pd.concat(map(UDF, dfs))
Better Way 2
Since you are using pandas.DataFrame, groupby function is a 'pandas' way to do what you want. (maybe, I guess, cause I don't know what you want to do. LOL)
def UDF(dfi):
# do something in user-defined function
final_df = df.groupby('age').apply(UDF)
Reference: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html

Related

Check if a pandas Dataframe string column contains all the elements given in an array

I have a dataframe as shown below:
>>> import pandas as pd
>>> df = pd.DataFrame(data = [['app;',1,2,3],['app; web;',4,5,6],['web;',7,8,9],['',1,4,5]],columns = ['a','b','c','d'])
>>> df
a b c d
0 app; 1 2 3
1 app; web; 4 5 6
2 web; 7 8 9
3 1 4 5
I have an input array that looks like this: ["app","web"]
For each of these values I want to check against a specific column of a dataframe and return a decision as shown below:
>>> df.a.str.contains("app")
0 True
1 True
2 False
3 False
Since str.contains only allows me to look for an individual value, I was wondering if there's some other direct way to determine the same something like:
df.a.str.contains(["app","web"]) # Returns TypeError: unhashable type: 'list'
My end goal is not to do an absolute match (df.a.isin(["app", "web"]) but rather a 'contains' logic that says return true even if it has those characters present in that cell of data frame.
Note: I can of course use apply method to create my own function for the same logic such as:
elementsToLookFor = ["app","web"]
df[header] = df.apply(lambda element: all([a in element for a in elementsToLookFor]))
But I am more interested in the optimal algorithm for this and so prefer to use a native pandas function within pandas, or else the next most optimized custom solution.
This should work too:
l = ["app","web"]
df['a'].str.findall('|'.join(l)).map(lambda x: len(set(x)) == len(l))
also this should work as well:
pd.concat([df['a'].str.contains(i) for i in l],axis=1).all(axis = 1)
so many solutions, which one is the most efficient
The str.contains-based answers are generally fastest, though str.findall is also very fast on smaller dfs:
values = ['app', 'web']
pattern = ''.join(f'(?=.*{value})' for value in values)
def replace_dummies_all(df):
return df.a.str.replace(' ', '').str.get_dummies(';')[values].all(1)
def findall_map(df):
return df.a.str.findall('|'.join(values)).map(lambda x: len(set(x)) == len(values))
def lower_contains(df):
return df.a.astype(str).str.lower().str.contains(pattern)
def contains_concat_all(df):
return pd.concat([df.a.str.contains(l) for l in values], axis=1).all(1)
def contains(df):
return df.a.str.contains(pattern)
Try with str.get_dummies
df.a.str.replace(' ','').str.get_dummies(';')[['web','app']].all(1)
0 False
1 True
2 False
3 False
dtype: bool
Update
df['a'].str.contains(r'^(?=.*web)(?=.*app)')
Update 2: (To ensure case insenstivity doesn't matter and the column dtype is str without which the logic may fail):
elementList = ['app','web']
for eachValue in elementList:
valueString += f'(?=.*{eachValue})'
df[header] = df[header].astype(str).str.lower() #To ensure case insenstivity and the dtype of the column is string
result = df[header].str.contains(valueString)

How to implement python custom function on dictionary of dataframes

I have a dictionary that contains 3 dataframes.
How do I implement a custom function to each dataframes in the dictionary.
In simpler terms, I want to apply the function find_outliers as seen below
# User defined function : find_outliers
#(I)
from scipy import stats
outlier_threshold = 1.5
ddof = 0
def find_outliers(s: pd.Series):
outlier_mask = np.abs(stats.zscore(s, ddof=ddof)) > outlier_threshold
# replace boolean values with corresponding strings
return ['background-color:blue' if val else '' for val in outlier_mask]
To the dictionary of dataframes dict_of_dfs below
# the dataset
import numpy as np
import pandas as pd
df = {
'col_A':['A_1001', 'A_1001', 'A_1001', 'A_1001', 'B_1002','B_1002','B_1002','B_1002','D_1003','D_1003','D_1003','D_1003'],
'col_X':[110.21, 191.12, 190.21, 12.00, 245.09,4321.8,122.99,122.88,134.28,148.14,161.17,132.17],
'col_Y':[100.22,199.10, 191.13,199.99, 255.19,131.22,144.27,192.21,7005.15,12.02,185.42,198.00],
'col_Z':[140.29, 291.07, 390.22, 245.09, 4122.62,4004.52,395.17,149.19,288.91,123.93,913.17,1434.85]
}
df = pd.DataFrame(df)
df
#dictionary_of_dataframes
#(II)
dict_of_dfs=dict(tuple(df.groupby('col_A')))
and lastly, flag outliers in each df of the dict_of_dfs
# end goal is to have find/flag outliers in each `df` of the `dict_of_dfs`
#(III)
desired_cols = ['col_X','col_Y','col_Z']
dict_of_dfs.style.apply(find_outliers, subset=desired_cols)
summarily, I want to apply I to II and finally flag outliers in III
Thanks for your attempt. :)
Desired output should look like this, but for the three dataframes
This may not be what you want, but this is how I'd approach it, but you'll have to work out the details of the function because you have it written to receive a series rather a dataframe. Groupby apply() will send the subsets of rows and then you can perform the actions on that subset and return the result.
For consideration:
inside the function you may be able to handle all columns like so:
def find_outliers(x):
for col in ['col_X','col_Y','col_Z']:
outlier_mask = np.abs(stats.zscore(x[col], ddof=ddof)) > outlier_threshold
x[col] = ['outlier' if val else '' for val in outlier_mask]
return x
newdf = df.groupby('col_A').apply(find_outliers)
col_A col_X col_Y col_Z
0 A_1001 outlier
1 A_1001
2 A_1001
3 A_1001 outlier
4 B_1002 outlier
5 B_1002 outlier
6 B_1002
7 B_1002
8 D_1003 outlier
9 D_1003
10 D_1003

Pandas: how to add a dataframe inside a cell of another dataframe?

I have an empty dataframe like the following:
simReal2013 = pd.DataFrame(index = np.arange(0,1,1))
Then I read as dataframes some .csv files.
stat = np.arange(0,5)
xv = [0.005, 0.01, 0.05]
br = [0.001,0.005]
for i in xv:
for j in br:
I = 0
for s in stat:
string = 'results/2013/real/run_%d_%f_%f_15.0_10.0_T0_RealNet.csv'%(s,i,j)
sim = pd.read_csv(string, sep=' ')
I += np.array(sim.I)
sim.I = I / 5
col = '%f_%f'%(i,j)
simReal2013.insert(0, col, sim)
I would like to add the dataframe that I read in a cell of simReal2013. In doing so I get the following error:
ValueError: Wrong number of items passed 9, placement implies 1
Yes, putting a dataframe inside of a dataframe is probably not the way you want to go, but if you must, this is one way to do it:
df_in=pd.DataFrame([[1,2,3]]*2)
d={}
d['a']=df
df_out=pd.DataFrame([d])
type(df_out.loc[0,"a"])
>>> pandas.core.frame.DataFrame
Maybe a dictionary of dataframes would suffice for your use.

How to implement a select-like function

I got a dataset in python and the structure of it is like
Tree Species number of trunks
------------------------------
Acer rubrum 1
Quercus bicolor 1
Quercus bicolor 1
aabbccdd 0
and I have a question of can I implement a function similar to
Select sum(number of trunks)
from trees.data['Number of Trunks']
where x = trees.data["Tree Species"]
group by trees.data["Tree Species"]
in python? x is an array contains five elements:
x = array(['Acer rubrum', 'Acer saccharum', 'Acer saccharinum',
'Quercus rubra', 'Quercus bicolor'], dtype='<U16')
what I want to do is mapping each elements in x to trees.data["Tree Species"] and calculate the sum of number of trunks, it should return an array of
array = (sum_num(Acer rubrum), sum_num(Acer saccharum), sum_num(Acer saccharinum),
sum_num(Acer Quercus rubra), sum_num(Quercus bicolor))
Did you want to look at Python Pandas. That will allow you to do something like
df.groupby('Tree Species')['Number of Trunks'].sum()
Please note here df is whatever the variable name you read in your data frame. I would recommend you to look at pandas and lambda function too.
You can do something like this:
import pandas as pd
df = pd.DataFrame()
tree_species = ["Acer rubrum", "Quercus bicolor", "Quercus bicolor", "aabbccdd"]
no_of_trunks = [1,1,1,0]
df["Tree Species"] = tree_species
df["Number of Trunks"] = no_of_trunks
df.groupby('Tree Species').sum() #This will create a pandas dataframe
df.groupby('Tree Species')['Number of Trunks'].sum() #This will create a pandas series.
You can do the same thing by just using dictionaries too:
tree_species = ["Acer rubrum", "Quercus bicolor", "Quercus bicolor", "aabbccdd"]
no_of_trunks = [1,1,1,0]
d = {}
for key, trunk in zip(tree_species, no_of_trunks):
if not key in d.keys():
d[key] = 0
d[key] += trunk
print(d)

create names of dataframes in a loop

I need to give names to previously defined dataframes.
I have a list of dataframes :
liste_verif = ( dffreesurfer,total,qcschizo)
And I would like to give them a name by doing something like:
for h in liste_verif:
h.name = str(h)
Would that be possible ?
When I'm testing this code, it's doesn't work : instead of considering h as a dataframe, python consider each column of my dataframe.
I would like the name of my dataframe to be 'dffreesurfer', 'total' etc...
You can use dict comprehension and map DataFrames by values in list L:
dffreesurfer = pd.DataFrame({'col1': [7,8]})
total = pd.DataFrame({'col2': [1,5]})
qcschizo = pd.DataFrame({'col2': [8,9]})
liste_verif = (dffreesurfer,total,qcschizo)
L = ['dffreesurfer','total','qcschizo']
dfs = {L[i]:x for i,x in enumerate(liste_verif)}
print (dfs['dffreesurfer'])
col1
0 7
1 8
print (dfs['total'])
col2
0 1
1 5

Categories

Resources