create names of dataframes in a loop - python

I need to give names to previously defined dataframes.
I have a list of dataframes :
liste_verif = ( dffreesurfer,total,qcschizo)
And I would like to give them a name by doing something like:
for h in liste_verif:
h.name = str(h)
Would that be possible ?
When I'm testing this code, it's doesn't work : instead of considering h as a dataframe, python consider each column of my dataframe.
I would like the name of my dataframe to be 'dffreesurfer', 'total' etc...

You can use dict comprehension and map DataFrames by values in list L:
dffreesurfer = pd.DataFrame({'col1': [7,8]})
total = pd.DataFrame({'col2': [1,5]})
qcschizo = pd.DataFrame({'col2': [8,9]})
liste_verif = (dffreesurfer,total,qcschizo)
L = ['dffreesurfer','total','qcschizo']
dfs = {L[i]:x for i,x in enumerate(liste_verif)}
print (dfs['dffreesurfer'])
col1
0 7
1 8
print (dfs['total'])
col2
0 1
1 5

Related

split dataframe into multiple dataframes using loop and lists

I'm attempting to create Dataframe using list. I have 2 List, I'm splitting list into multiple lists. using that multiple lists I'm creating dataframe and now I want to split that created dataframe.
below is the code of creating dataframe using list:
origin_list = ['60.17202,24.91805','51.13747,1.33148','55.65348,22.94213','61.17202,24.91805','62.13747,1.33148','63.65348,22.94213']
Destination_list = ['51.07906,12.13216','52.96035,1.905025','53.05306,16.13416','54.07906,3.13216','55.03406,12.13216','56.07906,12.13216','57.96035,1.905025','58.05306,16.13416','59.07906,3.13216','60.03406,12.13216']
# Code for splitting list into multiple lists
origin_li = [origin_list[i:i + 3] for i in range(0, len(origin_list), 3)]
destination_li = [Destination_list[i:i + 4] for i in range(0, len(Destination_list), 4)]
# Output of above 2 lines
# origing_li = [['60.17202,24.91805', '51.13747,1.33148', '55.65348,22.94213'], ['61.17202,24.91805', '62.13747,1.33148', '63.65348,22.94213']]
# destination_li = [['51.07906,12.13216', '52.96035,1.905025', '53.05306,16.13416', '54.07906,3.13216'], ['55.03406,12.13216', '56.07906,12.13216', '57.96035,1.905025', '58.05306,16.13416'], ['59.07906,3.13216', '60.03406,12.13216']]
df1 = pd.DataFrame()
# loop for every list
for i in origin_li:
print(len(i))
for j in destination_li:
sub_df = pd.DataFrame(list(itertools.product(i,j)))
df1 = pd.concat([df1,sub_df])
print(df1)
by running above code I'm getting an output like:
Now I want to split that output_dataframe by origin_li. For eg.
How do I split dataframe into multiple dataframes?
You can use groupby to create your dataframes:
dfs = dict(list(df1.groupby(np.arange(len(df1)) // 4)))
Output:
>>> dfs[1]
0 1
4 51.13747,1.33148 51.07906,12.13216
5 51.13747,1.33148 52.96035,1.905025
6 51.13747,1.33148 53.05306,16.13416
7 51.13747,1.33148 54.07906,3.13216
>>> dfs[5]
0 1
8 55.65348,22.94213 55.03406,12.13216
9 55.65348,22.94213 56.07906,12.13216
10 55.65348,22.94213 57.96035,1.905025
11 55.65348,22.94213 58.05306,16.13416

pandas create a subset according to a value in a column

I have this dataframe:
86,1/28/2004 0:00:00,16.9
86,5/25/2004 0:00:00,17.01
86,7/22/2004 0:00:00,17.06
87,11/15/2004 0:00:00,7.39
87,3/14/2005 0:00:00,7.59
86,11/15/2004 0:00:00,17.29
86,3/14/2005 0:00:00,17.38
86,4/19/2005 0:00:00,17.43
86,5/19/2005 0:00:00,17.28
87,1/22/2004 0:00:00,7.44
87,5/13/2004 0:00:00,7.36
I would like to work on two separate dataframe according to the value (id) of the first column. Ideally, I would like to have:
87,11/15/2004 0:00:00,7.39
87,3/14/2005 0:00:00,7.59
87,1/22/2004 0:00:00,7.44
87,5/13/2004 0:00:00,7.36
and
86,1/28/2004 0:00:00,16.9
86,5/25/2004 0:00:00,17.01
86,7/22/2004 0:00:00,17.06
86,11/15/2004 0:00:00,17.29
86,3/14/2005 0:00:00,17.38
86,4/19/2005 0:00:00,17.43
86,5/19/2005 0:00:00,17.28
As you can see I have one dataframe with all 87 in the first column and another with 86.
This is how I read the dataframe:
dfr = pd.read_csv(fname,sep=',',index_col=False,header=None)
I think that groupby is not the right options, if I have understood correctly the command.
I was thinking about query as:
aa = dfr.query(dfr.iloc[:,0]==86)
However, I have this error:
expr must be a string to be evaluated, <class 'pandas.core.series.Series'> given
You can simply slice your dataframe:
df_86 = df.loc[df['ColName'] == 86,:]
Another way to do it dynamically without having to specify the group beforehand.
df = pd.DataFrame({'ID': np.repeat([1, 2, 3], 4), 'col2': np.repeat([10, 11, 12], 4)})
Get the unique groupings:
groups = df['ID'].unique()
Create an empty dict to store new data frames
new_dfs = {}
Loop through and create new data frames from the slice:
for group in groups:
name = "ID" + str(group)
new_dfs[name] = df[df['ID'] == group]
new_dfs['ID1']
Which gives:
ID col2
0 1 10
1 1 10
2 1 10
3 1 10

Python - Convert columns with specific base_name into rows

I have the following format of a csv file:
id a_mean_val_1 a_mean_val_2 a_var_val_1 a_var_val_2 b_mean_val_1 b_mean_val_2 b_var_val_1 b_var_val_2
I would like to melt the columns 1 and 2 for all a and b features into rows as follows:
id a_mean a_var b_mean b_var
1 val1 val1 val1 val1
1 val2 val2 val2 val2
I am unsure how to achieve the melt function in pandas, where I could basically have an expression that matches keeps the base name: a_mean as root column and everything that has a suffix for that variable to melt them into rows.
Is there another method I could use to specify these rules?
Thank you
Like this:
rows = []
for line in open('mycsv.csv'):
fields = line.split(',')
rows.append( fields[0::2] )
rows.append( fields[1::2] )
df = pandas.DataFrame(rows, fields=['a_mean','a_var','b_mean','b_var'])
That doesn't provide an ID number. Is the ID part of the CSV file?
I went through the columns and if they were a part of a base column, then appended to a list. Finally, converted those to a dataframe.
So this code would work regardless of the order of the columns
[UPDATED WITH ID]
Since we're adding the entire columns one after the other, the ids will always start from the top, go to the end, and then repeat. So we can take "id" of the original df and multiply that by the number of rows to get the "id" for the new df.
Here's the CSV I used:
id,a_mean_val_1,a_mean_val_2,a_var_val_1,a_var_val_2,b_mean_val_1,b_mean_val_2,b_var_val_1,b_var_val_2
1,a_mean_val_1, a_mean_val_2, a_var_val_1, a_var_val_2, b_mean_val_1 ,b_mean_val_2, b_var_val_1, b_var_val_2
2,a_mean_val_5, a_mean_val_6, a_var_val_5, a_var_val_6, b_mean_val_5 ,b_mean_val_6, b_var_val_5, b_var_val_6
df = pd.read_csv('data_csv.csv')
# Ignore ID
columns = df.columns.tolist()[1:]
df_dict = {}
base = ['a_mean', 'a_var', 'b_mean', 'b_var']
for bas in base:
df_dict[bas] = []
for col in columns:
# for example, "a_mean" in "a_mean_val_1" then append
if(bas in col):
df_dict[bas] = df_dict[bas] + df[col].tolist()
ids = df['id'].tolist()
df_new = pd.DataFrame(df_dict)
df_new['id'] = ids*df.shape[0]
a_mean a_var b_mean b_var id
a_mean_val_1 a_var_val_1 b_mean_val_1 b_var_val_1 1
a_mean_val_5 a_var_val_5 b_mean_val_5 b_var_val_5 2
a_mean_val_2 a_var_val_2 b_mean_val_2 b_var_val_2 1
a_mean_val_6 a_var_val_6 b_mean_val_6 b_var_val_6 2

Create dataframe in a loop

I would like to create a dataframe in a loop and after use these dataframe in a loop. I tried eval() function but it didn't work.
For example :
for i in range(5):
df_i = df[(df.age == i)]
There I would like to create df_0,df_1 etc. And then concatenate these new dataframe after some calculations :
final_df = pd.concat(df_0,df_1)
for i in range(2:5):
final_df = pd.concat(final_df, df_i)
You can create a dict of DataFrames x and have is as dict keys:
np.random.seed(42)
df = pd.DataFrame({'age': np.random.randint(0, 5, 20)})
x = {}
for i in range(5):
x[i] = df[df['age']==i]
final = pd.concat(x.values())
Then you can refer to individual DataFrames as:
x[1]
Output:
age
5 1
13 1
15 1
And concatenate all of them with:
pd.concat(x.values())
Output:
age
18 0
5 1
13 1
15 1
2 2
6 2
...
The way is weird and not recommended, but it can be done.
Answer
for i in range(5):
exec("df_{i} = df[df['age']=={i}]")
def UDF(dfi):
# do something in user-defined function
for i in range(5):
exec("df_{i} = UDF(df_{i})")
final_df = pd.concat(df_0,df_1)
for i in range(2:5):
final_df = pd.concat(final_df, df_i)
Better Way 1
Using a list or a dict to store the dataframe should be a better way since you can access each dataframe by an index or a key.
Since another answer shows the way using dict (#perl), I will show you the way using list.
def UDF(dfi):
# do something in user-defined function
dfs = [df[df['age']==i] for i in range(i)]
final_df = pd.concat(map(UDF, dfs))
Better Way 2
Since you are using pandas.DataFrame, groupby function is a 'pandas' way to do what you want. (maybe, I guess, cause I don't know what you want to do. LOL)
def UDF(dfi):
# do something in user-defined function
final_df = df.groupby('age').apply(UDF)
Reference: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html

Text To Column Function

I am trying to write my own function in Python 3.5, but not having much luck.
I have a data frame that is 17 columns, 1,200 rows (tiny)
One of the columns is called "placement". Within this column, I have text contained in each row. The naming convention is as follows:
Campaign_Publisher_Site_AdType_AdSize_Device_Audience_Tactic_
The following code works perfectly and does exactly what i need it to do; I just don't want to do this for every data set i have:
df_detailed = df['Placement'].str[0:-1].str.split('_', expand=True).astype(str)
df_detailed = df.join(df_detailed)
new_columns = *["Then i rename the columns labelled 0,1,2 etc"]*
df_detailed.columns = new_columns
df_detailed.head()
What I'm trying to do is build a function, that takes any columns with _ as the delimitator and splits it across new columns.
I have tried the following (but unfortunately defining my own functions is something I'm horrible at.
def text_to_column(df):
df_detailed = df['Placement'].str[0:-1].str.split('_', expand=True).astype(str)
headings = df_detailed.columns
headings.replace(" ", "_")
df_detailed = df.join(df_detailed)
df_detailed.columns = headings
return (df)
and I get the following error "AttributeError: 'RangeIndex' object has no attribute 'replace'"
The end goal here is to write a function where I can pass the column name into the function, it separates the values contained within the column into new columns and then joins this back to my original Data Frame.
If I'm being ridiculous, please let me know. If someone can help me, it would be greatly appreciated.
Thanks,
Adrian
You need rename function for replace columns names:
headings = df_detailed.columns
headings.replace(" ", "_")
change to:
df_detailed = df_detailed.rename(columns=lambda x: x.replace(" ", "_"))
Or convert columns to_series because replace does not work with index (columns names):
headings.replace(" ", "_")
change to:
headings = headings.to_series().replace(" ", "_")
Also:
df_detailed = df['Placement'].str[0:-1].str.split('_', expand=True).astype(str)
is possible change to:
df_detailed = df['Placement'].str.rstrip('_').str.split('_', expand=True).astype(str)
EDIT:
Sample:
df = pd.DataFrame({'a': [1, 2], 'Placement': ['Campaign_Publisher_Site_AdType_AdSize_Device_Audience_Tactic_', 'a_b_c_d_f_g_h_i_']})
print (df)
Placement a
0 Campaign_Publisher_Site_AdType_AdSize_Device_A... 1
1 a_b_c_d_f_g_h_i_ 2
#input is DataFrame and column name
def text_to_column(df, col):
df_detailed = df[col].str.rstrip('_').str.split('_', expand=True).astype(str)
#replace columns names if necessary
df_detailed.columns = df_detailed.columns.to_series().replace(" ", "_")
#remove column and join new df
df_detailed = df.drop(col, axis=1).join(df_detailed)
return df_detailed
df = text_to_column(df, 'Placement')
print (df)
a 0 1 2 3 4 5 6 7
0 1 Campaign Publisher Site AdType AdSize Device Audience Tactic
1 2 a b c d f g h i

Categories

Resources