split dataframe into multiple dataframes using loop and lists

split dataframe into multiple dataframes using loop and lists - python

I'm attempting to create Dataframe using list. I have 2 List, I'm splitting list into multiple lists. using that multiple lists I'm creating dataframe and now I want to split that created dataframe.
below is the code of creating dataframe using list:
origin_list = ['60.17202,24.91805','51.13747,1.33148','55.65348,22.94213','61.17202,24.91805','62.13747,1.33148','63.65348,22.94213']
Destination_list = ['51.07906,12.13216','52.96035,1.905025','53.05306,16.13416','54.07906,3.13216','55.03406,12.13216','56.07906,12.13216','57.96035,1.905025','58.05306,16.13416','59.07906,3.13216','60.03406,12.13216']
# Code for splitting list into multiple lists
origin_li = [origin_list[i:i + 3] for i in range(0, len(origin_list), 3)]
destination_li = [Destination_list[i:i + 4] for i in range(0, len(Destination_list), 4)]
# Output of above 2 lines
# origing_li = [['60.17202,24.91805', '51.13747,1.33148', '55.65348,22.94213'], ['61.17202,24.91805', '62.13747,1.33148', '63.65348,22.94213']]
# destination_li = [['51.07906,12.13216', '52.96035,1.905025', '53.05306,16.13416', '54.07906,3.13216'], ['55.03406,12.13216', '56.07906,12.13216', '57.96035,1.905025', '58.05306,16.13416'], ['59.07906,3.13216', '60.03406,12.13216']]
df1 = pd.DataFrame()
# loop for every list
for i in origin_li:
print(len(i))
for j in destination_li:
sub_df = pd.DataFrame(list(itertools.product(i,j)))
df1 = pd.concat([df1,sub_df])
print(df1)
by running above code I'm getting an output like:
Now I want to split that output_dataframe by origin_li. For eg.
How do I split dataframe into multiple dataframes?

You can use groupby to create your dataframes:
dfs = dict(list(df1.groupby(np.arange(len(df1)) // 4)))
Output:
>>> dfs[1]
0 1
4 51.13747,1.33148 51.07906,12.13216
5 51.13747,1.33148 52.96035,1.905025
6 51.13747,1.33148 53.05306,16.13416
7 51.13747,1.33148 54.07906,3.13216
>>> dfs[5]
0 1
8 55.65348,22.94213 55.03406,12.13216
9 55.65348,22.94213 56.07906,12.13216
10 55.65348,22.94213 57.96035,1.905025
11 55.65348,22.94213 58.05306,16.13416

Related

Removing index which is not common between 2 dataframes

I have the following code:
import pandas.util.testing as testing
df = testing.makeDataFrame()
df
This this I have created 2 dataframes with one dataframe have 2 less lines than the original one.
This is df - Original
A B C D
OdhGFPa5Kw -0.686378 -1.210838 1.160708 0.903309
gelZFj4BG5 1.603112 1.852592 -0.065482 0.684566
mp3Aq5ueGD 0.254211 -0.788877 -0.626789 0.109116
pBtz9DHxUZ -0.970632 0.982661 -0.463984 -0.123727
K28pzbdYcX -1.311220 -2.121306 1.209484 -1.695901
71ZFgWaeDE 1.887420 0.337702 -0.176539 0.149089
alWOjkQ2eZ 1.997701 -0.354276 1.997802 -0.086803
This is df1 - with 2 less lines
A B C D
OdhGFPa5Kw -0.686378 -1.210838 1.160708 0.903309
gelZFj4BG5 1.603112 1.852592 -0.065482 0.684566
mp3Aq5ueGD 0.254211 -0.788877 -0.626789 0.109116
pBtz9DHxUZ -0.970632 0.982661 -0.463984 -0.123727
K28pzbdYcX -1.311220 -2.121306 1.209484 -1.695901
What I am trying to do is to remove all the rows which are not common between the two dataframes. To do this, we find the duplicate index in the two columns.
duplicates = set(df.index).intersection(df1.index)
Could you please advise how can I remove rows where index is not in the duplicates ?

If you want to remove the indices in place:
idx = df.index.difference(df1.index)
df.drop(idx, inplace=True)
If you want to create a new object:
idx = df.index.intersection(df1.index)
new_df = df.loc[idx]

Apply a function row by row using other dataframes' rows as list inputs in python

I'm trying to apply a function row-by-row which takes 5 inputs, 3 of which are lists. I want these lists to come from each row of 3 correspondings dataframes.
I've tried using 'apply' and 'lambda' as follows:
sol['tf_dd']=sol.apply(lambda tsol, rfsol, rbsol:
taurho_difdif(xy=xy,
l=l,
t=tsol,
rf=rfsol,
rb=rbsol),
axis=1)
However I get the error <lambda>() missing 2 required positional arguments: 'rfsol' and 'rbsol'
The DataFrame sol and the DataFrames tsol, rfsol and rbsol all have the same length. For each row, I want the entire row from tsol, rfsol and rbsol to be input as three lists.
Here is much simplified example (first with single lists, which I then want to replicate row by row with dataframes):
The output with single lists is a single value (120). With dataframes as inputs I want an output dataframe of length 10 where all values are 120.
t=[1,2,3,4,5]
rf=[6,7,8,9,10]
rb=[11,12,13,14,15]
def simple_func(t, rf, rb):
x=sum(t)
y=sum(rf)
z=sum(rb)
return x+y+z
out=simple_func(t,rf,rb)
# dataframe rows as lists
tsol=pd.DataFrame((t,t,t,t,t,t,t,t,t,t))
rfsol=pd.DataFrame((rf,rf,rf,rf,rf,rf,rf,rf,rf,rf))
rbsol=pd.DataFrame((rb,rb,rb,rb,rb,rb,rb,rb,rb,rb))
out2 = pd.DataFrame(index=range(len(tsol)), columns=['output'])
out2['output'] = out2.apply(lambda tsol, rfsol, rbsol:
simple_func(t=tsol.tolist(),
rf=rfsol.tolist(),
rb=rbsol.tolist()),
axis=1)

Try to use "name" field in Series Type to get index value, and then get the same index for the other DataFrame
import pandas as pd
import numpy as np
def postional_sum(inot, df1, df2, df3):
"""
Get input index and gather the same position for the other DataFrame collection
"""
position = inot.name
x = df1.iloc[position].sum()
y = df2.iloc[position].sum()
z = df3.iloc[position].sum()
return x + y + z
# dataframe rows as lists
tsol = pd.DataFrame(np.random.randn(10, 5), columns=range(5))
rfsol = pd.DataFrame(np.random.randn(10, 5), columns=range(5))
rbsol = pd.DataFrame(np.random.randn(10, 5), columns=range(5))
out2 = pd.DataFrame(index=range(len(tsol)), columns=["output"])
out2["output"] = out2.apply(lambda x: postional_sum(x, tsol, rfsol, rbsol), axis=1)
out2
Hope this helps!

When you run df.apply() with axis=1, it does not pass on the columns as individual arguments to the function, but as a Series object, as explained here. The correct way to do this would be
out2['output'] = out2.apply(lambda row:
simple_func(t=row["tsol"],
rf=row["rfsol"],
rb=row["rbsol"]),
axis=1)

You can eliminate the simple function using this:
out2["output"] = tsol.sum(axis=1) + rfsol.sum(axis=1) + rbsol.sum(axis=1)
Here is the complete code:
t=[1,2,3,4,5]
rf=[6,7,8,9,10]
rb=[11,12,13,14,15]
# dataframe rows as lists
tsol=pd.DataFrame((t,t,t,t,t,t,t,t,t,t))
rfsol=pd.DataFrame((rf,rf,rf,rf,rf,rf,rf,rf,rf,rf))
rbsol=pd.DataFrame((rb,rb,rb,rb,rb,rb,rb,rb,rb,rb))
out2 = pd.DataFrame(index=range(len(tsol)), columns=["output"])
out2["output"] = tsol.sum(axis=1) + rfsol.sum(axis=1) + rbsol.sum(axis=1)
print(out2)
OUTPUT:
output
0 120
1 120
2 120
3 120
4 120
5 120
6 120
7 120
8 120
9 120

pandas create a subset according to a value in a column

I have this dataframe:
86,1/28/2004 0:00:00,16.9
86,5/25/2004 0:00:00,17.01
86,7/22/2004 0:00:00,17.06
87,11/15/2004 0:00:00,7.39
87,3/14/2005 0:00:00,7.59
86,11/15/2004 0:00:00,17.29
86,3/14/2005 0:00:00,17.38
86,4/19/2005 0:00:00,17.43
86,5/19/2005 0:00:00,17.28
87,1/22/2004 0:00:00,7.44
87,5/13/2004 0:00:00,7.36
I would like to work on two separate dataframe according to the value (id) of the first column. Ideally, I would like to have:
87,11/15/2004 0:00:00,7.39
87,3/14/2005 0:00:00,7.59
87,1/22/2004 0:00:00,7.44
87,5/13/2004 0:00:00,7.36
and
86,1/28/2004 0:00:00,16.9
86,5/25/2004 0:00:00,17.01
86,7/22/2004 0:00:00,17.06
86,11/15/2004 0:00:00,17.29
86,3/14/2005 0:00:00,17.38
86,4/19/2005 0:00:00,17.43
86,5/19/2005 0:00:00,17.28
As you can see I have one dataframe with all 87 in the first column and another with 86.
This is how I read the dataframe:
dfr = pd.read_csv(fname,sep=',',index_col=False,header=None)
I think that groupby is not the right options, if I have understood correctly the command.
I was thinking about query as:
aa = dfr.query(dfr.iloc[:,0]==86)
However, I have this error:
expr must be a string to be evaluated, <class 'pandas.core.series.Series'> given

You can simply slice your dataframe:
df_86 = df.loc[df['ColName'] == 86,:]

Another way to do it dynamically without having to specify the group beforehand.
df = pd.DataFrame({'ID': np.repeat([1, 2, 3], 4), 'col2': np.repeat([10, 11, 12], 4)})
Get the unique groupings:
groups = df['ID'].unique()
Create an empty dict to store new data frames
new_dfs = {}
Loop through and create new data frames from the slice:
for group in groups:
name = "ID" + str(group)
new_dfs[name] = df[df['ID'] == group]
new_dfs['ID1']
Which gives:
ID col2
0 1 10
1 1 10
2 1 10
3 1 10

Create dataframe in a loop

I would like to create a dataframe in a loop and after use these dataframe in a loop. I tried eval() function but it didn't work.
For example :
for i in range(5):
df_i = df[(df.age == i)]
There I would like to create df_0,df_1 etc. And then concatenate these new dataframe after some calculations :
final_df = pd.concat(df_0,df_1)
for i in range(2:5):
final_df = pd.concat(final_df, df_i)

You can create a dict of DataFrames x and have is as dict keys:
np.random.seed(42)
df = pd.DataFrame({'age': np.random.randint(0, 5, 20)})
x = {}
for i in range(5):
x[i] = df[df['age']==i]
final = pd.concat(x.values())
Then you can refer to individual DataFrames as:
x[1]
Output:
age
5 1
13 1
15 1
And concatenate all of them with:
pd.concat(x.values())
Output:
age
18 0
5 1
13 1
15 1
2 2
6 2
...

The way is weird and not recommended, but it can be done.
Answer
for i in range(5):
exec("df_{i} = df[df['age']=={i}]")
def UDF(dfi):
# do something in user-defined function
for i in range(5):
exec("df_{i} = UDF(df_{i})")
final_df = pd.concat(df_0,df_1)
for i in range(2:5):
final_df = pd.concat(final_df, df_i)
Better Way 1
Using a list or a dict to store the dataframe should be a better way since you can access each dataframe by an index or a key.
Since another answer shows the way using dict (#perl), I will show you the way using list.
def UDF(dfi):
# do something in user-defined function
dfs = [df[df['age']==i] for i in range(i)]
final_df = pd.concat(map(UDF, dfs))
Better Way 2
Since you are using pandas.DataFrame, groupby function is a 'pandas' way to do what you want. (maybe, I guess, cause I don't know what you want to do. LOL)
def UDF(dfi):
# do something in user-defined function
final_df = df.groupby('age').apply(UDF)
Reference: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html

create names of dataframes in a loop

I need to give names to previously defined dataframes.
I have a list of dataframes :
liste_verif = ( dffreesurfer,total,qcschizo)
And I would like to give them a name by doing something like:
for h in liste_verif:
h.name = str(h)
Would that be possible ?
When I'm testing this code, it's doesn't work : instead of considering h as a dataframe, python consider each column of my dataframe.
I would like the name of my dataframe to be 'dffreesurfer', 'total' etc...

You can use dict comprehension and map DataFrames by values in list L:
dffreesurfer = pd.DataFrame({'col1': [7,8]})
total = pd.DataFrame({'col2': [1,5]})
qcschizo = pd.DataFrame({'col2': [8,9]})
liste_verif = (dffreesurfer,total,qcschizo)
L = ['dffreesurfer','total','qcschizo']
dfs = {L[i]:x for i,x in enumerate(liste_verif)}
print (dfs['dffreesurfer'])
col1
0 7
1 8
print (dfs['total'])
col2
0 1
1 5

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

split dataframe into multiple dataframes using loop and lists - python

Related

Removing index which is not common between 2 dataframes

Apply a function row by row using other dataframes' rows as list inputs in python

pandas create a subset according to a value in a column

Create dataframe in a loop

create names of dataframes in a loop

Categories

Resources