I have the following code:
import pandas.util.testing as testing
df = testing.makeDataFrame()
df
This this I have created 2 dataframes with one dataframe have 2 less lines than the original one.
This is df - Original
A B C D
OdhGFPa5Kw -0.686378 -1.210838 1.160708 0.903309
gelZFj4BG5 1.603112 1.852592 -0.065482 0.684566
mp3Aq5ueGD 0.254211 -0.788877 -0.626789 0.109116
pBtz9DHxUZ -0.970632 0.982661 -0.463984 -0.123727
K28pzbdYcX -1.311220 -2.121306 1.209484 -1.695901
71ZFgWaeDE 1.887420 0.337702 -0.176539 0.149089
alWOjkQ2eZ 1.997701 -0.354276 1.997802 -0.086803
This is df1 - with 2 less lines
A B C D
OdhGFPa5Kw -0.686378 -1.210838 1.160708 0.903309
gelZFj4BG5 1.603112 1.852592 -0.065482 0.684566
mp3Aq5ueGD 0.254211 -0.788877 -0.626789 0.109116
pBtz9DHxUZ -0.970632 0.982661 -0.463984 -0.123727
K28pzbdYcX -1.311220 -2.121306 1.209484 -1.695901
What I am trying to do is to remove all the rows which are not common between the two dataframes. To do this, we find the duplicate index in the two columns.
duplicates = set(df.index).intersection(df1.index)
Could you please advise how can I remove rows where index is not in the duplicates ?
If you want to remove the indices in place:
idx = df.index.difference(df1.index)
df.drop(idx, inplace=True)
If you want to create a new object:
idx = df.index.intersection(df1.index)
new_df = df.loc[idx]
I'm trying to apply a function row-by-row which takes 5 inputs, 3 of which are lists. I want these lists to come from each row of 3 correspondings dataframes.
I've tried using 'apply' and 'lambda' as follows:
sol['tf_dd']=sol.apply(lambda tsol, rfsol, rbsol:
taurho_difdif(xy=xy,
l=l,
t=tsol,
rf=rfsol,
rb=rbsol),
axis=1)
However I get the error <lambda>() missing 2 required positional arguments: 'rfsol' and 'rbsol'
The DataFrame sol and the DataFrames tsol, rfsol and rbsol all have the same length. For each row, I want the entire row from tsol, rfsol and rbsol to be input as three lists.
Here is much simplified example (first with single lists, which I then want to replicate row by row with dataframes):
The output with single lists is a single value (120). With dataframes as inputs I want an output dataframe of length 10 where all values are 120.
t=[1,2,3,4,5]
rf=[6,7,8,9,10]
rb=[11,12,13,14,15]
def simple_func(t, rf, rb):
x=sum(t)
y=sum(rf)
z=sum(rb)
return x+y+z
out=simple_func(t,rf,rb)
# dataframe rows as lists
tsol=pd.DataFrame((t,t,t,t,t,t,t,t,t,t))
rfsol=pd.DataFrame((rf,rf,rf,rf,rf,rf,rf,rf,rf,rf))
rbsol=pd.DataFrame((rb,rb,rb,rb,rb,rb,rb,rb,rb,rb))
out2 = pd.DataFrame(index=range(len(tsol)), columns=['output'])
out2['output'] = out2.apply(lambda tsol, rfsol, rbsol:
simple_func(t=tsol.tolist(),
rf=rfsol.tolist(),
rb=rbsol.tolist()),
axis=1)
Try to use "name" field in Series Type to get index value, and then get the same index for the other DataFrame
import pandas as pd
import numpy as np
def postional_sum(inot, df1, df2, df3):
"""
Get input index and gather the same position for the other DataFrame collection
"""
position = inot.name
x = df1.iloc[position].sum()
y = df2.iloc[position].sum()
z = df3.iloc[position].sum()
return x + y + z
# dataframe rows as lists
tsol = pd.DataFrame(np.random.randn(10, 5), columns=range(5))
rfsol = pd.DataFrame(np.random.randn(10, 5), columns=range(5))
rbsol = pd.DataFrame(np.random.randn(10, 5), columns=range(5))
out2 = pd.DataFrame(index=range(len(tsol)), columns=["output"])
out2["output"] = out2.apply(lambda x: postional_sum(x, tsol, rfsol, rbsol), axis=1)
out2
Hope this helps!
When you run df.apply() with axis=1, it does not pass on the columns as individual arguments to the function, but as a Series object, as explained here. The correct way to do this would be
out2['output'] = out2.apply(lambda row:
simple_func(t=row["tsol"],
rf=row["rfsol"],
rb=row["rbsol"]),
axis=1)
You can eliminate the simple function using this:
out2["output"] = tsol.sum(axis=1) + rfsol.sum(axis=1) + rbsol.sum(axis=1)
Here is the complete code:
t=[1,2,3,4,5]
rf=[6,7,8,9,10]
rb=[11,12,13,14,15]
# dataframe rows as lists
tsol=pd.DataFrame((t,t,t,t,t,t,t,t,t,t))
rfsol=pd.DataFrame((rf,rf,rf,rf,rf,rf,rf,rf,rf,rf))
rbsol=pd.DataFrame((rb,rb,rb,rb,rb,rb,rb,rb,rb,rb))
out2 = pd.DataFrame(index=range(len(tsol)), columns=["output"])
out2["output"] = tsol.sum(axis=1) + rfsol.sum(axis=1) + rbsol.sum(axis=1)
print(out2)
OUTPUT:
output
0 120
1 120
2 120
3 120
4 120
5 120
6 120
7 120
8 120
9 120
I have this dataframe:
86,1/28/2004 0:00:00,16.9
86,5/25/2004 0:00:00,17.01
86,7/22/2004 0:00:00,17.06
87,11/15/2004 0:00:00,7.39
87,3/14/2005 0:00:00,7.59
86,11/15/2004 0:00:00,17.29
86,3/14/2005 0:00:00,17.38
86,4/19/2005 0:00:00,17.43
86,5/19/2005 0:00:00,17.28
87,1/22/2004 0:00:00,7.44
87,5/13/2004 0:00:00,7.36
I would like to work on two separate dataframe according to the value (id) of the first column. Ideally, I would like to have:
87,11/15/2004 0:00:00,7.39
87,3/14/2005 0:00:00,7.59
87,1/22/2004 0:00:00,7.44
87,5/13/2004 0:00:00,7.36
and
86,1/28/2004 0:00:00,16.9
86,5/25/2004 0:00:00,17.01
86,7/22/2004 0:00:00,17.06
86,11/15/2004 0:00:00,17.29
86,3/14/2005 0:00:00,17.38
86,4/19/2005 0:00:00,17.43
86,5/19/2005 0:00:00,17.28
As you can see I have one dataframe with all 87 in the first column and another with 86.
This is how I read the dataframe:
dfr = pd.read_csv(fname,sep=',',index_col=False,header=None)
I think that groupby is not the right options, if I have understood correctly the command.
I was thinking about query as:
aa = dfr.query(dfr.iloc[:,0]==86)
However, I have this error:
expr must be a string to be evaluated, <class 'pandas.core.series.Series'> given
You can simply slice your dataframe:
df_86 = df.loc[df['ColName'] == 86,:]
Another way to do it dynamically without having to specify the group beforehand.
df = pd.DataFrame({'ID': np.repeat([1, 2, 3], 4), 'col2': np.repeat([10, 11, 12], 4)})
Get the unique groupings:
groups = df['ID'].unique()
Create an empty dict to store new data frames
new_dfs = {}
Loop through and create new data frames from the slice:
for group in groups:
name = "ID" + str(group)
new_dfs[name] = df[df['ID'] == group]
new_dfs['ID1']
Which gives:
ID col2
0 1 10
1 1 10
2 1 10
3 1 10
I would like to create a dataframe in a loop and after use these dataframe in a loop. I tried eval() function but it didn't work.
For example :
for i in range(5):
df_i = df[(df.age == i)]
There I would like to create df_0,df_1 etc. And then concatenate these new dataframe after some calculations :
final_df = pd.concat(df_0,df_1)
for i in range(2:5):
final_df = pd.concat(final_df, df_i)
You can create a dict of DataFrames x and have is as dict keys:
np.random.seed(42)
df = pd.DataFrame({'age': np.random.randint(0, 5, 20)})
x = {}
for i in range(5):
x[i] = df[df['age']==i]
final = pd.concat(x.values())
Then you can refer to individual DataFrames as:
x[1]
Output:
age
5 1
13 1
15 1
And concatenate all of them with:
pd.concat(x.values())
Output:
age
18 0
5 1
13 1
15 1
2 2
6 2
...
The way is weird and not recommended, but it can be done.
Answer
for i in range(5):
exec("df_{i} = df[df['age']=={i}]")
def UDF(dfi):
# do something in user-defined function
for i in range(5):
exec("df_{i} = UDF(df_{i})")
final_df = pd.concat(df_0,df_1)
for i in range(2:5):
final_df = pd.concat(final_df, df_i)
Better Way 1
Using a list or a dict to store the dataframe should be a better way since you can access each dataframe by an index or a key.
Since another answer shows the way using dict (#perl), I will show you the way using list.
def UDF(dfi):
# do something in user-defined function
dfs = [df[df['age']==i] for i in range(i)]
final_df = pd.concat(map(UDF, dfs))
Better Way 2
Since you are using pandas.DataFrame, groupby function is a 'pandas' way to do what you want. (maybe, I guess, cause I don't know what you want to do. LOL)
def UDF(dfi):
# do something in user-defined function
final_df = df.groupby('age').apply(UDF)
Reference: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html
I need to give names to previously defined dataframes.
I have a list of dataframes :
liste_verif = ( dffreesurfer,total,qcschizo)
And I would like to give them a name by doing something like:
for h in liste_verif:
h.name = str(h)
Would that be possible ?
When I'm testing this code, it's doesn't work : instead of considering h as a dataframe, python consider each column of my dataframe.
I would like the name of my dataframe to be 'dffreesurfer', 'total' etc...
You can use dict comprehension and map DataFrames by values in list L:
dffreesurfer = pd.DataFrame({'col1': [7,8]})
total = pd.DataFrame({'col2': [1,5]})
qcschizo = pd.DataFrame({'col2': [8,9]})
liste_verif = (dffreesurfer,total,qcschizo)
L = ['dffreesurfer','total','qcschizo']
dfs = {L[i]:x for i,x in enumerate(liste_verif)}
print (dfs['dffreesurfer'])
col1
0 7
1 8
print (dfs['total'])
col2
0 1
1 5