Related
I have a DataFrame with two columns A and B.
I want to create a new column named C to identify the continuous A with the same B value.
Here's an example
import pandas as pd
df = pd.DataFrame({'A':[1,2,3,5,6,10,11,12,13,18], 'B':[1,1,2,2,3,3,3,3,4,4]})
I found a similar question, but that method only identifies the continuous A regardless of B.
df['C'] = df['A'].diff().ne(1).cumsum().sub(1)
I have tried to groupby B and apply the function like this:
df['C'] = df.groupby('B').apply(lambda x: x['A'].diff().ne(1).cumsum().sub(1))
However, it doesn't work: TypeError: incompatible index of inserted column with frame index.
The expected output is
A B C
1 1 0
2 1 0
3 2 1
5 2 2
6 3 3
10 3 4
11 3 4
12 3 4
13 4 5
18 4 6
Let's create a sequential counter using groupby, diff and cumsum then factorize to reencode the counter
df['C'] = df.groupby('B')['A'].diff().ne(1).cumsum().factorize()[0]
Result
A B C
0 1 1 0
1 2 1 0
2 3 2 1
3 5 2 2
4 6 3 3
5 10 3 4
6 11 3 4
7 12 3 4
8 13 4 5
9 18 4 6
Use DataFrameGroupBy.diff with compare not equal 1 and Series.cumsum, last subtract 1:
df['C'] = df.groupby('B')['A'].diff().ne(1).cumsum().sub(1)
print (df)
A B C
0 1 1 0
1 2 1 0
2 3 2 1
3 5 2 2
4 6 3 3
5 10 3 4
6 11 3 4
7 12 3 4
8 13 4 5
9 18 4 6
Let's say I have a pandas DataFrame:
import pandas as pd
df = pd.DataFrame({'a': [1,2,2,2,2,1,1,1,2,2]})
>> df
a
0 1
1 2
2 2
3 2
4 2
5 1
6 1
7 1
8 2
9 2
I want to drop duplicates if they exceed a certain threshold n and replace them with that minimum. Let's say that n=3. Then, my target dataframe is
>> df
a
0 1
1 2
2 2
3 2
5 1
6 1
7 1
8 2
9 2
EDIT: Each set of consecutive repetitions is considered separately. In this example, rows 8 and 9 should be kept.
You can create unique value for each consecutive group, then use groupby and head:
group_value = np.cumsum(df.a.shift() != df.a)
df.groupby(group_value).head(3)
# result:
a
0 1
1 2
2 2
3 2
5 1
6 1
7 1
8 3
9 3
Use boolean indexing with groupby.cumcount:
N = 3
df[df.groupby('a').cumcount().lt(N)]
Output:
a
0 1
1 2
2 2
3 2
5 1
6 1
8 3
9 3
For the last N:
df[df.groupby('a').cumcount(ascending=False).lt(N)]
apply on consecutive repetitions
df[df.groupby(df['a'].ne(df['a'].shift()).cumsum()).cumcount().lt(3)])
Output:
a
0 1
1 2
2 2
3 2
5 1
6 1
7 1 # this is #3 of the local group
8 3
9 3
advantages of boolean indexing
You can use it for many other operations, such as setting values or masking:
group = df['a'].ne(df['a'].shift()).cumsum()
m = df.groupby(group).cumcount().lt(N)
df.where(m)
a
0 1.0
1 2.0
2 2.0
3 2.0
4 NaN
5 1.0
6 1.0
7 1.0
8 3.0
9 3.0
df.loc[~m] = -1
a
0 1
1 2
2 2
3 2
4 -1
5 1
6 1
7 1
8 3
9 3
I'm having trouble finding the solution to a fairly simple problem.
I would like to alphabetically arrange certain columns of a pandas dataframe that has over 100 columns (i.e. so many that I don't want to list them manually).
Example df:
import pandas as pd
subject = [1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,4,4,4,4,4,4]
timepoint = [1,2,3,4,5,6,1,2,3,4,5,6,1,2,4,1,2,3,4,5,6]
c = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3]
d = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3]
a = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3]
b = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3]
df = pd.DataFrame({'subject':subject,
'timepoint':timepoint,
'c':c,
'd':d,
'a':a,
'b':b})
df.head()
subject timepoint c d a b
0 1 1 2 2 2 2
1 1 2 3 3 3 3
2 1 3 4 4 4 4
3 1 4 5 5 5 5
4 1 5 6 6 6 6
How could I rearrange the column names to generate a df.head() that looks like this:
subject timepoint a b c d
0 1 1 2 2 2 2
1 1 2 3 3 3 3
2 1 3 4 4 4 4
3 1 4 5 5 5 5
4 1 5 6 6 6 6
i.e. keep the first two columns where they are and then alphabetically arrange the remaining column names.
Thanks in advance.
You can split your your dataframe based on column names, using normal indexing operator [], sort alphabetically the other columns using sort_index(axis=1), and concat back together:
>>> pd.concat([df[['subject','timepoint']],
df[df.columns.difference(['subject', 'timepoint'])]\
.sort_index(axis=1)],ignore_index=False,axis=1)
subject timepoint a b c d
0 1 1 2 2 2 2
1 1 2 3 3 3 3
2 1 3 4 4 4 4
3 1 4 5 5 5 5
4 1 5 6 6 6 6
5 1 6 7 7 7 7
6 2 1 3 3 3 3
7 2 2 4 4 4 4
8 2 3 1 1 1 1
9 2 4 2 2 2 2
10 2 5 3 3 3 3
11 2 6 4 4 4 4
12 3 1 5 5 5 5
13 3 2 4 4 4 4
14 3 4 5 5 5 5
15 4 1 8 8 8 8
16 4 2 4 4 4 4
17 4 3 5 5 5 5
18 4 4 6 6 6 6
19 4 5 2 2 2 2
20 4 6 3 3 3 3
Specify the first two columns you want to keep (or determine them from the data), then sort all of the other columns. Use .loc with the correct list to then "sort" the DataFrame.
import numpy as np
first_cols = ['subject', 'timepoint']
#first_cols = df.columns[0:2].tolist() # OR determine first two
other_cols = np.sort(df.columns.difference(first_cols)).tolist()
df = df.loc[:, first_cols+other_cols]
print(df.head())
subject timepoint a b c d
0 1 1 2 2 2 2
1 1 2 3 3 3 3
2 1 3 4 4 4 4
3 1 4 5 5 5 5
4 1 5 6 6 6 6
You can try getting the dataframe columns as a list, rearrange them, and assign it back to the dataframe using df = df[cols]
import pandas as pd
subject = [1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,4,4,4,4,4,4]
timepoint = [1,2,3,4,5,6,1,2,3,4,5,6,1,2,4,1,2,3,4,5,6]
c = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3]
d = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3]
a = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3]
b = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3]
df = pd.DataFrame({'subject':subject,
'timepoint':timepoint,
'c':c,
'd':d,
'a':a,
'b':b})
cols = df.columns.tolist()
cols = cols[:2] + sorted(cols[2:])
df = df[cols]
I have a pandas data frame and I want to move the "F" column to after the "B" column. Is there a way to do that?
A B C D E F
0 7 1 8 1 6
1 8 2 5 8 5 8
2 9 3 6 8 5
3 1 8 1 3 4
4 6 8 2 5 0 9
5 2 N/A 1 3 8
df2
A B F C D E
0 7 1 6 8 1
1 8 2 8 5 8 5
2 9 3 5 6 8
3 1 4 8 1 3
4 6 8 9 2 5 0
5 2 8 N/A 1 3
So it should finally look like df2.
Thanks in advance.
You can try df.insert + df.pop after getting location of B by get_loc
df.insert(df.columns.get_loc("B")+1,"F",df.pop("F"))
print(df)
A B F C D E
0 7.0 1 6.0 NaN 8 1.0
1 8.0 2 8.0 5.0 8 5.0
2 9.0 3 5.0 6.0 8 NaN
3 1.0 8 NaN 1.0 3 4.0
4 6.0 8 9.0 2.0 5 0.0
5 NaN 2 8.0 NaN 1 3.0
Another minimalist, (and very specific!) approach:
df = df[list('ABFCDE')]
Here is a very simple answer to this(only one line). Giving littlebit more explanation to the answer from #warped
You can do that after you added the 'n' column into your df as follows.
import pandas as pd
df = pd.DataFrame({'l':['a','b','c','d'], 'v':[1,2,1,2]})
df['n'] = 0
df
l v n
0 a 1 0
1 b 2 0
2 c 1 0
3 d 2 0
# here you can add the below code and it should work.
df = df[list('nlv')]
df
n l v
0 0 a 1
1 0 b 2
2 0 c 1
3 0 d 2
However, if you have words in your columns names instead of letters. It should include two brackets around your column names.
import pandas as pd
df = pd.DataFrame({'Upper':['a','b','c','d'], 'Lower':[1,2,1,2]})
df['Net'] = 0
df['Mid'] = 2
df['Zsore'] = 2
df
Upper Lower Net Mid Zsore
0 a 1 0 2 2
1 b 2 0 2 2
2 c 1 0 2 2
3 d 2 0 2 2
# here you can add below line and it should work
df = df[list(('Mid','Upper', 'Lower', 'Net','Zsore'))]
df
Mid Upper Lower Net Zsore
0 2 a 1 0 2
1 2 b 2 0 2
2 2 c 1 0 2
3 2 d 2 0 2
Given 2 pandas tables, both with the 3 columns id, x and y coordinates. So several rows of same id represent a graph with its x-yvalues. How would I find paths that do not exist in the first table, but in the second and append them to 1st table? Key problem is that the order of the graphs in both tables can be different.
Example:
df1 = pd.DataFrame({'id':[1,1,2,2,2,3,3,3], 'x':[1,1,5,4,4,1,1,1], 'y':[1,2,4,4,3,4,5,6]})
df2 = pd.DataFrame({'id':[1,1,1,2,2,3,3,3,4,4,4], 'x':[1,1,1,1,1,5,4,4,10,10,9], 'y':[4,5,6,1,2,4,4,3,1,2,2]})
(df1 intersect df2 ) ---------> df1
id x y id x y id x y
1 1 1 1 1 4 1 1 1
1 1 2 1 1 5 1 1 2
2 5 4 1 1 6 2 5 4
2 4 4 2 1 1 2 4 4
2 4 3 2 1 2 2 4 3
3 1 4 3 5 4 3 1 4
3 1 5 3 4 4 3 1 5
3 1 6 3 4 3 3 1 6
4 10 1 4 10 1
4 10 2 4 10 2
4 9 2 4 9 2
Should become:
df1 = pd.DataFrame({'id':[1,1,2,2,2,3,3,3,4,4,4], 'x':[1,1,5,4,4,1,1,1,10,10,9], 'y':[1,2,4,4,3,4,5,6,1,2,2]})
As you can see until id= 3, df1 and df2 have similar graphs, but their order is different from one to another table. In this case for example df1 first graph is df2 seconds graph. Now df2 has a 4th path that is not in df1. In that case the 4th path should be detected and appended to df1. Like that I want to get the intersection of the 2 pandas table and append the disjunction of the both to the first table, with the condition that the id, so to say the order of the paths can be different from one and another.
Imports:
import pandas as pd
Set starting DataFrames:
df1 = pd.DataFrame({'id':[1,1,2,2,2,3,3,3],
'x':[1,1,5,4,4,1,1,1],
'y':[1,2,4,4,3,4,5,6]})
df2 = pd.DataFrame({'id':[1,1,1,2,2,3,3,3,4,4,4],
'x':[1,1,1,1,1,5,4,4,10,10,9],
'y':[4,5,6,1,2,4,4,3,1,2,2]})
Outer Merge:
df_merged = df1.merge(df2, on=['x', 'y'], how='outer')
produces:
df_merged =
id_x x y id_y
0 1.0 1 1 2
1 1.0 1 2 2
2 2.0 5 4 3
3 2.0 4 4 3
4 2.0 4 3 3
5 3.0 1 4 1
6 3.0 1 5 1
7 3.0 1 6 1
8 NaN 10 1 4
9 NaN 10 2 4
10 NaN 9 2 4
Note: Why does id_x become floats?
Fill NaN:
df_merged.id_x = df_merged.id_x.fillna(df_merged.id_y).astype('int')
produces:
df_merged =
id_x x y id_y
0 1 1 1 2
1 1 1 2 2
2 2 5 4 3
3 2 4 4 3
4 2 4 3 3
5 3 1 4 1
6 3 1 5 1
7 3 1 6 1
8 4 10 1 4
9 4 10 2 4
10 4 9 2 4
Drop id_y:
df_merged = df_merged.drop(['id_y'], axis=1)
produces:
df_merged =
id_x x y
0 1 1 1
1 1 1 2
2 2 5 4
3 2 4 4
4 2 4 3
5 3 1 4
6 3 1 5
7 3 1 6
8 4 10 1
9 4 10 2
10 4 9 2
Rename id_x to id:
df_merged = df_merged.rename(columns={'id_x': 'id'})
produces:
df_merged =
id x y
0 1 1 1
1 1 1 2
2 2 5 4
3 2 4 4
4 2 4 3
5 3 1 4
6 3 1 5
7 3 1 6
8 4 10 1
9 4 10 2
10 4 9 2
Final Program is 4 lines of code:
import pandas as pd
df1 = pd.DataFrame({'id':[1,1,2,2,2,3,3,3],
'x':[1,1,5,4,4,1,1,1],
'y':[1,2,4,4,3,4,5,6]})
df2 = pd.DataFrame({'id':[1,1,1,2,2,3,3,3,4,4,4],
'x':[1,1,1,1,1,5,4,4,10,10,9],
'y':[4,5,6,1,2,4,4,3,1,2,2]})
df_merged = df1.merge(df2, on=['x', 'y'], how='outer')
df_merged.id_x = df_merged.id_x.fillna(df_merged.id_y).astype('int')
df_merged = df_merged.drop(['id_y'], axis=1)
df_merged = df_merged.rename(columns={'id_x': 'id'})
Please remember to put a check next to the selected answer.
Mauritius, try this code:
df1 = pd.DataFrame({'id':[1,1,2,2,2,3,3,3], 'x':[1,1,5,4,4,1,1,1], 'y':[1,2,4,4,3,4,5,6]})
df2 = pd.DataFrame({'id':[1,1,1,2,2,3,3,3,4,4,4,5], 'x':[1,1,1,1,1,5,4,4,10,10,9,1], 'y':[4,5,6,1,2,4,4,3,1,2,2,2]})
df1_s = [{(x,y) for x, y in df1[['x','y']][df1.id==i].values} for i in df1.id.unique()]
def f(df2):
data = {(x,y) for x, y in df2[['x','y']].values}
if data not in df1_s:
return True
else:
return False
check = df2.groupby('id').apply(f).apply(pd.Series)
ids = check[check[0]].index.values
df2 = df2.set_index('id').loc[ids].reset_index()
df1 = df1.append(df2)
OUT:
id x y
0 1 1 1
1 1 1 2
2 2 5 4
3 2 4 4
4 2 4 3
5 3 1 4
6 3 1 5
7 3 1 6
0 4 10 1
1 4 10 2
2 4 9 2
3 5 1 2
I think it can be done more simple and pythonic, but I think a lot and still don't know how = )
And I think, should to check ids is not the same in df1 and df2, before append one df to another (in the end). I might add this later.
Does this code do what you want?