Get all dataframe based in certain value in dataframe column - python

I have a DataFrame looks something like this :
import numpy as np
import pandas as pd
df=pd.DataFrame([['d',5,6],['a',6,6],['index',5,8],['b',3,1],['b',5,6],['index',6,7],
['e',2,3],['c',5,6],['index',5,8]],columns=['A','B','C'])
I want to select all the lines that are between index and create many dataframes
I want to obtain all as :
dataframe1:
A B C
1 a 6 6
2 index 5 8
3 3 b 3
dataframe 2
A B C
4 b 5 6
5 index 6 7
6 c 2 3
datframe3:
A B C
7 c 5 6
8 index 5 8
9 4 3 1
dataframe4 :
A B C
11 5 2 3
12 index 4 2
13 1 2 5

index_list = df.index[df['A'] == 'index'].tolist() # create a list of the index where df['A']=='index'
new_df = [] # empty list for dataframes
for i in index_list: # for loop
try:
new_df.append(df.iloc[i-1:i+2])
except:
pass
this creates a list of dataframes you can call them by new_df[0] new_df[1] or use a loop to print them out:
for i in range(len(new_df)):
print(f'{new_df[i]}\n')
A B C
1 a 6 6
2 index 5 8
3 b 3 1
A B C
4 b 5 6
5 index 6 7
6 e 2 3
A B C
7 c 5 6
8 index 5 8

Related

Construct a df such that every number within a range gets value 'A' assigned when knowing the start and end of the range values that belong to 'A'

Suppose I have the following Pandas dataframe:
In[285]: df = pd.DataFrame({'Name':['A','B'], 'Start': [1,6], 'End': [4,12]})
In [286]: df
Out[286]:
Name Start End
0 A 1 4
1 B 6 12
Now I would like to construct the dataframe as follows:
Name Number
0 A 1
1 A 2
2 A 3
3 A 4
4 B 6
5 B 7
6 B 8
7 B 9
8 B 10
9 B 11
10 B 12
My biggest struggle is in getting the 'Name' column right. Is there a smart way to do this in Python?
I would do pd.concat on a list comprehension:
pd.concat(pd.DataFrame({'Number': np.arange(s,e+1)})
.assign(Name=n)
for n,s,e in zip(df['Name'], df['Start'], df['End']))
Output:
Number Name
0 1 A
1 2 A
2 3 A
3 4 A
0 6 B
1 7 B
2 8 B
3 9 B
4 10 B
5 11 B
6 12 B
Update: As commented by #rafaelc:
pd.concat(pd.DataFrame({'Number': np.arange(s,e+1), 'Name': n})
for n,s,e in zip(df['Name'], df['Start'], df['End']))
works just fine.
Let us do it with this example (with 3 names):
import pandas as pd
df = pd.DataFrame({'Name':['A','B','C'], 'Start': [1,6,18], 'End': [4,12,20]})
You may create the target columns first, using list comprehensions:
name = [row.Name for i, row in df.iterrows() for _ in range(row.End - row.Start + 1)]
number = [k for i, row in df.iterrows() for k in range(row.Start, row.End + 1)]
And then you can create the target DataFrame:
expanded = pd.DataFrame({"Name": name, "Number": number})
You get:
Name Number
0 A 1
1 A 2
2 A 3
3 A 4
4 B 6
5 B 7
6 B 8
7 B 9
8 B 10
9 B 11
10 B 12
11 C 18
12 C 19
13 C 20
I'd take advantage of loc and index.repeat for a vectorized solution.
base = df.loc[df.index.repeat(df['End'] - df['Start'] + 1), ['Name', 'Start']]
base['Start'] += base.groupby(level=0).cumcount()
Name Start
0 A 1
0 A 2
0 A 3
0 A 4
1 B 6
1 B 7
1 B 8
1 B 9
1 B 10
1 B 11
1 B 12
Of course we can rename the columns and reset the index at the end, for a nicer showing.
base.rename(columns={'Start': 'Number'}).reset_index(drop=True)
Name Number
0 A 1
1 A 2
2 A 3
3 A 4
4 B 6
5 B 7
6 B 8
7 B 9
8 B 10
9 B 11
10 B 12

How to set value to a cell filtered by rows in python DataFrame?

import pandas as pd
df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9],[10,11,12]],columns=['A','B','C'])
df[df['B']%2 ==0]['C'] = 5
I am expecting this code to change the value of columns C to 5, wherever B is even. But it is not working.
It returns the table as follow
A B C
0 1 2 3
1 4 5 6
2 7 8 9
3 10 11 12
I am expecting it to return
A B C
0 1 2 5
1 4 5 6
2 7 8 5
3 10 11 12
If need change value of column in DataFrame is necessary DataFrame.loc with condition and column name:
df.loc[df['B']%2 ==0, 'C'] = 5
print (df)
A B C
0 1 2 5
1 4 5 6
2 7 8 5
3 10 11 12
Your solution is nice example of chained indexing - docs.
You could just change the order to:
df['C'][df['B']%2 == 0] = 5
And it also works
Using numpy where
df['C'] = np.where(df['B']%2 == 0, 5, df['C'])
Output
A B C
0 1 2 5
1 4 5 6
2 7 8 5
3 10 11 12

how to join two dataframe by picking couple of column from each if one of the column has same data

there are two dataframes df_one and df_two I want to create a new data frame by with selective column from each of the dataframes
df_one
e b c d
1 2 3 4
5 6 7 8
6 2 4 8
9 2 5 6
and
df_two
e f g h
1 8 7 6
5 6 6 4
6 6 2 4
9 5 3 2
I want to create a new dataframe new_df
e b g h d
1 6 7 6 4
5 2 6 4 8
6 2 2 4 8
9 2 3 2 6
enter image description here
result = pd.merge(df_one, df_two, on='e')
result=result.loc[:,["e","b","g","h","d"]]
Use:
pd.merge(df1[["e", "b", "d"]], df2[["e", "g", "h"]], on="e")

How to create sub-DatafFrame with minimal values count

I have a DataFrame of the form:
a b Class
0 1 10 A
1 2 12 A
2 3 2 A
3 12 5 B
4 5 7 A
5 6 8 B
6 7 17 A
7 1 1 B
8 5 0 B
From this DataFrame I want to get a another DataFrame that has at least N rows for each of the values of column Class (here at least N rows from class 'A' and N rows of class B).
The new DataFrame should include all the rows starting from the end of the DataFrame and down to the row where the condition is met.
In the data above with N=2 I expect to get:
a b Class
4 5 7 A
5 6 8 B
6 7 17 A
7 1 1 B
8 5 0 B
Thanks.
You can extract the last 2 items by Class and the first index of the result.
Then index from this point onwards on your original dataframe.
idx = df.groupby('Class').tail(2).index[0]
res = df[idx:]
print(res)
a b Class
4 5 7 A
5 6 8 B
6 7 17 A
7 1 1 B
8 5 0 B

Pandas - Merge multiple columns and sum

I have a main df like so:
index A B C
5 1 5 8
6 2 4 1
7 8 3 4
8 3 9 5
and an auxiliary df2 that I want to add to the main df like so:
index A B
5 4 2
6 4 3
7 7 1
8 6 2
Columns A & B are the same name, however the main df contains many columns that the secondary df2 does not. I want to sum the columns that are common and leave the others as is.
Output:
index A B C
5 5 7 8
6 6 7 1
7 15 4 4
8 9 11 5
Have tried variations of df.join, pd.merge and groupby but having no luck at the moment.
Last Attempt:
df.groupby('index').sum().add(df2.groupby('index').sum())
But this does not keep common columns.
pd.merge I am getting suffix _x and _y
Use add only with same columns by intersection:
c = df.columns.intersection(df2.columns)
df[c] = df[c].add(df2[c], fill_value=0)
print (df)
A B C
index
5 5 7 8
6 6 7 1
7 15 4 4
8 9 11 5
If use only add, integers columns which not matched are converted to floats:
df = df.add(df2, fill_value=0)
print (df)
A B C
index
5 5 7 8.0
6 6 7 1.0
7 15 4 4.0
8 9 11 5.0
EDIT:
If possible strings common columns:
print (df)
A B C D
index
5 1 5 8 a
6 2 4 1 e
7 8 3 4 r
8 3 9 5 w
print (df2)
A B C D
index
5 1 5 8 a
6 2 4 1 e
7 8 3 4 r
8 3 9 5 w
Solution is similar, only filter first only numeric columns by select_dtypes:
c = df.select_dtypes(np.number).columns.intersection(df2.select_dtypes(np.number).columns)
df[c] = df[c].add(df2[c], fill_value=0)
print (df)
A B C D
index
5 5 7 8 a
6 6 7 1 e
7 15 4 4 r
8 9 11 5 w
Not the cleanest way but it might work.
df_new = pd.DataFrame()
df_new['A'] = df['A'] + df2['A']
df_new['B'] = df['B'] + df2['B']
df_new['C'] = df['C']
print(df_new)
A B C
0 5 7 8
1 6 7 1
2 15 4 4
3 9 11 5

Categories

Resources