I need to go through a large pd and select consecutive rows with similar values in a column. i.e. in the pd below and selecting column x:
col row x y
1 1 1 1
2 2 2 2
6 3 3 8
9 2 3 4
5 3 3 9
4 9 4 4
5 5 5 1
3 7 5 2
6 6 6 6
The results output would be:
col row x y
6 3 3 8
9 2 3 4
5 3 3 9
5 5 5 1
3 7 5 2
Not sure how to do this.
IIUC, use boolean indexing using a mask of the consecutive values:
m = df['x'].eq(df['x'].shift())
df[m|m.shift(-1, fill_value=False)]
Output:
col row x y
2 6 3 3 8
3 9 2 3 4
4 5 3 3 9
6 5 5 5 1
7 3 7 5 2
I start with:
df
0 1 2 3 4
0 5 0 0 2 6
1 9 6 5 8 6
2 8 9 4 2 1
3 2 5 8 9 6
4 8 8 8 0 8
and want to end up with:
df
0 1 2 3 4
A B C
1 2 0 5 0 0 2 6
1 9 6 5 8 6
2 8 9 4 2 1
3 2 5 8 9 6
4 8 8 8 0 8
where A and B are known after df creation, and C is the original
index of the df.
MWE:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(10, size=(5, 5)))
df_a = 1
df_b = 2
breakpoint()
What I have in mind, but gives unhashable type error:
df.reindex([df_a, df_b, df.index])
Try with pd.MultiIndex.from_product:
df.index = pd.MultiIndex.from_product(
[[df_a], [df_b], df.index], names=['A','B','C'])
df
Out[682]:
0 1 2 3 4
A B C
1 2 0 7 0 1 9 9
1 0 4 7 3 2
2 7 2 0 0 4
3 5 5 6 8 4
4 1 4 9 8 1
dataframe in the image
year= 2020 (MAX COLUMN)
lastFifthYear = year - 4
input = '2001509-00'
I want to add all the values between year(2020) and lastFifthYear(2016) WHERE INPUT PARTNO = input
so for input value I should get 4+6+2+3+2 (2016+2017+2018+2019+2020) i.e 17
please give me some code
Here is some code that should work but you definitely need to improve on the way you ask questions here :-)
Considering df is the table you pasted as image above.
>>> year = 2016
>>> df_new=df.query('INPUT_PARTNO == "2001509-00"').melt(['ACTUAL_GI_YEAR', 'INPUT_PARTNO'], var_name='year', value_name='number')
>>> df_new.year=df_new.year.astype(int)
>>> df_new[df_new.year >= year].groupby(['ACTUAL_GI_YEAR','INPUT_PARTNO']).agg({'number' : sum})
number
ACTUAL_GI_YEAR INPUT_PARTNO
0 2001509-00 17
Example Setup
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0, 10, (10, 10)),
columns=list('ab')+list(range(2,10)))
Solved
#sum where a == 9 columns between 3,6 by rows
df['number'] = df.loc[df['a'].eq(9),
pd.to_numeric(df.columns, errors='coerce')
.to_series()
.between(3, 6)
.values].sum(axis=1)
print(df)
a b 2 3 4 5 6 7 8 9 number
0 1 9 9 2 6 0 6 1 4 2 NaN
1 2 3 4 8 7 2 4 0 0 6 NaN
2 2 2 7 4 9 6 7 1 0 0 NaN
3 0 3 5 3 0 4 2 7 2 6 NaN
4 7 7 1 4 7 7 9 7 4 2 NaN
5 9 9 9 0 3 3 3 8 7 7 9.0
6 9 0 5 5 7 9 6 6 5 7 27.0
7 2 1 9 1 9 3 3 4 4 9 NaN
8 4 0 5 9 6 7 3 9 1 6 NaN
9 5 5 0 8 6 4 5 4 7 4 NaN
I know how to remove the Index, using the .to_string(index=False). But I'm not able to figure out how to remove the column names.
matrix = [
[1,2,3,4,5,6,7,8,9],
[1,2,3,4,5,6,7,8,9],
[1,2,3,4,5,6,7,8,9],
[1,2,3,4,5,6,7,8,9],
[1,2,3,4,5,6,7,8,9],
[1,2,3,4,5,6,7,8,9],
[1,2,3,4,5,6,7,8,9],
[1,2,3,4,5,6,7,8,9],
[1,2,3,4,5,6,7,8,9]
]
def print_sudoku(s):
df = pd.DataFrame(s)
dff = df.to_string(index=False)
print(dff)
print_sudoku(matrix)
The result is this.
0 1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
I want to remove the first row, which is the row of column names.
You can use header=False when converting to string: df.to_string(index=False, header=False)
ref: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_string.html
I have a main df like so:
index A B C
5 1 5 8
6 2 4 1
7 8 3 4
8 3 9 5
and an auxiliary df2 that I want to add to the main df like so:
index A B
5 4 2
6 4 3
7 7 1
8 6 2
Columns A & B are the same name, however the main df contains many columns that the secondary df2 does not. I want to sum the columns that are common and leave the others as is.
Output:
index A B C
5 5 7 8
6 6 7 1
7 15 4 4
8 9 11 5
Have tried variations of df.join, pd.merge and groupby but having no luck at the moment.
Last Attempt:
df.groupby('index').sum().add(df2.groupby('index').sum())
But this does not keep common columns.
pd.merge I am getting suffix _x and _y
Use add only with same columns by intersection:
c = df.columns.intersection(df2.columns)
df[c] = df[c].add(df2[c], fill_value=0)
print (df)
A B C
index
5 5 7 8
6 6 7 1
7 15 4 4
8 9 11 5
If use only add, integers columns which not matched are converted to floats:
df = df.add(df2, fill_value=0)
print (df)
A B C
index
5 5 7 8.0
6 6 7 1.0
7 15 4 4.0
8 9 11 5.0
EDIT:
If possible strings common columns:
print (df)
A B C D
index
5 1 5 8 a
6 2 4 1 e
7 8 3 4 r
8 3 9 5 w
print (df2)
A B C D
index
5 1 5 8 a
6 2 4 1 e
7 8 3 4 r
8 3 9 5 w
Solution is similar, only filter first only numeric columns by select_dtypes:
c = df.select_dtypes(np.number).columns.intersection(df2.select_dtypes(np.number).columns)
df[c] = df[c].add(df2[c], fill_value=0)
print (df)
A B C D
index
5 5 7 8 a
6 6 7 1 e
7 15 4 4 r
8 9 11 5 w
Not the cleanest way but it might work.
df_new = pd.DataFrame()
df_new['A'] = df['A'] + df2['A']
df_new['B'] = df['B'] + df2['B']
df_new['C'] = df['C']
print(df_new)
A B C
0 5 7 8
1 6 7 1
2 15 4 4
3 9 11 5