I'm having trouble working out how to add the index value of a pandas dataframe to each value at that index. For example, if I have a dataframe of zeroes, the row with index 1 should have a value of 1 for all columns. The row at index 2 should have values of 2 for each column, and so on.
Can someone enlighten me please?
You can use pd.DataFrame.add with axis=0. Just remember, as below, to convert your index to a series first.
df = pd.DataFrame(np.random.randint(0, 10, (5, 5)))
print(df)
0 1 2 3 4
0 3 4 2 2 2
1 9 6 1 8 0
2 2 9 0 5 3
3 3 1 1 7 0
4 2 6 3 6 6
df = df.add(df.index.to_series(), axis=0)
print(df)
0 1 2 3 4
0 3 4 2 2 2
1 10 7 2 9 1
2 4 11 2 7 5
3 6 4 4 10 3
4 6 10 7 10 10
Related
I have a DataFrame with two columns A and B.
I want to create a new column named C to identify the continuous A with the same B value.
Here's an example
import pandas as pd
df = pd.DataFrame({'A':[1,2,3,5,6,10,11,12,13,18], 'B':[1,1,2,2,3,3,3,3,4,4]})
I found a similar question, but that method only identifies the continuous A regardless of B.
df['C'] = df['A'].diff().ne(1).cumsum().sub(1)
I have tried to groupby B and apply the function like this:
df['C'] = df.groupby('B').apply(lambda x: x['A'].diff().ne(1).cumsum().sub(1))
However, it doesn't work: TypeError: incompatible index of inserted column with frame index.
The expected output is
A B C
1 1 0
2 1 0
3 2 1
5 2 2
6 3 3
10 3 4
11 3 4
12 3 4
13 4 5
18 4 6
Let's create a sequential counter using groupby, diff and cumsum then factorize to reencode the counter
df['C'] = df.groupby('B')['A'].diff().ne(1).cumsum().factorize()[0]
Result
A B C
0 1 1 0
1 2 1 0
2 3 2 1
3 5 2 2
4 6 3 3
5 10 3 4
6 11 3 4
7 12 3 4
8 13 4 5
9 18 4 6
Use DataFrameGroupBy.diff with compare not equal 1 and Series.cumsum, last subtract 1:
df['C'] = df.groupby('B')['A'].diff().ne(1).cumsum().sub(1)
print (df)
A B C
0 1 1 0
1 2 1 0
2 3 2 1
3 5 2 2
4 6 3 3
5 10 3 4
6 11 3 4
7 12 3 4
8 13 4 5
9 18 4 6
I need to go through a large pd and select consecutive rows with similar values in a column. i.e. in the pd below and selecting column x: I want to specify consecutive values in column x? Say if I want consecutive values of 3 and 5 only
col row x y
1 1 1 1
5 7 3 0
2 2 2 2
6 3 3 8
9 2 3 4
5 3 3 9
4 9 4 4
5 5 5 1
3 7 5 2
6 6 6 6
5 8 6 2
3 7 6 0
The results output would be:
col row x y consecutive-count
6 3 3 8 1
9 2 3 4 1
5 3 3 9 1
5 5 5 1 2
3 7 5 2 2
I tried
m = df['x'].eq(df['x'].shift())
df[m|m.shift(-1, fill_value=False)]
But that includes the consecutive 6 that I don't want.
I also tried:
df.query( 'x in [3,5]')
That prints every row where x has 3 or 5.
IIUC use masks for boolean indexing. Check for 3 or 5, and use a cummax and reverse cummax to ensure having the order:
m1 = df['x'].eq(3)
m2 = df['x'].eq(5)
out = df[(m1|m2)&(m1.cummax()&m2[::-1].cummax())]
Output:
col row x y
2 6 3 3 8
3 9 2 3 4
4 5 3 3 9
6 5 5 5 1
7 3 7 5 2
you can create a group column for consecutive values, and filter by the group count and value of x:
# create unique ids for consecutive groups, then get group length:
group_num = (df.x.shift() != df.x).cumsum()
group_len = group_num.groupby(group_num).transform("count")
# filter main df:
df2 = df[(df.x.isin([3,5])) & (group_len > 1)]
# add new group num col
df2['consecutive-count'] = (df2.x != df2.x.shift()).cumsum()
output:
col row x y consecutive-count
3 6 3 3 8 1
4 9 2 3 4 1
5 5 3 3 9 1
7 5 5 5 1 2
8 3 7 5 2 2
I need to create a new dataframe from an existing one by selecting multiple columns, and appending those column values to a new column with it's corresponding index as a new column
So, lets say I have this as a dataframe:
A B C D E F
0 1 2 3 4 0
0 7 8 9 1 0
0 4 5 2 4 0
Transform into this by selecting columns B through E:
A index_value
1 1
7 1
4 1
2 2
8 2
5 2
3 3
9 3
2 3
4 4
1 4
4 4
So, for the new dataframe, column A would be all of the values from columns B through E in the old dataframe, and column index_value would correspond to the index value [starting from zero] of the selected columns.
I've been scratching my head for hours. Any help would be appreciated, thanks!
Python3, Using pandas & numpy libraries.
#Another way
A B C D E F
0 0 1 2 3 4 0
1 0 7 8 9 1 0
2 0 4 5 2 4 0
# Select columns to include
start_colum ='B'
end_column ='E'
index_column_name ='A'
#re-stack the dataframe
df = df.loc[:,start_colum:end_column].stack().sort_index(level=1).reset_index(level=0, drop=True).to_frame()
#Create the "index_value" column
df['index_value'] =pd.Categorical(df.index).codes+1
df.rename(columns={0:index_column_name}, inplace=True)
df.set_index(index_column_name, inplace=True)
df
index_value
A
1 1
7 1
4 1
2 2
8 2
5 2
3 3
9 3
2 3
4 4
1 4
4 4
This is just melt
df.columns = range(df.shape[1])
s = df.melt().loc[lambda x : x.value!=0]
s
variable value
3 1 1
4 1 7
5 1 4
6 2 2
7 2 8
8 2 5
9 3 3
10 3 9
11 3 2
12 4 4
13 4 1
14 4 4
Try using:
df = pd.melt(df[['B', 'C', 'D', 'E']])
# Or df['variable'] = df[['B', 'C', 'D', 'E']].melt()
df['variable'].shift().eq(df['variable'].shift(-1)).cumsum().shift(-1).ffill()
print(df)
Output:
variable value
0 1.0 1
1 1.0 7
2 1.0 4
3 2.0 2
4 2.0 8
5 2.0 5
6 3.0 3
7 3.0 9
8 3.0 2
9 4.0 4
10 4.0 1
11 4.0 4
I have one pandas dataframe, with one row and multiple columns.
I want to get the column number/index of the minimum value in the given row.
The code I found was: df.columns.get_loc('colname')
The above code asks for a column name.
My dataframe doesn't have column names. I want to get the column location of the minimum value.
Use argmin with converting DataFrame to array by values, only necessary only numeric data:
df = pd.DataFrame({
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[-5,3,6,9,2,-4]
})
print (df)
B C D E
0 4 7 1 -5
1 5 8 3 3
2 4 9 5 6
3 5 4 7 9
4 5 2 1 2
5 4 3 0 -4
df['col'] = df.values.argmin(axis=1)
print (df)
B C D E col
0 4 7 1 -5 3
1 5 8 3 3 2
2 4 9 5 6 0
3 5 4 7 9 1
4 5 2 1 2 2
5 4 3 0 -4 3
I have a dataframe that looks like the following:
index value
1 21.046091
2 52.400000
3 14.082153
4 1.859942
5 1.859942
6 2.331143
7 9.060000
8 0.789265
9 12967.7
The last value is much higher than the rest. I'm trying to bin all the values into 5 bins using pd.cut:
pd.cut(df['value'], 5, labels = [1,2,3,4,5])
But it only ends up returning the groups 1 and 5.
index value group
0 0.410000 1
1 21.046091 1
2 52.400000 1
3 14.082153 1
4 1.859942 1
5 1.859942 1
6 2.331143 1
7 9.060000 1
8 0.789265 1
9 12967.7 5
The higher value is clearly throwing it, but is there a way to ensure that all five bins are represented in the dataframe without getting rid of outlying values?
You could use qcut:
pd.qcut(df['value'],5,labels=[1,2,3,4,5])
Output:
index
1 4
2 5
3 4
4 1
5 1
6 2
7 3
8 1
9 5
Name: value, dtype: category
Categories (5, int64): [1 < 2 < 3 < 4 < 5]
print(df.assign(group = pd.qcut(df['value'],5,labels=[1,2,3,4,5])))
value group
index
1 21.046091 4
2 52.400000 5
3 14.082153 4
4 1.859942 1
5 1.859942 1
6 2.331143 2
7 9.060000 3
8 0.789265 1
9 12967.700000 5