Would I need to use pivot to make a row of columns into one column? I have a dataset like the one shown but within each row are 8 separate rows. I need to make each cell its own row.
This would be an example of what i would start with:
d = {'col1':[1,9,17],'col2':[2,10,18],'col3':[3,11,19],'col4':[4,12,20],'col5':[5,13,21],'col6':[6,14,22],'col7':[7,15,23],'col8':[8,16,24]}
import pandas as pd
df = pd.DataFrame(data=d)
And then would need to have a new df like this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
You can stack the dataframe vertically:
new_df = pd.DataFrame({'col': df.stack().values})
or using the underlying raw array (.to_numpy().ravel()/.to_numpy().flatten()):
new_df = pd.DataFrame({'col': df.to_numpy().flatten()})
print(new_df)
col
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9
9 10
10 11
11 12
12 13
13 14
14 15
15 16
16 17
17 18
18 19
19 20
20 21
21 22
22 23
23 24
You can also try using melt()
df1=df.melt().drop('variable', axis=1).sort_values('value')
Related
I am looking to extract column name and index of elements from a dataframe
import numpy as np
import pandas as pd
import random
lst = list(range(30))
segments = np.repeat(lst, 3)
random.shuffle(segments)
segments = segments.reshape(10, 9)
col_names = ['lz'+str(i) for i in range(95,104)]
rows_names = ['con'+str(i) for i in range(0,10)]
df = pd.DataFrame(segments, index=rows_names, columns=col_names)
lz95 lz96 lz97 lz98 lz99 lz100 lz101 lz102 lz103
con0 6 9 11 7 9 24 18 10 1
con1 24 5 21 15 18 25 24 7 29
con2 17 27 2 0 11 11 18 23 0
con3 16 22 20 22 20 14 14 0 8
con4 10 10 3 13 25 14 9 17 16
con5 3 28 22 2 27 12 16 21 4
con6 26 1 19 7 19 6 29 15 26
con7 26 28 4 13 23 23 1 25 19
con8 28 8 3 6 5 8 4 5 29
con9 2 15 21 27 17 13 12 12 20
For value=12, I am able to extract the column name lz_val = df.columns[df.isin([12]).any()]
But not for rows as it extract all indexes
con_val = df[df==12].index
Index(['con0', 'con1', 'con2', 'con3', 'con4', 'con5', 'con6', 'con7', 'con8',
'con9'],
dtype='object')
What about using np.where?
rows, cols = np.where(df == 12)
rows = df.index[rows]
cols = df.columns[cols]
>>> rows
Index(['con5', 'con9', 'con9'], dtype='object')
>>> cols
Index(['lz100', 'lz101', 'lz102'], dtype='object')
Disclaimer: This might be possible duplicate but I cannot find the exact solution. Please feel free to mark this question as duplicate and provide link to duplicate question in comments.
I am still learning python dataframe operations and this possibly has a very simple solution which I am not able to figure out.
I have a python dataframe with a single columns. Now I want to change value of each row to value of previous row if certain conditions are satisfied. I have created a loop solution to implement this but I was hoping for a more efficient solution.
Creation of initial data:
import numpy as np
import pandas as pd
data = np.random.randint(5,30,size=20)
df = pd.DataFrame(data, columns=['random_numbers'])
print(df)
random_numbers
0 6
1 24
2 29
3 18
4 22
5 17
6 12
7 7
8 6
9 27
10 29
11 13
12 23
13 6
14 25
15 24
16 16
17 15
18 25
19 19
Now lets assume two condition are 1) value less than 10 and 2) value more than 20. In any of these cases, set row value to previous row value. This has been implement in loop format as follows:
for index,row in df.iterrows():
if index == 0:
continue;
if(row.random_numbers<10):
df.loc[index,'random_numbers']=df.loc[index-1,'random_numbers']
if(row.random_numbers>20):
df.loc[index,'random_numbers']=df.loc[index-1,'random_numbers']
random_numbers
0 6
1 6
2 6
3 18
4 18
5 17
6 12
7 12
8 12
9 12
10 12
11 13
12 13
13 13
14 13
15 13
16 16
17 15
18 15
19 19
Please suggest a more efficient way to implement this logic as I am using large number of rows.
You can replace the values less than 10 and values more than 20 with NaN then use pandas.DataFrame.ffill() to fill nan with previous row value.
mask = (df['random_numbers'] < 10) | (df['random_numbers'] > 20)
# Since you escape with `if index == 0:`
mask[df.index[0]] = False
df.loc[mask, 'random_numbers'] = np.nan
df['random_numbers'].ffill(inplace=True)
# Original
random_numbers
0 7
1 28
2 8
3 14
4 12
5 20
6 21
7 11
8 16
9 27
10 19
11 23
12 18
13 5
14 6
15 11
16 6
17 8
18 17
19 8
# After replaced
random_numbers
0 7.0
1 7.0
2 7.0
3 14.0
4 12.0
5 20.0
6 20.0
7 11.0
8 16.0
9 16.0
10 19.0
11 19.0
12 18.0
13 18.0
14 18.0
15 11.0
16 11.0
17 11.0
18 17.0
19 17.0
We can also do it in a simpler way by using .mask() together with .ffill() and slicing on [1:] as follows:
df['random_numbers'][1:] = df['random_numbers'][1:].mask((df['random_numbers'] < 10) | (df['random_numbers'] > 20))
df['random_numbers'] = df['random_numbers'].ffill(downcast='infer')
.mask() tests for the condition and replace with NaN when the condition is true (default to replace with NaN if the parameter other= is not supplied). Retains the original values for rows with condition not met.
Note that the resulting numbers are maintained as integer instead of transformed unexpectedly to float type by supplying the downcast='infer' in the call to .ffill().
We use [1:] on the first line to ensure the data on row 0 is untouched without transformation.
# Original data: (reusing your sample data)
random_numbers
0 6
1 24
2 29
3 18
4 22
5 17
6 12
7 7
8 6
9 27
10 29
11 13
12 23
13 6
14 25
15 24
16 16
17 15
18 25
19 19
# After transposition:
random_numbers
0 6
1 6
2 6
3 18
4 18
5 17
6 12
7 12
8 12
9 12
10 12
11 13
12 13
13 13
14 13
15 13
16 16
17 15
18 15
19 19
My dataframe looks like:
c1
0 10
1 11
2 12
3 13
4 14
5 15
6 16
7 17
I want to find the minimum for every 3 rows. which looks like:
c1 min
0 10 10
1 11 10
2 12 10
3 13 13
4 14 13
5 15 13
6 16 16
7 17 16
and the number of rows might not be divisible by 3. I can't achieve it with rolling function.
If there is default index values use integer division by 3 and pass to GroupBy.transform with min:
df['min'] = df['c1'].groupby(df.index // 3).transform('min')
Or if any index generate helper np.arange:
df['min'] = df['c1'].groupby(np.arange(len(df)) // 3).transform('min')
print (df)
c1 min
0 10 10
1 11 10
2 12 10
3 13 13
4 14 13
5 15 13
6 16 16
7 17 16
You can also do this:
>>> df['min'] = df['c1'][::3]
>>> df.ffill().astype(int)
c1 min
0 10 10
1 11 10
2 12 10
3 13 13
4 14 13
5 15 13
6 16 16
7 17 16
I have a pandas dataframe with around 15 columns and all i am trying to do is see if the data in 1st row of partition_num is equal to the data in last row of partition_num if its not equal, add a new row at the end with the data from the 1st row
Input:
row id partition_num lat long time
0 1 7333 24 26 9
1 2 7333 15 19 10
2 3 7333 24 25 9
3 1 8999 26 18 15
4 2 8999 15 17 45
5 3 8999 26 18 15
6 1 3455 12 14 18
7 2 3455 12 14 18
Desired output:
row id partition_num lat long time
0 1 7333 24 26 9
1 2 7333 15 19 10
2 3 7333 25 26 9
3 4 7333 24 26 9
4 1 8999 26 18 15
5 2 8999 15 17 45
6 3 8999 26 18 15
7 1 3455 12 14 18
8 2 3455 12 14 18
Since the data for partition_num -7333 in row 0 is not equal to the data in row 2, add a new row(row 3) with same data as row 0
can we add a new column to identify the new record something like flag :
row id partition_num lat long time flag
0 1 7333 24 26 9 old
1 2 7333 15 19 10 old
2 3 7333 25 26 9 old
3 4 7333 24 26 9 new
4 1 8999 26 18 15 old
5 2 8999 15 17 45 old
6 3 8999 26 18 15 old
7 1 3455 12 14 18 old
8 2 3455 12 14 18 old
groupby will easily build sub_dataframes per partition_num. From that point the processing is simple:
for i, x in df.groupby('partition_num'):
if (x.iloc[0]['partition_num':] != x.iloc[-1]['partition_num':]).any():
s = x.iloc[0].copy()
s.id = x.iloc[-1].id + 1
df = df.append(s).reset_index(drop=True).rename_axis('row')
The following code compares the values of 'partition_num' in the first and last row, and if they don't match, appends the first row onto the end of the data frame:
if df.loc[0, 'partition_num'] != df.loc[len(df)-1, 'partition_num']:
df = df.append(df.loc[0, :]).reset_index(drop=True)
df.index.name = 'row'
print(df)
id partition_num lat long time
row
0 1 7333 24 26 9
1 2 7333 15 19 10
2 3 7333 24 26 9
3 1 8999 26 18 15
4 2 8999 15 17 45
5 3 8999 26 18 15
6 1 3455 12 14 18
7 2 3455 12 14 18
8 1 7333 24 26 9
The index column is set to 'row', and it is reset and renamed to get the correct ordering.
Added this piece to the above logic:
s['flag']= 'new_row'
and it worked!!
I am new to programming and have taken up learning python in an attempt to make some tasks I run in my research more efficient. I am running a PCA in the pandas module (I found a tutorial online) and have the script for this, but need to subselect part of a dataframe prior to the pca.
so far I have (just for example in reality I am reading a .csv file with a larger matrix)
x = np.random.randint(30, size=(8,8))
df = pd.DataFrame(x)
0 1 2 3 4 5 6 7
0 9 0 23 13 2 5 14 6
1 20 17 11 10 25 23 20 23
2 15 14 22 25 11 15 5 15
3 9 27 15 27 7 15 17 23
4 12 6 11 13 27 11 26 20
5 27 13 5 16 5 5 2 18
6 3 18 22 0 7 10 11 11
7 25 18 10 11 29 29 1 25
What I want to do is sub-select columns that satisfy a certain criteria in any of the rows, specifically I want every column that has at least one number =>27 (just for example) to produce a new dataframe
0 1 3 4 5
0 9 0 13 2 5
1 20 17 10 25 23
2 15 14 25 11 15
3 9 27 27 7 15
4 12 6 13 27 11
5 27 13 16 5 5
6 3 18 0 7 10
7 25 18 11 29 29
I have looked into the various slicing methods in pandas but none seem to do what I want (.loc and .iloc etc.).
The actual script I am using to read in thus far is
filename = 'Data.csv'
data = pd.read_csv(filename,sep = ',')
x = data.ix[:,1:] # variables - species
y = data.ix[:,0] # cases - age
so a sub dataframme of x is what I am after (as above).
Any advice is greatly appreciated.
Indexers like loc, iloc, and ix accept boolean arrays. For example if you have three columns, df.loc[:, [True, False, True]] will return all the rows and the columns 0 and 2 (when corresponding value is True). You can check whether any of the elements in a column is greater than or equal to 27 by (df>=27).any(). This will return True for the columns that has at least one value >=27. So you can slice the dataframe with:
df.loc[:, (df>=27).any()]
Out[34]:
0 1 3 4 5 7
0 8 2 28 9 14 21
1 24 26 23 17 0 0
2 3 24 7 15 4 28
3 29 17 12 7 7 6
4 5 3 10 24 29 14
5 23 21 0 16 23 13
6 22 10 27 1 7 24
7 9 27 2 27 17 12
And this is the initial dataframe:
df
Out[35]:
0 1 2 3 4 5 6 7
0 8 2 7 28 9 14 26 21
1 24 26 15 23 17 0 21 0
2 3 24 26 7 15 4 7 28
3 29 17 9 12 7 7 0 6
4 5 3 13 10 24 29 22 14
5 23 21 26 0 16 23 17 13
6 22 10 19 27 1 7 9 24
7 9 27 26 2 27 17 8 12