Pandas dataframe sub-selection - python

I am new to programming and have taken up learning python in an attempt to make some tasks I run in my research more efficient. I am running a PCA in the pandas module (I found a tutorial online) and have the script for this, but need to subselect part of a dataframe prior to the pca.
so far I have (just for example in reality I am reading a .csv file with a larger matrix)
x = np.random.randint(30, size=(8,8))
df = pd.DataFrame(x)
0 1 2 3 4 5 6 7
0 9 0 23 13 2 5 14 6
1 20 17 11 10 25 23 20 23
2 15 14 22 25 11 15 5 15
3 9 27 15 27 7 15 17 23
4 12 6 11 13 27 11 26 20
5 27 13 5 16 5 5 2 18
6 3 18 22 0 7 10 11 11
7 25 18 10 11 29 29 1 25
What I want to do is sub-select columns that satisfy a certain criteria in any of the rows, specifically I want every column that has at least one number =>27 (just for example) to produce a new dataframe
0 1 3 4 5
0 9 0 13 2 5
1 20 17 10 25 23
2 15 14 25 11 15
3 9 27 27 7 15
4 12 6 13 27 11
5 27 13 16 5 5
6 3 18 0 7 10
7 25 18 11 29 29
I have looked into the various slicing methods in pandas but none seem to do what I want (.loc and .iloc etc.).
The actual script I am using to read in thus far is
filename = 'Data.csv'
data = pd.read_csv(filename,sep = ',')
x = data.ix[:,1:] # variables - species
y = data.ix[:,0] # cases - age
so a sub dataframme of x is what I am after (as above).
Any advice is greatly appreciated.

Indexers like loc, iloc, and ix accept boolean arrays. For example if you have three columns, df.loc[:, [True, False, True]] will return all the rows and the columns 0 and 2 (when corresponding value is True). You can check whether any of the elements in a column is greater than or equal to 27 by (df>=27).any(). This will return True for the columns that has at least one value >=27. So you can slice the dataframe with:
df.loc[:, (df>=27).any()]
Out[34]:
0 1 3 4 5 7
0 8 2 28 9 14 21
1 24 26 23 17 0 0
2 3 24 7 15 4 28
3 29 17 12 7 7 6
4 5 3 10 24 29 14
5 23 21 0 16 23 13
6 22 10 27 1 7 24
7 9 27 2 27 17 12
And this is the initial dataframe:
df
Out[35]:
0 1 2 3 4 5 6 7
0 8 2 7 28 9 14 26 21
1 24 26 15 23 17 0 21 0
2 3 24 26 7 15 4 7 28
3 29 17 9 12 7 7 0 6
4 5 3 13 10 24 29 22 14
5 23 21 26 0 16 23 17 13
6 22 10 19 27 1 7 9 24
7 9 27 26 2 27 17 8 12

Related

How do I set a cell value based on another cell in same row in Python Using Pandas

I am getting my head around Python and Pandas library and trying out some basics but getting lost in documentation.
I have a Pandas DataFrame
A B C D
1 2 3 4
3 4 1 7
6 9 0 1
...
other 10k+ rows.
I now want to add a column say 'E' and that should read True/False if value of 'D' is in top 10% of the entire column.
One way I tried is to sort descending by column 'D' and then update top 10% rows, in this way I am able to sort but have not yet figured out how to update top 10% rows
also this way alters the original order which isn't desirable.
df = df.sort_values('D',ascending=False)
df.iloc[:0, :(df.shape[0]-1)/10, 5] = value --- this doesn't work.
Just checking if there is a way to achieve this without sorting ? if not, how do I update top 10% rows if they were sorted ?
Thanks
If need top10 values without sorting with duplicates use np.argsort:
np.random.seed(2021)
df = pd.DataFrame(np.random.randint(30, size=(20, 5)), columns=list('ABCDE'))
n = 10
N = int(len(df.index)*(n/100))
print (N)
2
df['mask'] = np.argsort(np.argsort(-df['E'].to_numpy())) < N
print (df)
A B C D E mask
0 20 21 25 0 13 False
1 22 12 27 29 21 True
2 29 24 12 22 6 False
3 6 6 1 5 7 False
4 1 14 1 28 5 False
5 26 2 16 3 17 False
6 16 18 22 27 20 False
7 29 24 5 17 6 False
8 10 14 7 21 6 False
9 9 21 22 25 18 False
10 10 4 13 10 19 False
11 25 18 26 15 8 False
12 10 12 21 11 19 False
13 1 14 17 25 18 False
14 7 21 19 27 12 False
15 23 19 9 4 9 False
16 7 25 7 7 20 False
17 27 29 11 27 19 False
18 18 14 25 27 18 False
19 21 18 26 0 20 True
If need all top2 values is possible compare with Series.nlargest and Series.isin:
df['mask'] = df['E'].isin(df['E'].nlargest(2))
print (df)
A B C D E mask
0 20 21 25 0 13 False
1 22 12 27 29 21 True
2 29 24 12 22 6 False
3 6 6 1 5 7 False
4 1 14 1 28 5 False
5 26 2 16 3 17 False
6 16 18 22 27 20 True
7 29 24 5 17 6 False
8 10 14 7 21 6 False
9 9 21 22 25 18 False
10 10 4 13 10 19 False
11 25 18 26 15 8 False
12 10 12 21 11 19 False
13 1 14 17 25 18 False
14 7 21 19 27 12 False
15 23 19 9 4 9 False
16 7 25 7 7 20 True
17 27 29 11 27 19 False
18 18 14 25 27 18 False
19 21 18 26 0 20 True
If you don't want to use built-in quantile. Use sort method :-
top_10_pc = int(len(df.index) * 0.1)
min_val = min(df.sort_values(by=['D'], ascending=False)[:top_10_pc]['D'])
df['E'] = df['D'] >= min_val

How can I split columns and values of whole dataframe?

I have a dataframe like this:
a\tb\tc d\te\tf g\th\ti
20\t21\t22 1\t2\t3 30\t31\t32
17\t18\t19 4\t5\t6 27\t28\t29
14\t15\t16 7\t8\t9 24\t25\t26
11\t12\t13 10\t11\t12 21\t22\t23
8\t9\t10 13\t14\t15 18\t19\t20
5\t6\t7 16\t17\t18 15\t16\t17
2\t3\t4 19\t20\t21 12\t13\t14
expected output:
a b c d e f g h i
0 20 21 22 1 2 3 30 31 32
1 17 18 19 4 5 6 27 28 29
2 14 15 16 7 8 9 24 25 26
3 11 12 13 10 11 12 21 22 23
4 8 9 10 13 14 15 18 19 20
5 5 6 7 16 17 18 15 16 17
6 2 3 4 19 20 21 12 13 14
My solution is:
l = list()
for column in df.columns:
columns = column.split()
d = df[column].str.split(expand=True)
l.append(d.rename(columns=dict(zip(range(len(columns)),columns))))
pd.concat(l,axis=1)
But this looks so complex.
Is there a simple way of doing this ?
Your approach looks good. You can simplify the rename part by just assigning the new names to .columns attribute:
def expand(col):
_df = df[col].str.split(expand=True)
_df.columns = col.split()
return _df
pd.concat(map(expand, df.columns), axis=1)
a b c d e f g h i
0 20 21 22 1 2 3 30 31 32
1 17 18 19 4 5 6 27 28 29
2 14 15 16 7 8 9 24 25 26
3 11 12 13 10 11 12 21 22 23
4 8 9 10 13 14 15 18 19 20
5 5 6 7 16 17 18 15 16 17
6 2 3 4 19 20 21 12 13 14

Fastest way to replace current value in dataframe based on last LARGEST value

say i have a dataframe that looks like this
A
0 17
1 21
2 18
3 11
4 4
5 27
6 21
7 11
8 7
9 4
10 7
11 4
12 3
13 27
14 27
15 11
16 11
17 25
I'd like to replace the next row's value with the last LARGEST value in that row. the desired output is this:
A B
0 17 17
1 21 21
2 18 0
3 11 0
4 4 0
5 27 27
6 21 0
7 11 0
8 7 0
9 4 0
10 7 0
11 4 0
12 3 0
13 27 27
14 27 27
15 11 0
16 11 0
17 25 0
Currently I run a iterrows function that looks like this:
df['B'] = df['A']
lastrow = -1
for i, row in df.iterrows():
if lastrow > row['B']:
row['B'] = 0
else:
lastrow = row['B']
But it's quite slow. Is there a way to improve the speed of this loop?
i timed it and for 100,000 rows this is the output:
CPU times: user 10.3 s, sys: 4.5 ms, total: 10.3 s
Wall time: 10.4 s
Check with cummax
df['B']=df.A.where(df.A.eq(df.A.cummax()),0)
df
Out[75]:
A B
0 17 17
1 21 21
2 18 0
3 11 0
4 4 0
5 27 27
6 21 0
7 11 0
8 7 0
9 4 0
10 7 0
11 4 0
12 3 0
13 27 27
14 27 27
15 11 0
16 11 0
17 25 0

Data in 1st row should be equal to the last row- Using Pandas

I have a pandas dataframe with around 15 columns and all i am trying to do is see if the data in 1st row of partition_num is equal to the data in last row of partition_num if its not equal, add a new row at the end with the data from the 1st row
Input:
row id partition_num lat long time
0 1 7333 24 26 9
1 2 7333 15 19 10
2 3 7333 24 25 9
3 1 8999 26 18 15
4 2 8999 15 17 45
5 3 8999 26 18 15
6 1 3455 12 14 18
7 2 3455 12 14 18
Desired output:
row id partition_num lat long time
0 1 7333 24 26 9
1 2 7333 15 19 10
2 3 7333 25 26 9
3 4 7333 24 26 9
4 1 8999 26 18 15
5 2 8999 15 17 45
6 3 8999 26 18 15
7 1 3455 12 14 18
8 2 3455 12 14 18
Since the data for partition_num -7333 in row 0 is not equal to the data in row 2, add a new row(row 3) with same data as row 0
can we add a new column to identify the new record something like flag :
row id partition_num lat long time flag
0 1 7333 24 26 9 old
1 2 7333 15 19 10 old
2 3 7333 25 26 9 old
3 4 7333 24 26 9 new
4 1 8999 26 18 15 old
5 2 8999 15 17 45 old
6 3 8999 26 18 15 old
7 1 3455 12 14 18 old
8 2 3455 12 14 18 old
groupby will easily build sub_dataframes per partition_num. From that point the processing is simple:
for i, x in df.groupby('partition_num'):
if (x.iloc[0]['partition_num':] != x.iloc[-1]['partition_num':]).any():
s = x.iloc[0].copy()
s.id = x.iloc[-1].id + 1
df = df.append(s).reset_index(drop=True).rename_axis('row')
The following code compares the values of 'partition_num' in the first and last row, and if they don't match, appends the first row onto the end of the data frame:
if df.loc[0, 'partition_num'] != df.loc[len(df)-1, 'partition_num']:
df = df.append(df.loc[0, :]).reset_index(drop=True)
df.index.name = 'row'
print(df)
id partition_num lat long time
row
0 1 7333 24 26 9
1 2 7333 15 19 10
2 3 7333 24 26 9
3 1 8999 26 18 15
4 2 8999 15 17 45
5 3 8999 26 18 15
6 1 3455 12 14 18
7 2 3455 12 14 18
8 1 7333 24 26 9
The index column is set to 'row', and it is reset and renamed to get the correct ordering.
Added this piece to the above logic:
s['flag']= 'new_row'
and it worked!!

python - replace last n columns with sum of all files

I am novice in python.
I have 8 csv files with 26 columns and 600 rows in each. now I want to take the last 4 column of each csv files (Column 22 to column 25), read the files and sum them up to replace all the 4 columns in each file. for example (I am showing some random data here):
new-1.csv:
a b c d e f g h i j k
1 1 1 1 1 1 1 1 1 1 1
2 2 2 2 2 2 2 2 2 2 2
3 3 3 3 3 3 3 3 3 3 3
4 4 4 4 4 4 4 4 4 4 4
5 5 5 5 5 5 5 5 5 5 5
6 6 6 6 6 6 6 6 6 6 6
7 7 7 7 7 7 7 7 7 7 7
8 8 8 8 8 8 8 8 8 8 8
9 9 9 9 9 9 9 9 9 9 9
new2.csv:
a b c d e f g h i j k
11 11 11 11 11 11 11 11 11 11 11
12 12 12 12 12 12 12 12 12 12 12
13 13 13 13 13 13 13 13 13 13 13
14 14 14 14 14 14 14 14 14 14 14
15 15 15 15 15 15 15 15 15 15 15
16 16 16 16 16 16 16 16 16 16 16
17 17 17 17 17 17 17 17 17 17 17
18 18 18 18 18 18 18 18 18 18 18
19 19 19 19 19 19 19 19 19 19 19
Now, I want to sum each element of "h, i, j, k" of from these 2 files, then replace the files last 4 columns with this new sum.
Modified new-1.csv:
a b c d e f g h i j k
1 1 1 1 1 1 1 12 12 12 12
2 2 2 2 2 2 2 14 14 14 14
3 3 3 3 3 3 3 16 16 16 16
4 4 4 4 4 4 4 18 18 18 18
5 5 5 5 5 5 5 20 20 20 20
6 6 6 6 6 6 6 22 22 22 22
7 7 7 7 7 7 7 24 24 24 24
8 8 8 8 8 8 8 26 26 26 26
9 9 9 9 9 9 9 28 28 28 28
Modified new-2.csv:
a b c d e f g h i j k
11 11 11 11 11 11 11 12 12 12 12
12 12 12 12 12 12 12 14 14 14 14
13 13 13 13 13 13 13 16 16 16 16
14 14 14 14 14 14 14 18 18 18 18
15 15 15 15 15 15 15 20 20 20 20
16 16 16 16 16 16 16 22 22 22 22
17 17 17 17 17 17 17 24 24 24 24
18 18 18 18 18 18 18 26 26 26 26
19 19 19 19 19 19 19 28 28 28 28
I am assuming I should use Panda or numpy for this, but not sure how to do it. any suggestions/hints would be appreciated.
You can do this by just using numpy.
import numpy as np
# list of all the files
file_list = ['foo.csv','bar.csv','baz.csv'] # all 8 files
col_names = ['a','b','c','d','e','f'] # all the names till z if necessary as the first row, else skip this
# initializing a numpy array, for containing sum from last 4 columns
add_cols = np.zeros((600,4))
# iterating over all .csv files
for file in file_list :
# skiprows will skip the first row and usecols will get values in last 4 cols
temp = np.loadtxt(file, skiprows=1, delimiter=',' , usecols = (22,23,24,25) )
add_cols = np.add(temp,add_cols)
# now again overwriting all the files, substituting the last 4 columns with the sum
for file in file_list :
#loading the content from file in temp
temp = np.loadtxt(file, skiprows=1, delimiter=',')
temp[:,[22,23,24,25]] = add_cols
# writing the column names first
with open(file,'w') as p:
p.write(','.join(col_names)+'\n')
# now appending final values in temp to the file as csv
with open(file,'a') as p:
np.savetxt(p,temp,delimiter=",",fmt="%i")
Now if your file is not comma separated and rather space separated, remove the delimiter option from all the functions as the delimiter is taken as space by default. Also join the first column accordingly.
After loading your csvs using read_csv, you can add the last 4 columns together and then overwrite them:
In [10]:
total = df[df.columns[-4:]].values + df1[df1.columns[-4:]].values
total
Out[10]:
array([[12, 12, 12, 12],
[14, 14, 14, 14],
[16, 16, 16, 16],
[18, 18, 18, 18],
[20, 20, 20, 20],
[22, 22, 22, 22],
[24, 24, 24, 24],
[26, 26, 26, 26],
[28, 28, 28, 28]], dtype=int64)
In [12]:
df[df.columns[-4:]] = total
df1[df1.columns[-4:]] = total
df
Out[12]:
a b c d e f g h i j k
0 1 1 1 1 1 1 1 12 12 12 12
1 2 2 2 2 2 2 2 14 14 14 14
2 3 3 3 3 3 3 3 16 16 16 16
3 4 4 4 4 4 4 4 18 18 18 18
4 5 5 5 5 5 5 5 20 20 20 20
5 6 6 6 6 6 6 6 22 22 22 22
6 7 7 7 7 7 7 7 24 24 24 24
7 8 8 8 8 8 8 8 26 26 26 26
8 9 9 9 9 9 9 9 28 28 28 28
In [13]:
df1
Out[13]:
a b c d e f g h i j k
0 11 11 11 11 11 11 11 12 12 12 12
1 12 12 12 12 12 12 12 14 14 14 14
2 13 13 13 13 13 13 13 16 16 16 16
3 14 14 14 14 14 14 14 18 18 18 18
4 15 15 15 15 15 15 15 20 20 20 20
5 16 16 16 16 16 16 16 22 22 22 22
6 17 17 17 17 17 17 17 24 24 24 24
7 18 18 18 18 18 18 18 26 26 26 26
8 19 19 19 19 19 19 19 28 28 28 28
We need to call the attribute .values here to return a np array because otherwise it will try to align on the index which in this case do not align.
Once you overwrite them call df.to_csv(file_path) and df1.to_csv(file_path)
In the case of your 8 dfs you can loop over them and aggregate whilst looping:
# take a copy of the firt df's last 4 columns
total = df_list[0]
total = total[total.columns[-4:]].values
for df in df_list[1:]:
total += df[df.columns[-4:]].values
Then just loop over your dfs again to overwrite:
for df in df_list:
df[df.columns[-4:]] = total
And then write out again using to_csv.

Categories

Resources