How can I split columns and values of whole dataframe? - python

I have a dataframe like this:
a\tb\tc d\te\tf g\th\ti
20\t21\t22 1\t2\t3 30\t31\t32
17\t18\t19 4\t5\t6 27\t28\t29
14\t15\t16 7\t8\t9 24\t25\t26
11\t12\t13 10\t11\t12 21\t22\t23
8\t9\t10 13\t14\t15 18\t19\t20
5\t6\t7 16\t17\t18 15\t16\t17
2\t3\t4 19\t20\t21 12\t13\t14
expected output:
a b c d e f g h i
0 20 21 22 1 2 3 30 31 32
1 17 18 19 4 5 6 27 28 29
2 14 15 16 7 8 9 24 25 26
3 11 12 13 10 11 12 21 22 23
4 8 9 10 13 14 15 18 19 20
5 5 6 7 16 17 18 15 16 17
6 2 3 4 19 20 21 12 13 14
My solution is:
l = list()
for column in df.columns:
columns = column.split()
d = df[column].str.split(expand=True)
l.append(d.rename(columns=dict(zip(range(len(columns)),columns))))
pd.concat(l,axis=1)
But this looks so complex.
Is there a simple way of doing this ?

Your approach looks good. You can simplify the rename part by just assigning the new names to .columns attribute:
def expand(col):
_df = df[col].str.split(expand=True)
_df.columns = col.split()
return _df
pd.concat(map(expand, df.columns), axis=1)
a b c d e f g h i
0 20 21 22 1 2 3 30 31 32
1 17 18 19 4 5 6 27 28 29
2 14 15 16 7 8 9 24 25 26
3 11 12 13 10 11 12 21 22 23
4 8 9 10 13 14 15 18 19 20
5 5 6 7 16 17 18 15 16 17
6 2 3 4 19 20 21 12 13 14

Related

How do I set a cell value based on another cell in same row in Python Using Pandas

I am getting my head around Python and Pandas library and trying out some basics but getting lost in documentation.
I have a Pandas DataFrame
A B C D
1 2 3 4
3 4 1 7
6 9 0 1
...
other 10k+ rows.
I now want to add a column say 'E' and that should read True/False if value of 'D' is in top 10% of the entire column.
One way I tried is to sort descending by column 'D' and then update top 10% rows, in this way I am able to sort but have not yet figured out how to update top 10% rows
also this way alters the original order which isn't desirable.
df = df.sort_values('D',ascending=False)
df.iloc[:0, :(df.shape[0]-1)/10, 5] = value --- this doesn't work.
Just checking if there is a way to achieve this without sorting ? if not, how do I update top 10% rows if they were sorted ?
Thanks
If need top10 values without sorting with duplicates use np.argsort:
np.random.seed(2021)
df = pd.DataFrame(np.random.randint(30, size=(20, 5)), columns=list('ABCDE'))
n = 10
N = int(len(df.index)*(n/100))
print (N)
2
df['mask'] = np.argsort(np.argsort(-df['E'].to_numpy())) < N
print (df)
A B C D E mask
0 20 21 25 0 13 False
1 22 12 27 29 21 True
2 29 24 12 22 6 False
3 6 6 1 5 7 False
4 1 14 1 28 5 False
5 26 2 16 3 17 False
6 16 18 22 27 20 False
7 29 24 5 17 6 False
8 10 14 7 21 6 False
9 9 21 22 25 18 False
10 10 4 13 10 19 False
11 25 18 26 15 8 False
12 10 12 21 11 19 False
13 1 14 17 25 18 False
14 7 21 19 27 12 False
15 23 19 9 4 9 False
16 7 25 7 7 20 False
17 27 29 11 27 19 False
18 18 14 25 27 18 False
19 21 18 26 0 20 True
If need all top2 values is possible compare with Series.nlargest and Series.isin:
df['mask'] = df['E'].isin(df['E'].nlargest(2))
print (df)
A B C D E mask
0 20 21 25 0 13 False
1 22 12 27 29 21 True
2 29 24 12 22 6 False
3 6 6 1 5 7 False
4 1 14 1 28 5 False
5 26 2 16 3 17 False
6 16 18 22 27 20 True
7 29 24 5 17 6 False
8 10 14 7 21 6 False
9 9 21 22 25 18 False
10 10 4 13 10 19 False
11 25 18 26 15 8 False
12 10 12 21 11 19 False
13 1 14 17 25 18 False
14 7 21 19 27 12 False
15 23 19 9 4 9 False
16 7 25 7 7 20 True
17 27 29 11 27 19 False
18 18 14 25 27 18 False
19 21 18 26 0 20 True
If you don't want to use built-in quantile. Use sort method :-
top_10_pc = int(len(df.index) * 0.1)
min_val = min(df.sort_values(by=['D'], ascending=False)[:top_10_pc]['D'])
df['E'] = df['D'] >= min_val

Print list of lists in matrix format

I have a list of lists:
[[15 16 18 19 12 11],[13 19 23 21 16 12],[12 15 17 19 20 10],[10 14 16 13 9 6]]
The length of each list in the list is the same.
I want to print out as rows and columns such as:
15 16 18 19 12 11
13 19 23 21 16 12
12 15 17 19 20 10
10 14 16 13 9 6
I know I can do it by using
lst = (' '.join(map(str,lst))),
But I want every integer to indent at the same level like the 9 should be indented below the 0 of 20, and 6 should be under 0 of 10.
Given an input (list of lists) ll:
'\n'.join(' '.join('%2d' % x for x in l) for l in ll)
Result:
15 16 18 19 12 11
13 19 23 21 16 12
12 15 17 19 20 10
10 14 16 13 9 6

Operations in dataframe

I have cvs data, this dataset has different latitude locations from 17 to 20, and each location has monthly data i.e (1,2,3,4,5,6, ...).
I would like to add a new column name and N and it depends on the latitude and the value per month, put the respective associated number for the given value.
Input data
lan/lon/year/month/prec
-17/18/1990/1/0.4
-17/18/1990/2/0.02
-17/18/1990/3/0.12
-17/18/1990/4/0.06
.
.
.
-17/18/2020/12/0.35
-17/20/1990/1/0.2
-17/20/1990/2/0.2
-17/20/1990/3/0.2
-17/20/1990/4/0.2
.
.
.
-17/20/2020/12/0.08
-18/20/1990/1/0.11
-18/20/1990/2/0.11
-18/20/1990/3/0.11
.
.
.
.
N values depend on lat and month
17 18 19 20 21
1 25 29 13 13 2
2 22 11 1 16 23
3 8 13 10 21 8
4 4 14 16 10 13
5 23 30 8 8 18
6 16 4 7 5 29
7 26 5 10 25 28
8 3 16 2 27 2
9 21 16 23 8 7
10 19 30 10 28 20
11 28 18 12 6 8
12 21 14 26 3 8
EXPECTED OUTPUT
lan/lon/year/month/prec/N
-17/18/1990/1/0.4/25
-17/18/1990/2/0.02/22
-17/18/1990/3/0.12/8
-17/18/1990/4/0.06/4
.
.
.
-17/18/2020/12/0.35/21
-17/20/1990/1/0.2/25
-17/20/1990/2/0.2/22
-17/20/1990/3/0.2/8
-17/20/1990/4/0.2/4
.
.
.
-17/20/2020/12/0.08/21
-18/20/1990/1/0.11/29
-18/20/1990/2/0.11/11
-18/20/1990/3/0.11/13
.
.
.
.
Use DataFrame.stack for reshape second df2 with rename MultiIndex names for match by columns lat1, month in df by DataFrame.join:
df = pd.read_csv(data, sep="/")
s = df2.rename(columns=int).unstack().rename_axis(['lan1','month'])
df['lan1'] = df['lan'].abs()
df2 = df.join(s.rename('N'), on=['lan1','month']).drop('lan1', axis=1)
print (df2)
lan lon year month prec N
0 -17 18 1990 1 0.40 25
1 -17 18 1990 2 0.02 22
2 -17 18 1990 3 0.12 8
3 -17 18 1990 4 0.06 4
4 -17 18 2020 12 0.35 21
5 -17 20 1990 1 0.20 25
6 -17 20 1990 2 0.20 22
7 -17 20 1990 3 0.20 8
8 -17 20 1990 4 0.20 4
9 -17 20 2020 12 0.08 21
10 -18 20 1990 1 0.11 29
11 -18 20 1990 2 0.11 11
12 -18 20 1990 3 0.11 13
print (df2.to_csv(sep='/', index=False))
lan/lon/year/month/prec/N
-17/18/1990/1/0.4/25
-17/18/1990/2/0.02/22
-17/18/1990/3/0.12/8
-17/18/1990/4/0.06/4
-17/18/2020/12/0.35/21
-17/20/1990/1/0.2/25
-17/20/1990/2/0.2/22
-17/20/1990/3/0.2/8
-17/20/1990/4/0.2/4
-17/20/2020/12/0.08/21
-18/20/1990/1/0.11/29
-18/20/1990/2/0.11/11
-18/20/1990/3/0.11/13
print (s)
lan month
17 1 25
2 22
3 8
4 4
5 23
6 16
7 26
8 3
9 21
10 19
11 28
12 21
18 1 29
2 11
3 13
4 14
5 30
6 4
7 5
8 16
9 16
10 30
11 18
12 14
19 1 13
2 1
3 10
4 16
5 8
6 7
7 10
8 2
9 23
10 10
11 12
12 26
20 1 13
2 16
3 21
4 10
5 8
6 5
7 25
8 27
9 8
10 28
11 6
12 3
21 1 2
2 23
3 8
4 13
5 18
6 29
7 28
8 2
9 7
10 20
11 8
12 8
dtype: int64

Pandas dataframe sub-selection

I am new to programming and have taken up learning python in an attempt to make some tasks I run in my research more efficient. I am running a PCA in the pandas module (I found a tutorial online) and have the script for this, but need to subselect part of a dataframe prior to the pca.
so far I have (just for example in reality I am reading a .csv file with a larger matrix)
x = np.random.randint(30, size=(8,8))
df = pd.DataFrame(x)
0 1 2 3 4 5 6 7
0 9 0 23 13 2 5 14 6
1 20 17 11 10 25 23 20 23
2 15 14 22 25 11 15 5 15
3 9 27 15 27 7 15 17 23
4 12 6 11 13 27 11 26 20
5 27 13 5 16 5 5 2 18
6 3 18 22 0 7 10 11 11
7 25 18 10 11 29 29 1 25
What I want to do is sub-select columns that satisfy a certain criteria in any of the rows, specifically I want every column that has at least one number =>27 (just for example) to produce a new dataframe
0 1 3 4 5
0 9 0 13 2 5
1 20 17 10 25 23
2 15 14 25 11 15
3 9 27 27 7 15
4 12 6 13 27 11
5 27 13 16 5 5
6 3 18 0 7 10
7 25 18 11 29 29
I have looked into the various slicing methods in pandas but none seem to do what I want (.loc and .iloc etc.).
The actual script I am using to read in thus far is
filename = 'Data.csv'
data = pd.read_csv(filename,sep = ',')
x = data.ix[:,1:] # variables - species
y = data.ix[:,0] # cases - age
so a sub dataframme of x is what I am after (as above).
Any advice is greatly appreciated.
Indexers like loc, iloc, and ix accept boolean arrays. For example if you have three columns, df.loc[:, [True, False, True]] will return all the rows and the columns 0 and 2 (when corresponding value is True). You can check whether any of the elements in a column is greater than or equal to 27 by (df>=27).any(). This will return True for the columns that has at least one value >=27. So you can slice the dataframe with:
df.loc[:, (df>=27).any()]
Out[34]:
0 1 3 4 5 7
0 8 2 28 9 14 21
1 24 26 23 17 0 0
2 3 24 7 15 4 28
3 29 17 12 7 7 6
4 5 3 10 24 29 14
5 23 21 0 16 23 13
6 22 10 27 1 7 24
7 9 27 2 27 17 12
And this is the initial dataframe:
df
Out[35]:
0 1 2 3 4 5 6 7
0 8 2 7 28 9 14 26 21
1 24 26 15 23 17 0 21 0
2 3 24 26 7 15 4 7 28
3 29 17 9 12 7 7 0 6
4 5 3 13 10 24 29 22 14
5 23 21 26 0 16 23 17 13
6 22 10 19 27 1 7 9 24
7 9 27 26 2 27 17 8 12

python - replace last n columns with sum of all files

I am novice in python.
I have 8 csv files with 26 columns and 600 rows in each. now I want to take the last 4 column of each csv files (Column 22 to column 25), read the files and sum them up to replace all the 4 columns in each file. for example (I am showing some random data here):
new-1.csv:
a b c d e f g h i j k
1 1 1 1 1 1 1 1 1 1 1
2 2 2 2 2 2 2 2 2 2 2
3 3 3 3 3 3 3 3 3 3 3
4 4 4 4 4 4 4 4 4 4 4
5 5 5 5 5 5 5 5 5 5 5
6 6 6 6 6 6 6 6 6 6 6
7 7 7 7 7 7 7 7 7 7 7
8 8 8 8 8 8 8 8 8 8 8
9 9 9 9 9 9 9 9 9 9 9
new2.csv:
a b c d e f g h i j k
11 11 11 11 11 11 11 11 11 11 11
12 12 12 12 12 12 12 12 12 12 12
13 13 13 13 13 13 13 13 13 13 13
14 14 14 14 14 14 14 14 14 14 14
15 15 15 15 15 15 15 15 15 15 15
16 16 16 16 16 16 16 16 16 16 16
17 17 17 17 17 17 17 17 17 17 17
18 18 18 18 18 18 18 18 18 18 18
19 19 19 19 19 19 19 19 19 19 19
Now, I want to sum each element of "h, i, j, k" of from these 2 files, then replace the files last 4 columns with this new sum.
Modified new-1.csv:
a b c d e f g h i j k
1 1 1 1 1 1 1 12 12 12 12
2 2 2 2 2 2 2 14 14 14 14
3 3 3 3 3 3 3 16 16 16 16
4 4 4 4 4 4 4 18 18 18 18
5 5 5 5 5 5 5 20 20 20 20
6 6 6 6 6 6 6 22 22 22 22
7 7 7 7 7 7 7 24 24 24 24
8 8 8 8 8 8 8 26 26 26 26
9 9 9 9 9 9 9 28 28 28 28
Modified new-2.csv:
a b c d e f g h i j k
11 11 11 11 11 11 11 12 12 12 12
12 12 12 12 12 12 12 14 14 14 14
13 13 13 13 13 13 13 16 16 16 16
14 14 14 14 14 14 14 18 18 18 18
15 15 15 15 15 15 15 20 20 20 20
16 16 16 16 16 16 16 22 22 22 22
17 17 17 17 17 17 17 24 24 24 24
18 18 18 18 18 18 18 26 26 26 26
19 19 19 19 19 19 19 28 28 28 28
I am assuming I should use Panda or numpy for this, but not sure how to do it. any suggestions/hints would be appreciated.
You can do this by just using numpy.
import numpy as np
# list of all the files
file_list = ['foo.csv','bar.csv','baz.csv'] # all 8 files
col_names = ['a','b','c','d','e','f'] # all the names till z if necessary as the first row, else skip this
# initializing a numpy array, for containing sum from last 4 columns
add_cols = np.zeros((600,4))
# iterating over all .csv files
for file in file_list :
# skiprows will skip the first row and usecols will get values in last 4 cols
temp = np.loadtxt(file, skiprows=1, delimiter=',' , usecols = (22,23,24,25) )
add_cols = np.add(temp,add_cols)
# now again overwriting all the files, substituting the last 4 columns with the sum
for file in file_list :
#loading the content from file in temp
temp = np.loadtxt(file, skiprows=1, delimiter=',')
temp[:,[22,23,24,25]] = add_cols
# writing the column names first
with open(file,'w') as p:
p.write(','.join(col_names)+'\n')
# now appending final values in temp to the file as csv
with open(file,'a') as p:
np.savetxt(p,temp,delimiter=",",fmt="%i")
Now if your file is not comma separated and rather space separated, remove the delimiter option from all the functions as the delimiter is taken as space by default. Also join the first column accordingly.
After loading your csvs using read_csv, you can add the last 4 columns together and then overwrite them:
In [10]:
total = df[df.columns[-4:]].values + df1[df1.columns[-4:]].values
total
Out[10]:
array([[12, 12, 12, 12],
[14, 14, 14, 14],
[16, 16, 16, 16],
[18, 18, 18, 18],
[20, 20, 20, 20],
[22, 22, 22, 22],
[24, 24, 24, 24],
[26, 26, 26, 26],
[28, 28, 28, 28]], dtype=int64)
In [12]:
df[df.columns[-4:]] = total
df1[df1.columns[-4:]] = total
df
Out[12]:
a b c d e f g h i j k
0 1 1 1 1 1 1 1 12 12 12 12
1 2 2 2 2 2 2 2 14 14 14 14
2 3 3 3 3 3 3 3 16 16 16 16
3 4 4 4 4 4 4 4 18 18 18 18
4 5 5 5 5 5 5 5 20 20 20 20
5 6 6 6 6 6 6 6 22 22 22 22
6 7 7 7 7 7 7 7 24 24 24 24
7 8 8 8 8 8 8 8 26 26 26 26
8 9 9 9 9 9 9 9 28 28 28 28
In [13]:
df1
Out[13]:
a b c d e f g h i j k
0 11 11 11 11 11 11 11 12 12 12 12
1 12 12 12 12 12 12 12 14 14 14 14
2 13 13 13 13 13 13 13 16 16 16 16
3 14 14 14 14 14 14 14 18 18 18 18
4 15 15 15 15 15 15 15 20 20 20 20
5 16 16 16 16 16 16 16 22 22 22 22
6 17 17 17 17 17 17 17 24 24 24 24
7 18 18 18 18 18 18 18 26 26 26 26
8 19 19 19 19 19 19 19 28 28 28 28
We need to call the attribute .values here to return a np array because otherwise it will try to align on the index which in this case do not align.
Once you overwrite them call df.to_csv(file_path) and df1.to_csv(file_path)
In the case of your 8 dfs you can loop over them and aggregate whilst looping:
# take a copy of the firt df's last 4 columns
total = df_list[0]
total = total[total.columns[-4:]].values
for df in df_list[1:]:
total += df[df.columns[-4:]].values
Then just loop over your dfs again to overwrite:
for df in df_list:
df[df.columns[-4:]] = total
And then write out again using to_csv.

Categories

Resources