Hi I have the following data frames:
import numpy as np
import pandas as pd
df = pd.DataFrame()
df['T1'] = ['A','B','C','D','E']
df['T2'] = ['G','H','I','J','K']
df['Match'] = df['T1'] +' Vs '+ df['T2']
Nsims = 5
df1 = pd.DataFrame((pd.np.tile(df,(Nsims,1))))
I created two new columns T1_point and T2_point by summing of five random numbers.
when I do as follow: it gave me the same number for all rows.
Ninit = 5
df1['T1_point'] = np.sum(np.random.uniform(size=Ninit))
df1['T2_point'] = np.sum(np.random.uniform(size=Ninit))
What I wanted to do is that I would like to get different values for each row by using random number.
How could I do that?
Thanks
Zep.
What you are basically asking is for a random number in each row. Just create a list of random numbers then and append them to your dataframe?
import random
df1['RAND'] = [ random.randint(1,10000000) for k in df1.index]
print df1
0 1 RAND
0 A G 6850189
1 B H 3692984
2 C I 8062507
3 D J 6156287
4 E K 7037728
5 A G 7641046
6 B H 1884503
7 C I 7887030
8 D J 4089507
9 E K 4253742
10 A G 8947290
11 B H 8634259
12 C I 7172269
13 D J 4906697
14 E K 7040624
15 A G 4702362
16 B H 5267067
17 C I 3282320
18 D J 6185152
19 E K 9335186
20 A G 3448703
21 B H 6039862
22 C I 9884632
23 D J 4846228
24 E K 5510052
Related
I wish to extract a dataframe of numbers(floating) based on the first instance of position MrkA and Mrk1. I am not interested in the second instance of MrkA because I know what columns to extract via the line df1
Input:
df = pd.DataFrame({'A':['sdfg',23,'MrkA',34,0,56],'B':['jfgh',23,'sdfg','MrkB',0,56], 'C':['cvb',7,'dsfgA','ghks',47,3],'D':['rrb',7,'gfd',3,0,7],'E':['dfg',7,'gfd',5,12,1],'F':['dfg',7,'sdfA',5,0,4],'G':['dfg',7,'sdA',5,8,9],'H':['dfg',7,'gfA',5,0,8],'I':['dfg',7,'sdfA',5,7,23]})
A B C D E F G H I
0 sdfg jfgh cvb rrb dfg dfg dfg dfg dfg
1 23 23 7 7 7 7 7 7 7
2 MrkA sdfg dsfgA MrkA gfd sdfA sdA gfA sdfA
3 34 Mrk1 ghks 3 Mrk2 5 5 5 5
4 0 0 47 0 12 0 8 0 7
5 56 56 3 7 1 4 9 8 23
for i,j in range(df.shape[1]):
for k,l in range(df.shape[0]):
if df.iloc[k,i] == 'MrkA'and df.iloc[l,j] == 'Mrk1':
col = i
row = k
df1=df.iloc[row+2:,[col,col+1,col+2,col+4,col+5,col+7,col+8]]
break
Output: cannot unpack non-iterable int object
Desired Output:
A B C E F H I
4 0 0 47 12 0 0 7
5 56 56 3 1 4 8 23
How shall I proceed?
Your problem is that df.shape[0]/df.shape[1] is a single element. So trying to unpack range(value) to 2 indices is causing the error.
It should be:
for i in range(df.shape[1]):
for j in range(df.shape[0]):
Then you can apply the desired logic to extract the rows.
Note that it's unclear way you ignore the second row which is also all numeric. If it's only a typo you can try the following to extract all the fully numeric rows and apply some logic there:
df[df.applymap(np.isreal).all(1)]
Edit
Although it is not clear from your specific example what is the logic:
In the example you gave there is no Mrk1 but rather MrkB.
Why is D column disappeared?
A hard-coded example that gives the desired output should be something similar to the following:
import pandas as pd
df = pd.DataFrame({'A':['sdfg',23,'MrkA',34,0,56],'B':['jfgh',23,'sdfg','MrkB',0,56], 'C':['cvb',7,'dsfgA','ghks',47,3],'D':['rrb',7,'gfd',3,0,7],'E':['dfg',7,'gfd',5,12,1],'F':['dfg',7,'sdfA',5,0,4],'G':['dfg',7,'sdA',5,8,9],'H':['dfg',7,'gfA',5,0,8],'I':['dfg',7,'sdfA',5,7,23]})
for r in range(0, df.shape[0] - 1):
for c in range(df.shape[1] - 1):
if df.iloc[r, c] == "MrkA" and df.iloc[r + 1, c + 1] == "MrkB":
print(df.iloc[r + 2:, :])
This gives:
A B C D E F G H I
4 0 0 47 0 12 0 8 0 7
5 56 56 3 7 1 4 9 8 23
I have a dataframe of 9,000 columns and 100 rows. I want to insert a column after every 3rd column such that its value is equal to 50 for all rows.
Existing DataFrame
0 1 2 3 4 5 6 7 8 9....9000
0 a b c d e f g h i j ....x
1 k l m n o p q r s t ....x
.
.
100 u v w x y z aa bb cc....x
Desired DataFrame
0 1 2 3 4 5 6 7 8 9....12000
0 a b c 50 d e f 50 g h i j ....x
1 k l m 50 n o p 50 q r s t ....x
.
.
100 u v w 50 x y z 50 aa bb cc....x
Create new DataFrame by indexing each 3rd column, add .5 for correct sorting and add to original with concat:
df.columns = np.arange(len(df.columns))
df1 = pd.DataFrame(50, index=df.index, columns= df.columns[2::3] + .5)
df2 = pd.concat([df, df1], axis=1).sort_index(axis=1)
df2.columns = np.arange(len(df2.columns))
print (df2)
0 1 2 3 4 5 6 7 8 9 10 11 12
0 a b c 50 d e f 50 g h i 50 j
1 k l m 50 n o p 50 q r s 50 t
Numpy
# How many columns to group
x = 3
# Get the shape of things
a = df.to_numpy()
m, n = a.shape
k = n // x
# Get only a multiple of x columns and reshape
b = a[:, :k * x].reshape(m, k, x)
# Get the other columns missed by b
c = a[:, k * x:]
# array of 50's that we'll append to the last dimension
_50 = np.ones((m, k, 1), np.int64) * 50
# append 50's and reshape back to 2D
d = np.append(b, _50, axis=2).reshape(m, k * (x + 1))
# Create DataFrame while appending the missing bit
pd.DataFrame(np.append(d, c, axis=1))
0 1 2 3 4 5 6 7 8 9 10 11 12
0 a b c 50 d e f 50 g h i 50 j
1 k l m 50 n o p 50 q r s 50 t
Setup
df = pd.DataFrame(np.reshape([*'abcdefghijklmnopqrst'], (2, -1)))
So here is one solution
s=pd.concat([y.assign(new=50) for x, y in df.groupby(np.arange(df.shape[1])//3,axis=1)],axis=1)
s.columns=np.arange(s.shape[1])
I have two functions applied on a dataframe
res = df.apply(lambda x:pd.Series(list(x)))
res = res.applymap(lambda x: x.strip('"') if isinstance(x, str) else x)
{{Update}} Dataframe has got almost 700 000 rows. This is taking much time to run.
How to reduce the running time?
Sample data :
A
----------
0 [1,4,3,c]
1 [t,g,h,j]
2 [d,g,e,w]
3 [f,i,j,h]
4 [m,z,s,e]
5 [q,f,d,s]
output:
A B C D E
-------------------------
0 [1,4,3,c] 1 4 3 c
1 [t,g,h,j] t g h j
2 [d,g,e,w] d g e w
3 [f,i,j,h] f i j h
4 [m,z,s,e] m z s e
5 [q,f,d,s] q f d s
This line of code res = df.apply(lambda x:pd.Series(list(x))) takes items from a list and fill one by one to each column as shown above. There will be almost 38 columns.
I think:
res = df.apply(lambda x:pd.Series(list(x)))
should be changed to:
df1 = pd.DataFrame(df['A'].values.tolist())
print (df1)
0 1 2 3
0 1 4 3 c
1 t g h j
2 d g e w
3 f i j h
4 m z s e
5 q f d s
And second if not mixed columns values - numeric with strings:
cols = res.select_dtypes(object).columns
res[cols] = res[cols].apply(lambda x: x.str.strip('"'))
I am reading a file with pd.read_csv and removing all the values that are -1. Here's the code
import pandas as pd
import numpy as np
columns = ['A', 'B', 'C', 'D']
catalog = pd.read_csv('data.txt', sep='\s+', names=columns, skiprows=1)
a = cataog['A']
b = cataog['B']
c = cataog['C']
d = cataog['D']
print len(b) # answer is 700
# remove rows that are -1 in column b
idx = np.where(b != -1)[0]
a = a[idx]
b = b[idx]
c = c[idx]
d = d[idx]
print len(b) # answer is 612
So I am assuming that I have successfully managed to remove all the rows where the value in column b is -1.
In order to test this, I am doing the following naive way:
for i in range(len(b)):
print i, a[i], b[i]
It prints out the values until it reaches a row which was supposedly filtered out. But now it gives a KeyError.
You can filtering by boolean indexing:
catalog = catalog[catalog['B'] != -1]
a = cataog['A']
b = cataog['B']
c = cataog['C']
d = cataog['D']
It is expected you get KeyError, because index values not match, because filtering.
One possible solution is convert Series to lists:
for i in range(len(b)):
print i, list(a)[i], list(b)[i]
Sample:
catalog = pd.DataFrame({'A':list('abcdef'),
'B':[-1,5,4,5,-1,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0]})
print (catalog)
A B C D
0 a -1 7 1
1 b 5 8 3
2 c 4 9 5
3 d 5 4 7
4 e -1 2 1
#filtered DataFrame have no index 0, 4
catalog = catalog[catalog['B'] != -1]
print (catalog)
A B C D
1 b 5 8 3
2 c 4 9 5
3 d 5 4 7
5 f 4 3 0
a = catalog['A']
b = catalog['B']
c = catalog['C']
d = catalog['D']
print (b)
1 5
2 4
3 5
5 4
Name: B, dtype: int64
#a[i] in first loop want match index value 0 (a[0]) what does not exist, so KeyError,
#same problem for b[0]
for i in range(len(b)):
print (i, a[i], b[i])
KeyError: 0
#convert Series to list, so list(a)[0] return first value of list - there is no Series index
for i in range(len(b)):
print (i, list(a)[i], list(b)[i])
0 b 5
1 c 4
2 d 5
3 f 4
Another solution should be create default index 0,1,... by reset_index with drop=True:
catalog = catalog[catalog['B'] != -1].reset_index(drop=True)
print (catalog)
A B C D
0 b 5 8 3
1 c 4 9 5
2 d 5 4 7
3 f 4 3 0
a = catalog['A']
b = catalog['B']
c = catalog['C']
d = catalog['D']
#default index values match a[0] and a[b]
for i in range(len(b)):
print (i, a[i], b[i])
0 b 5
1 c 4
2 d 5
3 f 4
If you filter out indices, then
for i in range(len(b)):
print i, a[i], b[i]
will attempt to access erased indices. Instead, you can use the following:
for i, ae, be in zip(a.index, a.values, b.values):
print(i, ae, be)
What's a smart way to convert a string with white spaces into some dataframe (some 'table') with desired dimensions (X columns and Y rows) in Python?
Say my string is string = 'A B C D E F G H I J K L' and I want to convert it into a 3 cols x 4 rows dataframe.
I guess there are useful pandas/numpy tool for that.
Use Numpy.reshape()
import numpy as np
import pandas as pd
string = 'A B C D E F G H I J K L'
list1 = [char for char in string.split(' ') if char != '']
df = pd.DataFrame(np.reshape(list1,[3,4]))
Outputs:
0 1 2 3
0 A B C D
1 E F G H
2 I J K L
Whoops... here it is with 3 col x 4 rows:
pd.DataFrame(np.reshape(list1,[4,3]))
0 1 2
0 A B C
1 D E F
2 G H I
3 J K L
Edit: put the imports on top.