Help me to change the only one B to change C
df = pd.DataFrame({'Student': ['A','B','B','D','E','F'],
'maths': [50,60,75,85,64,24],
'sci':[25,34,68,58,75,64],
'sco':[36,49,58,63,85,96]})
Student maths sci sco
0 A 50 25 36
1 B 60 34 49
2 B 75 68 58
3 D 85 58 63
4 E 64 75 85
5 F 24 64 96
df.replace('B','C') # it is changing both B values
using replace I want to change row 2 'B' to 'C'
I would suggest using the df.at function. Try using this code:
import pandas as pd
df = pd.DataFrame({'Student': ['A','B','B','D','E','F'],
'maths': [50,60,75,85,64,24],
'sci':[25,34,68,58,75,64],
'sco':[36,49,58,63,85,96]})
df.at[2, "Student"]= "C"
print(df)
You can use also iloc to select the row to change :
df = pd.DataFrame({'Student': ['A','B','B','D','E','F'],
'maths': [50,60,75,85,64,24],
'sci':[25,34,68,58,75,64],
'sco':[36,49,58,63,85,96]})
df.iloc[2,:] = df.iloc[2,'Student'].replace('B','C')
print(df)
# Student maths sci sco
#0 A 50 25 36
#1 B 60 34 49
#2 C 75 68 58
#3 D 85 58 63
#4 E 64 75 85
#5 F 24 64 96
Related
I have a dataset containing 3000 columns
every column looks like 'abc_dummy0', 'dfg_dummy0, asd_dummy0' and it's of length 130 before it moves onto 'dfg_dummy1'.... and so on until 'lkj_dummy39'
I can use
cols = [col for col in df.columns if 'dummy1' in col]
And it lists all the columns (130) containing the dummy1 at the end of each column name.
My question is, how can i create smaller chunks of data, each one containing 'dummy0', 'dummy1', 'dummy2'.. and so on all the way till 'dummy39' without doing this 40 times
I think it has something to do with
dummy{i} for i in range(0,39)
But I am not quite sure how to approach this, in a memory efficient and code efficient way
(because I would just write 40 lines of code for each respective 'dummy' group)
Here's what I can do, one at the time:
group_0 = [col for col in df.columns if 'dummy0' in col]
group_0 = df[group_0]
but how do I do this for all other 39 groups (both the "_i" at the end of the name and the dummy{i} part) ?
Try with groupby on axis=1 and create a dict entry for each group:
dfs = {group_name: frame
for group_name, frame in
df.groupby(df.columns.str.split('_').str[-1], axis=1)}
dfs:
{'dummy0': a_dummy0 b_dummy0 c_dummy0
0 79 62 17
1 28 31 81,
'dummy1': d_dummy1 e_dummy1 f_dummy1
0 74 9 63
1 8 77 16}
Sample Data:
import numpy as np
import pandas as pd
np.random.seed(5)
df = pd.DataFrame(np.random.randint(1, 100, (2, 6)),
columns=[f'{chr(3 * i + j + 97)}_dummy{i}'
for i in range(2)
for j in range(3)])
df:
a_dummy0 b_dummy0 c_dummy0 d_dummy1 e_dummy1 f_dummy1
0 79 62 17 74 9 63
1 28 31 81 8 77 16
groups based on values after the _:
df.columns.str.split('_').str[-1]
Index(['dummy0', 'dummy0', 'dummy0', 'dummy1', 'dummy1', 'dummy1'], dtype='object')
dfs:
{'dummy0': a_dummy0 b_dummy0 c_dummy0
0 79 62 17
1 28 31 81,
'dummy1': d_dummy1 e_dummy1 f_dummy1
0 74 9 63
1 8 77 16}
Then each group can be accessed like:
dfs['dummy0']
a_dummy0 b_dummy0 c_dummy0
0 79 62 17
1 28 31 81
Alternatively with a loop + filter:
for i in range(0, 2):
print(f'dummy{i}')
curr_df = df.filter(like=f'_dummy{i}')
# Do something with curr_df
print(curr_df)
Output:
dummy0
a_dummy0 b_dummy0 c_dummy0
0 79 62 17
1 28 31 81
dummy1
d_dummy1 e_dummy1 f_dummy1
0 74 9 63
1 8 77 16
You can use this like:
[col for col in df.columns for i in range(39) if str(i) in col if 'dummy' in col]
I want to sample a Pandas dataframe using values in a certain column, but I want to keep all rows with values that are in the sample.
For example, in the dataframe below I want to randomly sample some fraction of the values in b, but keep all corresponding rows in a and c.
d = pd.DataFrame({'a': range(1, 101, 1),'b': list(range(0, 100, 4))*4, 'c' :list(range(0, 100, 2))*2} )
Desired example output from a 16% sample:
Out[66]:
a b c
0 1 0 0
1 26 0 50
2 51 0 0
3 76 0 50
4 4 12 6
5 29 12 56
6 54 12 6
7 79 12 56
8 18 68 34
9 43 68 84
10 68 68 34
11 93 68 84
12 19 72 36
13 44 72 86
14 69 72 36
15 94 72 86
I've tried sampling the series and merging back to the main data, like this:
In [66]: pd.merge(d, d.b.sample(int(.16 * d.b.nunique())))
This creates the desired output, but it seems inefficient. My real dataset has millions of values in b and hundreds of millions of rows. I know I could also use some version of ``isin```, but that also is slow.
Is there a more efficient way to do this?
I really doubt that isin is slow:
uniques = df.b.unique()
# this maybe the bottle neck
samples = np.random.choice(uniques, replace=False, size=int(0.16*len(uniques)) )
# sampling here
df[df.b.isin(samples)]
You can profile the steps above. In case samples=... is slow, you can try:
idx = np.random.rand(len(uniques))
samples = uniques[idx<0.16]
Those took about 100 ms on my system on 10 million rows.
Note: d.b.sample(int(.16 * d.b.nunique())) does not sample 0.16 of the unique values in b.
I'm trying to make a ordinary loop under specific conditions.
I want to interact over rows, checking conditions, and then interact over columns counting how many times the condition was meet.
This counting should generate a new column e my dataframe indicating the total count for each row.
I tried to use apply and mapapply with no success.
I successfully generated the following code to reach my goal.
But I bet there is more efficient ways, or even, built-in pandas functions to do it.
Anyone know how?
sample code:
import pandas as pd
df = pd.DataFrame({'1column': [11, 22, 33, 44],
'2column': [32, 42, 15, 35],
'3column': [33, 77, 26, 64],
'4column': [99, 11, 110, 22],
'5column': [20, 64, 55, 33],
'6column': [10, 77, 77, 10]})
check_columns = ['3column','5column', '6column' ]
df1 = df.copy()
df1['bignum_count'] = 0
for column in check_columns:
inner_loop_count = []
bigseries = df[column]>=50
for big in bigseries:
if big:
inner_loop_count.append(1)
else:
inner_loop_count.append(0)
df1['bignum_count'] += inner_loop_count
# View the dataframe
df1
results:
1column 2column 3column 4column 5column 6column bignum_count
0 11 32 33 99 20 10 0
1 22 42 77 11 64 77 3
2 33 15 26 110 55 77 2
3 44 35 64 22 33 10 1
Index on the columns of interest and check which are greater or equal (ge) than a threshold:
df['bignum_count'] = df[check_columns].ge(50).sum(1)
print(df)
1column 2column 3column 4column 5column 6column bignum_count
0 11 32 33 99 20 10 0
1 22 42 77 11 64 77 3
2 33 15 26 110 55 77 2
3 44 35 64 22 33 10 1
check_columns
df1 = df.copy()
Use DataFrame.ge for >= with counts Trues values by sum:
df['bignum_count'] = df[check_columns].ge(50).sum(axis=1)
#alternative
#df['bignum_count'] = (df[check_columns]>=50).sum(axis=1)
print(df)
1column 2column 3column 4column 5column 6column bignum_count
0 11 32 33 99 20 10 0
1 22 42 77 11 64 77 3
2 33 15 26 110 55 77 2
3 44 35 64 22 33 10 1
I need to check df.head() and df.tail() many times.
When using df.head(), df.tail() jupyter notebook dispalys the ugly output.
Is there any single line command so that we can select only first 5 and last 5 rows:
something like:
df.iloc[:5 | -5:] ?
Test example:
df = pd.DataFrame(np.random.rand(20,2))
df.iloc[:5]
Update
Ugly but working ways:
df.iloc[(np.where( (df.index < 5) | (df.index > len(df)-5)))[0]]
or,
df.iloc[np.r_[np.arange(5), np.arange(df.shape[0]-5, df.shape[0])]]
Try look at numpy.r_
df.iloc[np.r_[0:5, -5:0]]
Out[358]:
0 1
0 0.899673 0.584707
1 0.443328 0.126370
2 0.203212 0.206542
3 0.562156 0.401226
4 0.085070 0.206960
15 0.082846 0.548997
16 0.435308 0.669673
17 0.426955 0.030303
18 0.327725 0.340572
19 0.250246 0.162993
Also head + tail is not a bad solution
df.head(5).append(df.tail(5))
Out[362]:
0 1
0 0.899673 0.584707
1 0.443328 0.126370
2 0.203212 0.206542
3 0.562156 0.401226
4 0.085070 0.206960
15 0.082846 0.548997
16 0.435308 0.669673
17 0.426955 0.030303
18 0.327725 0.340572
19 0.250246 0.162993
df.query("index<5 | index>"+str(len(df)-5))
Here's a way to query the index. You can change the values to whatever you want.
Another approach (per this SO post)
uses only Pandas .isin()
Generate some dummy/demo data
df = pd.DataFrame({'a':range(10,100)})
print(df.head())
a
0 10
1 11
2 12
3 13
4 14
print(df.tail())
a
85 95
86 96
87 97
88 98
89 99
print(df.shape)
(90, 1)
Generate list of required indexes
ls = list(range(5)) + list(range(len(df)-5, len(df)))
print(ls)
[0, 1, 2, 3, 4, 85, 86, 87, 88, 89]
Slice DataFrame using list of indexes
df_first_last_5 = df[df.index.isin(ls)]
print(df_first_last_5)
a
0 10
1 11
2 12
3 13
4 14
85 95
86 96
87 97
88 98
89 99
I'm trying to restructure a data-frame in R for k-means. Presently the data is structured like this:
Subject Posture s1 s2 s3....sn
1 45 45 43 42 ...
2 90 35 45 42 ..
3 0 3 56 98
4 45 ....
and so on. I'd like to collapse all the sn variables into a single column and create an additional variable with the s-number:
Subject Posture sn dv
1 45 1 45
2 90 2 35
3 0 3 31
4 45 4 45
Is this possible within R, or am I better off reshaping the csv directly in python?
Any help is greatly appreciated.
Here's the typical approach in base R (though using "reshape2" is probably the more typical practice).
Assuming we're starting with "mydf", defined as:
mydf <- data.frame(Subject = 1:3, Posture = c(45, 90, 0),
s1 = c(45, 35, 3), s2 = c(43, 45, 56), s3 = c(42, 42, 98))
You can reshape with:
reshape(mydf, direction = "long", idvar=c("Subject", "Posture"),
varying = 3:ncol(mydf), sep = "", timevar="sn")
# Subject Posture sn s
# 1.45.1 1 45 1 45
# 2.90.1 2 90 1 35
# 3.0.1 3 0 1 3
# 1.45.2 1 45 2 43
# 2.90.2 2 90 2 45
# 3.0.2 3 0 2 56
# 1.45.3 1 45 3 42
# 2.90.3 2 90 3 42
# 3.0.3 3 0 3 98
require(reshape2)
melt(df, id.vars="Posture")
Where df is the data.frame you presented. Next time please use dput() to provide actual data.
I think this will work for you.
EDIT:
Make sure to install the reshape2 package first of course.