I have a column which of 10th marks but some specific rows are not scaled properly i.e they are out of 10. I want to create a function that will help me to detect which are <=10 and then multiply to 100. I tried by creating a function but it failed.
Following is the Column:
data['10th']
0 0
1 0
2 0
3 10.00
4 0
...
2163 0
2164 0
2165 0
2166 76.50
2167 64.60
Name: 10th, Length: 2168, dtype: object
I am not what do you mean by "multiply to 100" but you should be able to use apply with lambda similar to this:
df = pd.DataFrame({"a": [1, 3, 5, 23, 76, 43 ,12, 3 ,5]})
df['a'] = df['a'].apply(lambda x: x*100 if x < 10 else x)
print(df)
0 100
1 300
2 500
3 23
4 76
5 43
6 12
7 300
8 500
If I do not understand you correctly you could replace the action and condition in the lambda function to your purpose.
Looks like you need to change the data type first data["10th"] = pd.to_numeric(data["10th"])
I assume you want to multiply by 10 not 100 to scale it with the other out of 100 scores. you can try this np.where(data["10th"]<10, data["10th"]*10, data["10th"])
assigning it back to the dataframe using. data["10th"] = np.where(data["10th"]<10, data["10th"]*10, data["10th"])
Related
I have looked everywhere for this answer which must exist. I am trying to find the smallest positive integer per row in a data frame.
Imagine a dataframe
'lat':[-120, -90, -100, -100],
'long':[20, 21, 19, 18],
'dist1':[2, 6, 8, 1],
'dist2':[1,3,10,5]}```
The following function gives me the minimum value, but includes negatives. i.e. the df['lat'] column.
df.min(axis = 1)
Obviously, I could drop the lat column, or convert to string or something, but I will need it later. The lat column is the only column with negative values. I am trying to return a new column such as
df['min_dist'] = [1,3,8,1]
I hope this all makes sense. Thanks in advance for any help.
In general you can use DataFrame.where to mark negative values as null and exclude them from min calculation:
df['min_dist'] = df.where(df > 0).min(1)
df
lat long dist1 dist2 min_dist
0 -120 20 2 1 1.0
1 -90 21 6 3 3.0
2 -100 19 8 10 8.0
3 -100 18 1 5 1.0
Filter for just the dist columns and apply the minimum function :
df.assign(min_dist = df.iloc[:, -2:].min(1))
Out[205]:
lat long dist1 dist2 min_dist
0 -120 20 2 1 1
1 -90 21 6 3 3
2 -100 19 8 10 8
3 -100 18 1 5 1
Just use:
df['min_dist'] = df[df > 0].min(1)
I want to replicate the data from the same dataframe when a certain condition is fulfilled.
Dataframe:
Hour,Wage
1,15
2,17
4,20
10,25
15,26
16,30
17,40
19,15
I want to replicate the dataframe when going through a loop and there is a difference greater than 4 in row.hour.
Expected Output:
Hour,Wage
1,15
2,17
4,20
10,25
15,26
16,30
17,40
19,15
2,17
4,20
i want to replicate the rows when the iterating through all the row and there is a difference greater than 4 in row.hour
row.hour[0] = 1
row.hour[1] = 2.here the difference between is 1 but in (row.hour[2]=4 and row,hour[3]=10).here the difference is 6 which is greater than 4.I want to replicate the data above of the index where this condition(greater than 4) is fulfilled
I can replicate the data with **df = pd.concat([df]*2, ignore_index=False)**.but it does not replicate when i run it with if statement
I tried the code below but nothing is happening.
**for i in range(0,len(df)-1):
if (df.iloc[i,0] - df.iloc[i+1,0]) > 4 :
df = pd.concat([df]*2, ignore_index=False)**
My understanding is: you want to compare 'Hour' values for two successive rows.
If the difference is > 4 you want to add the previous row to the DF.
If that is what you want try this:
Create a DF:
j = pd.DataFrame({'Hour':[1, 2, 4,10,15,16,17,19],
'Wage':[15,17,20,25,26,30,40,15]})
Define a function:
def f1(d):
dn = d.copy()
for x in range(len(d)-2):
if (abs(d.iloc[x+1].Hour - d.iloc[x+2].Hour) > 4):
idx = x + 0.5
dn.loc[idx] = d.iloc[x]['Hour'], d.iloc[x]['Wage']
dn = dn.sort_index().reset_index(drop=True)
return dn
Call the function passing your DF:
nd = f1(j)
Hour Wage
0 1 15
1 2 17
2 2 17
3 4 20
4 4 20
5 10 25
6 15 26
7 16 30
8 17 40
9 19 15
In line
if df.iloc[i,0] - df.iloc[i+1,0] > 4
you calculate 4-10 instead of 10-4 so you check -6 > 4 instead of 6 > 4
You have to replace items
if df.iloc[i+1,0] - df.iloc[i,0] > 4
or use abs() if you want to replicate in both situations - > 4 and < -4
if abs(df.iloc[i+1,0] - df.iloc[i,0]) > 4
If you would use print( df.iloc[i,0] - df.iloc[i+1,0]) (or debuger) the you would see it.
df = df.groupby(['X', 'Y'])['STATUS'].sum()
The output:
X Y
1 41 0
42 0
43 0
44 0
45 0
Name: STATUS, dtype: int64
The next step is to count the groupby X and Y. Now there will be some X and Y groups that will have a Status sum of 2 or more which is correct so far e.g.
X Y
2 41 0
42 1
43 2
44 0
45 1
See in X,Y = 2,43 has a sum of 2 but I want the output to be duplicate copies based on the sum of 2 or greater and also get rid of any group that has zero. This is for mapping on the software I have at the moment.
X Y
2 42 1
43 1
43 1
45 1
You saw 2,43 came up twice but it is counted as 1 individually and there are no zeros as we removed that. So can you please advise me how to do it?
Let me know if you need me to elaborate it more and appreciate any help
sum_vals=df.groupby(['X','Y'])['STATUS'].sum().reset_index()
sum_vals=sum_vals[sum_vals['STATUS']>0]
sum_vals=sum_vals.loc[sum_vals.index.repeat(sum_vals.STATUS)]
sum_vals['STATUS']=1
I have a dataframe that looks like this
initial year0 year1
0 0 12
1 1 13
2 2 14
3 3 15
Note that the number of year columns year0, year1... (year_count) is completely variable but will be constant throughout this code
I first wanted to apply a function to each of the 'year' columns to generate 'mod' columns like so
def mod(year, scalar):
return (year * scalar)
s = 5
year_count = 2
# Generate new columns
df[[f"mod{y}" for y in range (year_count)]] = df[[f"year{y}" for y in range(year_count)]].apply(mod, scalar=s)
initial year0 year1 mod0 mod1
0 0 12 0 60
1 1 13 5 65
2 2 14 10 70
3 3 15 15 75
All good so far. The problem is that I now want to apply another function to both the year column and its corresponding mod column to generate another set of val columns, so something like
def sum_and_scale(year_col, mod_col, scale):
return (year_col + mod_col) * scale
Then I apply this to each of the columns (year0, mod0), (year1, mod1) etc to generate the next tranche of columns.
With scale = 10 I should end up with
initial year0 year1 mod0 mod1 val0 val1
0 0 12 0 60 0 720
1 1 13 5 65 60 780
2 2 14 10 70 120 840
3 3 15 15 75 180 900
This is where I'm stuck - I don't know how to put two existing df columns together in a function with the same structure as in the first example, and if I do something like
df[['val0', 'val1']] = df['col1', 'col2'].apply(lambda x: sum_and_scale('mod0', 'mod1', scale=10))
I don't know how to generalise this to have arbitrary inputs and outputs and also apply the constant scale parameter. (I know the last piece of won't work but it's the other avenue to a solution I've seen)
The reason I'm asking is because I believe the loop that I currently have working is creating performance issues with the number of columns and the length of each column.
Thanks
IMHO, it's better with a simple for loop:
for i in range(2):
df[f'val{i}'] = sum_and_scale(df[f'year{i}'], df[f'mod{i}'], scale=10)
I have the following single column pandas DataFrame called y. The column is called 0(zero).
y =
1
0
0
1
0
1
1
2
0
1
1
2
2
2
2
1
0
0
I want to select N row indices of records per y value. In the above example, there are 6 records of 0, 7 records of 1 and 5 records of 2.
I need to select 4 records from each of these 3 groups.
Below I provide my code. However this code always selects the first N (e.g. 4) records per class. I need the selection to be done randomly on a whole dataset.
How can I do it?
idx0 = []
idx1 = []
idx2 = []
for i in range(0, len(y[0])):
if y[0].iloc[i]==0 and len(idx0)<=4:
idx0.append(i)
if y[0].iloc[i]==1 and len(idx1)<=4:
idx1.append(i)
if y[0].iloc[i]==2 and len(idx2)<=4:
idx2.append(i)
Update:
The expected outcome is a list of indices, not the filtered DataFrame y.
n = 4
a = y.groupby(0).apply(lambda x: x.sample(n)).reset_index(1).\
rename(columns={'level_1':'indices'}).reset_index(drop=True).groupby(0)['indices'].\
apply(list).reset_index()
class = 0
idx = a.ix[2].tolist()[class]
y.values[idx] # THIS RETURNS WRONG WRONG CLASSES IN SOME CASES
0
1. # <- WRONG
0
0
Use groupby() with df.sample():
n=4
df.groupby('Y').apply(lambda x: x.sample(n)).reset_index(drop=True)
Y
0 0
1 0
2 0
3 0
4 1
5 1
6 1
7 1
8 2
9 2
10 2
11 2
EDIT, for index:
df.groupby('Y').apply(lambda x: x.sample(n)).reset_index(1).\
rename(columns={'level_1':'indices'}).reset_index(drop=True).groupby('Y')['indices'].\
apply(list).reset_index()
Y indices
0 0 [4, 1, 17, 16]
1 1 [0, 6, 10, 5]
2 2 [13, 14, 7, 11]
Using
idx0,idx1,idx2=[ np.random.choice(y.index.values,4,replace=False).tolist()for _, y in df.groupby('0')]
idx0
Out[48]: [1, 2, 16, 8]
To be more detail
s=pd.Series([1,0,1,0,2],index=[1,3,4,5,9])
idx=[1,4] # both anky and mine answer return the index
s.loc[idx] # using .loc with index is correct
Out[59]:
1 1
4 1
dtype: int64
s.values[idx]# using value with slice with index, is wrong
Out[60]: array([0, 2], dtype=int64)
Supposing column "y" belongs to a dataframe "df" and you want to select N=4 random rows:
for i in np.unique(df.y).astype(int):
print(df.y[np.random.choice(np.where(df.y==np.unique(df.y)[i])[0],4)])
You will get:
10116 0
329 0
4709 0
5630 0
Name: y, dtype: int32
382 1
392 1
9124 1
383 1
Name: y, dtype: int32
221 2
443 2
4235 2
5322 2
Name: y, dtype: int32
Edited, to get index:
pd.concat([df.y[np.random.choice(np.where(df.y==np.unique(df.y)[i])[0],4)] for i in np.unique(df.y).astype(int)],axis=0)
You will get:
10116 0
329 0
4709 0
5630 0
382 1
392 1
9124 1
383 1
221 2
443 2
4235 2
5322 2
Name: y, dtype: int32
To get a nested list of indices:
[df.holiday[np.random.choice(np.where(df.holiday==np.unique(df.holiday)[i])[0],4)].index.tolist() for i in np.unique(df.holiday).astype(int)]
You will get:
[[10116,329,4709,5630],[382,392,9124,383],[221,443,4235,5322]]
N = 4
y.loc[y[0]==0].sample(N)
y.loc[y[0]==1].sample(N)
y.loc[y[0]==2].sample(N)