I am looking for the right approach for solve the following task (using python):
I have a dataset which is a 2D matrix. Lets say:
1 2 3
5 4 7
8 3 9
0 7 2
From each row I need to pick one number which is not 0 (I can also make it NaN if that's easier).
I need to find the combination with the lowest total sum.
So far so easy. I take the lowest value of each row.
The solution would be:
1 x x
x 4 x
x 3 x
x x 2
Sum: 10
But: There is a variable minimum and a maximum sum allowed for each column. So just choosing the minimum of each row may lead to a not valid combination.
Let's say min is defined as 2 in this example, no max is defined. Then the solution would be:
1 x x
5 x x
x 3 x
x x 2
Sum: 11
I need to choose 5 in row two as otherwise column one would be below the minimum (2).
I could use brute force and test all possible combinations. But due to the amount of data which needs to be analyzed (amount of data sets, not size of each data set) that's not possible.
Is this a common problem with a known mathematical/statistical or other solution?
Thanks
Robert
Related
I need to calculate the percentile using a specific algorithm that is not available using either pandas.rank() or numpy.rank().
The ranking algorithm is calculated as follows for a series:
rank[i] = (# of values in series less than i + # of values equal to
i*0.5)/total # of values
so if I had the following series
s=pd.Series(data=[5,3,8,1,9,4,14,12,6,1,1,4,15])
For the first element, 5 there are 6 values less than 5 and no other values = to 5. The rank would be (6+0x0.5)/13 or 6/13.
For the fourth element (1) it would be (0+ 2x0.5)/13 or 1/13.
How could I calculate this without using a loop? I assume a combination of s.apply and/or s.where() but can't figure it out and have tried searching. I am looking to apply to the entire series at once, with the result being a series with the percentile ranks.
You could use numpy broadcasting. First convert s to a numpy column array. Then use numpy broadcasting to count the number of items less than i for each i. Then count the number of items equal to i for each i (note that we need to subract 1 since, i is equal to i itself). Finally add them and build a Series:
tmp = s.to_numpy()
s_col = tmp[:, None]
less_than_i_count = (s_col>tmp).sum(axis=1)
eq_to_i_count = ((s_col==tmp).sum(axis=1) - 1) * 0.5
ranks = pd.Series((less_than_i_count + eq_to_i_count) / len(s), index=s.index)
Output:
0 0.461538
1 0.230769
2 0.615385
3 0.076923
4 0.692308
5 0.346154
6 0.846154
7 0.769231
8 0.538462
9 0.076923
10 0.076923
11 0.346154
12 0.923077
dtype: float64
I have a series of numbers of two different magnitudes in a dataframe column. They are
0 154480.429000
1 154.480844
2 154480.433000
3 154.480844
4 154480.433000
......
As we can see that above, I am not sure how to set a condition to scale the small number 154.480844 to have the same order of magnitude as the large one 154480.433000 in dataframe.
How can this be done efficiently with pandas?
Use np.log10 to determine the scaling factor required. Something like this:
v = np.log10(ser).astype(int)
ser * 10 ** (v.max() - v).values
0 154480.429
1 154480.844
2 154480.433
3 154480.844
4 154480.433
Name: 1, dtype: float64
I have a source dataframe which needs to be looped through for all the values of Comments which are Grouped By values present in corresponding Name field and the result needs to be appended as a new column in the DF. This can be into a new DataFrame as well.
Input Data :
Name Comments
0 N-1 Good
1 N-2 bad
2 N-3 ugly
3 N-1 very very good
4 N-3 what is this
5 N-4 pathetic
6 N-1 needs improvement
7 N-2 this is not right
8 Ano-5 It is average
[8 rows x 2 columns]
For example - For all values of Comments of Name N-1, run a loop and add the output as a new column along with these 2 values (of Name, Comment).
I tried to do the following, and was able to group by based on Name. But I am unable to run through all values of Comments for them to append the output :
gp = CommentsData.groupby(['Document'])
for g in gp.groups.items():
Data1 = CommentsData.loc[g[1]]
#print(Data1)
Data in Group by loop comes like :
Name Comments
0 N-1 good
3 N-1 very very good
6 N-1 needs improvement
1 N-2 bad
7 N-2 this is not right
I am unable to access the values in 2nd column.
Using df.iloc[i] - I am only able to access first element. But not all (as the number of elements will vary for different values of Names).
Now, I want to use the values in Comment and then add the output as an additional column in the dataframe(can be a new DF).
Expected Output :
Name Comments Result
0 N-1 Good A
1 N-2 bad B
2 N-3 ugly C
3 N-1 very very good A
4 N-3 what is this B
5 N-4 pathetic C
6 N-1 needs improvement C
7 N-2 this is not right B
8 Ano-5 It is average B
[8 rows x 3 columns]
you can use apply and reset_index
df.groupby('Name').Comments.apply(pd.DataFrame.reset_index, drop=True).unstack()
I've been trying and failing to use iterrows with if/else statements to return calculated values from DataFrame columns. Am starting to think it's the wrong method.
In this example I have two variables x and y, and a DataFrame:
category number
0 one 13
1 two 14
2 one 7
3 three 8
4 one 3
5 two 8
6 four 9
If the category is one or two, divide the corresponding number by 2 and assign half the value to variable x and half to variable y. But if the category is three or four, assign the whole corresponding number to just variable y. x and y would then be the summed result, as in:
x = 22.5
(Because: 13/2+14/2+7/2+3/2+8/2 = 22.5)
y = 39.5
(Because: 13/2+14/2+7/2+8+3/2+8/2+9 = 39.5)
I haven't found any example of iterrows being used like this. Are these types of calculations even possible using iterrows or is there better way?
You can use .loc to slice by each case you're looking at, and then aggregate as appropriate.
case1 = ['one', 'two']
case2 = ['three', 'four']
x = df.loc[df.category.isin(case1), 'number'].sum()/2
y = x + df.loc[df.category.isin(case2), 'number'].sum()
I wanna implement a calculate method like a simple scenario:
value computed as the sum of daily data during the previous N days (set N = 3 in the following example)
Dataframe df: (df.index is 'date')
date value
20140718 1
20140721 2
20140722 3
20140723 4
20140724 5
20140725 6
20140728 7
......
to do calculating like:
date value new
20140718 1 0
20140721 2 0
20140722 3 0
20140723 4 6 (3+2+1)
20140724 5 9 (4+3+2)
20140725 6 12 (5+4+3)
20140728 7 15 (6+5+4)
......
Now I have done this using for cycle like:
df['value']=[0]*len(df)
for idx in df.index
loc=df.index.get_loc(idx)
if((loc-N)>=0):
tmp=df.ix[df.index[loc-3]:df.index[loc-1]]
sum=tmp['value'].sum()
else:
sum=0
df['new'].ix(idx)=sum
But, when the length of dataframe or the value of N is very long / big, these calculating will be very slow....How I can implement this faster using a function or by other ways?
Besides, if the scenario is more complex? how ? Thanks.
Since you want the sum of the previous three excluding the current one, you can use rolling_apply over the a window of four and sum up all but the last value.
new = rolling_apply(df, 4, lambda x:sum(x[:-1]), min_periods=4)
This is the same as shifting afterwards with a window of three:
new = rolling_apply(df, 3, sum, min_periods=3).shift()
Then
df["new"] = new["value"].fillna(0)