I have a lot of data that I'm trying to do some basic machine learning on, kind of like the Titanic example that predicts whether a passenger survived or died (I learned this in an intro Python class) based on factors like their gender, age, fare class...
What I'm trying to predict is whether a screw fails depending on how it was made (referred to as Lot). The engineers just listed how many times a failure occurred. Here's how it's formatted.
Lot
Failed?
100
3
110
0
120
1
130
4
The values in the cells are the number of occurrences, so for example:
Lot 100 had three screws that failed
Lot 110 had 0 screws that failed
Lot 120 had one screw that failed
Lot 130 had four screws that failed
I plan on doing a logistic regression using scikit-learn, but first I need each row to be listed as a failure or not. What I'd like to see is a row for every observation, and have them listed as either a 0 (did not occur) or 1 (did occur). Here's what it'd look like after
Lot
Failed?
100
1
100
1
100
1
110
0
120
1
140
1
140
1
140
1
140
1
Here's what I've tried and what I've gotten
df = pd.DataFrame({
'Lot' : ['100', '110', '120', '130'],
'Failed?' : [3, 0, 1, 4]
})
df.loc[df.index.repeat(df['Failed?'])].reset_index(drop = True)
When I do this it repeats the rows but keeps the same values in the Failed? column.
Lot
Failed?
100
3
100
3
100
3
110
0
120
1
140
4
140
4
140
4
140
4
Any ideas? Thank you!
You can use pandas.Series.repeat with reindex, but first you need to differentiate between rows that have 0 and those that do not:
s = df[df['Failed?'].eq(0)] # "save" rows with 0 as value as they will be excluded in repeat since they are repeated 0 times.
df = df.reindex(df.index.repeat(df['Failed?'])) #repeat each row depending on value
df['Failed?'] = 1 #set all values equal to 1
df = pd.concat([df,s]).sort_index() #bring in the 0 values that we saved as 's' earlier and sort by the index to put back in order
df
#The above code as a one-liner:
(pd.concat([df.reindex(df.index.repeat(df['Failed?'])).assign(**{'Failed?' : 1}),
df[df['Failed?'].eq(0)]])
.sort_index())
Out[1]:
Lot Failed?
0 100 1
0 100 1
0 100 1
1 110 0
2 120 1
3 130 1
3 130 1
3 130 1
3 130 1
below will give you failure or not but I suppose you are better served by the other answer.
df.loc[df['Failed?']>0,'Failed?'] = 1
Just as a comment: this is a bit of a strange data transformation, you might want to just keep a numerical target variable
I think it might be a noob question, but I'm new to coding. I used the following code to categorize my data. But I need to command that if, e.g., not all my conditions together fulfill the categories terms, e.g., consider only 4 out of 7 conditions, and give me the mentioned category. How can I do it? I really appreciate any help you can provide.
c1=df['Stroage Condition'].eq('refrigerate')
c2=df['Profit Per Unit'].between(100,150)
c3=df['Inventory Qty']<20
df['Restock Action']=np.where(c1&c2&c3,'Hold Current stock level','On Sale')
print(df)
Let`s say this is your dataframe:
Stroage Condition refrigerate Profit Per Unit Inventory Qty
0 0 1 0 20
1 1 1 102 1
2 2 2 5 2
3 3 0 100 8
and the conditions are the ones you defined:
c1=df['Stroage Condition'].eq(df['refrigerate'])
c2=df['Profit Per Unit'].between(100,150)
c3=df['Inventory Qty']<20
Then you can define a lambda function and pass this to your np.where() function. There you can define how many conditions have to be True. In this example I set the value to at least two.
def my_select(x,y,z):
return np.array([x,y,z]).sum(axis=0) >= 2
Finally you run one more line:
df['Restock Action']=np.where(my_select(c1,c2,c3), 'Hold Current stock level', 'On Sale')
print(df)
This prints to the console:
Stroage Condition refrigerate Profit Per Unit Inventory Qty Restock Action
0 0 1 0 20 On Sale
1 1 1 102 1 Hold Current stock level
2 2 2 5 2 Hold Current stock level
3 3 0 100 8 Hold Current stock level
If you have more conditions or rules, you have extend the lambda function with as many variables as rules.
Imagine a small garden, divided into 8 equal parts, each a square foot. The garden is 4 ft x 2 ft, so the "bins" are in two rows. Let's number them as:
0 1 2 3
4 5 6 7
We want to arrange different plants in each one. Each plant has some buddies that they like to be near. For example, basil likes to be near tomatoes. I want to find an arrangement for the garden that maximizes the number of positive relationships.
Using python, it's easy to shove the different crops in a list. It's also easy to make a scoring function to find the total score for a particular arrangement. My problem is reducing the problem size. In this setup, there are 8! (40,320) possible permutations, different arrangements of plants in the garden. In the real one I'm trying to solve, I'm using a 16-bin garden, twice the size. That's 16! possible permutations to go through, over 20 trillion. It's taking too long. (I've described the problem here with 8 bins instead of 16 to simplify.)
I've used itertools.permutations to run through all the possible permutations of 8 items. However, it doesn't know enough to skip arrangements that are essentially duplicates. If I rotate a garden arrangement by 180 degrees, it's really the same solution. If I mirror left-to-right or up-and-down, they're also the same solutions. How can I set this up to reduce the total problem set?
In other problems, I've used lookups to check through a list of solutions already checked. With this large number of solutions, that would consume more time than simply going through all of them. Please help me reduce the problem set!
# maximize the number of good relationships in a garden
import itertools
# each crop has 2 items: the name of the crop and a list of all the good friends
crops = []
crops.append(['basil',['tomato','pepper','lettuce']]) # basil likes to be near tomato, pepper or lettuce
crops.append(['strawberry',['beans','lettuce']])
crops.append(['beans',['beet','marigold','cucumber','potato','strawberry','radish']])
crops.append(['beet',['beans']])
crops.append(['cucumber',['lettuce','radish','tomato','dill','marigold']])
crops.append(['marigold',['tomato','cucumber','potato','beans']])
crops.append(['tomato',['cucumber','chives','marigold','basil','dill']])
crops.append(['bok_choy',['dill']])
# 0 1 2 3 This is what the garden looks like, with 8 bins
# 4 5 6 7
mates = [ [0,1], [1,2], [2,3], [4,5], [5,6], [6,7], [0,4], [1,5], [2,6], [3,7] ] # these are the relationships that directly border one another
def score(c): # A scoring function that returns the number of good relationships
s = 0
for pair in mates:
for j in c[pair[1]][1]:
if c[pair[0]][0] == j:
s = s + 1
for j in c[pair[0]][1]: # and the revers, 1-0
if c[pair[1]][0] == j:
s = s + 1
return s
scoremax = 0
for x in itertools.permutations(crops,8):
s = score(x)
if s >= scoremax: # show the arrangement
for i in range(0,4):
print( x[i][0] + ' ' * (12-len(x[i][0])) + x[i+4][0] + ' ' * (12-len(x[i+4][0])) ) # print to screen
print(s)
print('')
if s > scoremax:
scoremax = s
EDIT: To clarify, these are the symmetry and rotation arrangements I'm trying to skip. For clarity, I'll use numbers instead of the plant name strings.
0 1 2 3 is same when mirrored 3 2 1 0
4 5 6 7 7 6 5 4
0 1 2 3 is same when mirrored 4 5 6 7
4 5 6 7 0 1 2 3
0 1 2 3 is same when rotated 7 6 5 4
4 5 6 7 3 2 1 0
In general, it is often very difficult to efficiently break symmetries for this kind of problems.
In this case, there seem to be just 2 symmetries:
right to left is the same as left to right
up to down is the same as down to up
We can break both of them if we add three conditions:
the crop at plot 0 should both be smaller than the crop at plot 3 and the one at plot 4 and the one at plot 7
For 'smaller' we can use any measure that gives a strict ordering. In this case we can simply compare the strings.
The main loop would then look as follows. Once the optimal scoremax is reached, only solutions that don't have symmetry will be printed. Also, every possible solution will either be printed directly or will be printed in its canonical form (i.e. mirrored horizontally and/or vertically).
# maximize the number of good relationships in a garden
import itertools
# each crop has 2 items: the name of the crop and a list of all the good friends
crops = []
crops.append(['basil',['tomato','pepper','lettuce']]) # basil likes to be near tomato, pepper or lettuce
crops.append(['strawberry',['beans','lettuce']])
crops.append(['beans',['beet','marigold','cucumber','potato','strawberry','radish']])
crops.append(['beet',['beans']])
crops.append(['cucumber',['lettuce','radish','tomato','dill','marigold']])
crops.append(['marigold',['tomato','cucumber','potato','beans']])
crops.append(['tomato',['cucumber','chives','marigold','basil','dill']])
crops.append(['bok_choy',['dill']])
# 0 1 2 3 This is what the garden looks like, with 8 bins
# 4 5 6 7
mates = [ [0,1], [1,2], [2,3], [4,5], [5,6], [6,7], [0,4], [1,5], [2,6], [3,7] ] # these are the relationships that directly border one another
def score(c): # A scoring function that returns the number of good relationships
s = 0
for pair in mates:
for j in c[pair[1]][1]:
if c[pair[0]][0] == j:
s = s + 1
for j in c[pair[0]][1]: # and the revers, 1-0
if c[pair[1]][0] == j:
s = s + 1
return s
scoremax = 0
for x in itertools.permutations(crops,8):
if x[0][0] < x[3][0] and x[0][0] < x[4][0] and x[0][0] < x[7][0]:
s = score(x)
if s >= scoremax: # show the arrangement
for i in range(0,4):
print( x[i][0] + ' ' * (12-len(x[i][0])) + x[i+4][0] + ' ' * (12-len(x[i+4][0])) ) # print to screen
print(s)
print('')
if s > scoremax:
scoremax = s
I am looking for the right approach for solve the following task (using python):
I have a dataset which is a 2D matrix. Lets say:
1 2 3
5 4 7
8 3 9
0 7 2
From each row I need to pick one number which is not 0 (I can also make it NaN if that's easier).
I need to find the combination with the lowest total sum.
So far so easy. I take the lowest value of each row.
The solution would be:
1 x x
x 4 x
x 3 x
x x 2
Sum: 10
But: There is a variable minimum and a maximum sum allowed for each column. So just choosing the minimum of each row may lead to a not valid combination.
Let's say min is defined as 2 in this example, no max is defined. Then the solution would be:
1 x x
5 x x
x 3 x
x x 2
Sum: 11
I need to choose 5 in row two as otherwise column one would be below the minimum (2).
I could use brute force and test all possible combinations. But due to the amount of data which needs to be analyzed (amount of data sets, not size of each data set) that's not possible.
Is this a common problem with a known mathematical/statistical or other solution?
Thanks
Robert
I have three arrays as listed below:
users — Contains the id of 50000 users ( all distinct )
pusers — Contains the id of users who own some posts (contains repeated id's also, that is, one user can own many posts) [ 50000 values]
score — Contains the score corresponding to each value in pusers.[ 50000 values]
Now I want to populate another array PScore based on the following calculation. For each value of users in pusers, I need to fetch the corresponding score and add it to the PScore array in the index corresponding to the user.
Example,
if users[5] = 23224
and pusers[6] = pusers[97] = 23224
then PScore[5] += score[6]+score[97]
Items of note:
score is related to pusers (e.g., pusers[5] has score[5])
PScore is expected to be related to users (e.g., cumulative score of users[5] is Pscore[5])
The ultimate aim is to assign a cumulative score of posts to the user who owns it.
The users who don't own any posts are assigned a score of 0.
Can anyone help me in doing this? I tried a lot but once I run my different trials, the output screen remains blank until I Ctrl+Z and get out.
I went through all of the following posts but I couldn't use them effectively for my scenario.
Compare values of two arrays in python
how to compare two arrays in python?
Checking if any elements in one list are in another
I am new to this forum and I'm a beginner in Python too. Any help is going to be really useful to me.
Additional Information
I'm working on a small project using StackOverflow data.
I'm using Orange tool and I'm in the process of learning the tool and python.
Ok I understand that something is wrong with my approach. So shouldn't I use lists for this scenario? Can anyone please tell me how I should proceed with this?
Sample of the data that i have arrived at is as shown below.
PUsers Score
-1 0
-1 0
-1 0
-1 0
-1 0
-1 0
-1 0
-1 0
-1 0
-1 0
-1 0
-1 0
-1 0
-1 0
-1 0
-1 0
-1 0
-1 0
-1 0
-1 0
-1 0
-1 0
13 0
77 1
77 4
77 3
77 0
77 2
77 2
77 3
102 2
105 0
108 2
108 2
117 2
Users
-1
1
2
3
4
5
8
9
10
11
13
16
17
19
20
22
23
24
25
26
27
29
30
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
48
49
50
All that I want is the total score associated with each user. Once again, the pusers list contains repetition while users list contains unique values. I need the total score associated with each user stored in such a way that, if I say PScore[6], it should refer to the total score associated with User[6].
Hope I answered the queries.
Thanks in advance.
From how you described your arrays and since you're using python, this looks like a perfect candidate for dictionaries.
Instead of having one array for post owner and another array for post score, you should be able to make a dictionary that maps a user id to a score. When you're taking in data, look in the dictionary to see if the user already exists. If so, add the score to the current score. If not, make a new entry. When you've looped through all the data, you should have a dictionary that maps from user id to total score.
http://docs.python.org/2/tutorial/datastructures.html#dictionaries
I think your algorithm is either wrong or broken.
Try to compute it's complexity. If it's N^2 or more you are likely using an inefficient algorithm. O(N^2) with 50.000 elements should take a few seconds. O(N^3) will probably take minutes.
If you're sure of your approach try running it with some small fake data to figure out if it does the right thing or if you accidentally added some infinite loop.
You can easily get it working in linear time with dictionaries.