I have two dataframes, let's call them df A and df B
A =
0 1
123 798
456 845
789 932
B =
0 1
321 593
546 603
937 205
Now I would like to multiply them, but also with an expression, as in A-1/B^2 for each of them
AB =
0 1
123-1/(321^2) 798-1/(593^2)
456-1/(546^2) 845-1/603^2)
789-1/(937^2) 932-1/(205^2)
Now, I have figured I could loop through each row and each column and try some sort of
A[i][j]-1/(B[i][j]^2)
But when it goes up to a 1000x1000 matrix, it would take quite some time.
Is there any operation for pandas or numpy that allows these sort of cross matrix operations? Not just multiplying one matrix by the other, but rather doing a math opeartion between them.
Maybe calculate the divider at first for a new df B ?
Having DataFrame like so:
# comments are the equations that have to be done to calculate the given column
df = pd.DataFrame({
'item_tolerance': [230, 115, 155],
'item_intake': [250,100,100],
'open_items_previous_day': 0, # df.item_intake.shift() + df.open_items_previous_day.shift() - df.items_shipped.shift() + df.items_over_under_sla.shift()
'total_items_to_process': 0, # df.item_intake + df.open_items_previous_day
'sla_relevant': 0, # df.item_tolerance if df.open_items_previous_day + df.item_intake > df.item_tolerance else df.open_items_previous_day + df.item_intake
'items_shipped': [230, 115, 50],
'items_over_under_sla': 0 # df.items_shipped - df.sla_relevant
})
item_tolerance
item_intake
open_items_previous_day
total_items_to_process
sla_relevant
items_shipped
items_over_under_sla
0
230
250
0
0
0
230
0
1
115
100
0
0
0
115
0
2
155
100
0
0
0
50
0
I'd like to calculate all the columns that have comments in them. I've tried using df.apply(some_method, axis=1) to perform row wise calculations but the problem is that I don't have the access to the previous row inside some_method(row).
To give a little more explanation, what I'm trying to achieve is for example: df.items_over_under_sla = df.items_shipped - df.sla_relevant but df.sla_relevant is based on equation which needs df.open_items_previous_day which needs df.open_items_previous_day which needs the previous row to be calculated. This is the problem, I need to calculate rows based on the values from this row and the previous one.
What is the correct approach to such problem?
If you are calculating each column with a different operation I suggest obtaining them individually:
df['open_items_previous_day'] = df['item_intake'].shift(fill_value=0) + df['open_items_previous_day'].shift(fill_value=0) - df['items_shipped'].shift(fill_value=0) + df['items_over_under_sla'].shift(fill_value=0)
df['total_items_to_process'] = df['item_intake'] + df['open_items_previous_day']
df = df.assign(sla_relevant=np.where(df['open_items_previous_day'] + df['item_intake'] > df['item_tolerance'], df['item_tolerance'], df['open_items_previous_day'] + df['item_intake']))
df['items_over_under_sla'] = df['items_shipped'] - df['sla_relevant']
df
Out[1]:
item_tolerance item_intake open_items_previous_day total_items_to_process sla_relevant items_shipped items_over_under_sla
0 230 250 0 250 230 230 0
1 115 100 20 120 115 115 0
2 155 100 -15 85 85 50 -35
The problem that you are facing is not about having to use the previous row (you are working around that just fine using the shift function). The real problem here is that all columns that you are trying to get (except for total_items_to_process) depend on each other, therefore you can't get the rest of the columns without having one of them first (or assuming it is zero initially).
That's why you are going to get different results depending on which column you've calculated first.
I have a lot of data that I'm trying to do some basic machine learning on, kind of like the Titanic example that predicts whether a passenger survived or died (I learned this in an intro Python class) based on factors like their gender, age, fare class...
What I'm trying to predict is whether a screw fails depending on how it was made (referred to as Lot). The engineers just listed how many times a failure occurred. Here's how it's formatted.
Lot
Failed?
100
3
110
0
120
1
130
4
The values in the cells are the number of occurrences, so for example:
Lot 100 had three screws that failed
Lot 110 had 0 screws that failed
Lot 120 had one screw that failed
Lot 130 had four screws that failed
I plan on doing a logistic regression using scikit-learn, but first I need each row to be listed as a failure or not. What I'd like to see is a row for every observation, and have them listed as either a 0 (did not occur) or 1 (did occur). Here's what it'd look like after
Lot
Failed?
100
1
100
1
100
1
110
0
120
1
140
1
140
1
140
1
140
1
Here's what I've tried and what I've gotten
df = pd.DataFrame({
'Lot' : ['100', '110', '120', '130'],
'Failed?' : [3, 0, 1, 4]
})
df.loc[df.index.repeat(df['Failed?'])].reset_index(drop = True)
When I do this it repeats the rows but keeps the same values in the Failed? column.
Lot
Failed?
100
3
100
3
100
3
110
0
120
1
140
4
140
4
140
4
140
4
Any ideas? Thank you!
You can use pandas.Series.repeat with reindex, but first you need to differentiate between rows that have 0 and those that do not:
s = df[df['Failed?'].eq(0)] # "save" rows with 0 as value as they will be excluded in repeat since they are repeated 0 times.
df = df.reindex(df.index.repeat(df['Failed?'])) #repeat each row depending on value
df['Failed?'] = 1 #set all values equal to 1
df = pd.concat([df,s]).sort_index() #bring in the 0 values that we saved as 's' earlier and sort by the index to put back in order
df
#The above code as a one-liner:
(pd.concat([df.reindex(df.index.repeat(df['Failed?'])).assign(**{'Failed?' : 1}),
df[df['Failed?'].eq(0)]])
.sort_index())
Out[1]:
Lot Failed?
0 100 1
0 100 1
0 100 1
1 110 0
2 120 1
3 130 1
3 130 1
3 130 1
3 130 1
below will give you failure or not but I suppose you are better served by the other answer.
df.loc[df['Failed?']>0,'Failed?'] = 1
Just as a comment: this is a bit of a strange data transformation, you might want to just keep a numerical target variable
If we have the data set:
import pandas as pd
a = pd.DataFrame({"A":[34,12,78,84,26], "B":[54,87,35,25,82], "C":[56,78,0,14,13], "D":[0,23,72,56,14], "E":[78,12,31,0,34]})
b = pd.DataFrame({"A":[45,24,65,65,65], "B":[45,87,65,52,12], "C":[98,52,32,32,12], "D":[0,23,1,365,53], "E":[24,12,65,3,65]})
How does one create a correlation matrix, in which the y-axis represents "a" and the x-axis represents "b"?
The aim is to see correlations between the matching columns of the two datasets like this:
If you won't mind a NumPy based vectorized solution, based on this solution post to Computing the correlation coefficient between two multi-dimensional arrays -
corr2_coeff(a.values.T,b.values.T).T # func from linked solution post.
Sample run -
In [621]: a
Out[621]:
A B C D E
0 34 54 56 0 78
1 12 87 78 23 12
2 78 35 0 72 31
3 84 25 14 56 0
4 26 82 13 14 34
In [622]: b
Out[622]:
A B C D E
0 45 45 98 0 24
1 24 87 52 23 12
2 65 65 32 1 65
3 65 52 32 365 3
4 65 12 12 53 65
In [623]: corr2_coeff(a.values.T,b.values.T).T
Out[623]:
array([[ 0.71318502, -0.5923714 , -0.9704441 , 0.48775228, -0.07401011],
[ 0.0306753 , -0.0705457 , 0.48801177, 0.34685977, -0.33942737],
[-0.26626431, -0.01983468, 0.66110713, -0.50872017, 0.68350413],
[ 0.58095645, -0.55231196, -0.32053858, 0.38416478, -0.62403866],
[ 0.01652716, 0.14000468, -0.58238879, 0.12936016, 0.28602349]])
This achieves exactly what you want:
from scipy.stats import pearsonr
# create a new DataFrame where the values for the indices and columns
# align on the diagonals
c = pd.DataFrame(columns = a.columns, index = a.columns)
# since we know set(a.columns) == set(b.columns), we can just iterate
# through the columns in a (although a more robust way would be to iterate
# through the intersection of the two sets of columns, in the case your actual dataframes' columns don't match up
for col in a.columns:
correl_signif = pearsonr(a[col], b[col]) # correlation of those two Series
correl = correl_signif[0] # grab the actual Pearson R value from the tuple from above
c.loc[col, col] = correl # locate the diagonal for that column and assign the correlation coefficient
Edit: Well, it achieved exactly what you wanted, until the question was modified. Although this can easily be changed:
c = pd.DataFrame(columns = a.columns, index = a.columns)
for col in c.columns:
for idx in c.index:
correl_signif = pearsonr(a[col], b[idx])
correl = correl_signif[0]
c.loc[idx, col] = correl
c is now this:
Out[16]:
A B C D E
A 0.713185 -0.592371 -0.970444 0.487752 -0.0740101
B 0.0306753 -0.0705457 0.488012 0.34686 -0.339427
C -0.266264 -0.0198347 0.661107 -0.50872 0.683504
D 0.580956 -0.552312 -0.320539 0.384165 -0.624039
E 0.0165272 0.140005 -0.582389 0.12936 0.286023
I use this function that breaks it down with numpy
def corr_ab(a, b):
a_ = a.values
b_ = b.values
ab = a_.T.dot(b_)
n = len(a)
sums_squared = np.outer(a_.sum(0), b_.sum(0))
stds_squared = np.outer(a_.std(0), b_.std(0))
return pd.DataFrame((ab - sums_squared / n) / stds_squared / n,
a.columns, b.columns)
demo
corr_ab(a, b)
Do you have to use Pandas? This seem can be done via numpy rather easily. Did i understand the task incorrectly?
import numpy
X = {"A":[34,12,78,84,26], "B":[54,87,35,25,82], "C":[56,78,0,14,13], "D":[0,23,72,56,14], "E":[78,12,31,0,34]}
Y = {"A":[45,24,65,65,65], "B":[45,87,65,52,12], "C":[98,52,32,32,12], "D":[0,23,1,365,53], "E":[24,12,65,3,65]}
for key,value in X.items():
print "correlation stats for %s is %s" % (key, numpy.corrcoef(value,Y[key]))
In my Dataframe I have one column with numeric values, let say - distance. I want to find out which group of distance (range) have the biggest number of records (rows).
Doing simple:
df.distance.count_values() returns:
74 1
90 1
94 1
893 1
889 1
885 1
877 1
833 1
122 1
545 1
What I want to achieve is something like buckets from histogram, so I am expecting output like this:
900 4 #all values < 900 and > 850
100 3
150 1
550 1
850 1
The one approach I've figured out so far, but I don't think is the best and most optimal one is just find max and min values, divide by my step (50 in this case) and then do loop checking all the values and assigning to appropriate group.
Is there any other, better approach for that?
I'd suggest doing the following, assuming your value column is labeled val
import numpy as np
df['bin'] = df['val'].apply(lambda x: 50*np.floor(x/50))
The result is the following:
df.groupby('bin')['val'].count()
Thanks to EdChum suggestion and based on this example I've figured out, the best way (at least for me) is to do something like this:
import numpy as np
step = 50
#...
max_val = df.distance.max()
bins = list(range(0,int(np.ceil(max_val/step))*step+step,step))
clusters = pd.cut(df.distance,bins,labels=bins[1:])