Calculating DataFrame columns based on other columns - python

Having DataFrame like so:
# comments are the equations that have to be done to calculate the given column
df = pd.DataFrame({
'item_tolerance': [230, 115, 155],
'item_intake': [250,100,100],
'open_items_previous_day': 0, # df.item_intake.shift() + df.open_items_previous_day.shift() - df.items_shipped.shift() + df.items_over_under_sla.shift()
'total_items_to_process': 0, # df.item_intake + df.open_items_previous_day
'sla_relevant': 0, # df.item_tolerance if df.open_items_previous_day + df.item_intake > df.item_tolerance else df.open_items_previous_day + df.item_intake
'items_shipped': [230, 115, 50],
'items_over_under_sla': 0 # df.items_shipped - df.sla_relevant
})
item_tolerance
item_intake
open_items_previous_day
total_items_to_process
sla_relevant
items_shipped
items_over_under_sla
0
230
250
0
0
0
230
0
1
115
100
0
0
0
115
0
2
155
100
0
0
0
50
0
I'd like to calculate all the columns that have comments in them. I've tried using df.apply(some_method, axis=1) to perform row wise calculations but the problem is that I don't have the access to the previous row inside some_method(row).
To give a little more explanation, what I'm trying to achieve is for example: df.items_over_under_sla = df.items_shipped - df.sla_relevant but df.sla_relevant is based on equation which needs df.open_items_previous_day which needs df.open_items_previous_day which needs the previous row to be calculated. This is the problem, I need to calculate rows based on the values from this row and the previous one.
What is the correct approach to such problem?

If you are calculating each column with a different operation I suggest obtaining them individually:
df['open_items_previous_day'] = df['item_intake'].shift(fill_value=0) + df['open_items_previous_day'].shift(fill_value=0) - df['items_shipped'].shift(fill_value=0) + df['items_over_under_sla'].shift(fill_value=0)
df['total_items_to_process'] = df['item_intake'] + df['open_items_previous_day']
df = df.assign(sla_relevant=np.where(df['open_items_previous_day'] + df['item_intake'] > df['item_tolerance'], df['item_tolerance'], df['open_items_previous_day'] + df['item_intake']))
df['items_over_under_sla'] = df['items_shipped'] - df['sla_relevant']
df
Out[1]:
item_tolerance item_intake open_items_previous_day total_items_to_process sla_relevant items_shipped items_over_under_sla
0 230 250 0 250 230 230 0
1 115 100 20 120 115 115 0
2 155 100 -15 85 85 50 -35
The problem that you are facing is not about having to use the previous row (you are working around that just fine using the shift function). The real problem here is that all columns that you are trying to get (except for total_items_to_process) depend on each other, therefore you can't get the rest of the columns without having one of them first (or assuming it is zero initially).
That's why you are going to get different results depending on which column you've calculated first.

Related

Repeat rows based on numbers in multiple columns - Python

I have a lot of data that I'm trying to do some basic machine learning on, kind of like the Titanic example that predicts whether a passenger survived or died (I learned this in an intro Python class) based on factors like their gender, age, fare class...
What I'm trying to predict is whether a screw fails depending on how it was made (referred to as Lot). The engineers just listed how many times a failure occurred. Here's how it's formatted.
Lot
Failed?
100
3
110
0
120
1
130
4
The values in the cells are the number of occurrences, so for example:
Lot 100 had three screws that failed
Lot 110 had 0 screws that failed
Lot 120 had one screw that failed
Lot 130 had four screws that failed
I plan on doing a logistic regression using scikit-learn, but first I need each row to be listed as a failure or not. What I'd like to see is a row for every observation, and have them listed as either a 0 (did not occur) or 1 (did occur). Here's what it'd look like after
Lot
Failed?
100
1
100
1
100
1
110
0
120
1
140
1
140
1
140
1
140
1
Here's what I've tried and what I've gotten
df = pd.DataFrame({
'Lot' : ['100', '110', '120', '130'],
'Failed?' : [3, 0, 1, 4]
})
df.loc[df.index.repeat(df['Failed?'])].reset_index(drop = True)
When I do this it repeats the rows but keeps the same values in the Failed? column.
Lot
Failed?
100
3
100
3
100
3
110
0
120
1
140
4
140
4
140
4
140
4
Any ideas? Thank you!
You can use pandas.Series.repeat with reindex, but first you need to differentiate between rows that have 0 and those that do not:
s = df[df['Failed?'].eq(0)] # "save" rows with 0 as value as they will be excluded in repeat since they are repeated 0 times.
df = df.reindex(df.index.repeat(df['Failed?'])) #repeat each row depending on value
df['Failed?'] = 1 #set all values equal to 1
df = pd.concat([df,s]).sort_index() #bring in the 0 values that we saved as 's' earlier and sort by the index to put back in order
df
#The above code as a one-liner:
(pd.concat([df.reindex(df.index.repeat(df['Failed?'])).assign(**{'Failed?' : 1}),
df[df['Failed?'].eq(0)]])
.sort_index())
Out[1]:
Lot Failed?
0 100 1
0 100 1
0 100 1
1 110 0
2 120 1
3 130 1
3 130 1
3 130 1
3 130 1
below will give you failure or not but I suppose you are better served by the other answer.
df.loc[df['Failed?']>0,'Failed?'] = 1
Just as a comment: this is a bit of a strange data transformation, you might want to just keep a numerical target variable

Pandas: get pairs of rows with similar (the difference being within some bound) column values

I have a Pandas dataframe with 1M rows, 3 columns (TrackP, TrackPt, NumLongTracks) and I want to find pairs of 'matching' rows, such that for say two 'matching' rows the difference between the values for each row of column 1 (TrackP), column 2 (TrackPt) and column 3 (NumLongTracks) are all within some bound i.e. no more than ±1,
TrackP TrackPt NumLongTracks
1 2801 544 102
2 2805 407 65
3 2802 587 70
4 2807 251 145
5 2802 543 101
6 2800 545 111
For this particular case you would only retain the pair row 1 and row 5, because for this pair
TrackP(row 1) - TrackP(row 5) = -1,
TrackPt(row 1) - TrackP(row 5) = +1,
NumLongTracks(row 1) - NumLongTracks(row 5) = +1
This is trivial when the values are exactly the same between rows, but I'm having trouble figuring out the best way to do this for this particular case.
I think is easier to handle the columns as a single value for comparision.
#new dataframe
tr = track.TrackP.astype(str) + track.TrackPt.astype(str) + track.NumLongTracks.astype(str)
# finding matching routes
matching = []
for r,i in zip(tr,tr.index):
if r[0:4]: #4 position is TrackP
close = (int(r[0:4])-1,int(r[0:4])+1) #range 1 up/down
ptRange = (int(r[5:7])-1,int(r[5:7])+1)
nLRange = (int(r[8:])-1,int(r[8:])+1)
for r2 in tr:
if int(r2[0:4]) in close: #TrackP in range
if int(r2[5:7]) in ptRange: #TrackPt in range
if int(r2[8:]) in nLRange: #NumLongTracks in range
pair = [r,r2]
matching.append(pair)
# back to the format
#[['2801544102', '2802543101'], ['2802543101', '2801544102']]
import collections
routes = collections.defaultdict(list)
for seq in matching:
routes['TrackP'].append(int(seq[0][0:4]))
routes['TrackPt'].append(int(seq[0][4:7]))
routes['NumLongTracks'].append(int(seq[0][7:]))
Now you can easily decompress in a dataframe using the formula:
df = pd.DataFrame.from_dict(dict(routes))
print(df)
TrackP TrackPt NumLongTracks
0 2801 544 102
1 2802 543 101

Search for treshold values based on key from three columns(or more)

I need help with dataset that looks like this:
Name1 Name2 Name3 Temp Height
Alon Walon Balon 105 34 ]
Alon Walon Balon 106 42 |
Alon Walon Balon 105 33 ]-- Samples of Spot: Alon-Walon-Balon
Alon Walon Kalon 101 11 ]
Alon Walon Kalon 102 32 ]-- Samples of Spot: Alon-Walon-Kalon
Alon Talon Balon 111 12 ]-- Samples of Spot: Alon-Talon-Balon
Alon Talon Calon 121 10 ]-- Samples of Spot: Alon-Talon-Calon
What I want to achieve?
I have samples for one point in space, this point is described with three words, in this case let's take Alon-Walon-Balon:
I want to compare each value from Temp to other value like 105 if this value is higher than 105 then save this to another column.
The same goes for Height.
How am I doing this right now?
df = df.groupby[['Name1','Name2','Name3','Temp','Height']].size().reset_index()
visited = ()
cntSpot = 0
overValTemp = 0
overValHeight = 0
for i in len(df):
name1 = str(df.get_value(i,'Name1'))
name2 = str(df.get_value(i,'Name2'))
name3 = str(df.get_value(i,'Name3'))
if str(name1+name2+name3) in visited:
cntSpot+=1
if df.get_value(i,'Temp')>105:
overValTemp+=1
if df.get_value(i,'Height)<13:
overValHeight+=1
a = str(name1+name2+name3)
visited.update({a:cntSpot,overValemp,overValHeight})
Now I have set of dictionaries with information how many times every spot is over certain values.
This is the information I need, how many times case occurred for one Spot.
Where is the trick?
The csv files are more than 2GB and I need to process It incredibely fast.
Here is a solution, that uses pandas groupby and is definitely more efficient than the loop.
grouped = df.groupby(('Name1', 'Name2', 'Name3'))
count = grouped.size()
temp = grouped.apply(lambda x: x[x['Temp']>105].shape[0])
height = grouped.apply(lambda x: x[x['Height']<13].shape[0])
result = pd.concat([count, temp, height],
keys = ['Count', 'overValTemp', 'overValHeight'],
axis = 1)
result.index = map(lambda x: "-".join(x), result.index.tolist())
The result is the following:
Count overValTemp overValHeight
Alon-Talon-Balon 1 1 1
Alon-Talon-Calon 1 1 1
Alon-Walon-Balon 3 1 0
Alon-Walon-Kalon 2 0 1

Sum data points from individual pandas dataframes in a summary dataframe based on custom (and possibly overlapping) bins

I have many dataframes with individual counts (e.g. df_boston below). Each row defines a data point that is uniquely identified by its marker and its point. I have a summary dataframe (df_inventory_master) that has custom bins (the points above map to the Begin-End coordinates in the master). I want to add a column to this dataframe for each individual city that sums the counts from that city in a new column. An example is shown.
Two quirks are that the the bins in the master frame can be overlapping (the count should be added to both) and that some counts may not fall in the master (the count should be ignored).
I can do this in pure Python but since the data are in dataframes it would be helpful and likely faster to do the manipulations in pandas. I'd appreciate any tips here!
This is the master frame:
>>> df_inventory_master = pd.DataFrame({'Marker': [1, 1, 1, 2],
... 'Begin': [100, 300, 500, 100],
... 'End': [200, 600, 900, 250]})
>>> df_inventory_master
Begin End Marker
0 100 200 1
1 300 600 1
2 500 900 1
3 100 250 2
This is data for one city:
>>> df_boston = pd.DataFrame({'Marker': [1, 1, 1, 1],
... 'Point': [140, 180, 250, 500],
... 'Count': [14, 600, 1000, 700]})
>>> df_boston
Count Marker Point
0 14 1 140
1 600 1 180
2 1000 1 250
3 700 1 500
This is the desired output.
- Note that the count of 700 (Marker 1, Point 500) falls in 2 master bins and is counted for both.
- Note that the count of 1000 (Marker 1, Point 250) does not fall in a master bin and is not counted.
- Note that nothing maps to Marker 2 because df_boston does not have any Marker 2 data.
>>> desired_frame
Begin End Marker boston
0 100 200 1 614
1 300 600 1 700
2 500 900 1 700
3 100 250 2 0
What I've tried: I looked at the pd.cut() function, but with the nature of the bins overlapping, and in some cases absent, this does not seem to fit. I can add the column filled with 0 values to get part of the way there but then will need to find a way to sum the data in each frame, using bins defined in the master.
>>> df_inventory_master['boston'] = pd.Series([0 for x in range(len(df_inventory_master.index))], index=df_inventory_master.index)
>>> df_inventory_master
Begin End Marker boston
0 100 200 1 0
1 300 600 1 0
2 500 900 1 0
3 100 250 2 0
Here is how I approached it, basically a *sql style left join * using the pandas merge operation, then apply() across the row axis, with a lambda to decide if the individual records are in the band or not, finally groupby and sum:
df_merged = df_inventory_master.merge(df_boston, on=['Marker'],how='left')
# logical overwrite of count
df_merged['Count'] = df_merged.apply(lambda x: x['Count'] if x['Begin'] <= x['Point'] <= x['End'] else 0 , axis=1 )
df_agged = df_merged[['Begin','End','Marker','Count']].groupby(['Begin','End','Marker']).sum()
df_agged_resorted = df_agged.sort_index(level = ['Marker','Begin','End'])
df_agged_resorted = df_agged_resorted.astype(np.int)
df_agged_resorted.columns =['boston'] # rename the count column to boston.
print df_agged_resorted
And the result is
boston
Begin End Marker
100 200 1 614
300 600 1 700
500 900 1 700
100 250 2 0

Filling in missing data in Python

I was hoping you would be able to help me solve a small problem.
I am using a small device that prints out two properties that I save to a file. The device rasters in X and Y direction to form a grid. I am interested in plotting the relative intensity of these two properties as a function of the X and Y dimensions. I record the data in 4 columns that are comma separated (X, Y, property 1, property 2).
The grid is examined in lines, so for each Y value, it will move from X1 to X2 which are separated several millimeters apart. Then it will move to the next line and over again.
I am able to process the data in python with pandas/numpy but it doesn't work too well when there are any missing rows (which unfortunately does happen).
I have attached a sample of the output (and annotated the problems):
44,11,500,1
45,11,120,2
46,11,320,3
47,11,700,4
New << used as my Y axis separator
44,12,50,5
45,12,100,6
46,12,1500,7
47,12,2500,8
Sometimes, however a line or a few will be missing making it not possible to process and plot. Currently I have not been able to automatically fix it and have to do it manually. The bad output looks like this:
44,11,500,1
45,11,120,2
46,11,320,3
47,11,700,4
New << used as my Y axis separator
45,12,100,5 << missing 44,12...
46,12,1500,6
47,12,2500,7
I know the number of lines I expect since I know my range of X and Y.
What would be the best way to deal with this? Currently I manually enter the missing X and Y values and populate property 1 and 2 with values of 0. This can be time consuming and I would like to automate it. I have two questions.
Question 1: How can I automatically fill in my missing data with the corresponding values of X and Y and two zeros? This could be obtained from a pre-generated array of X and Y values that correspond to the experimental range.
Question 2: Is there a better way to split the file into separate arrays for plotting (rather than using the 'New' line?) For instance, by having a 'if' function that will output each line between X(start) and X(end) to a separate array? I've tried doing that but with no success.
I've attached my current (crude) code:
df = pd.read_csv('FileName.csv', delimiter = ',', skiprows=0)
rows = [-1] + np.where(df['X']=='New')[0].tolist() + [len(df.index)]
dff = {}
for i, r in enumerate(rows[:-1]):
dff[i] = df[r+1: rows[i+1]]
maxY = len(dff)
data = []
data2 = []
for yaxes in range(0, maxY):
data2.append(dff[yaxes].ix[:,2])
<data2 is then used for plotting using matplotlib>
To answer my Question 1, I was thinking about using the 'reindex' and 'reset_index' functions, however haven't managed to make them work.
I would appreciate any suggestions.
Is this meet what you want?
Q1: fill X using reindex, and others using fillna
Q2: Passing separated StringIO to read_csv is easier (change if you use Python 3)
# read file and split the input
f = open('temp.csv', 'r')
chunks = f.read().split('New')
# read csv as separated dataframes, using first column as index
dfs = [pd.read_csv(StringIO(unicode(chunk)), header=None, index_col=0) for chunk in chunks]
def pad(df):
# reindex, you should know the range of x
df = df.reindex(np.arange(44, 48))
# pad y from forward / backward, assuming y should have the single value
df[1] = df[1].fillna(method='bfill')
df[1] = df[1].fillna(method='ffill')
# padding others
df = df.fillna(0)
# revert index to values
return df.reset_index(drop=False)
dfs = [pad(df) for df in dfs]
dfs[0]
# 0 1 2 3
# 0 44 11 500 1
# 1 45 11 120 2
# 2 46 11 320 3
# 3 47 11 700 4
# dfs[1]
# 0 1 2 3
# 0 44 12 0 0
# 1 45 12 100 5
# 2 46 12 1500 6
# 3 47 12 2500 7
First Question
I've included print statements inside function to explain how this function works
In [89]:
def replace_missing(df , Ids ):
# check what are the mssing values
missing = np.setdiff1d(Ids , df[0])
if len(missing) > 0 :
missing_df = pd.DataFrame(data = np.zeros( (len(missing) , 4 )))
#print('---missing df---')
#print(missing_df)
missing_df[0] = missing
#print('---missing df---')
#print(missing_df)
missing_df[1].replace(0 , df[1].iloc[0] , inplace = True)
#print('---missing df---')
#print(missing_df)
df = pd.concat([df , missing_df])
#print('---final df---')
#print(df)
return df
​
In [91]:
Ids = np.arange(44,48)
final_df = df1.groupby(df1[1] , as_index = False).apply(replace_missing , Ids).reset_index(drop = True)
final_df
Out[91]:
0 1 2 3
44 11 500 1
45 11 120 2
46 11 320 3
47 11 700 4
45 12 100 5
46 12 1500 6
47 12 2500 7
44 12 0 0
Second question
In [92]:
group = final_df.groupby(final_df[1])
In [99]:
separate = [group.get_group(key) for key in group.groups.keys()]
separate[0]
Out[104]:
0 1 2 3
44 11 500 1
45 11 120 2
46 11 320 3
47 11 700 4

Categories

Resources