Finding thickness between two points based on meeting conditions - python

I have a dataframe with a list of surfaces and depths. Some of the surfaces are labeled with the suffix _top and _base.
How can I write a function that will create a column that calculates the thickness of only the surfaces that have the same name with the _top and _base suffix (e.g. red_top - red_base = thickness)?
Example:
df = pd.DataFrame({'Surface': ['red_top', 'red_base',
'blue_top', 'blue_base', 'green_top', 'pink'],
'Depth':[2, 6, 12, 45, 55, 145]})
I've tried to split the surface column to create one for the surfaces and one for the top/base, but I'm not sure if that is necessary and am still stuck on how to calculate the thickness based on meeting those conditions.
Many thanks

I would first split "Surface" column into two parts - "color" and "level", then pivot the table by "color", and then calculate thickness as follows
split = df.Surface.str.split("_", expand=True)
split.columns = ["Color", "Level"]
df = pd.concat([df, split], axis=1)
df_pivoted = df.pivot(index="Color", columns="Level", values="Depth")
df_pivoted["Thinkness"] = df_pivoted.base - df_pivoted.top
df_pivoted for your example looks like this -
Level NaN base top Thinkness
Color
blue NaN 45.0 12.0 33.0
green NaN NaN 55.0 NaN
pink 145.0 NaN NaN NaN
red NaN 6.0 2.0 4.0
The NaN column has non-empty values for Surfaces without the subscript.
The line below provides thickness calculation just for data with both _top and _base,
thickness = (df_pivoted.base-df_pivoted.top).dropna()
print(thickness)
results in
Color
blue 33.0
red 4.0
dtype: float64

Related

Pandas, how to calculate delta between one cell and another in different rows

I have the following frame:
USERID, EVENT1TIME, EVENT2TIME, MISC1, MISC2
123,45,,,
123,,46,,
123,,47,,
123,,48,,
123,,49,,
123,,51,,
124,45,,,
124,,46,,
124,,47,,
124,,48,,
124,,49,,
124,,51,,
I'd like to add a 4th column that is (EVENT2TIME - EVENT1TIME)
USERID, EVENT1TIME, EVENT2TIME, MISC1, MISC2, DELTA
123,45,,,,
123,,46,,,1
123,,47,,,2
123,,48,,,3
123,,49,,,4
123,,51,,,6
124,45,,,,
124,,46,,,1
124,,47,,,2
124,,48,,,3
124,,49,,,4
124,,51,,,6
I think the first thing to do is to copy the value from the row where EVENT1TIME is populated into the other instances of that USERID. But I suspect there may be a better way.
I am making some assumptions:
You want to calculate the difference between column EVENT2TIME and first row of EVENT1TIME
You want to store the results into DELTA
You can do this as follows:
import pandas as pd
df = pd.read_csv('abc.txt')
print (df)
df['DELTA'] = df.iloc[:,2] - df.iloc[0,1]
print (df)
The output of this will be:
USERID EVENT1TIME EVENT2TIME MISC1 MISC2 DELTA
0 123 45.0 NaN NaN NaN
1 123 NaN 46.0 NaN NaN 1.0
2 123 NaN 47.0 NaN NaN 2.0
3 123 NaN 48.0 NaN NaN 3.0
4 123 NaN 49.0 NaN NaN 4.0
5 123 NaN 51.0 NaN NaN 6.0
If you know EVENT1TIME is always and only in the first row, just store it as a variable and subtract it.
val = df.EVENT1TIME[0]
df['DELTA'] = df.EVENT2TIME - val
If you have multiple values every so often in EVENT1TIME, use some logic to back or forward fill all the empty rows for EVENT1TIME. This fill is not stored in the final output df.
df['DELTA'] = df.EVENT2TIME - df.EVENT1TIME.ffill() # forward fill (down) all nan values
# OR
df['DELTA'] = df.EVENT2TIME - df.EVENT1TIME.bfill() # back fill (up) all nan values
EDIT: Keeping this for continuity despite how hacky it is.
locations = list(df[~np.isnan(df.EVENT1TIME)].index)
vals = df.EVENT1TIME.loc[locations] # all EVENT1TIME values
locations.append(df.EVENT1TIME.index[-1]+1) # last row index + 1
last_loc = locations[0]
for idx, next_loc in enumerate(locations[1:]):
temp = df.loc[last_loc:next_loc-1]
df['DELTA'].loc[last_loc:next_loc-1] = temp.EVENT2VALUE - vals[last_loc]
last_loc = next_loc

Plotting by Index with different labels

I am using pandas and matplotlib to generate some charts.
My DataFrame:
Journal Papers per year in journal
0 Information and Software Technology 4
1 2012 International Conference on Cyber Securit... 4
2 Journal of Network and Computer Applications 4
3 IEEE Security & Privacy 5
4 Computers & Security 11
My Dataframe is a result of a groupby out of a larger dataframe. What I want now, is a simple barchart, which in theory works fine with a df_groupby_time.plot(kind='bar'). However, I get this:
What I want are different colored bars, and a legend which states which color corresponds to which paper.
Playing around with relabeling hasn't gotten me anywhere so far. And I have no idea anymore on how to achieve what I want.
EDIT:
Resetting the index and plotting isn't what I want:
df_groupby_time.set_index("Journals").plot(kind='bar')
I found a solution, based on this question here.
SO, the dataframe needs to be transformed into a matrix, were the values exist only on the main diagonal.
First, I save the column journals for later in a variable.
new_cols = df["Journal"].values
Secondly, I wrote a function, that takes a series, the column Papers per year in Journal, and the previously saved new columns, as input parameters, and returns a dataframe, where the values are only on the main diagonal.:
def values_into_main_diagonal(some_series, new_cols):
"""Puts the values of a series onto the main diagonal of a new df.
some_series - any series given
new_cols - the new column labels as list or numpy.ndarray"""
x = [{i: some_series[i]} for i in range(len(some_series))]
main_diag_df = pd.DataFrame(x)
main_diag_df.columns = new_cols
return main_diag_df
Thirdly, feeding the function the Papers per year in Journal column and our saved new columns names, returns the following dataframe:
new_df:
1_journal 2_journal 3_journal 4_journal 5_journal
0 4 NaN NaN NaN NaN
1 NaN 4 NaN NaN NaN
2 NaN NaN 4 NaN NaN
3 NaN NaN NaN 5 NaN
4 NaN NaN NaN NaN 11
Finally plotting the new_df via new_df.plot(kind='bar', stacked=True) gives me what I want. The Journals in different colors as the legend and NOT on the axis.:

How to populate a column in one dataframe by comparing it to another dataframe

I have a dataframe called res_df:
In [54]: res_df.head()
Out[54]:
Bldg_Sq_Ft GEOID CensusPop HU_Pop Pop_By_Area
0 753.026123 240010013002022 11.0 7.0 NaN
7 95.890495 240430003022003 17.0 8.0 NaN
8 1940.862793 240430003022021 86.0 33.0 NaN
24 2254.519775 245102801012021 27.0 13.0 NaN
25 11685.613281 245101503002000 152.0 74.0 NaN
I have a second dataframe made from the summarized information in res_df. It's grouped by the GEOID column and then summarized using aggregations to get the sum of the Bldg_Sq_Ft and the mean of the CensusPop columns for each unique GEOID. Let's call it geoid_sum:
In [55]:geoid_sum = geoid_sum.groupby('GEOID').agg({'GEOID': 'count', 'Bldg_Sq_Ft': 'sum', 'CensusPop': 'mean'})
In [56]: geoid_sum.head()
Out[56]:
GEOID Bldg_Sq_Ft CensusPop
GEOID
100010431001011 1 1154.915527 0.0
100030144041044 1 5443.207520 26.0
100050519001066 1 1164.390503 4.0
240010001001001 15 30923.517090 41.0
240010001001007 3 6651.656677 0.0
My goal is to find the GEOIDs in res_df that match the GEOID's in geoid_sum. I want to populate the value in Pop_By_Area for that row using an equation:
Pop_By_Area = (geoid_sum['CensusPop'] * ref_df['Bldg_Sq_Ft'])/geoid_sum['Bldg_Sq_Ft']
I've created a simple function that takes those parameters, but I am unsure how to iterate through the dataframes and apply the function.
def popByArea(census_pop_mean, bldg_sqft, bldg_sqft_sum):
x = float()
x = (census_pop_mean * bldg_sqft)/bldg_sqft_sum
return x
I've tried creating a series based on the GEOID matches: s = res_df.GEOID.isin(geoid_sum.GEOID.values) but that didn't seem to work (produced all false boolean values). How can I find the matches and apply my function to populate the Pop_By_Area column?
I think you need the reindex
geoid_sum = geoid_sum.groupby('GEOID').\
agg({'GEOID': 'count', 'Bldg_Sq_Ft': 'sum', 'CensusPop': 'mean'}).\
reindex(res_df['GEOID'])
res_df['Pop_By_Area'] = (geoid_sum['CensusPop'].values * ref_df['Bldg_Sq_Ft'])/geoid_sum['Bldg_Sq_Ft'].values

plotting a vbar_stack using a dataframe

I'm struggling to get a stacked vbar working.
With python/pandas and bokeh I want to plot several statistics about the players of a football team. The dataframe is nicely filled, the values are a string where they should be an int where it should be a numeric value.
I used the sample of bokeh to try and adjust it for my purpose, but I'm stuck on
'ValueError: Keyword argument sequences for broadcasting must be the same length as stackers' this error.
My code (without imports and scraping pieces) is:
source = ColumnDataSource(data=statsdfsource[['goals','assists','naam']])
p = figure(plot_height=250, title="Fruit Counts by Year",
toolbar_location=None, tools="")
p.vbar_stack(['goals','assists'], x='naam', width=0.9, color=colors,
source=source)
p.y_range.start = 0
p.x_range.range_padding = 0.1
p.xgrid.grid_line_color = None
p.axis.minor_tick_line_color = None
p.outline_line_color = None
p.legend.location = "top_left"
p.legend.orientation = "horizontal"
show(p)
The dataframe I fill the columndatasource with is
goals assists naam
0 NaN NaN Miguel Santos
1 NaN NaN Aykut Özer
2 NaN NaN Job van de Walle
3 NaN NaN Rowen Koot
4 8.0 6.0 Perr Schuurs
5 4.0 2.0 Wessel Dammers
6 12.0 2.0 Stefan Askovski
7 1.0 NaN Mica Pinto
8 NaN NaN Christopher Braun
9 1.0 4.0 Marco Ospitalieri
10 NaN 1.0 Clint Esser
The result I want to reach is a stacked columnframe, where on the x-axis is the name of the player, with 2 columns above it, one with the goals the player made and one with the assists.
I think I'm messing up somewhere with how my dataframe is built, but I'm a bit floating how it should be formed (can't really imagine on the other hand that the dataframe doesn't fit the purpose).
When using categorical ranges, you have to tell figure what the categories for the axis are and what order you want them to show up, e.g. provide x_range something like:
# specify all the factors for the x-axis by passing x_range
p = figure(..., x_range=sorted(df.naam.unique()))
It's also possible the NaN values are a problem, since they are "contagious". I'd recommend changing them to zeros instead in any case.
Finally the error message probably indicates that your colors list is the wrong length. You are stacking two bars in each column, so the list of colors needs to also be two (one color for each "row" in the stack).

Pandas - finding anomaly in paired column values in large Dataframe

I've been banging my head against a wall on this for a couple of hours, and would appreciate any help I could get.
I'm working with a large data set (over 270,000 rows), and am trying to find an anomaly within two columns that should have paired values.
From the snippet of output below - I'm looking at the Alcohol_Category_ID and Alcohol_Category_Name columns. The ID column has a numeric string value that should pair up 1:1 with a string descriptor in the Name column. (e.g., "1031100.0" == "100 PROOF VODKA".
As you can see, both columns have the same count of non-null values. However, there are 72 unique IDs and only 71 unique Names. I take this to mean that one Name is incorrectly associated with two different IDs.
County Alcohol_Category_ID Alcohol_Category_Name Vendor_Number \
count 269843 270288 270288 270920
unique 99 72 71 116
top Polk 1031080.0 VODKA 80 PROOF 260
freq 49092 35366 35366 46825
first NaN NaN NaN NaN
last NaN NaN NaN NaN
mean NaN NaN NaN NaN
std NaN NaN NaN NaN
min NaN NaN NaN NaN
25% NaN NaN NaN NaN
50% NaN NaN NaN NaN
75% NaN NaN NaN NaN
max NaN NaN NaN NaN
My trouble is in actually isolating out where this duplication is occurring so that I can hopefully replace the erroneous ID with its correct value. I am having a dog of a time with this.
My dataframe is named i_a.
I've been trying to examine the pairings of values between these two columns with groupby and count statements like this:
i_a.groupby(["Alcohol_Category_Name", "Alcohol_Category_ID"]).Alcohol_Category_ID.count()
However, I'm not sure how to whittle it down from there. And there are too many pairings to make this easy to do visually.
Can someone recommend a way to isolate out the Alcohol_Category_Name associated with more than one Alcohol_Category_ID?
Thank you so much for your consideration!
EDIT: After considering the advice of Dmitry, I found the solution by continually pairing down duplicates until I honed in on the value of interest, like so:
#Finding all unique pairings of Category IDs and Names
subset = i_a.drop_duplicates(["Alcohol_Category_Name", "Alcohol_Category_ID"])
#Now, determine which of the category names appears more than once (thus paired with more than one ID)
subset[subset["Alcohol_Category_Name"].duplicated()]
Thank you so much for your help. It seems really obvious in retrospect, but I could not figure it out for the life of me.
I think this snippet meets your needs:
> df = pd.DataFrame({'a':[1,2,3,1,2,3], 'b':[1,2,1,1,2,1]})
So df.a has 3 unique values mapping to 2 uniques in df.b.
> df.groupby('b')['a'].nunique()
b
1 2
2 1
That shows that df.b=1 maps to 2 uniques in a (and that df.b=2 maps to only 1).

Categories

Resources