intersection of two columns of pandas dataframe - python

I have 2 pandas dataframes: dataframe1 and dataframe2 that look like this:
mydataframe1
Out[15]:
Start End
100 200
300 450
500 700
mydataframe2
Out[16]:
Start End Value
0 400 0
401 499 -1
500 1000 1
1001 1698 1
Each row correspond to a segment (start-end).
For each segment in dataframe1 I would like to assign a value depending on the values assigned to the segments in dataframe2.
For example:
the first segment in dataframe1 100 200 is included in the first segment of dataframe2 0 400 then I should assign the value 0
the second segment in dataframe1 300 450 is contained in the first 0 400 and second 401 499 segments of dataframe2. In this case I need to split this segments in 2 and assign the 2 corresponding values. ie 300 400 -> value 0 and 401 - 450 value ->-1
the final dataframe1 should look like
mydataframe1
Out[15]:
Start End Value
100 200 0
300 400 0
401 450 -1
500 700 1
I hope I was claer..can you help me?

I doubt that there is a Pandas method that you can use to solve this directly.
You have to calculate the intersections manually to get the result you want. The intervaltree library makes the interval overlap calculation easier and more efficient at least.
IntervalTree.search() returns the (full) intervals that overlap with the provided one but does not calculate their intersection. This is why I also apply the intersect() function I have defined.
import pandas as pd
from intervaltree import Interval, IntervalTree
def intersect(a, b):
"""Intersection of two intervals."""
intersection = max(a[0], b[0]), min(a[1], b[1])
if intersection[0] > intersection[1]:
return None
return intersection
def interval_df_intersection(df1, df2):
"""Calculate the intersection of two sets of intervals stored in DataFrames.
The intervals are defined by the "Start" and "End" columns.
The data in the rest of the columns of df1 is included with the resulting
intervals."""
tree = IntervalTree.from_tuples(zip(
df1.Start.values,
df1.End.values,
df1.drop(["Start", "End"], axis=1).values.tolist()
))
intersections = []
for row in df2.itertuples():
i1 = Interval(row.Start, row.End)
intersections += [list(intersect(i1, i2)) + i2.data for i2 in tree[i1]]
# Make sure the column names are in the correct order
data_cols = list(df1.columns)
data_cols.remove("Start")
data_cols.remove("End")
return pd.DataFrame(intersections, columns=["Start", "End"] + data_cols)
interval_df_intersection(mydataframe2, mydataframe1)
The result is identical to what you were after.

Here is an answer using the NCLS library. It does not do the splitting, but rather answers the question in the title and does so really quickly.
Setup:
from ncls import NCLS
contents = """Start End
100 200
300 450
500 700"""
import pandas as pd
from io import StringIO
df = pd.read_table(StringIO(contents), sep="\s+")
contents2 = """Start End Value
0 400 0
401 499 -1
500 1000 1
1001 1698 1"""
df2 = pd.read_table(StringIO(contents2), sep="\s+")
Execution:
n = NCLS(df.Start.values, df.End.values, df.index.values)
x, x2 = n.all_overlaps_both(df2.Start.values, df2.End.values, df2.index.values)
dfx = df.loc[x]
# Start End
# 0 100 200
# 0 100 200
# 1 300 450
# 2 500 700
df2x = df2.loc[x2]
# Start End Value
# 0 0 400 0
# 1 401 499 -1
# 1 401 499 -1
# 2 500 1000 1
dfx.insert(dfx.shape[1], "Value", df2x.Value.values)
# Start End Value
# 0 100 200 0
# 0 100 200 0
# 1 300 450 -1
# 2 500 700 1

Related

How to randomly select a sample of data according to specified proportions in Python Pandas?

I have DataFrame in Python Pandas like below:
ID
TG
111
0
222
0
333
1
444
1
555
0
...
...
Above DataFrame has 5 000 000 rows, with:
99.40 % -> 0
0.60% -> 1
And I need to randomly select sample of this data, so as to have 5% of '1' in column TG.
So as a result I need to have DataFrame with observations where 5% are '1', and rest (95% of '0') randomly selected.
For example I need 200 000 observations from my dataset where 5% will be 1 and rest 0
How can I do that in Python Pandas ?
I'm sure there is some more performant way but maybe this works using .sample? Based on a dataset of 5_000 rows.
zeros = df.query("TG.eq(0)")
frac = int(round(.05 * len(zeros), 0))
ones = df.query("TG.ne(0)").sample(n=frac)
df = pd.concat([ones, zeros]).reset_index(drop=True)
print(df["TG"].value_counts())
0 4719
1 236

Merge_asof but only let the nearest merge on key

I am currently trying to merge two data frames using the merge_asof method. However, when using this method I stumbled upon the issue that if I have a empty gap in any of my data then there will be issues with duplicate cells in the merged dataframe. For clarification, I two dataframes that look like this:
1.
index Meter_Indication (km) Fuel1 (l)
0 35493 245
1 35975 267
2 36000 200
3 36303 160
4 36567 300
5 38653 234
index Meter_Indication (km) Fuel2 (l)
0 35494 300
1 35980 203
2 36573 323
3 38656 233
These two dataframes contain data about refueling vehicles where the fuel column is refueled amount in liters and the Meter_Indication indicate how many km the car in total has driven (something that is impossible to become less over time, and is why it is a great key to merge on). However, as you can see there are less rows in df2 than in df1 which currently (in my case makes it so that the values merge on the nearest value. Like this:
(merged df)
index Meter_Indication (km) Fuel1 (l) Fuel2(l)
0 35493 245 300
1 35975 267 203
2 36000 200 203
3 36303 160 323
4 36567 300 323
5 38653 234 233
As you can see there are duplicates of the value 203 and 323. My goal is to instead of the dataframe containing all the 5 rows, instead excluding the ones that dont have a "nearest"-match. I want only the actually nearest to merge with the value. With other words my desired data frame is:
index Meter_Indication (km) Fuel1 (l) Fuel2(l)
0 35493 245 300
1 35975 267 203
3 36567 300 323
4 38653 234 233
You can see here that the values that were not a "closest" match with another value were dropped.
I have tried looking for this everywhere but cant find anything to match my desired outcome.
My current code is:
#READS PROVIDED DOCUMENTS.
df1 = pd.read_excel(
filepathname1, "CWA107 Event", na_values=["NA"], skiprows=1, usecols="A, B, C, D, E, F")
df2 = pd.read_excel(
filepathname2,
na_values=["NA"],
skiprows=1,
usecols=["Fuel2 (l)", "Unnamed: 3", "Meter_Indication"],)
# Drop NaN rows.
df2.dropna(inplace=True)
df1.dropna(inplace=True)
#Filters out rows with the keywords listed in 'blacklist'.
df1.rename(columns={"Bränslenivå (%)": "Bränsle"}, inplace=True)
df1 = df1[~df1.Bränsle.isin(blacklist)]
df1.rename(columns={"Bränsle": "Bränslenivå (%)"}, inplace=True)
#Creates new column for the difference in fuellevel column.
df1["Difference (%)"] = df1["Bränslenivå (%)"]
df1["Difference (%)"] = df1.loc[:, "Bränslenivå (%)"].diff()
# Renames time-column so that they match.
df2.rename(columns={"Unnamed: 3": "Tid"}, inplace=True)
# Drops rows where the difference is equal to 0.
df1filt = df1[(df1["Difference (%)"] != 0)]
# Converts time-column to only year, month and date.
df1filt["Tid"] = pd.to_datetime(df1filt["Tid"]).dt.strftime("%Y%m%d").astype(str)
df1filt.reset_index(level=0, inplace=True)
#Renames the index column to "row" in order to later use the "row" column
df1filt.rename(columns={"index": "row"}, inplace=True)
# Creates a new column for the difference in total driven kilometers (used for matching)
df1filt["Match"] = df1filt["Vägmätare (km)"]
df1filt["Match"] = df1filt.loc[:, "Vägmätare (km)"].diff()
#Merges refuels that are previously seperated because of the timeintervals. For example when a refuel takes a lot of time and gets split into two different refuels.
ROWRANGE = len(df1filt)+1
thevalue = 0
for currentrow in range(ROWRANGE-1):
if df1filt.loc[currentrow, 'Difference (%)'] >= 0.0 and df1filt.loc[currentrow-1, 'Difference (%)'] <= 0:
thevalue = 0
thevalue += df1filt.loc[currentrow,'Difference (%)']
df1filt.loc[currentrow,'Match'] = "SUMMED"
if df1filt.loc[currentrow, 'Difference (%)'] >= 0.0 and df1filt.loc[currentrow-1, 'Difference (%)'] >= 0:
thevalue += df1filt.loc[currentrow,'Difference (%)']
if df1filt.loc[currentrow, 'Difference (%)'] <= 0.0 and df1filt.loc[currentrow-1, 'Difference (%)'] >= 0:
df1filt.loc[currentrow-1,'Difference (%)'] = thevalue
df1filt.loc[currentrow-1,'Match'] = "OFFICIAL"
thevalue = 0
#Removes single "refuels" that are lower than 5
df1filt = df1filt[(df1filt['Difference (%)'] > 5)]
#Creates a new dataframe for the summed values
df1filt2 = df1filt[(df1filt['Match'] == "OFFICIAL")]
#Creates a estimated refueled amount column for the automatic
df1filt2["Fuel1 (l)"] = df1filt2["Difference (%)"]
df1filt2["Fuel1 (l)"] = df1filt2.loc[:, "Difference (%)"]/100 *fuelcapacity
#Renames total kilometer column so that the two documents can match
df1filt2.rename(columns={"Vägmätare (km)": "Meter_Indication"}, inplace=True)
#Filters out rows where refuel and kilometer = NaN (Manual)
df2filt = df2[(df2['Fuel2 (l)'] != NaN) & (df2['Meter_Indication'] != NaN)]
#Drops first row
df2filt.drop(df2filt.index[0], inplace=True)
#Adds prefix for the time column so that they match (not used anymore because km is used to match)
df2filt['Tid'] = '20' + df2filt['Tid'].astype(str)
#Rounds numeric columns
decimals = 0
df2filt['Meter_Indication'] = pd.to_numeric(df2filt['Meter_Indication'],errors='coerce')
df2filt['Fuel2 (l)'] = pd.to_numeric(df2filt['Fuel2 (l)'],errors='coerce')
df2filt['Meter_Indication'] = df2filt['Meter_Indication'].apply(lambda x: round(x, decimals))
df2filt['Fuel2 (l)'] = df2filt['Fuel2 (l)'].apply(lambda x: round(x, decimals))
#Removes last number (makes the two excels matchable)
df2filt['Meter_Indication'] //= 10
df1filt2['Meter_Indication'] //= 10
#Creates merged dataframe with the two
merged_df = df1filt2.merge(df2filt, on='Meter_Indication')
Hopefully this was enough information! Thank you in advance.
Try this:
# Assign new column to keep meter indication from df2
df = pd.merge_asof(df1, df2.assign(meter_indication_2=df2['Meter_Indication (km)']), on='Meter_Indication (km)', direction='nearest')
# Calculate absolute difference
df['meter_indication_diff'] = df['Meter_Indication (km)'].sub(df['meter_indication_2']).abs()
# Sort values, drop duplicates (keep the ones with the smallest diff) and do some clean up
df = df.sort_values(by=['meter_indication_2', 'meter_indication_diff']).drop_duplicates(subset=['meter_indication_2']).sort_index().drop(['meter_indication_2', 'meter_indication_diff'], axis=1)
# Output
Meter_Indication (km) Fuel1 (l) Fuel2 (l)
0 35493 245 300
1 35975 267 203
4 36567 300 323
5 38653 234 233

How to find the first row with min value of a column in dataframe

I have a data frame where I am trying to get the row of min value by subtracting the abs difference of two columns to make a third column where I am trying to get the first or second min value of the data frame of col[3] I get an error. Is there a better method to get the row of min value from a column[3].
df2 = df[[2,3]]
df2[4] = np.absolute(df[2] - df[3])
#lowest = df.iloc[df[6].min()]
2 3 4
0 -111 -104 7
1 -130 110 240
2 -105 -112 7
3 -118 -100 18
4 -147 123 270
5 225 -278 503
6 102 -122 224
2 3 4
desired result = 2 -105 -112 7
Get difference to Series, add Series.abs and then compare by minimal value in boolean indexing:
s = (df[2] - df[3]).abs()
df = df[s == s.min()]
If want new column for diffence:
df['diff'] = (df[2] - df[3]).abs()
df = df[df['diff'] == df['diff'].min()]
Another idea is get index by minimal value by Series.idxmin and then select by DataFrame.loc, for one row DataFrame are necessary [[]]:
s = (df[2] - df[3]).abs()
df = df.loc[[s.idxmin()]]
EDIT:
For more dynamic code with convert to integers if possible use:
def int_if_possible(x):
try:
return x.astype(int)
except Exception:
return x
df = df.apply(int_if_possible)

Sum data points from individual pandas dataframes in a summary dataframe based on custom (and possibly overlapping) bins

I have many dataframes with individual counts (e.g. df_boston below). Each row defines a data point that is uniquely identified by its marker and its point. I have a summary dataframe (df_inventory_master) that has custom bins (the points above map to the Begin-End coordinates in the master). I want to add a column to this dataframe for each individual city that sums the counts from that city in a new column. An example is shown.
Two quirks are that the the bins in the master frame can be overlapping (the count should be added to both) and that some counts may not fall in the master (the count should be ignored).
I can do this in pure Python but since the data are in dataframes it would be helpful and likely faster to do the manipulations in pandas. I'd appreciate any tips here!
This is the master frame:
>>> df_inventory_master = pd.DataFrame({'Marker': [1, 1, 1, 2],
... 'Begin': [100, 300, 500, 100],
... 'End': [200, 600, 900, 250]})
>>> df_inventory_master
Begin End Marker
0 100 200 1
1 300 600 1
2 500 900 1
3 100 250 2
This is data for one city:
>>> df_boston = pd.DataFrame({'Marker': [1, 1, 1, 1],
... 'Point': [140, 180, 250, 500],
... 'Count': [14, 600, 1000, 700]})
>>> df_boston
Count Marker Point
0 14 1 140
1 600 1 180
2 1000 1 250
3 700 1 500
This is the desired output.
- Note that the count of 700 (Marker 1, Point 500) falls in 2 master bins and is counted for both.
- Note that the count of 1000 (Marker 1, Point 250) does not fall in a master bin and is not counted.
- Note that nothing maps to Marker 2 because df_boston does not have any Marker 2 data.
>>> desired_frame
Begin End Marker boston
0 100 200 1 614
1 300 600 1 700
2 500 900 1 700
3 100 250 2 0
What I've tried: I looked at the pd.cut() function, but with the nature of the bins overlapping, and in some cases absent, this does not seem to fit. I can add the column filled with 0 values to get part of the way there but then will need to find a way to sum the data in each frame, using bins defined in the master.
>>> df_inventory_master['boston'] = pd.Series([0 for x in range(len(df_inventory_master.index))], index=df_inventory_master.index)
>>> df_inventory_master
Begin End Marker boston
0 100 200 1 0
1 300 600 1 0
2 500 900 1 0
3 100 250 2 0
Here is how I approached it, basically a *sql style left join * using the pandas merge operation, then apply() across the row axis, with a lambda to decide if the individual records are in the band or not, finally groupby and sum:
df_merged = df_inventory_master.merge(df_boston, on=['Marker'],how='left')
# logical overwrite of count
df_merged['Count'] = df_merged.apply(lambda x: x['Count'] if x['Begin'] <= x['Point'] <= x['End'] else 0 , axis=1 )
df_agged = df_merged[['Begin','End','Marker','Count']].groupby(['Begin','End','Marker']).sum()
df_agged_resorted = df_agged.sort_index(level = ['Marker','Begin','End'])
df_agged_resorted = df_agged_resorted.astype(np.int)
df_agged_resorted.columns =['boston'] # rename the count column to boston.
print df_agged_resorted
And the result is
boston
Begin End Marker
100 200 1 614
300 600 1 700
500 900 1 700
100 250 2 0

Pandas: concatenating conditioned on unique values

I am concatenating two Pandas dataframes as below.
part1 = pd.DataFrame({'id' :[100,200,300,400,500],
'amount': np.random.randn(5)
})
part2 = pd.DataFrame({'id' :[700,100,800,500,300],
'amount': np.random.randn(5)
})
concatenated = pd.concat([part1, part2], axis=0)
amount id
0 -0.458653 100
1 2.172348 200
2 0.072494 300
3 -0.253939 400
4 -0.061866 500
0 -1.187505 700
1 -0.810784 100
2 0.321881 800
3 -1.935284 500
4 -1.351507 300
How can I limit the operation so that a row in part2 is only included in concatenated if the row id does not already appear in part1? In a way, I want to treat the id column like a set.
Is it possible to do this during concat() or is this more a post-processing step?
Desired output for this example would be:
concatenated_desired
amount id
0 -0.458653 100
1 2.172348 200
2 0.072494 300
3 -0.253939 400
4 -0.061866 500
0 -1.187505 700
2 0.321881 800
call drop_duplicates() after concat():
part1 = pd.DataFrame({'id' :[100,200,300,400,500],
'amount': np.arange(5)
})
part2 = pd.DataFrame({'id' :[700,100,800,500,300],
'amount': np.random.randn(5)
})
concatenated = pd.concat([part1, part2], axis=0)
print concatenated.drop_duplicates(cols="id")
Calculate the id's not in part1
In [28]:
diff = part2.ix[~part2['id'].isin(part1['id'])]
diff
Out[28]:
amount id
0 -2.184038 700
2 -0.070749 800
now concat
In [29]:
concatenated = pd.concat([part1, diff], axis=0)
concatenated
Out[29]:
amount id
0 -2.240625 100
1 -0.348184 200
2 0.281050 300
3 0.082460 400
4 -0.045416 500
0 -2.184038 700
2 -0.070749 800
You can also put this in a one liner:
concatenated = pd.concat([part1, part2.ix[~part2['id'].isin(part1['id'])]], axis=0)
If you get a column with an id, then use it as an index. Perform manipulations with a real index will make things easier. Here you can use combine_first that does what you are searching for:
part1 = part1.set_index('id')
part2 = part2.set_index('id')
part1.combine_first(p2)
Out[38]:
amount
id
100 1.685685
200 -1.895151
300 -0.804097
400 0.119948
500 -0.434062
700 0.215255
800 -0.031562
If you really need not to get that index, reset it after:
part1.combine_first(p2).reset_index()
Out[39]:
id amount
0 100 1.685685
1 200 -1.895151
2 300 -0.804097
3 400 0.119948
4 500 -0.434062
5 700 0.215255
6 800 -0.031562

Categories

Resources