Pandas: concatenating conditioned on unique values - python

I am concatenating two Pandas dataframes as below.
part1 = pd.DataFrame({'id' :[100,200,300,400,500],
'amount': np.random.randn(5)
})
part2 = pd.DataFrame({'id' :[700,100,800,500,300],
'amount': np.random.randn(5)
})
concatenated = pd.concat([part1, part2], axis=0)
amount id
0 -0.458653 100
1 2.172348 200
2 0.072494 300
3 -0.253939 400
4 -0.061866 500
0 -1.187505 700
1 -0.810784 100
2 0.321881 800
3 -1.935284 500
4 -1.351507 300
How can I limit the operation so that a row in part2 is only included in concatenated if the row id does not already appear in part1? In a way, I want to treat the id column like a set.
Is it possible to do this during concat() or is this more a post-processing step?
Desired output for this example would be:
concatenated_desired
amount id
0 -0.458653 100
1 2.172348 200
2 0.072494 300
3 -0.253939 400
4 -0.061866 500
0 -1.187505 700
2 0.321881 800

call drop_duplicates() after concat():
part1 = pd.DataFrame({'id' :[100,200,300,400,500],
'amount': np.arange(5)
})
part2 = pd.DataFrame({'id' :[700,100,800,500,300],
'amount': np.random.randn(5)
})
concatenated = pd.concat([part1, part2], axis=0)
print concatenated.drop_duplicates(cols="id")

Calculate the id's not in part1
In [28]:
diff = part2.ix[~part2['id'].isin(part1['id'])]
diff
Out[28]:
amount id
0 -2.184038 700
2 -0.070749 800
now concat
In [29]:
concatenated = pd.concat([part1, diff], axis=0)
concatenated
Out[29]:
amount id
0 -2.240625 100
1 -0.348184 200
2 0.281050 300
3 0.082460 400
4 -0.045416 500
0 -2.184038 700
2 -0.070749 800
You can also put this in a one liner:
concatenated = pd.concat([part1, part2.ix[~part2['id'].isin(part1['id'])]], axis=0)

If you get a column with an id, then use it as an index. Perform manipulations with a real index will make things easier. Here you can use combine_first that does what you are searching for:
part1 = part1.set_index('id')
part2 = part2.set_index('id')
part1.combine_first(p2)
Out[38]:
amount
id
100 1.685685
200 -1.895151
300 -0.804097
400 0.119948
500 -0.434062
700 0.215255
800 -0.031562
If you really need not to get that index, reset it after:
part1.combine_first(p2).reset_index()
Out[39]:
id amount
0 100 1.685685
1 200 -1.895151
2 300 -0.804097
3 400 0.119948
4 500 -0.434062
5 700 0.215255
6 800 -0.031562

Related

Create calculated field within dataset using Python

I have a dataset, df, where I would like to create columns that display the output of a subtraction calculation:
Data
count power id p_q122 p_q222 c_q122 c_q222
100 1000 aa 200 300 10 20
100 2000 bb 400 500 5 10
Desired
cnt pwr id p_q122 avail1 p_q222 avail2 c_q122 count1 c_q222 count2
100 1000 aa 200 800 300 700 10 90 20 80
100 2000 bb 400 1600 500 1500 5 95 10 90
Doing
a = df['avail1'] = + df['pwr'] - df['p_q122']
b = df['avail2'] = + df['pwr'] - df['p_q222']
I am looking for a more elegant way that provides the desire output. Any suggestion is appreciated.
We can perform 2D subtraction with numpy:
pd.DataFrame(
df['power'].to_numpy()[:, None] - df.filter(like='p_').to_numpy()
).rename(columns=lambda i: f'avail{i + 1}')
avail1 avail2
0 800 700
1 1600 1500
Benefit here is that, no matter how many p_ columns there are, all will be subtracted from the power column.
We can concat all of the computations with df like:
df = pd.concat([
df,
# power calculations
pd.DataFrame(
df['power'].to_numpy()[:, None] - df.filter(like='p_').to_numpy()
).rename(columns=lambda i: f'avail{i + 1}'),
# Count calculations
pd.DataFrame(
df['count'].to_numpy()[:, None] - df.filter(like='c_').to_numpy()
).rename(columns=lambda i: f'count{i + 1}'),
], axis=1)
which gives df:
count power id p_q122 p_q222 ... c_q222 avail1 avail2 count1 count2
0 100 1000 aa 200 300 ... 20 800 700 90 80
1 100 2000 bb 400 500 ... 10 1600 1500 95 90
[2 rows x 11 columns]
If we have many column groups to do, we can build the list of DataFrames programmatically as well:
df = pd.concat([df, *(
pd.DataFrame(
df[col].to_numpy()[:, None] - df.filter(like=filter_prefix).to_numpy()
).rename(columns=lambda i: f'{new_prefix}{i + 1}')
for col, filter_prefix, new_prefix in [
('power', 'p_', 'avail'),
('count', 'c_', 'count')
]
)], axis=1)
Setup and imports:
import pandas as pd
df = pd.DataFrame({
'count': [100, 100], 'power': [1000, 2000], 'id': ['aa', 'bb'],
'p_q122': [200, 400], 'p_q222': [300, 500], 'c_q122': [10, 5],
'c_q222': [20, 10]
})
Try:
df['avail1'] = df['power'].sub(df['p_q122'])
df['avail2'] = df['power'].sub(df['p_q222'])

What is the right way to loop over a pandas dataframe and apply a condition?

I am trying to loop through a list of dictionaries, comparing a value to a pair of columns in a Pandas dataframe and adding a value to a third column under a certain condition.
My list of dictionaries that looks like this:
dict_list = [{'type': 'highlight', 'id': 0, 'page_number': 4, 'location_number': 40, 'content': 'Foo'}, {'type': 'highlight', 'id': 1, 'page_number': 12, 'location_number': 96, 'content': 'Bar'}, {'type': 'highlight', 'id': 2, 'page_number': 128, 'location_number': 898, 'content': 'Some stuff'}]
My dataframe looks like this:
start end note_count
1 1 100 0
2 101 200 0
3 201 300 0
For each dictionary, I want to pull the "page_number" value and compare it to the "start" and "end" columns in the dataframe rows. If page_number is within the range of those two values in a row, I want to +1 to the "note_count" column for that row. This is my current code:
for dict in dict_list:
page_number = dict['page_number']
for index, row in ventile_frame.iterrows():
ventile_frame["note_count"][(ventile_frame["start"] <= page_number) & (ventile_frame["end"] >= page_number)] += 1
print (ventile_frame)
I would expect to see a result like this.
start end note_count
1 1 100 2
2 101 200 1
3 201 300 0
Instead, I am seeing this.
start end note_count
1 1 100 9
2 101 200 0
3 201 300 0
Thanks for any help!
You don't need to iterate on the rows of ventile_frame - and that's the beauty of it!
(ventile_frame["start"] <= page_number) & (ventile_frame["end"] >= page_number) will produce a boolean mask indicating whether page_number is within the range of each row. Try it with a fixed value for page_number to understand what's going on:
print((ventile_frame["start"] <= 4) & (ventile_frame["end"] >= 4))
Bottom line is, you just need to iterate on the dicts:
for single_dict in dict_list:
page_number = single_dict['page_number']
ventile_frame["note_count"][(ventile_frame["start"] <= page_number) & (ventile_frame["end"] >= page_number)] += 1
print (ventile_frame)
Note that I replaced dict by single_dict in the above code, it's best to avoid shadowing built-in python names.
Here is a way using IntervalIndex:
m=pd.DataFrame(dict_list)
s = pd.IntervalIndex.from_arrays(df.start,df.end, 'both')
#output-> IntervalIndex([[1, 100], [101, 200], [201, 300]],
#closed='both',
#dtype='interval[int64]')
n=m.set_index(s).loc[m['page_number']].groupby(level=0)['page_number'].count()
n.index=pd.MultiIndex.from_arrays([n.index])
final=df.set_index(['start','end']).assign(new_note_count=n).reset_index()
final['new_note_count']=final['new_note_count'].fillna(0)
Output:
start end note_count new_note_count
0 1 100 0 2.0
1 101 200 0 1.0
2 201 300 0 0.0
Details:
Once we have the index as interval , set index of m and .loc[] the page_number
print(m.set_index(s).loc[m['page_number']])
type id page_number location_number content
[1, 100] highlight 0 4 40 Foo
[1, 100] highlight 0 4 40 Foo
[101, 200] highlight 1 12 96 Bar
Then using groupby() get counts, convert to Multiindex and assign it back.
I would do this with DataFrame.apply:
first create a series with the numbers of pages contained in the dictionary:
page_serie=pd.Series([dict_t['page_number'] for dict_t in dict_list])
print(page_serie)
0 4
1 12
2 128
dtype: int64
Then,
for each row of your dataframe you determine if the values ​​of the series are between 'start' and 'end' and the sums
df['note_count']=df.apply(lambda x: page_serie.between(x['start'],x['end']),axis=1).sum(axis=1)
print(df)
start end note_count
1 1 100 2
2 101 200 1
3 201 300 0

Using apply function to Pandas dataframe [duplicate]

This question already has an answer here:
Running sum in pandas (without loop)
(1 answer)
Closed 4 years ago.
I am trying to do an emulation of a loan with monthly payments in pandas.
The credit column contains amount of money which I borrowed from the bank.
The debit column contains amount of money which I payed back to a bank.
The total column should contain the amount which is left to pay to a bank. Basically it contains the subtraction result between the credit and debit column).
I was able to write the following code:
import pandas as pd
# This function returns the subtraction result of credit and debit
def f(x):
return (x['credit'] - x['debit'])
df = pd.DataFrame({'credit': [1000, 0, 0, 500],
'debit': [0, 100, 200, 0]})
for i in df:
df['total'] = df.apply(f, axis=1)
print(df)
It works (it subtracts the debit from the credit). But it doesn't keep results in the total columns. Please see Actual and Expected results below.
Actual result:
credit debit total
0 1000 0 1000
1 0 100 -100
2 0 200 -200
3 500 0 500
Expected result:
credit debit total
0 1000 0 1000
1 0 100 900
2 0 200 700
3 500 0 1200
You could use cumsum:
df['total'] = (df.credit - df.debit).cumsum()
print(df)
Output
credit debit total
0 1000 0 1000
1 0 100 900
2 0 200 700
3 500 0 1200
You don't need apply here.
import pandas as pd
df = pd.DataFrame({'credit': [1000, 0, 0, 500],
'debit': [0, 100, 200, 0]})
df['Total'] = (df['credit'] - df['debit']).cumsum()
print(df)
Output
credit debit Total
0 1000 0 1000
1 0 100 900
2 0 200 700
3 500 0 1200
The reason why apply wasn't working is because apply executes over each row rather than keeping the running total after each subtraction. Passing cumsum() into the subtraction kill keep the running total to get the desired results.

intersection of two columns of pandas dataframe

I have 2 pandas dataframes: dataframe1 and dataframe2 that look like this:
mydataframe1
Out[15]:
Start End
100 200
300 450
500 700
mydataframe2
Out[16]:
Start End Value
0 400 0
401 499 -1
500 1000 1
1001 1698 1
Each row correspond to a segment (start-end).
For each segment in dataframe1 I would like to assign a value depending on the values assigned to the segments in dataframe2.
For example:
the first segment in dataframe1 100 200 is included in the first segment of dataframe2 0 400 then I should assign the value 0
the second segment in dataframe1 300 450 is contained in the first 0 400 and second 401 499 segments of dataframe2. In this case I need to split this segments in 2 and assign the 2 corresponding values. ie 300 400 -> value 0 and 401 - 450 value ->-1
the final dataframe1 should look like
mydataframe1
Out[15]:
Start End Value
100 200 0
300 400 0
401 450 -1
500 700 1
I hope I was claer..can you help me?
I doubt that there is a Pandas method that you can use to solve this directly.
You have to calculate the intersections manually to get the result you want. The intervaltree library makes the interval overlap calculation easier and more efficient at least.
IntervalTree.search() returns the (full) intervals that overlap with the provided one but does not calculate their intersection. This is why I also apply the intersect() function I have defined.
import pandas as pd
from intervaltree import Interval, IntervalTree
def intersect(a, b):
"""Intersection of two intervals."""
intersection = max(a[0], b[0]), min(a[1], b[1])
if intersection[0] > intersection[1]:
return None
return intersection
def interval_df_intersection(df1, df2):
"""Calculate the intersection of two sets of intervals stored in DataFrames.
The intervals are defined by the "Start" and "End" columns.
The data in the rest of the columns of df1 is included with the resulting
intervals."""
tree = IntervalTree.from_tuples(zip(
df1.Start.values,
df1.End.values,
df1.drop(["Start", "End"], axis=1).values.tolist()
))
intersections = []
for row in df2.itertuples():
i1 = Interval(row.Start, row.End)
intersections += [list(intersect(i1, i2)) + i2.data for i2 in tree[i1]]
# Make sure the column names are in the correct order
data_cols = list(df1.columns)
data_cols.remove("Start")
data_cols.remove("End")
return pd.DataFrame(intersections, columns=["Start", "End"] + data_cols)
interval_df_intersection(mydataframe2, mydataframe1)
The result is identical to what you were after.
Here is an answer using the NCLS library. It does not do the splitting, but rather answers the question in the title and does so really quickly.
Setup:
from ncls import NCLS
contents = """Start End
100 200
300 450
500 700"""
import pandas as pd
from io import StringIO
df = pd.read_table(StringIO(contents), sep="\s+")
contents2 = """Start End Value
0 400 0
401 499 -1
500 1000 1
1001 1698 1"""
df2 = pd.read_table(StringIO(contents2), sep="\s+")
Execution:
n = NCLS(df.Start.values, df.End.values, df.index.values)
x, x2 = n.all_overlaps_both(df2.Start.values, df2.End.values, df2.index.values)
dfx = df.loc[x]
# Start End
# 0 100 200
# 0 100 200
# 1 300 450
# 2 500 700
df2x = df2.loc[x2]
# Start End Value
# 0 0 400 0
# 1 401 499 -1
# 1 401 499 -1
# 2 500 1000 1
dfx.insert(dfx.shape[1], "Value", df2x.Value.values)
# Start End Value
# 0 100 200 0
# 0 100 200 0
# 1 300 450 -1
# 2 500 700 1

Sum data points from individual pandas dataframes in a summary dataframe based on custom (and possibly overlapping) bins

I have many dataframes with individual counts (e.g. df_boston below). Each row defines a data point that is uniquely identified by its marker and its point. I have a summary dataframe (df_inventory_master) that has custom bins (the points above map to the Begin-End coordinates in the master). I want to add a column to this dataframe for each individual city that sums the counts from that city in a new column. An example is shown.
Two quirks are that the the bins in the master frame can be overlapping (the count should be added to both) and that some counts may not fall in the master (the count should be ignored).
I can do this in pure Python but since the data are in dataframes it would be helpful and likely faster to do the manipulations in pandas. I'd appreciate any tips here!
This is the master frame:
>>> df_inventory_master = pd.DataFrame({'Marker': [1, 1, 1, 2],
... 'Begin': [100, 300, 500, 100],
... 'End': [200, 600, 900, 250]})
>>> df_inventory_master
Begin End Marker
0 100 200 1
1 300 600 1
2 500 900 1
3 100 250 2
This is data for one city:
>>> df_boston = pd.DataFrame({'Marker': [1, 1, 1, 1],
... 'Point': [140, 180, 250, 500],
... 'Count': [14, 600, 1000, 700]})
>>> df_boston
Count Marker Point
0 14 1 140
1 600 1 180
2 1000 1 250
3 700 1 500
This is the desired output.
- Note that the count of 700 (Marker 1, Point 500) falls in 2 master bins and is counted for both.
- Note that the count of 1000 (Marker 1, Point 250) does not fall in a master bin and is not counted.
- Note that nothing maps to Marker 2 because df_boston does not have any Marker 2 data.
>>> desired_frame
Begin End Marker boston
0 100 200 1 614
1 300 600 1 700
2 500 900 1 700
3 100 250 2 0
What I've tried: I looked at the pd.cut() function, but with the nature of the bins overlapping, and in some cases absent, this does not seem to fit. I can add the column filled with 0 values to get part of the way there but then will need to find a way to sum the data in each frame, using bins defined in the master.
>>> df_inventory_master['boston'] = pd.Series([0 for x in range(len(df_inventory_master.index))], index=df_inventory_master.index)
>>> df_inventory_master
Begin End Marker boston
0 100 200 1 0
1 300 600 1 0
2 500 900 1 0
3 100 250 2 0
Here is how I approached it, basically a *sql style left join * using the pandas merge operation, then apply() across the row axis, with a lambda to decide if the individual records are in the band or not, finally groupby and sum:
df_merged = df_inventory_master.merge(df_boston, on=['Marker'],how='left')
# logical overwrite of count
df_merged['Count'] = df_merged.apply(lambda x: x['Count'] if x['Begin'] <= x['Point'] <= x['End'] else 0 , axis=1 )
df_agged = df_merged[['Begin','End','Marker','Count']].groupby(['Begin','End','Marker']).sum()
df_agged_resorted = df_agged.sort_index(level = ['Marker','Begin','End'])
df_agged_resorted = df_agged_resorted.astype(np.int)
df_agged_resorted.columns =['boston'] # rename the count column to boston.
print df_agged_resorted
And the result is
boston
Begin End Marker
100 200 1 614
300 600 1 700
500 900 1 700
100 250 2 0

Categories

Resources