Create pandas df from difference between two dfs - python

Set-up
I have two pandas data frames df1 and df2, each containing two columns with observations for id and its respective url,
| id | url | | id | url |
------------ ------------
| 1 | url | | 2 | url |
| 2 | url | | 4 | url |
| 3 | url | | 3 | url |
| 4 | url | | 5 | url |
| 6 | url |
Some observations are in both dfs, which is clear from the id column, e.g. observation 2 and it's url are in both dfs.
The positioning within the dfs of those 'double' observations does not necessarily have to be the same, e.g. observation 2 is in first row in df1 and second in df2.
Lastly, the dfs do not necessarily have the same number of observations, e.g. df1 has four observations while df2 has five.
Problem
I want to elicit all unique observations in df2 and insert them in a new df (df3), i.e. I want to obtain,
| id | url |
------------
| 5 | url |
| 6 | url |
How do I go about?
I've tried this answer but cannot get it to work for my two-column dataframes.
I've also tried this other answer, but this gives me an empty common dataframe.

Possibly something like this: df3 = df2[~df2.id.isin(df1.id.tolist())]

ID numbers make good index names:
df1.index = df1.id
df2.index = df2.id
Then use the very straightforward index.difference:
diff_index = df2.index.difference(df1.index)
df3 = df2.loc[diff_index]

Related

Comparing two Dataframes and creating a third one where certain contions are met

I am trying to compare two different dataframe that have the same column names and indexes(not numerical) and I need to obtain a third df with the biggest value for the row with the same column name.
Example
df1=
| | col_1 | col2 | col-3 |
| rft_12312 | 4 | 7 | 4 |
| rft_321321 | 3 | 4 | 1 |
df2=
| | col_1 | col2 | col-3 |
| rft_12312 | 7 | 3 | 4 |
| rft_321321 | 3 | 7 | 6 |
Required result
| | col_1 | col2 | col-3 |
| rft_12312 | 7 (because df2.value in this \[row :column\] \>df1.value) | 7 | 4 |
| rft_321321 | 3(when they are equal doesn't matter from which column is the value) | 7 | 6 |
I've already tried pd.update with filter_func defined as:
def filtration_function(val1,val2):
if val1 >= val2:
return val1
else:
return val2
but is not working. I need the check for each column with same name.
also pd.compare but does not allow me to pick the right values.
Thank you in advance :)
I think one possibility would be to use "combine". This method generates an element-wise comparsion between the two dataframes and returns the maximum value of each element.
Example:
import pandas as pd
def filtration_function(val1, val2):
return max(val1, val2)
result = df1.combine(df2, filtration_function)
I think method "where" can work to:
import pandas as pd
result = df1.where(df1 >= df2, df2)

Filtering by column - multiple pd.arrays

I have multiple Dataframes from the same measurements at different times. Now I want to find correlating measurements (which can be located by a string in one of the columns called abmn). The problem is that I cannot iterate since all datasets have a different length.
Would you know any solution?
example
df1
| res | abmn |
| 3 | 1234 |
| 0 | 1245 |
| 2 | 1256 |
df2
| res | abmn |
| 1 | 1234 |
| 0 | 1256 |
| 2 | 1267 |
I would want two dataframes only with
df1
| res | abmn |
| 3 | 1234 |
| 2 | 1256 |
df2
| res | abmn |
| 1 | 1234 |
| 0 | 1256 |
I tried a loop but that wouldn't work since they are of different lengths. I think I managed to have a list with all the strings of the abmn values (all values that are in all four datasets) but I haven't really found a solution
Here is a possible answer to your question. I first see which values are common to both, and then create two new dataframes based on the values to keep.
Here is the code:
import pandas as pd
# generate the dataframes as per the example
df1 = pd.DataFrame({"res":[3,0,2], "abmn":[1234,1245,1256]})
df2 = pd.DataFrame({"res":[1,0,2], "abmn":[1234,1256,1267]})
# identify the items to keep
items_to_keep = []
for item in df1['abmn']:
if item in df2['abmn'].values:
items_to_keep.append(item)
# create new dataframes based on the list of items to keep
new_df1 = df1[df1.abmn.isin(items_to_keep)]
new_df2 = df2[df2.abmn.isin(items_to_keep)]
# print the two new dataframes
print("new_df1")
print(new_df1)
print("==========")
print("new_df2")
print(new_df2)
OUTPUT:
new_df1
res abmn
0 3 1234
2 2 1256
==========
new_df2
res abmn
0 1 1234
1 0 1256
Please note, that the for-loop, can be replaced with a list-comprehension:
# identify the items to keep
items_to_keep = [item for item in df1['abmn'] if item in df2['abmn'].values]

Creating new Columns and fill them based on another columns values

Let's say I have a dataframe df looking like this:
|ColA |
|---------|
|B=7 |
|(no data)|
|C=5 |
|B=3,C=6 |
How do I extract the data into new colums, so it looks like this:
|ColA | B | C |
|------|---|---|
|True | 7 | |
|False | | |
|True | | 5 |
|True | 3 | 6 |
For filling the columns I know I can use regex .extract, as shown in this solution.
But how do I set the Column name at the same time? So far I use a loop over df.ColA.loc[df["ColA"].isna()].iteritems(), but that does not seem like the best option for a lot of data.
You could use str.extractall to get the data, then reshape the output and join to a derivative of the original dataframe:
# create the B/C columns
df2 = (df['ColA'].str.extractall('([^=]+)=([^=,]+),?')
.set_index(0, append=True)
.droplevel('match')[1]
.unstack(0, fill_value='')
)
# rework ColA and join previous output
df.notnull().join(df2).fillna('')
# or if several columns:
df.assign(ColA=df['ColA'].notnull()).join(df2).fillna('')
output:
ColA B C
0 True 7
1 False
2 True 5
3 True 3 6

How to apply a function on each group of data in a pandas group by

Suppose the data frame below:
|id |day | order |
|---|--- |-------|
| a | 2 | 6 |
| a | 4 | 0 |
| a | 7 | 4 |
| a | 8 | 8 |
| b | 11 | 10 |
| b | 15 | 15 |
I want to apply a function to day and order column of each group by rows on id column.
The function is:
def mean_of_differences(my_list):
return sum([ my_list[i] - my_list[i-1] for i in range(1, len(my_list))]) / len(my_list)
This function calculates mean of differences of each element and the next one. For example, for id=a, day would be 2+3+1 divided by 4. I know how to use lambda, but didn't find a way to implement this in a pandas group by. Also, each column should be ordered to get my desired output, so apparently it is not possible to sort by one column before group by
The output should be like this:
|id |day| order |
|---|---|-------|
| a |1.5| 2 |
| b | 2 | 2.5 |
Any one know how to do so in a group by?
First, sort your data by day then group by id and finally compute your diff/mean.
df = df.sort_values('day') \
.groupby('id') \
.agg({'day': lambda x: x.diff().fillna(0).mean()}) \
.reset_index()
Output:
>>> df
id day
0 a 1.5
1 b 2.0

Pandas - applying groupings and counts to multiple columns in order to generate/change a dataframe

I'm sure what I'm trying to do is fairly simple for those with better knowledge of PD, but I'm simply stuck at transforming:
+---------+------------+-------+
| Trigger | Date | Value |
+---------+------------+-------+
| 1 | 01/01/2016 | a |
+---------+------------+-------+
| 2 | 01/01/2016 | b |
+---------+------------+-------+
| 3 | 01/01/2016 | c |
+---------+------------+-------+
...etc, into:
+------------+---------------------+---------+---------+---------+
| Date | #of triggers | count a | count b | count c |
+------------+---------------------+---------+---------+---------+
| 01/01/2016 | 3 | 1 | 1 | 1 |
+------------+---------------------+---------+---------+---------+
| 02/01/2016 | 5 | 2 | 1 | 2 |
+------------+---------------------+---------+---------+---------+
... and so on
The issue is, I've got no bloody idea of how to achieve this..
I've scoured SO, but I can't seem to find anything that applies to my specific case.
I presume I'd have to group it all by date, but then once that is done, what do I need to do to get the remaining columns?
The initial DF is loaded from an SQL Alchemy query object, and then I want to manipulate it to get the result I described above. How would one do this?
Thanks
df.groupby(['Date','Value']).count().unstack(level=-1)
You can use GroupBy.size with unstack, also parameter sort=False is helpful:
df1 = df.groupby(['Date','Value'])['Value'].size().unstack(fill_value=0)
df1['Total'] = df1.sum(axis=1)
cols = df1.columns[-1:].union(df1.columns[:-1])
df1 = df1[cols]
print (df1)
Value Total a b c
Date
01/01/2016 3 1 1 1
The difference between size and count is:
size counts NaN values, count does not.

Categories

Resources