I have multiple Dataframes from the same measurements at different times. Now I want to find correlating measurements (which can be located by a string in one of the columns called abmn). The problem is that I cannot iterate since all datasets have a different length.
Would you know any solution?
example
df1
| res | abmn |
| 3 | 1234 |
| 0 | 1245 |
| 2 | 1256 |
df2
| res | abmn |
| 1 | 1234 |
| 0 | 1256 |
| 2 | 1267 |
I would want two dataframes only with
df1
| res | abmn |
| 3 | 1234 |
| 2 | 1256 |
df2
| res | abmn |
| 1 | 1234 |
| 0 | 1256 |
I tried a loop but that wouldn't work since they are of different lengths. I think I managed to have a list with all the strings of the abmn values (all values that are in all four datasets) but I haven't really found a solution
Here is a possible answer to your question. I first see which values are common to both, and then create two new dataframes based on the values to keep.
Here is the code:
import pandas as pd
# generate the dataframes as per the example
df1 = pd.DataFrame({"res":[3,0,2], "abmn":[1234,1245,1256]})
df2 = pd.DataFrame({"res":[1,0,2], "abmn":[1234,1256,1267]})
# identify the items to keep
items_to_keep = []
for item in df1['abmn']:
if item in df2['abmn'].values:
items_to_keep.append(item)
# create new dataframes based on the list of items to keep
new_df1 = df1[df1.abmn.isin(items_to_keep)]
new_df2 = df2[df2.abmn.isin(items_to_keep)]
# print the two new dataframes
print("new_df1")
print(new_df1)
print("==========")
print("new_df2")
print(new_df2)
OUTPUT:
new_df1
res abmn
0 3 1234
2 2 1256
==========
new_df2
res abmn
0 1 1234
1 0 1256
Please note, that the for-loop, can be replaced with a list-comprehension:
# identify the items to keep
items_to_keep = [item for item in df1['abmn'] if item in df2['abmn'].values]
Related
I am trying to compare two different dataframe that have the same column names and indexes(not numerical) and I need to obtain a third df with the biggest value for the row with the same column name.
Example
df1=
| | col_1 | col2 | col-3 |
| rft_12312 | 4 | 7 | 4 |
| rft_321321 | 3 | 4 | 1 |
df2=
| | col_1 | col2 | col-3 |
| rft_12312 | 7 | 3 | 4 |
| rft_321321 | 3 | 7 | 6 |
Required result
| | col_1 | col2 | col-3 |
| rft_12312 | 7 (because df2.value in this \[row :column\] \>df1.value) | 7 | 4 |
| rft_321321 | 3(when they are equal doesn't matter from which column is the value) | 7 | 6 |
I've already tried pd.update with filter_func defined as:
def filtration_function(val1,val2):
if val1 >= val2:
return val1
else:
return val2
but is not working. I need the check for each column with same name.
also pd.compare but does not allow me to pick the right values.
Thank you in advance :)
I think one possibility would be to use "combine". This method generates an element-wise comparsion between the two dataframes and returns the maximum value of each element.
Example:
import pandas as pd
def filtration_function(val1, val2):
return max(val1, val2)
result = df1.combine(df2, filtration_function)
I think method "where" can work to:
import pandas as pd
result = df1.where(df1 >= df2, df2)
I have a table that looks like this; it is the stacked version of a crosstab, so each combination of item and period is unique:
+------+--------+-------+
| item | period | value |
+------+--------+-------+
| x | 1 | 6 |
| x | 2 | 4 |
| x | 3 | 5 |
| y | 1 | 9 |
| y | 2 | 10 |
| y | 3 | 100 |
+------+--------+-------+
For each item, I need to find the period with the lowest value, so the desired result is:
+------+--------+-------+
| item | period | value |
+------+--------+-------+
| x | 2 | 4 |
| y | 1 | 9 |
+------+--------+-------+
I have looked into pandas.DataFrame.idxmin() but it doesn't seem to be what I need.
I have found a way with groupby, min and merge but I was wondering if there is a more elegant solution?
I have found many similar questions related to R and SQL (my solution is in fact "SQLish", but not to Python
My solution is:
import numpy as np
import pandas as pd
df = pd.DataFrame()
df['item'] = np.repeat(['x','y'],3)
df['period'] = np.tile( [1,2,3] ,2 )
df['value'] = [6,4,5,9,10,100]
min_value = df[['item','value']].groupby('item').min().reset_index(drop = False)
periods_with_min_value = pd.merge(min_value, df, how ='inner', on=['item','value'])
df.loc[df.groupby("item")["value"].idxmin()]
Out[12]:
item period value
1 x 2 4
3 y 1 9
Tested on pandas 1.1.3, python 3.7, debian 10 64-bit. No warning was emitted.
N.B. This solution won't work if there were repeated or corrupted index values. This could be resolved by .reset_index(drop=True) in advance.
Set-up
I have two pandas data frames df1 and df2, each containing two columns with observations for id and its respective url,
| id | url | | id | url |
------------ ------------
| 1 | url | | 2 | url |
| 2 | url | | 4 | url |
| 3 | url | | 3 | url |
| 4 | url | | 5 | url |
| 6 | url |
Some observations are in both dfs, which is clear from the id column, e.g. observation 2 and it's url are in both dfs.
The positioning within the dfs of those 'double' observations does not necessarily have to be the same, e.g. observation 2 is in first row in df1 and second in df2.
Lastly, the dfs do not necessarily have the same number of observations, e.g. df1 has four observations while df2 has five.
Problem
I want to elicit all unique observations in df2 and insert them in a new df (df3), i.e. I want to obtain,
| id | url |
------------
| 5 | url |
| 6 | url |
How do I go about?
I've tried this answer but cannot get it to work for my two-column dataframes.
I've also tried this other answer, but this gives me an empty common dataframe.
Possibly something like this: df3 = df2[~df2.id.isin(df1.id.tolist())]
ID numbers make good index names:
df1.index = df1.id
df2.index = df2.id
Then use the very straightforward index.difference:
diff_index = df2.index.difference(df1.index)
df3 = df2.loc[diff_index]
I'm sure what I'm trying to do is fairly simple for those with better knowledge of PD, but I'm simply stuck at transforming:
+---------+------------+-------+
| Trigger | Date | Value |
+---------+------------+-------+
| 1 | 01/01/2016 | a |
+---------+------------+-------+
| 2 | 01/01/2016 | b |
+---------+------------+-------+
| 3 | 01/01/2016 | c |
+---------+------------+-------+
...etc, into:
+------------+---------------------+---------+---------+---------+
| Date | #of triggers | count a | count b | count c |
+------------+---------------------+---------+---------+---------+
| 01/01/2016 | 3 | 1 | 1 | 1 |
+------------+---------------------+---------+---------+---------+
| 02/01/2016 | 5 | 2 | 1 | 2 |
+------------+---------------------+---------+---------+---------+
... and so on
The issue is, I've got no bloody idea of how to achieve this..
I've scoured SO, but I can't seem to find anything that applies to my specific case.
I presume I'd have to group it all by date, but then once that is done, what do I need to do to get the remaining columns?
The initial DF is loaded from an SQL Alchemy query object, and then I want to manipulate it to get the result I described above. How would one do this?
Thanks
df.groupby(['Date','Value']).count().unstack(level=-1)
You can use GroupBy.size with unstack, also parameter sort=False is helpful:
df1 = df.groupby(['Date','Value'])['Value'].size().unstack(fill_value=0)
df1['Total'] = df1.sum(axis=1)
cols = df1.columns[-1:].union(df1.columns[:-1])
df1 = df1[cols]
print (df1)
Value Total a b c
Date
01/01/2016 3 1 1 1
The difference between size and count is:
size counts NaN values, count does not.
So I have a dataframe with some values. This is my dataframe:
|in|x|y|z|
+--+-+-+-+
| 1|a|a|b|
| 2|a|b|b|
| 3|a|b|c|
| 4|b|b|c|
I would like to get number of unique values of each row, and number of values that are not equal to value in column x. The result should look like this:
|in | x | y | z | count of not x |unique|
+---+---+---+---+---+---+
| 1 | a | a | b | 1 | 2 |
| 2 | a | b | b | 2 | 2 |
| 3 | a | b | c | 2 | 3 |
| 4 | b | b |nan| 0 | 1 |
I could come up with some dirty decisions here. But there must be some elegant way of doing that. My mind is turning around dropduplicates(that does not work on series); turning into array and .unique(); df.iterrows() that I want to evade; and .apply on each row.
Here are solutions using apply.
df['count of not x'] = df.apply(lambda x: (x[['y','z']] != x['x']).sum(), axis=1)
df['unique'] = df.apply(lambda x: x[['x','y','z']].nunique(), axis=1)
A non-apply solution for getting count of not x:
df['count of not x'] = (~df[['y','z']].isin(df['x'])).sum(1)
Can't think of anything great for unique. This uses apply, but may be faster, depending on the shape of the data.
df['unique'] = df[['x','y','z']].T.apply(lambda x: x.nunique())