Python - pandas remove duplicate rows based on condition - python

I have a csv which has data that looks like this
id | code | date
-------------+-----------------------------
| 1 | 2 | 2022-10-05 07:22:39+00::00 |
| 1 | 0 | 2022-11-05 02:22:35+00::00 |
| 2 | 3 | 2021-01-05 10:10:15+00::00 |
| 2 | 0 | 2019-01-11 10:05:21+00::00 |
| 2 | 1 | 2022-01-11 10:05:22+00::00 |
| 3 | 2 | 2022-10-10 11:23:43+00::00 |
I want to remove duplicate id based on the following condition -
For code column, choose the value which is not equal to 0 and then choose one which is having latest timestamp.
Add another column prev_code, which contains list of all the remaining value of the code that's not present in code column.
Something like this -
id | code | prev_code
-------------+----------
| 1 | 2 | [0] |
| 2 | 1 | [0,2] |
| 3 | 2 | [] |

There is probably a sleeker solution but something along the following lines should work.
df = pd.read_csv('file.csv')
lastcode = df[df.code!=0].groupby('id').apply(lambda block: block[block['date'] == block['date'].max()]['code'])
prev_codes = df.groupby('id').agg(code=('code', lambda x: [val for val in x if val != lastcode[x.name].values[0]]))['code']
pd.DataFrame({'id': map(lambda x: x[0], lastcode.index.values), 'code': lastcode.values, 'prev_code': prev_codes.values})

Related

Comparing two Dataframes and creating a third one where certain contions are met

I am trying to compare two different dataframe that have the same column names and indexes(not numerical) and I need to obtain a third df with the biggest value for the row with the same column name.
Example
df1=
| | col_1 | col2 | col-3 |
| rft_12312 | 4 | 7 | 4 |
| rft_321321 | 3 | 4 | 1 |
df2=
| | col_1 | col2 | col-3 |
| rft_12312 | 7 | 3 | 4 |
| rft_321321 | 3 | 7 | 6 |
Required result
| | col_1 | col2 | col-3 |
| rft_12312 | 7 (because df2.value in this \[row :column\] \>df1.value) | 7 | 4 |
| rft_321321 | 3(when they are equal doesn't matter from which column is the value) | 7 | 6 |
I've already tried pd.update with filter_func defined as:
def filtration_function(val1,val2):
if val1 >= val2:
return val1
else:
return val2
but is not working. I need the check for each column with same name.
also pd.compare but does not allow me to pick the right values.
Thank you in advance :)
I think one possibility would be to use "combine". This method generates an element-wise comparsion between the two dataframes and returns the maximum value of each element.
Example:
import pandas as pd
def filtration_function(val1, val2):
return max(val1, val2)
result = df1.combine(df2, filtration_function)
I think method "where" can work to:
import pandas as pd
result = df1.where(df1 >= df2, df2)

How to apply a function on each group of data in a pandas group by

Suppose the data frame below:
|id |day | order |
|---|--- |-------|
| a | 2 | 6 |
| a | 4 | 0 |
| a | 7 | 4 |
| a | 8 | 8 |
| b | 11 | 10 |
| b | 15 | 15 |
I want to apply a function to day and order column of each group by rows on id column.
The function is:
def mean_of_differences(my_list):
return sum([ my_list[i] - my_list[i-1] for i in range(1, len(my_list))]) / len(my_list)
This function calculates mean of differences of each element and the next one. For example, for id=a, day would be 2+3+1 divided by 4. I know how to use lambda, but didn't find a way to implement this in a pandas group by. Also, each column should be ordered to get my desired output, so apparently it is not possible to sort by one column before group by
The output should be like this:
|id |day| order |
|---|---|-------|
| a |1.5| 2 |
| b | 2 | 2.5 |
Any one know how to do so in a group by?
First, sort your data by day then group by id and finally compute your diff/mean.
df = df.sort_values('day') \
.groupby('id') \
.agg({'day': lambda x: x.diff().fillna(0).mean()}) \
.reset_index()
Output:
>>> df
id day
0 a 1.5
1 b 2.0

For each category, how to find the value of a column corresponding to the minimum of another column?

I have a table that looks like this; it is the stacked version of a crosstab, so each combination of item and period is unique:
+------+--------+-------+
| item | period | value |
+------+--------+-------+
| x | 1 | 6 |
| x | 2 | 4 |
| x | 3 | 5 |
| y | 1 | 9 |
| y | 2 | 10 |
| y | 3 | 100 |
+------+--------+-------+
For each item, I need to find the period with the lowest value, so the desired result is:
+------+--------+-------+
| item | period | value |
+------+--------+-------+
| x | 2 | 4 |
| y | 1 | 9 |
+------+--------+-------+
I have looked into pandas.DataFrame.idxmin() but it doesn't seem to be what I need.
I have found a way with groupby, min and merge but I was wondering if there is a more elegant solution?
I have found many similar questions related to R and SQL (my solution is in fact "SQLish", but not to Python
My solution is:
import numpy as np
import pandas as pd
df = pd.DataFrame()
df['item'] = np.repeat(['x','y'],3)
df['period'] = np.tile( [1,2,3] ,2 )
df['value'] = [6,4,5,9,10,100]
min_value = df[['item','value']].groupby('item').min().reset_index(drop = False)
periods_with_min_value = pd.merge(min_value, df, how ='inner', on=['item','value'])
df.loc[df.groupby("item")["value"].idxmin()]
Out[12]:
item period value
1 x 2 4
3 y 1 9
Tested on pandas 1.1.3, python 3.7, debian 10 64-bit. No warning was emitted.
N.B. This solution won't work if there were repeated or corrupted index values. This could be resolved by .reset_index(drop=True) in advance.

How to split a spark dataframe column string?

I have a dataframe which looks like this:
|--------------------------------------|---------|---------|
| path | content|
|------------------------------------------------|---------|
| /root/path/main_folder1/folder1/path1.txt | Val 1 |
|------------------------------------------------|---------|
| /root/path/main_folder1/folder2/path2.txt | Val 1 |
|------------------------------------------------|---------|
| /root/path/main_folder1/folder2/path3.txt | Val 1 |
|------------------------------------------------|---------|
I want to split the column values in path by "/" and get the values only until /root/path/mainfolder1
The Output that I want is
|--------------------------------------|---------|---------|---------------------------|
| path | content| root_path |
|------------------------------------------------|---------|---------------------------|
| /root/path/main_folder1/folder1/path1.txt | Val 1 | /root/path/main_folder1 |
|------------------------------------------------|---------|---------------------------|
| /root/path/main_folder1/folder2/path2.txt | Val 1 | /root/path/main_folder1 |
|------------------------------------------------|---------|---------------------------|
| /root/path/main_folder1/folder2/path3.txt | Val 1 | /root/path/main_folder1 |
|------------------------------------------------|---------|---------------------------|
I know that I have to use withColumn split and regexp_extract but I am not quiet getting how to limit the output of regexp_extract.
What is it that I have to do to get the desired output?
You can use a regular expression to extract the first three directory levels.
df.withColumn("root_path", F.regexp_extract(F.col("path"), "^((/\w*){3})",1))\
.show(truncate=False)
Output:
+-----------------------------------------+-------+-----------------------+
|path |content|root_path |
+-----------------------------------------+-------+-----------------------+
|/root/path/main_folder1/folder1/path1.txt|val 1 |/root/path/main_folder1|
|/root/path/main_folder1/folder2/path2.txt|val 2 |/root/path/main_folder1|
|/root/path/main_folder1/folder2/path3.txt|val 3 |/root/path/main_folder1|
+-----------------------------------------+-------+-----------------------+

Iterate pyspark dataframe rows and apply UDF

I have a dataframe that looks like this:
partitionCol orderCol valueCol
+--------------+----------+----------+
| partitionCol | orderCol | valueCol |
+--------------+----------+----------+
| A | 1 | 201 |
| A | 2 | 645 |
| A | 3 | 302 |
| B | 1 | 335 |
| B | 2 | 834 |
+--------------+----------+----------+
I want to group by the partitionCol, then within each partition to iterate over the rows, ordered by orderCol and apply some function to calculate a new column based on the valueCol and a cached value.
e.g.
def foo(col_value, cached_value):
tmp = <some value based on a condition between col_value and cached_value>
<update the cached_value using some logic>
return tmp
I understand I need to groupby the partitionCol and apply a UDF that will operate on each chink separately, but struggling to find a good way to iterate the rows and applying the logic I described, to get a desired output of:
+--------------+----------+----------+---------------+
| partitionCol | orderCol | valueCol | calculatedCol -
+--------------+----------+----------+---------------+
| A | 1 | 201 | C1 |
| A | 2 | 645 | C1 |
| A | 3 | 302 | C2 |
| B | 1 | 335 | C1 |
| B | 2 | 834 | C2 |
+--------------+----------+----------+---------------+
I think the best way for you to do that is to apply an UDF on the whole set of data :
# first, you create a struct with the order col and the valu col
df = df.withColumn("my_data", F.struct(F.col('orderCol'), F.col('valueCol'))
# then you create an array of that new column
df = df.groupBy("partitionCol").agg(F.collect_list('my_data').alias("my_data")
# finaly, you apply your function on that array
df = df.withColumn("calculatedCol", my_udf(F.col("my_data"))
But without knowing exactly what you want to do, that is all I can offer.

Categories

Resources