Pandas DataFrame drop consecutive duplicates - python

I want to modify drop_duplicates in a such way:
For example, I've got DataFrame with rows:
| A header | Another header |
| -------- | -------------- |
| First | el1 |
| Second | el2 |
| Second | el8 |
| First | el3 |
| Second | el4 |
| Second | el5 |
| First | el6 |
| Second | el9 |
And I need not to drop all duplicates, but only consecutive ones. So as a result a want:
| A header | Another header |
| -------- | -------------- |
| First | el1 |
| Second | el2 |
| First | el3 |
| Second | el4 |
| First | el6 |
| Second | el9 |
Tried to do it with for, but maybe there are better ways

You can simply do it by using shift() as follows:
import pandas as pd
df = pd.DataFrame({
'A header': ['First', 'Second', 'Second', 'First', 'Second', 'Second', 'First', 'Second'],
'Another header': ['el1', 'el2', 'el8', 'el3', 'el4', 'el5', 'el6', 'el9'],
})
print(df)
"""
A header Another header
0 First el1
1 Second el2
2 Second el8
3 First el3
4 Second el4
5 Second el5
6 First el6
7 Second el9
"""
df2 = df[df['A header'] != df['A header'].shift(1)]
print(df2)
"""
A header Another header
0 First el1
1 Second el2
3 First el3
4 Second el4
6 First el6
7 Second el9
"""
Using shift(1), you can compare each row with the row's previous row.
For more information, see https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shift.html

extract dup:
l=[]
for i in range(len(df1)-1):
if df1['A header'][i]==df1['A header'][i+1] :
l.append(i+1)
drop dup:
df1.drop(l, inplace=True)

Related

Check if PySaprk column values exists in another dataframe column values

I'm trying to figure out the condition to check if the values of one PySpark dataframe exist in another PySpark dataframe, and if so extract the value and compare again. I was thinking of doing a multiple withColumn() with a when() function.
For example my two dataframes can be something like:
df1
| id | value |
| ----- | ---- |
| hello | 1111 |
| world | 2222 |
df2
| id | value |
| ------ | ---- |
| hello | 1111 |
| world | 3333 |
| people | 2222 |
And the result I wish to obtain is to check first if the value of df1.id exists in df2.id and if true return me the df2.value, for example I was trying something like:
df1 = df1.withColumn("df2_value", when(df1.id == df2.id, df2.value))
So I get something like:
df1
| id | value | df2_value |
| ----- | ---- | --------- |
| hello | 1111 | 1111 |
| world | 2222 | 3333 |
So that now I can do another check between these two value columns in the df1 dataframe, and return a boolean column (1or 0) in a new dataframe.
The result I wish to get would be something like:
df3
| id | value | df2_value | match |
| ----- | ---- | --------- | ----- |
| hello | 1111 | 1111 | 1 |
| world | 2222 | 3333 | 0 |
Left join df1 with df2 on id after prefixing all df2 columns except id with df2_*:
from pyspark.sql import functions as F
df1 = spark.createDataFrame([("hello", 1111), ("world", 2222)], ["id", "value"])
df2 = spark.createDataFrame([("hello", 1111), ("world", 3333), ("people", 2222)], ["id", "value"])
df = df1.join(
df2.select("id", *[F.col(c).alias(f"df2_{c}") for c in df2.columns if c != 'id']),
["id"],
"left"
)
Then using functools.reduce you can construct a boolean expression to check if columns match in the 2 dataframes like this:
from functools import reduce
check_expr = reduce(
lambda acc, x: acc & (F.col(x) == F.col(f"df2_{x}")),
[c for c in df1.columns if c != 'id'],
F.lit(True)
)
df.withColumn("match", check_expr.cast("int")).show()
#+-----+-----+---------+-----+
#| id|value|df2_value|match|
#+-----+-----+---------+-----+
#|hello| 1111| 1111| 1|
#|world| 2222| 3333| 0|
#+-----+-----+---------+-----+

PySpark: Get first character of each word in string

For an assignment I have been asked to shorten the names of clients to only the first letter of each name where they are separated by a space character.
I found a lot of solutions for this in Python, but I am not able to translate this to a dataframe.
The DF looks like this:
| ID | Name |
| -------- | -------------- |
| 1 | John Doe |
| 2 | Roy Lee Winters|
| 3 | Mary-Kate Baron|
My desired output would be:
| ID | Name | Shortened_name|
| -------- | -------- | -------------- |
| 1 | John Doe | JD |
| 2 | Roy Lee Winters | RLW |
| 3 | Mary-Kate Baron | MB |
I've had some result with the code below but this is not working when there are more then 2 names. I would also like to have some more 'flexible' code as some people have 4 or 5 names where others only have 1.
df.withColumn("col1", F.substring(F.split(F.col("Name"), " ").getItem(0), 1, 1))\
.withColumn("col2", F.substring(F.split(F.col("Name"), " ").getItem(1), 1, 1))\
.withColumn('Shortened_name', F.concat('col1', 'col2'))
You can split the Name column then use transform function on the resulting array to get first letter of each element:
from pyspark.sql import functions as F
df = spark.createDataFrame([(1, "John Doe"), (2, "Roy Lee Winters"), (3, "Mary-Kate Baron")], ["ID", "Name"])
df1 = df.withColumn(
"Shortened_name",
F.array_join(F.expr("transform(split(Name, ' '), x -> left(x, 1))"), "")
)
df1.show()
# +---+---------------+--------------+
# | ID| Name|Shortened_name|
# +---+---------------+--------------+
# | 1| John Doe| JD|
# | 2|Roy Lee Winters| RLW|
# | 3|Mary-Kate Baron| MB|
# +---+---------------+--------------+
Or by using aggregate function:
df1 = df.withColumn(
"Shortened_name",
F.expr("aggregate(split(Name, ' '), '', (acc, x) -> acc || left(x, 1))")
)

Python DataFrame - Select dataframe rows based on values in a column of same dataframe

I'm struggling with a dataframe related problem.
columns = [desc[0] for desc in cursor.description]
data = cursor.fetchall()
df2 = pd.DataFrame(list(data), columns=columns)
df2 is as follows:
| Col1 | Col2 |
| -------- | -------------- |
| 2145779 | 2 |
| 8059234 | 3 |
| 2145779 | 3 |
| 4265093 | 2 |
| 2145779 | 2 |
| 1728234 | 5 |
I want to make a list of values in col1 where value of col2="3"
You can use boolean indexing:
out = df2.loc[df2.Col2.eq(3), "Col1"].agg(list)
print(out)
Prints:
[8059234, 2145779]

How to split a spark dataframe column string?

I have a dataframe which looks like this:
|--------------------------------------|---------|---------|
| path | content|
|------------------------------------------------|---------|
| /root/path/main_folder1/folder1/path1.txt | Val 1 |
|------------------------------------------------|---------|
| /root/path/main_folder1/folder2/path2.txt | Val 1 |
|------------------------------------------------|---------|
| /root/path/main_folder1/folder2/path3.txt | Val 1 |
|------------------------------------------------|---------|
I want to split the column values in path by "/" and get the values only until /root/path/mainfolder1
The Output that I want is
|--------------------------------------|---------|---------|---------------------------|
| path | content| root_path |
|------------------------------------------------|---------|---------------------------|
| /root/path/main_folder1/folder1/path1.txt | Val 1 | /root/path/main_folder1 |
|------------------------------------------------|---------|---------------------------|
| /root/path/main_folder1/folder2/path2.txt | Val 1 | /root/path/main_folder1 |
|------------------------------------------------|---------|---------------------------|
| /root/path/main_folder1/folder2/path3.txt | Val 1 | /root/path/main_folder1 |
|------------------------------------------------|---------|---------------------------|
I know that I have to use withColumn split and regexp_extract but I am not quiet getting how to limit the output of regexp_extract.
What is it that I have to do to get the desired output?
You can use a regular expression to extract the first three directory levels.
df.withColumn("root_path", F.regexp_extract(F.col("path"), "^((/\w*){3})",1))\
.show(truncate=False)
Output:
+-----------------------------------------+-------+-----------------------+
|path |content|root_path |
+-----------------------------------------+-------+-----------------------+
|/root/path/main_folder1/folder1/path1.txt|val 1 |/root/path/main_folder1|
|/root/path/main_folder1/folder2/path2.txt|val 2 |/root/path/main_folder1|
|/root/path/main_folder1/folder2/path3.txt|val 3 |/root/path/main_folder1|
+-----------------------------------------+-------+-----------------------+

concat column values untill the second last valid transaction(non zero or not null)

I want to concat transactions to its second last valid trasaction.
Suppose I have columns till 4th transaction and I want to generate a sequence like below output.
Note: Values in trans columns are categorical.
Input Data:
| Cust_id | trans_1 | trans_2 | trans_3 | trans_4 |
|------------|---------|---------|---------|---------|
| 1000026037 | 'a' | 'b' | 'd' | NaN |
| 1000026048 | 'm' | 'c' | NaN | NaN |
| 1000026081 | 'x' | 't' | 'y' | NaN |
| 1000026451 | 'r' | 'p' | NaN | 'u' |
Desired Output:
| Sequence |
|----------|
| 'a b' |
| 'm' |
| 'x t' |
| 'r p' |
Select the transaction columns and get the data until second last nonzero and concatenate.
df.filter(regex='trans_')
.apply(lambda x: x.iloc[x.nonzero()].iloc[:-1], axis=1)
.add(' ')
.sum(axis=1)
.str.strip()
OR
df.filter(regex='trans_')
.apply(lambda x: ' '.join(x.iloc[x.nonzero()].iloc[:-1]), axis=1)
NOTE
Ensure all zeros are integer zeros and not string zeros i.e 0 and not '0'
df = df.replace({'0': 0})

Categories

Resources