I want to modify drop_duplicates in a such way:
For example, I've got DataFrame with rows:
| A header | Another header |
| -------- | -------------- |
| First | el1 |
| Second | el2 |
| Second | el8 |
| First | el3 |
| Second | el4 |
| Second | el5 |
| First | el6 |
| Second | el9 |
And I need not to drop all duplicates, but only consecutive ones. So as a result a want:
| A header | Another header |
| -------- | -------------- |
| First | el1 |
| Second | el2 |
| First | el3 |
| Second | el4 |
| First | el6 |
| Second | el9 |
Tried to do it with for, but maybe there are better ways
You can simply do it by using shift() as follows:
import pandas as pd
df = pd.DataFrame({
'A header': ['First', 'Second', 'Second', 'First', 'Second', 'Second', 'First', 'Second'],
'Another header': ['el1', 'el2', 'el8', 'el3', 'el4', 'el5', 'el6', 'el9'],
})
print(df)
"""
A header Another header
0 First el1
1 Second el2
2 Second el8
3 First el3
4 Second el4
5 Second el5
6 First el6
7 Second el9
"""
df2 = df[df['A header'] != df['A header'].shift(1)]
print(df2)
"""
A header Another header
0 First el1
1 Second el2
3 First el3
4 Second el4
6 First el6
7 Second el9
"""
Using shift(1), you can compare each row with the row's previous row.
For more information, see https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shift.html
extract dup:
l=[]
for i in range(len(df1)-1):
if df1['A header'][i]==df1['A header'][i+1] :
l.append(i+1)
drop dup:
df1.drop(l, inplace=True)
Related
I'm trying to figure out the condition to check if the values of one PySpark dataframe exist in another PySpark dataframe, and if so extract the value and compare again. I was thinking of doing a multiple withColumn() with a when() function.
For example my two dataframes can be something like:
df1
| id | value |
| ----- | ---- |
| hello | 1111 |
| world | 2222 |
df2
| id | value |
| ------ | ---- |
| hello | 1111 |
| world | 3333 |
| people | 2222 |
And the result I wish to obtain is to check first if the value of df1.id exists in df2.id and if true return me the df2.value, for example I was trying something like:
df1 = df1.withColumn("df2_value", when(df1.id == df2.id, df2.value))
So I get something like:
df1
| id | value | df2_value |
| ----- | ---- | --------- |
| hello | 1111 | 1111 |
| world | 2222 | 3333 |
So that now I can do another check between these two value columns in the df1 dataframe, and return a boolean column (1or 0) in a new dataframe.
The result I wish to get would be something like:
df3
| id | value | df2_value | match |
| ----- | ---- | --------- | ----- |
| hello | 1111 | 1111 | 1 |
| world | 2222 | 3333 | 0 |
Left join df1 with df2 on id after prefixing all df2 columns except id with df2_*:
from pyspark.sql import functions as F
df1 = spark.createDataFrame([("hello", 1111), ("world", 2222)], ["id", "value"])
df2 = spark.createDataFrame([("hello", 1111), ("world", 3333), ("people", 2222)], ["id", "value"])
df = df1.join(
df2.select("id", *[F.col(c).alias(f"df2_{c}") for c in df2.columns if c != 'id']),
["id"],
"left"
)
Then using functools.reduce you can construct a boolean expression to check if columns match in the 2 dataframes like this:
from functools import reduce
check_expr = reduce(
lambda acc, x: acc & (F.col(x) == F.col(f"df2_{x}")),
[c for c in df1.columns if c != 'id'],
F.lit(True)
)
df.withColumn("match", check_expr.cast("int")).show()
#+-----+-----+---------+-----+
#| id|value|df2_value|match|
#+-----+-----+---------+-----+
#|hello| 1111| 1111| 1|
#|world| 2222| 3333| 0|
#+-----+-----+---------+-----+
For an assignment I have been asked to shorten the names of clients to only the first letter of each name where they are separated by a space character.
I found a lot of solutions for this in Python, but I am not able to translate this to a dataframe.
The DF looks like this:
| ID | Name |
| -------- | -------------- |
| 1 | John Doe |
| 2 | Roy Lee Winters|
| 3 | Mary-Kate Baron|
My desired output would be:
| ID | Name | Shortened_name|
| -------- | -------- | -------------- |
| 1 | John Doe | JD |
| 2 | Roy Lee Winters | RLW |
| 3 | Mary-Kate Baron | MB |
I've had some result with the code below but this is not working when there are more then 2 names. I would also like to have some more 'flexible' code as some people have 4 or 5 names where others only have 1.
df.withColumn("col1", F.substring(F.split(F.col("Name"), " ").getItem(0), 1, 1))\
.withColumn("col2", F.substring(F.split(F.col("Name"), " ").getItem(1), 1, 1))\
.withColumn('Shortened_name', F.concat('col1', 'col2'))
You can split the Name column then use transform function on the resulting array to get first letter of each element:
from pyspark.sql import functions as F
df = spark.createDataFrame([(1, "John Doe"), (2, "Roy Lee Winters"), (3, "Mary-Kate Baron")], ["ID", "Name"])
df1 = df.withColumn(
"Shortened_name",
F.array_join(F.expr("transform(split(Name, ' '), x -> left(x, 1))"), "")
)
df1.show()
# +---+---------------+--------------+
# | ID| Name|Shortened_name|
# +---+---------------+--------------+
# | 1| John Doe| JD|
# | 2|Roy Lee Winters| RLW|
# | 3|Mary-Kate Baron| MB|
# +---+---------------+--------------+
Or by using aggregate function:
df1 = df.withColumn(
"Shortened_name",
F.expr("aggregate(split(Name, ' '), '', (acc, x) -> acc || left(x, 1))")
)
I'm struggling with a dataframe related problem.
columns = [desc[0] for desc in cursor.description]
data = cursor.fetchall()
df2 = pd.DataFrame(list(data), columns=columns)
df2 is as follows:
| Col1 | Col2 |
| -------- | -------------- |
| 2145779 | 2 |
| 8059234 | 3 |
| 2145779 | 3 |
| 4265093 | 2 |
| 2145779 | 2 |
| 1728234 | 5 |
I want to make a list of values in col1 where value of col2="3"
You can use boolean indexing:
out = df2.loc[df2.Col2.eq(3), "Col1"].agg(list)
print(out)
Prints:
[8059234, 2145779]
I have a dataframe which looks like this:
|--------------------------------------|---------|---------|
| path | content|
|------------------------------------------------|---------|
| /root/path/main_folder1/folder1/path1.txt | Val 1 |
|------------------------------------------------|---------|
| /root/path/main_folder1/folder2/path2.txt | Val 1 |
|------------------------------------------------|---------|
| /root/path/main_folder1/folder2/path3.txt | Val 1 |
|------------------------------------------------|---------|
I want to split the column values in path by "/" and get the values only until /root/path/mainfolder1
The Output that I want is
|--------------------------------------|---------|---------|---------------------------|
| path | content| root_path |
|------------------------------------------------|---------|---------------------------|
| /root/path/main_folder1/folder1/path1.txt | Val 1 | /root/path/main_folder1 |
|------------------------------------------------|---------|---------------------------|
| /root/path/main_folder1/folder2/path2.txt | Val 1 | /root/path/main_folder1 |
|------------------------------------------------|---------|---------------------------|
| /root/path/main_folder1/folder2/path3.txt | Val 1 | /root/path/main_folder1 |
|------------------------------------------------|---------|---------------------------|
I know that I have to use withColumn split and regexp_extract but I am not quiet getting how to limit the output of regexp_extract.
What is it that I have to do to get the desired output?
You can use a regular expression to extract the first three directory levels.
df.withColumn("root_path", F.regexp_extract(F.col("path"), "^((/\w*){3})",1))\
.show(truncate=False)
Output:
+-----------------------------------------+-------+-----------------------+
|path |content|root_path |
+-----------------------------------------+-------+-----------------------+
|/root/path/main_folder1/folder1/path1.txt|val 1 |/root/path/main_folder1|
|/root/path/main_folder1/folder2/path2.txt|val 2 |/root/path/main_folder1|
|/root/path/main_folder1/folder2/path3.txt|val 3 |/root/path/main_folder1|
+-----------------------------------------+-------+-----------------------+
I want to concat transactions to its second last valid trasaction.
Suppose I have columns till 4th transaction and I want to generate a sequence like below output.
Note: Values in trans columns are categorical.
Input Data:
| Cust_id | trans_1 | trans_2 | trans_3 | trans_4 |
|------------|---------|---------|---------|---------|
| 1000026037 | 'a' | 'b' | 'd' | NaN |
| 1000026048 | 'm' | 'c' | NaN | NaN |
| 1000026081 | 'x' | 't' | 'y' | NaN |
| 1000026451 | 'r' | 'p' | NaN | 'u' |
Desired Output:
| Sequence |
|----------|
| 'a b' |
| 'm' |
| 'x t' |
| 'r p' |
Select the transaction columns and get the data until second last nonzero and concatenate.
df.filter(regex='trans_')
.apply(lambda x: x.iloc[x.nonzero()].iloc[:-1], axis=1)
.add(' ')
.sum(axis=1)
.str.strip()
OR
df.filter(regex='trans_')
.apply(lambda x: ' '.join(x.iloc[x.nonzero()].iloc[:-1]), axis=1)
NOTE
Ensure all zeros are integer zeros and not string zeros i.e 0 and not '0'
df = df.replace({'0': 0})