remove whitespace from end of dataframe - python

I want to remove whitespace after all data in the excel table using panda in Jupiter notebook.
foreg:
| A header | Another header |
| -------- | -------------- |
| First**whitespace** | row |
| Second | row**whitespace** |
output:
| A header | Another header |
| -------- | -------------- |
| First | row |
| Second | row |

If all columns are strings use rstrip in DataFrame.applymap:
df = df.applymap(lambda x: x.rstrip())
Or Series.str.rstrip for columns in DataFrame.apply:
df = df.apply(lambda x: x.str.rstrip())
If possible some non strings (non object) columns is possible filter columns names:
cols = df.select_dtypes(object).columns
df[cols] = df[cols].apply(lambda x: x.str.rstrip())

Related

Pandas DataFrame drop consecutive duplicates

I want to modify drop_duplicates in a such way:
For example, I've got DataFrame with rows:
| A header | Another header |
| -------- | -------------- |
| First | el1 |
| Second | el2 |
| Second | el8 |
| First | el3 |
| Second | el4 |
| Second | el5 |
| First | el6 |
| Second | el9 |
And I need not to drop all duplicates, but only consecutive ones. So as a result a want:
| A header | Another header |
| -------- | -------------- |
| First | el1 |
| Second | el2 |
| First | el3 |
| Second | el4 |
| First | el6 |
| Second | el9 |
Tried to do it with for, but maybe there are better ways
You can simply do it by using shift() as follows:
import pandas as pd
df = pd.DataFrame({
'A header': ['First', 'Second', 'Second', 'First', 'Second', 'Second', 'First', 'Second'],
'Another header': ['el1', 'el2', 'el8', 'el3', 'el4', 'el5', 'el6', 'el9'],
})
print(df)
"""
A header Another header
0 First el1
1 Second el2
2 Second el8
3 First el3
4 Second el4
5 Second el5
6 First el6
7 Second el9
"""
df2 = df[df['A header'] != df['A header'].shift(1)]
print(df2)
"""
A header Another header
0 First el1
1 Second el2
3 First el3
4 Second el4
6 First el6
7 Second el9
"""
Using shift(1), you can compare each row with the row's previous row.
For more information, see https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shift.html
extract dup:
l=[]
for i in range(len(df1)-1):
if df1['A header'][i]==df1['A header'][i+1] :
l.append(i+1)
drop dup:
df1.drop(l, inplace=True)

Check if PySaprk column values exists in another dataframe column values

I'm trying to figure out the condition to check if the values of one PySpark dataframe exist in another PySpark dataframe, and if so extract the value and compare again. I was thinking of doing a multiple withColumn() with a when() function.
For example my two dataframes can be something like:
df1
| id | value |
| ----- | ---- |
| hello | 1111 |
| world | 2222 |
df2
| id | value |
| ------ | ---- |
| hello | 1111 |
| world | 3333 |
| people | 2222 |
And the result I wish to obtain is to check first if the value of df1.id exists in df2.id and if true return me the df2.value, for example I was trying something like:
df1 = df1.withColumn("df2_value", when(df1.id == df2.id, df2.value))
So I get something like:
df1
| id | value | df2_value |
| ----- | ---- | --------- |
| hello | 1111 | 1111 |
| world | 2222 | 3333 |
So that now I can do another check between these two value columns in the df1 dataframe, and return a boolean column (1or 0) in a new dataframe.
The result I wish to get would be something like:
df3
| id | value | df2_value | match |
| ----- | ---- | --------- | ----- |
| hello | 1111 | 1111 | 1 |
| world | 2222 | 3333 | 0 |
Left join df1 with df2 on id after prefixing all df2 columns except id with df2_*:
from pyspark.sql import functions as F
df1 = spark.createDataFrame([("hello", 1111), ("world", 2222)], ["id", "value"])
df2 = spark.createDataFrame([("hello", 1111), ("world", 3333), ("people", 2222)], ["id", "value"])
df = df1.join(
df2.select("id", *[F.col(c).alias(f"df2_{c}") for c in df2.columns if c != 'id']),
["id"],
"left"
)
Then using functools.reduce you can construct a boolean expression to check if columns match in the 2 dataframes like this:
from functools import reduce
check_expr = reduce(
lambda acc, x: acc & (F.col(x) == F.col(f"df2_{x}")),
[c for c in df1.columns if c != 'id'],
F.lit(True)
)
df.withColumn("match", check_expr.cast("int")).show()
#+-----+-----+---------+-----+
#| id|value|df2_value|match|
#+-----+-----+---------+-----+
#|hello| 1111| 1111| 1|
#|world| 2222| 3333| 0|
#+-----+-----+---------+-----+

replacing string with a different string in pandas depending on value

I am practicing pandas and I have an exercise with which I have a problem
I have an excel file that has a column where two types of urls are stored.
df = pd.DataFrame({'id': [],
'url': ['www.something/12312', 'www.something/12343', 'www.somethingelse/42312', 'www.somethingelse/62343']})
| id | url |
| -------- | -------------- |
| | 'www.something/12312' |
| | 'www.something/12343' |
| | 'www.somethingelse/42312' |
| | 'www.somethingelse/62343' |
I am supposed to transform this into ids, but only number should be part of the id, the new id column should look like this:
df = pd.DataFrame({'id': [id_12312 , id_12343, diffid_42312, diffid_62343], 'url': ['www.something/12312', 'www.something/12343', 'www.somethingelse/42312', 'www.somethingelse/62343']})
| id | url |
| -------- | -------------- |
| id_12312 | 'www.something/12312' |
| id_12343 | 'www.something/12343' |
| diffid_42312 | 'www.somethingelse/42312' |
| diffid_62343 | 'www.somethingelse/62343' |
My problem is how to get only numbers and replace them if that kind of id?
I have tried the replace and extract function in pandas
id_replaced = df.replace(regex={re.search('something', df['url']): 'id_' + str(re.search(r'\d+', i).group()), re.search('somethingelse', df['url']): 'diffid_' + str(re.search(r'\d+', i).group())})
df['id'] = df['url'].str.extract(re.search(r'\d+', df['url']).group())
However, they are throwing an error TypeError: expected string or bytes-like object.
Sorry for the tables in codeblock. The page was screaming that I have code that is not properly formatted when it was not in a codeblock.
Here is one solution, but I don't quite understand when do you use the id prefix and when to use diffid ..
>>> df.id = 'id_'+df.url.str.split('/', n=1, expand=True)[1]
>>> df
id url
0 id_12312 www.something/12312
1 id_12343 www.something/12343
2 id_42312 www.somethingelse/42312
3 id_62343 www.somethingelse/62343
Or using str.extract
>>> df.id = 'id_' + df.url.str.extract(r'/(\d+)$')

Python DataFrame - Select dataframe rows based on values in a column of same dataframe

I'm struggling with a dataframe related problem.
columns = [desc[0] for desc in cursor.description]
data = cursor.fetchall()
df2 = pd.DataFrame(list(data), columns=columns)
df2 is as follows:
| Col1 | Col2 |
| -------- | -------------- |
| 2145779 | 2 |
| 8059234 | 3 |
| 2145779 | 3 |
| 4265093 | 2 |
| 2145779 | 2 |
| 1728234 | 5 |
I want to make a list of values in col1 where value of col2="3"
You can use boolean indexing:
out = df2.loc[df2.Col2.eq(3), "Col1"].agg(list)
print(out)
Prints:
[8059234, 2145779]

Reading csv file in Python Pandas with backslash and quotes as delimiter

I have a csv file with data which looks like below when seen in a notepad:
| Column A | Column B | Column C | Column D |
---------------------------------------------------
| "100053 | \"253\" | \"Apple\"| \"2020-01-01\" |
| "100056 | \"254\" | \"Apple\"| \"2020-01-01\" |
| "100063 | \"255\" | \"Choco\"| \"2020-01-01\" |
I tried this:
df = pd.read_csv("file_name.csv", sep='\t', low_memory=False)
But the output I'm getting is
| Column A | Column B | Column C | Column D |
-------------------------------------------------------------
| 100053\t\253\" | \"Apple\" | \"2020-01-01\"| |
| 100056\t\254\" | \"Apple\" | \"2020-01-01\"| |
| 100063\t\255\" | \"Choco\" | \"2020-01-01\"| |
How can I get the output in proper format in the respective columns with all the extra characters removed?
I have tried different variations of delimiter, escapechar.. but no luck. Maybe I'm missing something?
Edit: I figured out how to get rid of the external characters
df["ColumnB"]=df["ColumnB"].map(lambda x: str(x)[2:-2])
The above gets rid of the leading 2 characters and the trailing 2 characters.

Categories

Resources