I want to remove whitespace after all data in the excel table using panda in Jupiter notebook.
foreg:
| A header | Another header |
| -------- | -------------- |
| First**whitespace** | row |
| Second | row**whitespace** |
output:
| A header | Another header |
| -------- | -------------- |
| First | row |
| Second | row |
If all columns are strings use rstrip in DataFrame.applymap:
df = df.applymap(lambda x: x.rstrip())
Or Series.str.rstrip for columns in DataFrame.apply:
df = df.apply(lambda x: x.str.rstrip())
If possible some non strings (non object) columns is possible filter columns names:
cols = df.select_dtypes(object).columns
df[cols] = df[cols].apply(lambda x: x.str.rstrip())
Related
I want to modify drop_duplicates in a such way:
For example, I've got DataFrame with rows:
| A header | Another header |
| -------- | -------------- |
| First | el1 |
| Second | el2 |
| Second | el8 |
| First | el3 |
| Second | el4 |
| Second | el5 |
| First | el6 |
| Second | el9 |
And I need not to drop all duplicates, but only consecutive ones. So as a result a want:
| A header | Another header |
| -------- | -------------- |
| First | el1 |
| Second | el2 |
| First | el3 |
| Second | el4 |
| First | el6 |
| Second | el9 |
Tried to do it with for, but maybe there are better ways
You can simply do it by using shift() as follows:
import pandas as pd
df = pd.DataFrame({
'A header': ['First', 'Second', 'Second', 'First', 'Second', 'Second', 'First', 'Second'],
'Another header': ['el1', 'el2', 'el8', 'el3', 'el4', 'el5', 'el6', 'el9'],
})
print(df)
"""
A header Another header
0 First el1
1 Second el2
2 Second el8
3 First el3
4 Second el4
5 Second el5
6 First el6
7 Second el9
"""
df2 = df[df['A header'] != df['A header'].shift(1)]
print(df2)
"""
A header Another header
0 First el1
1 Second el2
3 First el3
4 Second el4
6 First el6
7 Second el9
"""
Using shift(1), you can compare each row with the row's previous row.
For more information, see https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shift.html
extract dup:
l=[]
for i in range(len(df1)-1):
if df1['A header'][i]==df1['A header'][i+1] :
l.append(i+1)
drop dup:
df1.drop(l, inplace=True)
I'm trying to figure out the condition to check if the values of one PySpark dataframe exist in another PySpark dataframe, and if so extract the value and compare again. I was thinking of doing a multiple withColumn() with a when() function.
For example my two dataframes can be something like:
df1
| id | value |
| ----- | ---- |
| hello | 1111 |
| world | 2222 |
df2
| id | value |
| ------ | ---- |
| hello | 1111 |
| world | 3333 |
| people | 2222 |
And the result I wish to obtain is to check first if the value of df1.id exists in df2.id and if true return me the df2.value, for example I was trying something like:
df1 = df1.withColumn("df2_value", when(df1.id == df2.id, df2.value))
So I get something like:
df1
| id | value | df2_value |
| ----- | ---- | --------- |
| hello | 1111 | 1111 |
| world | 2222 | 3333 |
So that now I can do another check between these two value columns in the df1 dataframe, and return a boolean column (1or 0) in a new dataframe.
The result I wish to get would be something like:
df3
| id | value | df2_value | match |
| ----- | ---- | --------- | ----- |
| hello | 1111 | 1111 | 1 |
| world | 2222 | 3333 | 0 |
Left join df1 with df2 on id after prefixing all df2 columns except id with df2_*:
from pyspark.sql import functions as F
df1 = spark.createDataFrame([("hello", 1111), ("world", 2222)], ["id", "value"])
df2 = spark.createDataFrame([("hello", 1111), ("world", 3333), ("people", 2222)], ["id", "value"])
df = df1.join(
df2.select("id", *[F.col(c).alias(f"df2_{c}") for c in df2.columns if c != 'id']),
["id"],
"left"
)
Then using functools.reduce you can construct a boolean expression to check if columns match in the 2 dataframes like this:
from functools import reduce
check_expr = reduce(
lambda acc, x: acc & (F.col(x) == F.col(f"df2_{x}")),
[c for c in df1.columns if c != 'id'],
F.lit(True)
)
df.withColumn("match", check_expr.cast("int")).show()
#+-----+-----+---------+-----+
#| id|value|df2_value|match|
#+-----+-----+---------+-----+
#|hello| 1111| 1111| 1|
#|world| 2222| 3333| 0|
#+-----+-----+---------+-----+
I am practicing pandas and I have an exercise with which I have a problem
I have an excel file that has a column where two types of urls are stored.
df = pd.DataFrame({'id': [],
'url': ['www.something/12312', 'www.something/12343', 'www.somethingelse/42312', 'www.somethingelse/62343']})
| id | url |
| -------- | -------------- |
| | 'www.something/12312' |
| | 'www.something/12343' |
| | 'www.somethingelse/42312' |
| | 'www.somethingelse/62343' |
I am supposed to transform this into ids, but only number should be part of the id, the new id column should look like this:
df = pd.DataFrame({'id': [id_12312 , id_12343, diffid_42312, diffid_62343], 'url': ['www.something/12312', 'www.something/12343', 'www.somethingelse/42312', 'www.somethingelse/62343']})
| id | url |
| -------- | -------------- |
| id_12312 | 'www.something/12312' |
| id_12343 | 'www.something/12343' |
| diffid_42312 | 'www.somethingelse/42312' |
| diffid_62343 | 'www.somethingelse/62343' |
My problem is how to get only numbers and replace them if that kind of id?
I have tried the replace and extract function in pandas
id_replaced = df.replace(regex={re.search('something', df['url']): 'id_' + str(re.search(r'\d+', i).group()), re.search('somethingelse', df['url']): 'diffid_' + str(re.search(r'\d+', i).group())})
df['id'] = df['url'].str.extract(re.search(r'\d+', df['url']).group())
However, they are throwing an error TypeError: expected string or bytes-like object.
Sorry for the tables in codeblock. The page was screaming that I have code that is not properly formatted when it was not in a codeblock.
Here is one solution, but I don't quite understand when do you use the id prefix and when to use diffid ..
>>> df.id = 'id_'+df.url.str.split('/', n=1, expand=True)[1]
>>> df
id url
0 id_12312 www.something/12312
1 id_12343 www.something/12343
2 id_42312 www.somethingelse/42312
3 id_62343 www.somethingelse/62343
Or using str.extract
>>> df.id = 'id_' + df.url.str.extract(r'/(\d+)$')
I'm struggling with a dataframe related problem.
columns = [desc[0] for desc in cursor.description]
data = cursor.fetchall()
df2 = pd.DataFrame(list(data), columns=columns)
df2 is as follows:
| Col1 | Col2 |
| -------- | -------------- |
| 2145779 | 2 |
| 8059234 | 3 |
| 2145779 | 3 |
| 4265093 | 2 |
| 2145779 | 2 |
| 1728234 | 5 |
I want to make a list of values in col1 where value of col2="3"
You can use boolean indexing:
out = df2.loc[df2.Col2.eq(3), "Col1"].agg(list)
print(out)
Prints:
[8059234, 2145779]
I have a csv file with data which looks like below when seen in a notepad:
| Column A | Column B | Column C | Column D |
---------------------------------------------------
| "100053 | \"253\" | \"Apple\"| \"2020-01-01\" |
| "100056 | \"254\" | \"Apple\"| \"2020-01-01\" |
| "100063 | \"255\" | \"Choco\"| \"2020-01-01\" |
I tried this:
df = pd.read_csv("file_name.csv", sep='\t', low_memory=False)
But the output I'm getting is
| Column A | Column B | Column C | Column D |
-------------------------------------------------------------
| 100053\t\253\" | \"Apple\" | \"2020-01-01\"| |
| 100056\t\254\" | \"Apple\" | \"2020-01-01\"| |
| 100063\t\255\" | \"Choco\" | \"2020-01-01\"| |
How can I get the output in proper format in the respective columns with all the extra characters removed?
I have tried different variations of delimiter, escapechar.. but no luck. Maybe I'm missing something?
Edit: I figured out how to get rid of the external characters
df["ColumnB"]=df["ColumnB"].map(lambda x: str(x)[2:-2])
The above gets rid of the leading 2 characters and the trailing 2 characters.