Multiple special character transformations on dataframe using Pandas [duplicate] - python

This question already has answers here:
Faster method of extracting characters for multiple columns in dataframe
(2 answers)
How to extract part of a string in Pandas column and make a new column
(3 answers)
Reference - What does this regex mean?
(1 answer)
Closed 2 months ago.
I wish to keep everything before the hyphen in one column, and keep everything before the colon in another column using Pandas.
Data
ID Type Stat
AA - type2 AAB:AB33:77:000 Y
CC - type3 CCC:AB33:77:000 N
Desired
ID Type
AA AAB
CC CCC
Doing
separator = '-'
result_1 = my_str.split(separator, 1)[0]
Any suggestion is appreciated

We can try using str.extract here:
df["ID"] = df["ID"].str.extract(r'(\w+)')
df["Type"] = df["Type"].str.extract(r'(\w+)')

I would say
func1 = lambda _: _['ID'].split('- ')[0]
func2 = lambda _: _['Type'].split(':')[0]
data\
.assign(ID=func1)\
.assign(Type=func2)
References
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.assign.html

Related

Data frame renaming columns [duplicate]

This question already has answers here:
Remove or replace spaces in column names
(2 answers)
How can I make pandas dataframe column headers all lowercase?
(6 answers)
Closed 1 year ago.
data sample from CSV file
Model,Displ,Cyl,Trans,Drive,Fuel,Cert Region,Stnd,Stnd Description,Underhood ID,Veh Class,Air Pollution Score,City MPG,Hwy MPG,Cmb MPG,Greenhouse Gas Score,SmartWay,Comb CO2
ACURA RDX,3.5,6,SemiAuto-6,2WD,Gasoline,FA,T3B125,Federal Tier 3 Bin 125,JHNXT03.5GV3,small SUV,3,20,28,23,5,No,386
import pandas as pd
df_18 = pd.read_csv('file name')
request:
Rename all column labels to replace spaces with underscores and convert everything to lowercase.
below code did work, and I don't know why
df_18.rename(str.lower().str.strip().str.replace(" ","_"),axis=1,inplace=True)
You can directly assign the list of column names to pandas.DataFrame.columns; you can perform the required operations i.e. lower, strip, and replace in a list-comprehension for each column names, and assign it back to the dataframe.columns
df_18.columns = [col.lower().strip().replace(" ","_") for col in df_18]
OUTPUT:
model displ cyl ... greenhouse_gas_score smartway comb_co2
0 ACURA RDX 3.5 6 ... 5 No 386
[1 rows x 18 columns]
There are many ways to rename the column,
reference for renaming columns
reference for replace string
you can use the below code.
df_18.columns=[col.lower().replace(" ","_") for col in df_18.columns]
for column in df_18.columns:
new_column_name = column.lower().strip().replace(" ","_")
if new_column_name != column:
df_18[new_column_name] = df_18[column]
del df_18[column]

Pandas: How to remove character that include non english characters? [duplicate]

This question already has answers here:
Remove non-ASCII characters from pandas column
(8 answers)
Closed 1 year ago.
In my DF there are values like الÙجيرة in different columns. How can I remove such values? I am reading the data from an excel file. So on reading, if we could do something then that will be great.
Also, I have some values like Battery ÁÁÁ so I want it to be Battery, So how can I delete these non-English characters but keep other content?
You can use regex to remove designated characters from your strings:
import re
import pandas as pd
records = [{'name':'Foo الÙجيرة'}, {'name':'Battery ÁÁÁ'}]
df = pd.DataFrame.from_records(records)
# Allow alpha numeric and spaces (add additional characters as needed)
pattern = re.compile('[^A-z0-9 ]+')
def clean_text(string):
return pattern.search('', string)
# Apply to your df
df['clean_name'] = df['name'].apply(clean_text)
name clean_name
0 Foo الÙجيرة Foo
1 Battery ÁÁÁ Battery
For more solutions, you can read this SO Q: Python, remove all non-alphabet chars from string
You can use python split method to do that or you can lambda function:
df[column_name] = df[column_name].apply(lambda column_name : column_name[start:stop])
#df['location'] = df['location'].apply(lambda location:location[0:4])
Split Method
df[column_name] = df[column_name].apply(lambda column_name: column_name.split('')[0])

Overwriting values using .loc [duplicate]

This question already has answers here:
Try to replace a specific value in a dataframe, but does not overwritte it
(1 answer)
Changing values in pandas dataframe does not work
(1 answer)
Closed 2 years ago.
I want to conditionally overwrite some values for a given column in my DataFrame using this command
enq.dropna().loc[q16.apply(lambda x: x[:3].lower()) == 'oui', q16_] = 'OUI' # q16 = enq[column_name].dropna()
which has the form
df.dropna().loc[something == something_else, column_name] = new_value
I don't get any error but when I check the result, I see that nothing has changed.
Thanks for reading and helping.
Your problem is because dropna() is a new dataframe which is a copy of df, you have to do it in two steps:
enq.dropna(inplace=True)
enq.loc[q16.apply(lambda x: x[:3].lower()) == 'oui', q16_] = 'OUI'

How can I use multiple .contains() inside a .when() in pySpark? [duplicate]

This question already has answers here:
PySpark: multiple conditions in when clause
(4 answers)
Closed 3 years ago.
I am trying to create classes in a new column, based on existing words in another column. For that, I need to include multiple .contains() conditions. But none of the one I tried work.
def classes_creation(data):
df = data.withColumn("classes", when(data.where(F.col("MISP_RFW_Title").like('galleys') | F.col("MISP_RFW_Title").like('coffee')),"galleys") ).otherwise(lit(na))
return df
# RETURNS ERROR
def classes_creation(data):
df = data.withColumn("classes", when(col("MISP_RFW_Title").contains("galleys").contains("word"), 'galleys').otherwise(lit(na))
return df
# RETURNS COLUMN OF NA ONLY
def classes_creation(data):
df = data.withColumn("classes", when(col("MISP_RFW_Title").contains("galleys" | "word"), 'galleys').otherwise(lit(na))
return df
# RETURNS COLUMN OF NA ONLY
If I understood your requirements correctly, you can use regex for matching with rlike
data.withColumn("classes", when(col("MISP_RFW_Title").rlike("galleys|word"), 'galleys').otherwise('a'))
or maybe if you have different columns, you can use something like this
data.withColumn("classes", when((col("MISP_RFW_Title").contains("galleys")|col("MISP_RFW_Title").contains("word")), 'galleys').otherwise('a'))

Python data frames - how to select all columns that have a specific substring in their name [duplicate]

This question already has answers here:
Find column whose name contains a specific string
(8 answers)
Closed 7 years ago.
in Python I have a data frame (df) that contains columns with the following names A_OPEN, A_CLOSE, B_OPEN, B_CLOSE, C_OPEN, C_CLOSE, D_ etc.....
How can I easily select only the columns that contain _CLOSE in their name? A,B,C,D,E,F etc can have any value so I do not want to use the specific column names
In SQL this would be done with the like operator: df[like'%_CLOSE%']
What's the python way?
You could use a list comprehension, e.g.:
df[[x for x in df.columns if "_CLOSE" in x]]
Example:
df = pd.DataFrame(
columns = ['_CLOSE_A', '_CLOSE_B', 'C'],
data = [[2,3,4], [3,4,5]]
)
Then,
>>>print(df[[x for x in df.columns if "_CLOSE" in x]])
_CLOSE_A _CLOSE_B
0 2 3
1 3 4

Categories

Resources