Python: how to check entries with white spaces in a dataframe? - python

I have a dataframe df containing the information of car brands. For instance,
df['Car_Brand'][1]
'HYUNDAI '
where the length of each entries is the same len(df['Car_Brand'][1])=30. I can also have entries with only white spaces.
df['Car_Brand']
0 TOYOTA
1 HYUNDAI
2
3
4
5 OPEL
6
7 JAGUAR
where
df['Car_Brand'][2]
' '
I would like to drop from the dataframe all the entries with white spaces and reduce the size of the others. Finally:
df['Car_Brand'][1]
'HYUNDAI '
becomes
df['Car_Brand'][1]
'HYUNDAI'
I started to remove the withe spaces, in this way:
tmp = df['Car_Brand'].str.replace(" ","")

using str.strip and convert it to bool to filter the empty ones
df['Car_Brand'] = df['Car_Brand'].strip()
df[df['Car_Brand'].astype(bool)]

It seems need:
s = df['Car_Brand']
s1 = s[s != ''].reset_index(drop=True)
#if multiple whitespaces
#s1 = s[s.str.strip() != ''].reset_index(drop=True)
print (s1)
0 TOYOTA
1 HYUNDAI
2 OPEL
3 JAGUAR
Name: Car_Brand, dtype: object
If multiple whitespaces:
s = df[~df['Car_Brand'].str.contains('^\s+$')]

Related

Python/Pandas:How to process a column of data when it meets certain criteria

i have a csv lie this
userlabel|country
SZ5GZTD_[56][13631808]|russia
YZ5GZTC-3_[51][13680735]|uk
XZ5GZTA_12-[51][13574893]|usa
testYZ5GZWC_11-[51][13632101]|cuba
I use pandas to read this csv, I'd like to add a new column ci,Its value comes from userlabel,and the following conditions must be met:
convert values to lowercase
start with 'yz' or 'testyz'
the code is like this :
(df['userlabel'].str.lower()).str.extract(r"(test)?([a-z]+).*", expand=True)[1]
when it matched,ci is the number between the first "- or _" and second "- or _" from userlabel.
the fake code is like this:
ci = (userlabel,r'.*(\_|\-)(\d+)(\_|\-).*',2)
finally,the result is like this
userlabel ci country
SZ5GZTD_[56][13631808] russia
YZ5GZTC-3_[51][13680735] 3 uk
XZ5GZTA_12-[51][13574893] usa
testYZ5GZWC_11-[51][13632101] 11 cuba
You can use
import pandas as pd
df = pd.DataFrame({'userlabel':['SZ5GZTD_[56][13631808]','YZ5GZTC-3_[51][13680735]','XZ5GZTA_12-[51][13574893]','testYZ5GZWC_11-[51][13632101]'], 'country':['russia','uk','usa','cuba']})
df['ci'] = df['userlabel'].str.extract(r"(?i)^(?:yz|testyz)[^_-]*[_-](\d+)[-_]", expand=True)
>>> df['ci']
0 NaN
1 3
2 NaN
3 11
Name: ci, dtype: object
# To rearrange columns, add the following line:
df = df[['userlabel', 'ci', 'country']]
>>> df
userlabel ci country
0 SZ5GZTD_[56][13631808] NaN russia
1 YZ5GZTC-3_[51][13680735] 3 uk
2 XZ5GZTA_12-[51][13574893] NaN usa
3 testYZ5GZWC_11-[51][13632101] 11 cuba
See the regex demo.
Regex details:
(?i) - make the pattern case insensitive (no need using str.lower())
^ - start of string
(?:yz|testyz) - a non-capturing group matching either yz or testyz
[^_-]* - zero or more chars other than _ and -
[_-] - the first _ or -
(\d+) - Group 1 (the Series.str.extract requires a capturing group since it only returns this captured substring): one or more digits
[-_] - a - or _.
import re
def get_val(s):
l = re.findall(r'^(YZ|testYZ).*[_-](\d+)[_-].*', s)
return None if(len(l) == 0) else l[0][1]
df['ci'] = df['userlabel'].apply(lambda x: get_val(x))
df = df[['userlabel', 'ci', 'country']]
userlabel ci country
0 SZ5GZTD_[56][13631808] None russia
1 YZ5GZTC-3_[51][13680735] 3 uk
2 XZ5GZTA_12-[51][13574893] None usa
3 testYZ5GZWC_11-[51][13632101] 11 cuba

How do I iterate through a Python data frame where I occasionally need to replace two rows with one row, in-place?

The "in-place" is the aspect I struggle with- creating a brand new data frame is solved (provided at the end). My specific issue is that my imported data occasionally splits one column's string into two substrings, placing the first substring on one row with its other columns of data and placing the second substring on the following row with NaN values for its other columns.
This is what the data frame should look like:
Actor Color Number
0 Amy Adams red 1
1 Bill Burr orange 2
2 Courtney Cox yellow 3
3 Danny DeVito green 4
4 Emilio Estevez blue 5
This is what my imported data frame initially looks like, where "Courtney Cox" and "Emilio Estevez" have been split into two rows. I provided the code to create this data frame. (Don't worry about the shift from integer to float- it's irrelevant.)
Actor Color Number
0 Amy Adams red 1.0
1 Bill Burr orange 2.0
2 Courtney yellow 3.0
3 Cox NaN NaN
4 Danny DeVito green 4.0
5 Emilio blue 5.0
6 Estevez NaN NaN
bad_df = pd.DataFrame({'Actor': ['Amy Adams','Bill Burr','Courtney','Cox','Danny DeVito','Emilio','Estevez'],
'Color':['red','orange','yellow',np.nan,'green','blue',np.nan],
'Number':[1,2,3,np.nan,4,5,np.nan]})
I do have access to the correct list for the Actor column.
actor_list = ['Amy Adams','Bill Burr','Courtney Cox','Danny DeVito','Emilio Estevez']
My data frames are actually pretty small, so copying the data frame or creating a separate data frame isn't a problem, but it seems like I should be able to perform my fix in-place.
Here's my current approach (iteratively creating a new data frame), but it seems sloppy. I iterate through a zip where each element consists of the index of a row, the row's Actor string, and the next row's Actor string. However, I have to do the last row outside of the loop so I don't look for a "next row" that doesn't exist.
new_df = pd.DataFrame()
for a1idx, a1, a2 in zip(bad_df.iloc[:-1,0].index, bad_df.iloc[:-1,0], bad_df.iloc[1:,0]):
if a1 in actor_list: # First and last name are in this row
new_df = new_df.append(bad_df.iloc[a1idx,:]) # Add row
elif a1 + ' ' + a2 in actor_list: # First and last name are in consecutive rows
new_df = new_df.append(bad_df.iloc[a1idx,:]) # Add row
new_df.iloc[-1,0] = a1 + ' ' + a2 # Correct name in row
# If neither of the above if conditions are met, this means we're inefficiently
# looking at a row with just a last name which was dealt with in the previous iteration
if bad_df.iloc[-1,0] in actor_list: # Check very last row of data frame
new_df = new_df.append(bad_df.iloc[-1,:]) # Add row
Is there a way to do this in-place?
Would that be a better way?
import pandas as pd
bad_df = pd.DataFrame({'Actor': ['Amy Adams','Bill Burr','Courtney','Cox','Danny DeVito','Emilio','Estevez'],
'Color':['red','orange','yellow',np.nan,'green','blue',np.nan],
'Number':[1,2,3,np.nan,4,5,np.nan]})
actor_list = ['Amy Adams','Bill Burr','Courtney Cox','Danny DeVito','Emilio Estevez']
nan_index = bad_df['Color'].isna()
bad_df.loc[nan_index, 'last_names'] = bad_df['Actor'][nan_index]
bad_df['last_names'] = bad_df['last_names'].shift(-1)
mask = pd.Series(nan_index).shift(-1, fill_value=False)
bad_df.loc[mask, 'Actor'] = bad_df['Actor'].str.cat(bad_df['last_names'], sep=' ')
bad_df.drop('last_names', axis=1, inplace=True)
bad_df = bad_df[~nan_index]
print(bad_df)
Output:
Actor Color Number
0 Amy Adams red 1.0
1 Bill Burr orange 2.0
2 Courtney Cox yellow 3.0
4 Danny DeVito green 4.0
5 Emilio Estevez blue 5.0

How to replace words with different case in PANDAS dataframe

I have a really big data frame and it contains a specific column "city" with multiple cities repeating but in different case, for eg -
***City***
Gurgaon
GURGAON
gurgaon
Chennai
CHENNAI
Banglore
Hydrabad
BANGLORE
HYDRABAD
.
Is there a way to replace all the same cities with different case, with a single name.
There are total 3k rows in each column, so manually it's not possible.
Edit -
The city column of the DF also contains cities like
'Gurgaon'
'GURGAON'
'gurgaon ' #there is a white space at the end
I want something so that they all change to the same name and the delimiter is also removed. So that the output is →
'Gurgaon'
'Gurgaon'
'Gurgaon' #no white space at the end
Thanks
Here is how you can use str.strip() to remove trailing whitespaces, and then use str.title():
import pandas as pd
df = pd.DataFrame({'City':["Gurgaon",
"GURGAON",
"gurgaon",
"Chennai",
"CHENNAI",
"Banglore",
"Hydrabad",
"BANGLORE",
"HYDRABAD"]})
df['City'] = df['City'].str.strip()
df['City'] = df['City'].str.title()
print(df)
Output:
City
0 Gurgaon
1 Gurgaon
2 Gurgaon
3 Chennai
4 Chennai
5 Banglore
6 Hydrabad
7 Banglore
8 Hydrabad
First, change the cities to have the same format:
df.city=df.city.apply(lambda x: x.capitalize())
Then, remove duplicates:
df.drop_duplicates()
(I assume the rest of the columns are equal)

How to condense text cleaning steps into single Python function?

Newer programmer here, deeply appreciate any help this knowledgeable community is willing to provide.
I have a column of 140,000 text strings (company names) in a pandas dataframe on which I want to strip all whitespace everywhere in/around the strings, remove all punctuation, substitute specific substrings, and uniformly transform to lowercase. I want to then take the first 0:10 elements in the strings and store them in a new dataframe column.
Here is a reproducible example.
import string
import pandas as pd
data = ["West Georgia Co",
"W.B. Carell Clockmakers",
"Spine & Orthopedic LLC",
"LRHS Saint Jose's Grocery",
"Optitech#NYCityScape"]
df = pd.DataFrame(data, columns = ['co_name'])
def remove_punctuations(text):
for punctuation in string.punctuation:
text = text.replace(punctuation, '')
return text
# applying remove_punctuations function
df['co_name_transform'] = df['co_name'].apply(remove_punctuations)
# this next step replaces 'Saint' with 'st' to standardize,
# and I may want to make other substitutions but this is a common one.
df['co_name_transform'] = df.co_name_transform.str.replace('Saint', 'st')
# replace whitespace
df['co_name_transform'] = df.co_name_transform.str.replace(' ', '')
# make lowercase
df['co_name_transform'] = df.co_name_transform.str.lower()
# select first 0:10 of strings
df['co_name_transform'] = df.co_name_transform.str[0:10]
print(df)
co_name co_name_transform
0 West Georgia Co westgeorgi
1 W.B. Carell Clockmakers wbcarellcl
2 Spine & Orthopedic LLC spineortho
3 LRHS Saint Jose's Grocery lrhsstjose
4 Optitech#NYCityScape optitechny
How can I put all these steps into a single function like this?
def clean_text(df[col]):
for co in co_name:
do_all_the_steps
return df[new_col]
Thank you
You don't need a function to do this. Try the following one-liner.
df['co_name_transform'] = df['co_name'].str.replace('[^A-Za-z0-9-]+', '').str.replace('Saint', 'st').str.lower().str[0:10]
Final output will be.
co_name co_name_transform
0 West Georgia Co westgeorgi
1 W.B. Carell Clockmakers wbcarellcl
2 Spine & Orthopedic LLC spineortho
3 LRHS Saint Jose's Grocery lrhsstjose
4 Optitech#NYCityScape optitechny
You can do all the steps in the function you pass to the apply method:
import re
df['co_name_transform'] = df['co_name'].apply(lambda s: re.sub(r'[\W_]+', '', s).replace('Saint', 'st').lower()[:10])
Another solution, similar to the previous one, but with the list of "to_replace" in one dictionary, so you can add more items to replace. Also, the previous solution won't give the first 10.
data = ["West Georgia Co",
"W.B. Carell Clockmakers",
"Spine & Orthopedic LLC",
"LRHS Saint Jose's Grocery",
"Optitech#NYCityScape","Optitech#NYCityScape","Optitech#NYCityScape","Optitech#NYCityScape","Optitech#NYCityScape","Optitech#NYCityScape","Optitech#NYCityScape","Optitech#NYCityScape","Optitech#NYCityScape"]
df = pd.DataFrame(data, columns = ['co_name'])
to_replace = {'[^A-Za-z0-9-]+':'','Saint':'st'}
for i in to_replace :
df['co_name'] = df['co_name'].str.replace(i,to_replace[i]).str.lower()
df['co_name'][0:10]
Result :
0 westgeorgiaco
1 wbcarellclockmakers
2 spineorthopedicllc
3 lrhssaintjosesgrocery
4 optitechnycityscape
5 optitechnycityscape
6 optitechnycityscape
7 optitechnycityscape
8 optitechnycityscape
9 optitechnycityscape
Name: co_name, dtype: object
Previous solution ( won't show the first 10)
df['co_name_transform'] = df['co_name'].str.replace('[^A-Za-z0-9-]+', '').str.replace('Saint', 'st').str.lower().str[0:10]
Result :
0 westgeorgi
1 wbcarellcl
2 spineortho
3 lrhssaintj
4 optitechny
5 optitechny
6 optitechny
7 optitechny
8 optitechny
9 optitechny
10 optitechny
11 optitechny
12 optitechny
Name: co_name_transform, dtype: object

Extract integers after double space with regex

I have a dataframe where I want to extract stuff after double space. For all rows in column NAME there is a double white space after the company names before the integers.
NAME INVESTMENT PERCENT
0 APPLE COMPANY A 57 638 232 stocks OIL LTD 0.12322
1 BANANA 1 COMPANY B 12 946 201 stocks GOLD LTD 0.02768
2 ORANGE COMPANY C 8 354 229 stocks GAS LTD 0.01786
df = pd.DataFrame({
'NAME': ['APPLE COMPANY A 57 638 232 stocks', 'BANANA 1 COMPANY B 12 946 201 stocks', 'ORANGE COMPANY C 8 354 229 stocks'],
'PERCENT': [0.12322, 0.02768 , 0.01786]
})
I have this earlier, but it also includes integers in the company name:
df['STOCKS']=df['NAME'].str.findall(r'\b\d+\b').apply(lambda x: ''.join(x))
Instead I tried to extract after double spaces
df['NAME'].str.split('(\s{2})')
which gives output:
0 [APPLE COMPANY A, , 57 638 232 stocks]
1 [BANANA 1 COMPANY B, , 12 946 201 stocks]
2 [ORANGE COMPANY C, , 8 354 229 stocks]
However, I want the integers that occur after double spaces to be joined/merged and put into a new column.
NAME PERCENT STOCKS
0 APPLE COMPANY A 0.12322 57638232
1 BANANA 1 COMPANY B 0.02768 12946201
2 ORANGE COMPANY C 0.01786 12946201
How can I modify my second function to do what I want?
Following the original logic you may use
df['STOCKS'] = df['NAME'].str.extract(r'\s{2,}(\d+(?:\s\d+)*)', expand=False).str.replace(r'\s+', '')
df['NAME'] = df['NAME'].str.replace(r'\s{2,}\d+(?:\s\d+)*\s+stocks', '')
Output:
NAME PERCENT STOCKS
0 APPLE COMPANY A 0.12322 57638232
1 BANANA 1 COMPANY B 0.02768 12946201
2 ORANGE COMPANY C 0.01786 8354229
Details
\s{2,}(\d+(?:\s\d+)*) is used to extract the first occurrence of whitespace-separated consecutive digit chunks after 2 or more whitespaces and .replace(r'\s+', '') removes any whitespaces in that extracted text afterwards
.replace(r'\s{2,}\d+(?:\s\d+)*\s+stocks' updates the text in the NAME column, it removes 2 or more whitespaces, consecutive whitespace-separated digit chunks and then 1+ whitespaces and stocks. Actually, the last \s+stocks may be replaced with .* if there are other words.
Another pandas approach, which will cast STOCKS to numeric type:
df_split = (df['NAME'].str.extractall('^(?P<NAME>.+)\s{2}(?P<STOCKS>[\d\s]+)')
.reset_index(level=1, drop=True))
df_split['STOCKS'] = pd.to_numeric(df_split.STOCKS.str.replace('\D', ''))
Assign these columns back into your original DataFrame:
df[['NAME', 'STOCKS']] = df_split[['NAME', 'STOCKS']]
COMPANY_NAME STOCKS PERCENT
0 APPLE COMPANY A 57638232 0.12322
1 BANANA 1 COMPANY B 12946201 0.02768
2 ORANGE COMPANY C 8354229 0.01786
You can use look behind and look ahead operators.
''.join(re.findall(r'(?<=\s{2})(.*)(?=stocks)',string)).replace(' ','')
This catches all characters between two spaces and the word stocks and replace all the spaces with null.
Another Solution using Split
df["NAME"].apply(lambda x:x[x.find(' ')+2:x.find('stocks')-1].replace(' ',''))
Reference:-
Look_behind
You can try
df['STOCKS'] = df['NAME'].str.split(',')[2].replace(' ', '')
df['NAME'] = df['NAME'].str.split(',')[0]
This can be done without using regex by using split.
df['STOCKS'] = df['NAME'].apply(lambda x: ''.join(x.split(' ')[1].split(' ')[:-1]))
df['NAME'] = df['NAME'].str.replace(r'\s?\d+(?:\s\d+).*', '')

Categories

Resources