Clean column in data frame - python

I'm trying to clean one column which contains the ID number which is starting from S and 7 numbers, e.g.: 'S1234567' and save only this number into new column. I started with this column named Remarks, this is an example of the data inside:
Remarks
0 S0252508 Shippment UK
1 S0255111 Shippment UK
2 S0256352 Shippment UK
3 S0259138 Shippment UK
4 S0260425 Shippment US
I've menaged to separate those rows which has the format S1234567 + text using the code below:
merged_out['Remarks'] = merged_out['Remarks'].replace("\t", "\r")
merged_out['Remarks'] = merged_out['Remarks'].replace("\n", "\r")
s = merged_out['Remarks'].str.split("\r").apply(pd.Series, 1).stack()
s.index = s.index.droplevel(-1)
s.name = 'Remarks'
del merged_out['Remarks']
merged_out = merged_out.join(s)
merged_out[['Number','Remarks']] = merged_out.Remarks.str.split(" ", 1, expand=True)
After creating a data frame I found that there are a lot of mistakes inside of that column because the data are written there manually, so there are some examples of those wrong records:
Number
0. Pallets:
1. S0246734/S0246735/S0246736
3. delivery
4. S0258780 31 cok
5. S0246732-
6. 2
7. ok
8. nan
And this is only the wrong data which are in the Number column, I will need to clear this and save only those which has the correct number, if there is sth. like that: S0246732/S0246736/S0246738, then I need to have separated row for each number with the same data as it was for this record. For the other one I need to save those which contains the number, the other should have the null value.

Here is a regex approach that will do what I think your question asks:
import pandas as pd
merged_out = pd.DataFrame({
'Remarks':[
'S0252508 Shippment UK',
'S0255111 Shippment UK',
'S0256352 Shippment UK',
'S0259138/S0259139 Shippment UK',
'S12345678 Shippment UK',
'S0260425 Shippment US']
})
pat = r'(?:(\bS\d{7})/)*(\bS\d{7}\b)'
df = merged_out.Remarks.str.extractall(pat)
df = ( pd.concat([
pd.DataFrame(df.unstack().apply(lambda row: row.dropna().tolist(), axis=1), columns=['Number']),
merged_out],
axis=1).explode('Number') )
df.Remarks = df.Remarks.str.replace(pat + r'\s*', '', regex=True)
Input:
Remarks
0 S0252508 Shippment UK
1 S0255111 Shippment UK
2 S0256352 Shippment UK
3 S0259138/S0259139 Shippment UK
4 S12345678 Shippment UK
5 S0260425 Shippment US
Output:
Number Remarks
0 S0252508 Shippment UK
1 S0255111 Shippment UK
2 S0256352 Shippment UK
3 S0259138 Shippment UK
3 S0259139 Shippment UK
5 S0260425 Shippment US
4 NaN S12345678 Shippment UK
Explanation:
with Series.str.extractall(), use a pattern to obtain 0 or more occurrences of word boundary \b followed by S followed by 7 digits and a 1 occurrence of S followed by 7 digits (flanked by word boundaries \b)
use unstack() to eliminate multiple index levels
use apply() with dropna() and tolist() to create a new dataframe with a Number column containing a list of numbers for each row
use explode() to add new rows for lists with more than one Number item
with Series.str.replace(), filter out the number matches using the previous pattern, plus r'\s*' to match trailing whitespace characters, to obtain the residual Remarks
Notes:
all rows in the sample input contain one valid Number except that one row contains multiple Number values separated by / delimiters, and another row contains no valid Number (it has S followed by 8 digits, more than the 7 that make a valid Number)

I think the easiest solution is to use regular expressions and a list comprehension:
import re
import pandas as pd
merged_out['Remarks'] = [re.split('\s', i)[0] for i in merged_out['Remarks']]
Explanation:
This regular expression allows you to split the data when there is a space and makes a list from the i row in the column Remarks. With the 0, I selected the 0 element in this list. In this case, it is the number.
In this case, the list comprehension iterates through all the column in the dataset. In consequence, you will obtain the corresponding number of each row in the new column Remarks.

Related

To get the Index of a particular element, which is a sentence string present in a column of a DataFrame with conditions

I have a table. There are numbers in the column 'Para'. I have to find the index for a particular sentence from the column 'Country_Title in such a way that the value at Column 'Para' is 2.
Main DataFrame 'df_countries' is shown below:
Index
Sequence
Para
Country_Title
0
5
4
India is seventh largest country
1
6
6
Australia is a continent country
2
7
2
Canada is the 2nd largest country
3
9
3
UAE is a country in Western Asia
4
10
2
China is a country in East Asia
5
11
1
Germany is in Central Europe
6
13
2
Russia is the largest country
7
14
3
Capital city of China is Beijing
Suppose my keyword is China. And I want to get the index for the sentence with 'China', but only the one at 'Para' = 2
Consider the rows at index :- 4 and 7 ; both have same Country_Title. But I want to obtain the index for the one with 'Para' = 2, i.e., the result must be index = 4
My Approach:
I derived another DataFrame 'df_para2_countries' from above table as shown below:
Index
Para
Country_Title
2
2
Canada is the 2nd largest country
4
2
China is a country in East Asia
6
2
Russia is the largest country
Now I store the country title as:
c = list(df_level2_countries['Country_Title'])
I used a for loop to parse through elements in 'c' and find the index of a particular country in the table 'df_countries'
for i in c:
if 'China' in i:
print(i)
ind= df_para2_countries.loc[df_para2_countries['Country_Title'] = i]
print(ind)
the identifier 'ind' gives error.
I want to get the index, but this doesn't work.
Please post your suggestion on how can I approach to this.
You need two equals in your condition?
If you need only the 'index', that is, the value from your first column called index, then you can change the series returned from .loc() to a list, and then get first value, for instance:
ind = df_para2_countries.loc[df_para2_countries['Country_Title'] == i].to_list()[0]
Hope it works :)

DataFrame from variable and filtering data

I have a DataFrame and want to extract 3 columns from it, but one of them is an input from the user. I made a list, but need it to be iterable so I can run a For iteration.
So far I made it through by making a dictionary with 2 of the columns making a list of each and zipping them... but I really need the 3 columns...
My code:
Data=pd.read_csv(----------)
selec=input("What month would you want to show?")
NewData=[(Data['Country']),(Data['City']),(Data[selec].astype('int64')]
#here I try to iterate:
iteration=[i for i in NewData if NewData[i]<=25]
print (iteration)
*TypeError:list indices must be int ot slices, not Series*
My CSV is the following:
I want to be able to choose the month with the variable "selec" and filter the results of the month I've chosen... so the output for selec="Feb" would be:
I tried as well with loc/iloc, but not lucky at all (unhashable type:'list').
See the below example for how you can:
select specific columns from a DataFrame by providing a list of columns between the selection brackets (link to tutorial)
select specific rows from a DataFrame by providing a condition between the selection brackets (link to tutorial)
iterate rows of a DataFrame, although I don't suppose you need it - if you'd like to keep working with the DataFrame after filtering it, it's better to use the method mentioned above (you won't have to put the rows back together, and it will likely be more performant because pandas is optimized for bulk operations)
import pandas as pd
# this is just for testing, instead of pd.read_csv(...)
df = pd.DataFrame([
dict(Country="Spain", City="Madrid", Jan="15", Feb="16", Mar="17", Apr="18", May=""),
dict(Country="Spain", City="Galicia", Jan="1", Feb="2", Mar="3", Apr="4", May=""),
dict(Country="France", City="Paris", Jan="0", Feb="2", Mar="3", Apr="4", May=""),
dict(Country="Algeria", City="Argel", Jan="20", Feb="28", Mar="29", Apr="30", May=""),
])
print("---- Original df:")
print(df)
selec = "Feb" # let's pretend this comes from input()
print("\n---- Just the 3 columns:")
df = df[["Country", "City", selec]] # narrow down the df to just the 3 columns
df[selec] = df[selec].astype("int64") # convert the selec column to proper type
print(df)
print("\n---- Filtered dataframe:")
df1 = df[df[selec] <= 25]
print(df1)
print("\n---- Iterated & filtered rows:")
for row in df.itertuples():
# we could also use row[3] instead of getattr(...)
if getattr(row, selec) <= 25:
print(row)
Output:
---- Original df:
Country City Jan Feb Mar Apr May
0 Spain Madrid 15 16 17 18
1 Spain Galicia 1 2 3 4
2 France Paris 0 2 3 4
3 Algeria Argel 20 28 29 30
---- Just the 3 columns:
Country City Feb
0 Spain Madrid 16
1 Spain Galicia 2
2 France Paris 2
3 Algeria Argel 28
---- Filtered dataframe:
Country City Feb
0 Spain Madrid 16
1 Spain Galicia 2
2 France Paris 2
---- Iterated & filtered dataframe:
Pandas(Index=0, Country='Spain', City='Madrid', Feb=16)
Pandas(Index=1, Country='Spain', City='Galicia', Feb=2)
Pandas(Index=2, Country='France', City='Paris', Feb=2)

How to replace words with different case in PANDAS dataframe

I have a really big data frame and it contains a specific column "city" with multiple cities repeating but in different case, for eg -
***City***
Gurgaon
GURGAON
gurgaon
Chennai
CHENNAI
Banglore
Hydrabad
BANGLORE
HYDRABAD
.
Is there a way to replace all the same cities with different case, with a single name.
There are total 3k rows in each column, so manually it's not possible.
Edit -
The city column of the DF also contains cities like
'Gurgaon'
'GURGAON'
'gurgaon ' #there is a white space at the end
I want something so that they all change to the same name and the delimiter is also removed. So that the output is →
'Gurgaon'
'Gurgaon'
'Gurgaon' #no white space at the end
Thanks
Here is how you can use str.strip() to remove trailing whitespaces, and then use str.title():
import pandas as pd
df = pd.DataFrame({'City':["Gurgaon",
"GURGAON",
"gurgaon",
"Chennai",
"CHENNAI",
"Banglore",
"Hydrabad",
"BANGLORE",
"HYDRABAD"]})
df['City'] = df['City'].str.strip()
df['City'] = df['City'].str.title()
print(df)
Output:
City
0 Gurgaon
1 Gurgaon
2 Gurgaon
3 Chennai
4 Chennai
5 Banglore
6 Hydrabad
7 Banglore
8 Hydrabad
First, change the cities to have the same format:
df.city=df.city.apply(lambda x: x.capitalize())
Then, remove duplicates:
df.drop_duplicates()
(I assume the rest of the columns are equal)

How to condense text cleaning steps into single Python function?

Newer programmer here, deeply appreciate any help this knowledgeable community is willing to provide.
I have a column of 140,000 text strings (company names) in a pandas dataframe on which I want to strip all whitespace everywhere in/around the strings, remove all punctuation, substitute specific substrings, and uniformly transform to lowercase. I want to then take the first 0:10 elements in the strings and store them in a new dataframe column.
Here is a reproducible example.
import string
import pandas as pd
data = ["West Georgia Co",
"W.B. Carell Clockmakers",
"Spine & Orthopedic LLC",
"LRHS Saint Jose's Grocery",
"Optitech#NYCityScape"]
df = pd.DataFrame(data, columns = ['co_name'])
def remove_punctuations(text):
for punctuation in string.punctuation:
text = text.replace(punctuation, '')
return text
# applying remove_punctuations function
df['co_name_transform'] = df['co_name'].apply(remove_punctuations)
# this next step replaces 'Saint' with 'st' to standardize,
# and I may want to make other substitutions but this is a common one.
df['co_name_transform'] = df.co_name_transform.str.replace('Saint', 'st')
# replace whitespace
df['co_name_transform'] = df.co_name_transform.str.replace(' ', '')
# make lowercase
df['co_name_transform'] = df.co_name_transform.str.lower()
# select first 0:10 of strings
df['co_name_transform'] = df.co_name_transform.str[0:10]
print(df)
co_name co_name_transform
0 West Georgia Co westgeorgi
1 W.B. Carell Clockmakers wbcarellcl
2 Spine & Orthopedic LLC spineortho
3 LRHS Saint Jose's Grocery lrhsstjose
4 Optitech#NYCityScape optitechny
How can I put all these steps into a single function like this?
def clean_text(df[col]):
for co in co_name:
do_all_the_steps
return df[new_col]
Thank you
You don't need a function to do this. Try the following one-liner.
df['co_name_transform'] = df['co_name'].str.replace('[^A-Za-z0-9-]+', '').str.replace('Saint', 'st').str.lower().str[0:10]
Final output will be.
co_name co_name_transform
0 West Georgia Co westgeorgi
1 W.B. Carell Clockmakers wbcarellcl
2 Spine & Orthopedic LLC spineortho
3 LRHS Saint Jose's Grocery lrhsstjose
4 Optitech#NYCityScape optitechny
You can do all the steps in the function you pass to the apply method:
import re
df['co_name_transform'] = df['co_name'].apply(lambda s: re.sub(r'[\W_]+', '', s).replace('Saint', 'st').lower()[:10])
Another solution, similar to the previous one, but with the list of "to_replace" in one dictionary, so you can add more items to replace. Also, the previous solution won't give the first 10.
data = ["West Georgia Co",
"W.B. Carell Clockmakers",
"Spine & Orthopedic LLC",
"LRHS Saint Jose's Grocery",
"Optitech#NYCityScape","Optitech#NYCityScape","Optitech#NYCityScape","Optitech#NYCityScape","Optitech#NYCityScape","Optitech#NYCityScape","Optitech#NYCityScape","Optitech#NYCityScape","Optitech#NYCityScape"]
df = pd.DataFrame(data, columns = ['co_name'])
to_replace = {'[^A-Za-z0-9-]+':'','Saint':'st'}
for i in to_replace :
df['co_name'] = df['co_name'].str.replace(i,to_replace[i]).str.lower()
df['co_name'][0:10]
Result :
0 westgeorgiaco
1 wbcarellclockmakers
2 spineorthopedicllc
3 lrhssaintjosesgrocery
4 optitechnycityscape
5 optitechnycityscape
6 optitechnycityscape
7 optitechnycityscape
8 optitechnycityscape
9 optitechnycityscape
Name: co_name, dtype: object
Previous solution ( won't show the first 10)
df['co_name_transform'] = df['co_name'].str.replace('[^A-Za-z0-9-]+', '').str.replace('Saint', 'st').str.lower().str[0:10]
Result :
0 westgeorgi
1 wbcarellcl
2 spineortho
3 lrhssaintj
4 optitechny
5 optitechny
6 optitechny
7 optitechny
8 optitechny
9 optitechny
10 optitechny
11 optitechny
12 optitechny
Name: co_name_transform, dtype: object

Pandas: Delete rows of a DataFrame if total count of a particular column occurs only 1 time

I'm looking to delete rows of a DataFrame if total count of a particular column occurs only 1 time
Example of raw table (values are arbitrary for illustrative purposes):
print df
Country Series Value
0 Bolivia Population 123
1 Kenya Population 1234
2 Ukraine Population 12345
3 US Population 123456
5 Bolivia GDP 23456
6 Kenya GDP 234567
7 Ukraine GDP 2345678
8 US GDP 23456789
9 Bolivia #McDonalds 3456
10 Kenya #Schools 3455
11 Ukraine #Cars 3456
12 US #Tshirts 3456789
Intended outcome:
print df
Country Series Value
0 Bolivia Population 123
1 Kenya Population 1234
2 Ukraine Population 12345
3 US Population 123456
5 Bolivia GDP 23456
6 Kenya GDP 234567
7 Ukraine GDP 2345678
8 US GDP 23456789
I know that df.Series.value_counts()>1 will identify which df.Series occur more than 1 time; and that the code returned will look something like the following:
Population
True
GDP
True
#McDonalds
False
#Schools
False
#Cars
False
#Tshirts
False
I want to write something like the following so that my new DataFrame drops column values from df.Series that occur only 1 time, but this doesn't work:
df.drop(df.Series.value_counts()==1,axis=1,inplace=True)
You can do this by creating a boolean list/array by either list comprehensions or using DataFrame's string manipulation methods.
The list comprehension approach is:
vc = df['Series'].value_counts()
u = [i not in set(vc[vc==1].index) for i in df['Series']]
df = df[u]
The other approach is to use the str.contains method to check whether the values of the Series column contain a given string or match a given regular expression (used in this case as we are using multiple strings):
vc = df['Series'].value_counts()
pat = r'|'.join(vc[vc==1].index) #Regular expression
df = df[~df['Series'].str.contains(pat)] #Tilde is to negate boolean
Using this regular expressions approach is a bit more hackish and may require some extra processing (character escaping, etc) on pat in case you have regex metacharacters in the strings you want to filter out (which requires some basic regex knowledge). However, it's worth noting this approach is about 4x faster than using the list comprehension approach (tested on the data provided in the question).
As a side note, I recommend avoiding using the word Series as a column name as that's the name of a pandas object.
This is an old question, but the current answer doesn't work for any moderately large dataframes. A much faster and more "dataframe" way is to add a value count column and filter out count.
Create the dataset:
df = pd.DataFrame({'Country': 'Bolivia Kenya Ukraine US Bolivia Kenya Ukraine US Bolivia Kenya Ukraine US'.split(),
'Series': 'Pop Pop Pop Pop GDP GDP GDP GDP McDonalds Schools Cars Tshirts'.split()})
Drop rows that have a count < 1 for the column ('Series' in this case):
# Group values for Series and add 'cnt' column with count
df['cnt'] = df.groupby(['Series'])['Country'].transform('count')
# Drop indexes for count value == 1, and dropping 'cnt' column
df.drop(df[df.cnt==1].index)[['Country','Series']]

Categories

Resources