Removing non-alphanumeric symbols in dataframe - python

How do I remove non-alphabet from the values in the dataframe? I only managed to convert all to lower case
def doubleAwardList(self):
dfwinList = pd.DataFrame()
dfloseList = pd.DataFrame()
dfwonandLost = pd.DataFrame()
#self.dfWIN... and self.dfLOSE... is just the function used to call the files chosen by user
groupby_name= self.dfWIN.groupby("name")
groupby_nameList= self.dfLOSE.groupby("name _List")
list4 = []
list5 = []
notAwarded = "na"
for x, group in groupby_name:
if x != notAwarded:
list4.append(str.lower(str(x)))
dfwinList= pd.DataFrame(list4)
for x, group in groupby_nameList:
list5.append(str.lower(str(x)))
dfloseList = pd.DataFrame(list5)
data sample: Basically I mainly need to remove the full stops and hyphens as I will require to compare it to another file but the naming isn't very consistent so i had to remove the non-alphanumeric for much more accurate result
creative-3
smart tech pte. ltd.
nutritive asia
asia's first
desired result:
creative 3
smart tech pte ltd
nutritive asia
asia s first

Use DataFrame.replace only and add whitespace to pattern:
df = df.replace('[^a-zA-Z0-9 ]', '', regex=True)
If one column - Series:
df = pd.DataFrame({'col': ['creative-3', 'smart tech pte. ltd.',
'nutritive asia', "asia's first"],
'col2':range(4)})
print (df)
col col2
0 creative-3 0
1 smart tech pte. ltd. 1
2 nutritive asia 2
3 asia's first 3
df['col'] = df['col'].replace('[^a-zA-Z0-9 ]', '', regex=True)
print (df)
col col2
0 creative3 0
1 smart tech pte ltd 1
2 nutritive asia 2
3 asias first 3
EDIT:
If multiple columns is possible select only object, obviously string columns and if necessary cast to strings:
cols = df.select_dtypes('object').columns
print (cols)
Index(['col'], dtype='object')
df[cols] = df[cols].astype(str).replace('[^a-zA-Z0-9 ]', '', regex=True)
print (df)
col col2
0 creative3 0
1 smart tech pte ltd 1
2 nutritive asia 2
3 asias first 3

Why not just the below, (i did make into lower btw):
df=df.replace('[^a-zA-Z0-9]', '',regex=True).str.lower()
Then now:
print(df)
Will get the desired data-frame
Update:
try:
df=df.apply(lambda x: x.str.replace('[^a-zA-Z0-9]', '').lower(),axis=0)
If only one column do:
df['your col']=df['your col'].str.replace('[^a-zA-Z0-9]', '').str.lower()

Related

Python : Remodelling a dataframe and regrouping data from a specific column with predefined rows

Let's say that I have this dataframe with four columns : "Name", "Value", "Ccy" and "Group" :
import pandas as pd
Name = ['ID', 'Country', 'IBAN','Dan_Age', 'Dan_city', 'Dan_country', 'Dan_sex', 'Dan_Age', 'Dan_country','Dan_sex' , 'Dan_city','Dan_country' ]
Value = ['TAMARA_CO', 'GERMANY','FR56','18', 'Berlin', 'GER', 'M', '22', 'FRA', 'M', 'Madrid', 'ESP']
Ccy = ['','','','EUR','EUR','USD','USD','','CHF', '','DKN','']
Group = ['0','0','0','1','1','1','1','2','2','2','3','3']
df = pd.DataFrame({'Name':Name, 'Value' : Value, 'Ccy' : Ccy,'Group':Group})
print(df)
Name Value Ccy Group
0 ID TAMARA_CO 0
1 Country GERMANY 0
2 IBAN FR56 0
3 Dan_Age 18 EUR 1
4 Dan_city Berlin EUR 1
5 Dan_country GER USD 1
6 Dan_sex M USD 1
7 Dan_Age 22 2
8 Dan_country FRA CHF 2
9 Dan_sex M 2
10 Dan_city Madrid DKN 3
11 Dan_country ESP 3
I want to represent this data differently before saving it in a csv. I would like to group the duplicates in the column "Name" with the associates values in "Values" and "Ccy". I want that the data in the column "Value" and "Ccy" are stored in the row(index) defined by the column "Group". Like that I do not mixed the data.
Then if the name is in the "group" 0, it means that it is general data so I would like that the all the rows from this "Name" are filled with the same value.
So I would like to get this result :
ID_Value Country_Value IBAN_Value Dan_age Dan_age_Ccy Dan_city_Value Dan_city_Ccy Dan_sex_Value
1 TAMARA GER FR56 18 EUR Berlin EUR M
2 TAMARA GER FR56 22 M
3 TAMARA GER FR56 Madrid DKN
I can not find how to do the first part. With the code below, I do not get what I want evn if I remove the columns empty
g = df.groupby(['Name']).cumcount()
df = df.set_index([g,'Name']).unstack().sort_index(level=1, axis=1)
df.columns = df.columns.map(lambda x: f'{x[0]}_{x[1]}')
Anyone can help me !
Thank you
You can use the following. See comments in code for each step:
s = df.loc[df['Group'] == '0', 'Name'].tolist() # this variable will be used later according to Condition 2
df['Name'] = pd.Categorical(df['Name'], categories=df['Name'].unique(), ordered=True) #this preserves order before pivoting
df = df.pivot(index='Group', columns='Name') #transforms long-to-wide per expected output
for col in df.columns:
if col[1] in s: df[col] = df[col].shift().ffill() #Condition 2
df = df.iloc[1:].replace('',np.nan).dropna(axis=1, how='all').fillna('') #dataframe cleanup
df.columns = ['_'.join(col) for col in df.columns.swaplevel()] #column name cleanup
df
Out[1]:
ID_Value Country_Value IBAN_Value Dan_Age_Value Dan_city_Value \
Group
1 TAMARA_CO GERMANY FR56 18 Berlin
2 TAMARA_CO GERMANY FR56 22
3 TAMARA_CO GERMANY FR56 Madrid
Dan_country_Value Dan_sex_Value Dan_Age_Ccy Dan_city_Ccy \
Group
1 GER M EUR EUR
2 FRA M
3 ESP DKN
Dan_country_Ccy Dan_sex_Ccy
Group
1 USD USD
2 CHF
3
From there, you can drop columns you don't want, change strings from "TAMARA_CO" to "TAMARA", "GERMANY" to "GER", use reset_index(drop=True), etc.
You can do this quite easily with only 3 steps:
Split your data frame into 2 parts: the "general data" (which we want as a series) and the more specific data. Each data frame now contains the same kinds of information.
The key part of your problem: reorganizing the data. All you need is the pandas pivot function. It does exactly what you need!
Add the general information and the pivoted data back together.
# Split Data
general = df[df.Group == "0"].set_index("Name")["Value"].copy()
main_df = df[df.Group != "0"]
# Pivot Data
result = main_df.pivot(index="Group", columns=["Name"],
values=["Value", "Ccy"]).fillna("")
result.columns = [f"{c[1]}_{c[0]}" for c in result.columns]
# Create a data frame that has an identical row for each group
general_df = pd.DataFrame([general]*3, index=result.index)
general_df.columns = [c + "_Value" for c in general_df.columns]
# Merge the data back together
result = general_df.merge(result, on="Group")
The result given above does not give the exact column order you want, so you'd have to specify that manually with
final_cols = ["ID_Value", "Country_Value", "IBAN_Value",
"Dan_age_Value", "Dan_Age_Ccy", "Dan_city_Value",
"Dan_city_Ccy", "Dan_sex_Value"]
result = result[final_cols]

converting a list of pairs in a column into two columns using pandas dataframe

I have a dataframe df with two columns as follows:
ID Country_pairs
0 X [(France, USA), (USA, France)]
1 Y [(USA, UK), (UK, France), (USA, France)]
I want to output all possible pairs of countries in two columns as follows:
ID Country1 Country2
X France USA
X USA France
Y USA UK
Y UK France
Y USA France
Doing this gives me the output I want:
result = pd.DataFrame()
for index, row in df.iterrows():
x = row['Country_pairs']
temp = pd.DataFrame(data=x, columns=['Country1','Country2'])
temp['PID'] = row['PID']
result = result.append(temp)
print result
The dataframe is over 200 million rows, so this is a very slow since I'm looping. I was wondering if there is a faster solution?
import pandas as pd
# setup the dataframe
df = pd.DataFrame({'ID': ['X', 'Y'],
'Country_Pairs': [[('France', 'USA'), ('USA', 'France')],
[('USA', 'UK'), ('UK', 'France'), ('USA', 'France')]]})
ID Country_Pairs
0 X [(France, USA), (USA, France)]
1 Y [(USA, UK), (UK, France), (USA, France)]
# separate each tuple to its own row with explode
df2 = df.explode('Country_Pairs')
# separate each value in the tuple to its own column
df2[['Country1', 'Counrtry2']] = pd.DataFrame(df2.Country_Pairs.tolist(), index=df2.index)
# delete Country_Pairs
df2.drop(columns=['Country_Pairs'], inplace=True)
ID Country1 Counrtry2
0 X France USA
0 X USA France
1 Y USA UK
1 Y UK France
1 Y USA France
give a trial at this:
explode_df = df.apply(pd.Series.explode)
split_country = explode_df.apply(lambda x: ' '.join(x['Country_pairs']), axis=1).str.split(expand=True)
# whether you would like to combine the results
res = pd.concat([explode_df, split_country], axis=1)
You are looking for .explode()
result = df.explode('Country_pairs')
result["Country1"] = result.Country_pairs.apply(lambda t:t[0])
result["Country2"] = result.Country_pairs.apply(lambda t:t[1])
del result["Country_pairs"]
200 million rows is massive, no point running this computation on Pandas. As suggested in the comments, use Apache spark. or if it's in a database, u could possibly work something there.
The solution I proffer works on small datasets suited to Pandas ... take the data out of Pandas, use the itertools functions - product and chain, and build back the dataframe. It should be reasonably fast, but certainly not for 200 million rows
#using the data provided by #Trenton
df = pd.DataFrame({'ID': ['X', 'Y'],
'Country_Pairs': [[('France', 'USA'), ('USA', 'France')],
[('USA', 'UK'), ('UK', 'France'), ('USA', 'France')]]})
from itertools import product, chain
step1 = chain.from_iterable(product(first, last)
for first, last in
df.to_numpy())
res = pd.DataFrame(((first, *last) for first, last in step1),
columns=['ID', 'Country1', 'Country2'])
res
ID Country1 Country2
0 X France USA
1 X USA France
2 Y USA UK
3 Y UK France
4 Y USA France

Extract integers after double space with regex

I have a dataframe where I want to extract stuff after double space. For all rows in column NAME there is a double white space after the company names before the integers.
NAME INVESTMENT PERCENT
0 APPLE COMPANY A 57 638 232 stocks OIL LTD 0.12322
1 BANANA 1 COMPANY B 12 946 201 stocks GOLD LTD 0.02768
2 ORANGE COMPANY C 8 354 229 stocks GAS LTD 0.01786
df = pd.DataFrame({
'NAME': ['APPLE COMPANY A 57 638 232 stocks', 'BANANA 1 COMPANY B 12 946 201 stocks', 'ORANGE COMPANY C 8 354 229 stocks'],
'PERCENT': [0.12322, 0.02768 , 0.01786]
})
I have this earlier, but it also includes integers in the company name:
df['STOCKS']=df['NAME'].str.findall(r'\b\d+\b').apply(lambda x: ''.join(x))
Instead I tried to extract after double spaces
df['NAME'].str.split('(\s{2})')
which gives output:
0 [APPLE COMPANY A, , 57 638 232 stocks]
1 [BANANA 1 COMPANY B, , 12 946 201 stocks]
2 [ORANGE COMPANY C, , 8 354 229 stocks]
However, I want the integers that occur after double spaces to be joined/merged and put into a new column.
NAME PERCENT STOCKS
0 APPLE COMPANY A 0.12322 57638232
1 BANANA 1 COMPANY B 0.02768 12946201
2 ORANGE COMPANY C 0.01786 12946201
How can I modify my second function to do what I want?
Following the original logic you may use
df['STOCKS'] = df['NAME'].str.extract(r'\s{2,}(\d+(?:\s\d+)*)', expand=False).str.replace(r'\s+', '')
df['NAME'] = df['NAME'].str.replace(r'\s{2,}\d+(?:\s\d+)*\s+stocks', '')
Output:
NAME PERCENT STOCKS
0 APPLE COMPANY A 0.12322 57638232
1 BANANA 1 COMPANY B 0.02768 12946201
2 ORANGE COMPANY C 0.01786 8354229
Details
\s{2,}(\d+(?:\s\d+)*) is used to extract the first occurrence of whitespace-separated consecutive digit chunks after 2 or more whitespaces and .replace(r'\s+', '') removes any whitespaces in that extracted text afterwards
.replace(r'\s{2,}\d+(?:\s\d+)*\s+stocks' updates the text in the NAME column, it removes 2 or more whitespaces, consecutive whitespace-separated digit chunks and then 1+ whitespaces and stocks. Actually, the last \s+stocks may be replaced with .* if there are other words.
Another pandas approach, which will cast STOCKS to numeric type:
df_split = (df['NAME'].str.extractall('^(?P<NAME>.+)\s{2}(?P<STOCKS>[\d\s]+)')
.reset_index(level=1, drop=True))
df_split['STOCKS'] = pd.to_numeric(df_split.STOCKS.str.replace('\D', ''))
Assign these columns back into your original DataFrame:
df[['NAME', 'STOCKS']] = df_split[['NAME', 'STOCKS']]
COMPANY_NAME STOCKS PERCENT
0 APPLE COMPANY A 57638232 0.12322
1 BANANA 1 COMPANY B 12946201 0.02768
2 ORANGE COMPANY C 8354229 0.01786
You can use look behind and look ahead operators.
''.join(re.findall(r'(?<=\s{2})(.*)(?=stocks)',string)).replace(' ','')
This catches all characters between two spaces and the word stocks and replace all the spaces with null.
Another Solution using Split
df["NAME"].apply(lambda x:x[x.find(' ')+2:x.find('stocks')-1].replace(' ',''))
Reference:-
Look_behind
You can try
df['STOCKS'] = df['NAME'].str.split(',')[2].replace(' ', '')
df['NAME'] = df['NAME'].str.split(',')[0]
This can be done without using regex by using split.
df['STOCKS'] = df['NAME'].apply(lambda x: ''.join(x.split(' ')[1].split(' ')[:-1]))
df['NAME'] = df['NAME'].str.replace(r'\s?\d+(?:\s\d+).*', '')

Removing strings in a series of headers

I have a number of columns in a dataframe:
df = pd.DataFrame({'Date':[1990],'State Income of Alabama':[1],
'State Income of Washington':[2],
'State Income of Arizona':[3]})
All headers have the same number of strings and all have the exact same strings with exactly one white space between the State's name.
I want to take out the strings 'State Income of ' and leave the state in tact as a new header for the set so they just all read:
Alabama Washington Arizona
1 2 3
I've tried using the replace columns function in Python like:
df.columns = df.columns.str.replace('State Income of ', '')
But this isn't giving me the desired output.
Here is another solution, not in place:
df.rename(columns=lambda x: x.split()[-1])
or in place:
df.rename(columns=lambda x: x.split()[-1], inplace = True)
Your way works for me, but there are alternatives:
One way is to split your column names and take the last word:
df.columns = [i.split()[-1] for i in df.columns]
>>> df
Alabama Arizona Washington
0 1 3 2
You can use the re module for this:
>>> import pandas as pd
>>> df = pd.DataFrame({'State Income of Alabama':[1],
... 'State Income of Washington':[2],
... 'State Income of Arizona':[3]})
>>>
>>> import re
>>> df.columns = [re.sub('State Income of ', '', col) for col in df]
>>> df
Alabama Washington Arizona
0 1 2 3
re.sub('State Income of', '', col) will replace any occurrence of 'State Income of' with an empty string (with "nothing," effectively) in the string col.

Pandas - how to remove spaces in each column in a dataframe?

I'm trying to remove spaces, apostrophes, and double quote in each column data using this for loop
for c in data.columns:
data[c] = data[c].str.strip().replace(',', '').replace('\'', '').replace('\"', '').strip()
but I keep getting this error:
AttributeError: 'Series' object has no attribute 'strip'
data is the data frame and was obtained from an excel file
xl = pd.ExcelFile('test.xlsx');
data = xl.parse(sheetname='Sheet1')
Am I missing something? I added the str but that didn't help. Is there a better way to do this.
I don't want to use the column labels, like so data['column label'], because the text can be different. I would like to iterate each column and remove the characters mentioned above.
incoming data:
id city country
1 Ontario Canada
2 Calgary ' Canada'
3 'Vancouver Canada
desired output:
id city country
1 Ontario Canada
2 Calgary Canada
3 Vancouver Canada
UPDATE: using your sample DF:
In [80]: df
Out[80]:
id city country
0 1 Ontario Canada
1 2 Calgary ' Canada'
2 3 'Vancouver Canada
In [81]: df.replace(r'[,\"\']','', regex=True).replace(r'\s*([^\s]+)\s*', r'\1', regex=True)
Out[81]:
id city country
0 1 Ontario Canada
1 2 Calgary Canada
2 3 Vancouver Canada
OLD answer:
you can use DataFrame.replace() method:
In [75]: df.to_dict('r')
Out[75]:
[{'a': ' x,y ', 'b': 'a"b"c', 'c': 'zzz'},
{'a': "x'y'z", 'b': 'zzz', 'c': ' ,s,,'}]
In [76]: df
Out[76]:
a b c
0 x,y a"b"c zzz
1 x'y'z zzz ,s,,
In [77]: df.replace(r'[,\"\']','', regex=True).replace(r'\s*([^\s]+)\s*', r'\1', regex=True)
Out[77]:
a b c
0 xy abc zzz
1 xyz zzz s
r'\1' - is a numbered capturing RegEx group
data[c] does not return a value, it returns a series (a whole column of data).
You can apply the strip operation to an entire column df.apply. You can apply the strip function this way.

Categories

Resources