Pandas - how to remove spaces in each column in a dataframe? - python

I'm trying to remove spaces, apostrophes, and double quote in each column data using this for loop
for c in data.columns:
data[c] = data[c].str.strip().replace(',', '').replace('\'', '').replace('\"', '').strip()
but I keep getting this error:
AttributeError: 'Series' object has no attribute 'strip'
data is the data frame and was obtained from an excel file
xl = pd.ExcelFile('test.xlsx');
data = xl.parse(sheetname='Sheet1')
Am I missing something? I added the str but that didn't help. Is there a better way to do this.
I don't want to use the column labels, like so data['column label'], because the text can be different. I would like to iterate each column and remove the characters mentioned above.
incoming data:
id city country
1 Ontario Canada
2 Calgary ' Canada'
3 'Vancouver Canada
desired output:
id city country
1 Ontario Canada
2 Calgary Canada
3 Vancouver Canada

UPDATE: using your sample DF:
In [80]: df
Out[80]:
id city country
0 1 Ontario Canada
1 2 Calgary ' Canada'
2 3 'Vancouver Canada
In [81]: df.replace(r'[,\"\']','', regex=True).replace(r'\s*([^\s]+)\s*', r'\1', regex=True)
Out[81]:
id city country
0 1 Ontario Canada
1 2 Calgary Canada
2 3 Vancouver Canada
OLD answer:
you can use DataFrame.replace() method:
In [75]: df.to_dict('r')
Out[75]:
[{'a': ' x,y ', 'b': 'a"b"c', 'c': 'zzz'},
{'a': "x'y'z", 'b': 'zzz', 'c': ' ,s,,'}]
In [76]: df
Out[76]:
a b c
0 x,y a"b"c zzz
1 x'y'z zzz ,s,,
In [77]: df.replace(r'[,\"\']','', regex=True).replace(r'\s*([^\s]+)\s*', r'\1', regex=True)
Out[77]:
a b c
0 xy abc zzz
1 xyz zzz s
r'\1' - is a numbered capturing RegEx group

data[c] does not return a value, it returns a series (a whole column of data).
You can apply the strip operation to an entire column df.apply. You can apply the strip function this way.

Related

How can I count # of occurences of more than one column (eg city & country)?

Given the following data ...
city country
0 London UK
1 Paris FR
2 Paris US
3 London UK
... I'd like a count of each city-country pair
city country n
0 London UK 2
1 Paris FR 1
2 Paris US 1
The following works but feels like a hack:
df = pd.DataFrame([('London', 'UK'), ('Paris', 'FR'), ('Paris', 'US'), ('London', 'UK')], columns=['city', 'country'])
df.assign(**{'n': 1}).groupby(['city', 'country']).count().reset_index()
I'm assigning an additional column n of all 1s, grouping on city&country, and then count()ing occurrences of this new 'all 1s' column. It works, but adding a column just to count it feels wrong.
Is there a cleaner solution?
There is a better way..use value_counts
df.value_counts().reset_index(name='n')
city country n
0 London UK 2
1 Paris FR 1
2 Paris US 1

Transform one row to a data frame with multiple rows

I have a data frame containing one row:
df_1D = pd.DataFrame({'Day1':[5],
'Day2':[6],
'Day3':[7],
'ID':['AB12'],
'Country':['US'],
'Destination_A':['Miami'],
'Destination_B':['New York'],
'Destination_C':['Chicago'],
'First_Agent':['Jim'],
'Second_Agent':['Ron'],
'Third_Agent':['Cynthia']},
)
Day1 Day2 Day3 ID ... Destination_C First_Agent Second_Agent Third_Agent
0 5 6 7 AB12 ... Chicago Jim Ron Cynthia
I'm wondering if there's an easy way, to transform it into a dataframe with three rows as shown here:
Day ID Country Destination Agent
0 5 AB12 US Miami Jim
1 6 AB12 US New York Ron
2 7 AB12 US Chicago Cynthia
Have you tried to pivot it with .pivot function? https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pivot.html
One option using reshaping, which only requires to know the final columns:
# define final columns
cols = ['Day', 'ID', 'Destination', 'Country', 'Agent']
# the part below is automatic
# ------
# extract the keywords
pattern = f"({'|'.join(cols)})"
new = df_1D.columns.str.extract(pattern)[0]
# and reshape
out = (df_1D
.set_axis(pd.MultiIndex.from_arrays([new, new.groupby(new).cumcount()]), axis=1)
.loc[0].unstack(0).ffill()[cols]
)
Output:
Day ID Destination Country Agent
0 5 AB12 Miami US Jim
1 6 AB12 New York US Ron
2 7 AB12 Chicago US Cynthia
alternative defining idx/cols separately
idx = ['ID', 'Country']
cols = ['Day', 'Destination', 'Agent']
df2 = df_1D.set_index(idx)
pattern = f"({'|'.join(cols)})"
new = df2.columns.str.extract(pattern)[0]
out = (df2
.set_axis(pd.MultiIndex.from_arrays([new, new.groupby(new).cumcount().astype(str)],
names=[None, None]),
axis=1)
.stack().reset_index(idx)
)
clomuns_day=[col for col in df_1D if col.startswith('Day')]
clomuns_dest=[col for col in df_1D if col.startswith('Destination')]
clomuns_agent=[col for col in df_1D if 'Agent'in col]
new_df=pd.DataFrame()
new_df['Day']=df_1D[clomuns_day].values.tolist()[0]
new_df['ID']= list(df_1D['ID'])*len(new_df)
new_df['Country']= list(df_1D['Country'])*len(new_df)
new_df['Destination']=df_1D[clomuns_dest].values.tolist()[0]
new_df['Agent']=df_1D[clomuns_agent].values.tolist()[0]
Out:
Day ID Country Destination Agent
0 5 AB12 US Miami Jim
1 6 AB12 US New York Ron
2 7 AB12 US Chicago Cynthia
you can use it whatever destination is repeat
One option is with pivot_longer from pyjanitor, where for this case, you pass a list of regexes to names_pattern, and the new column names to names_to:
# pip install pyjanitor
import janitor
import pandas as pd
(df_1D
.pivot_longer(
index=['ID','Country'],
names_to = ['Day','Destination','Agent'],
names_pattern=['Day','Destination','Agent'])
)
ID Country Day Destination Agent
0 AB12 US 5 Miami Jim
1 AB12 US 6 New York Ron
2 AB12 US 7 Chicago Cynthia
I don't think there is a way to treat this fully automated. It requires manual manipulation. This is the shortest code that comes to my mind. Feel free to comment:
d1 = {}
for k in ['Day', 'Destination', 'Agent']:
d1[k] = [d[i][0] for i in d.keys() if k in i]
for k in ['ID', 'Country']:
d1[k] = d[k] * len(d1['Day'])
d1 = pd.DataFrame(d1)
Output:
Hope this help.

How to iterate over pandas dataframe rows, find string and separate into colums?

So here is my issue, I have a dataframe df with a column "Info" like this:
0 US[edit]
1 Boston(B1)
2 Washington(W1)
3 Chicago(C1)
4 UK[edit]
5 London(L2)
6 Manchester(L2)
I would like to put all the strings containing "[ed]" into a separate column df['state'], the remaining strings should be put into another column df['city']. I wanna do some clean up too and remove things in [] and (). This is what I tried:
for ind, row in df.iterrows():
if df['Info'].str.contains('[ed', regex=False):
df['state']=df['info'].str.split('\[|\(').str[0]
else:
df['city']=df['info'].str.split('\[|\(').str[0]
At the end I would like to have something like this
US Boston
US Washington
US Chicago
UK London
UK Manchester
When I try this I always get "The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()"
Any help? Thanks!!
Use Series.where with forward filling missing values for state column, for city assign Series s and then
filter by boolean indexing with inverted mask by ~:
m = df['Info'].str.contains('[ed', regex=False)
s = df['Info'].str.split('\[|\(').str[0]
df['state'] = s.where(m).ffill()
df['city'] = s
df = df[~m]
print (df)
Info state city
1 Boston(B1) US Boston
2 Washington(W1) US Washington
3 Chicago(C1) US Chicago
5 London(L2) UK London
6 Manchester(L2) UK Manchester
If you want you can also remove original column by adding DataFrame.pop:
m = df['Info'].str.contains('[ed', regex=False)
s = df.pop('Info').str.split('\[|\(').str[0]
df['state'] = s.where(m).ffill()
df['city'] = s
df = df[~m]
print (df)
state city
1 US Boston
2 US Washington
3 US Chicago
5 UK London
6 UK Manchester
I would do:
s = df.Info.str.extract('([\w\s]+)(\[edit\])?')
df['city'] = s[0]
df['state'] = s[0].mask(s[1].isna()).ffill()
df = df[s[1].isna()]
Output:
Info city state
1 1 Boston(B1) Boston US
2 2 Washington(W1) Washington US
3 3 Chicago(C1) Chicago US
5 5 London(L2) London UK
6 6 Manchester(L2) Manchester UK

Listing unique value counts per groups in pandas dataframe

I am new to pandas and python.
I am trying to group items by one column and list the information from the data frame per group.
My dataframe:
B C D E F
1 Honda USA 2000 Washington New
2 Honda USA 2001 Salt Lake Used
3 Ford Canada 2005 Washington New
4 Toyota USA 2010 Ney York Used
5 Honda USA 2001 Salt Lake Used
6 Honda Canada 2011 Salt Lake Crashed
7 Ford Italy 2014 Rome New
I am trying to group my dataframe by column B and list how many C, D, E, F column values are in group B. For example we see that in column B there are 4 Honda which I am grouping it together. Then I want to list the following information - USA(3), Canada(1), 2000(1),2001(2), 2011(1), Washington(1), Salt Lake(3), New(1), Used(2), Crashed(1) and do the same per every group ( car make ) in column B:
Car Country Year City Condition
1 Honda(4) USA(3) 2000(1) Washington(1) New(1)
Canada(1) 2001(2) Salt Lake(3) Used(2)
2011(1) Crashed(1)
2 Ford(2) Canada(1) 2005(5) Washington(1) New(2)
Italy(1) 2014(1) Rome(1)
...
What I've tried so far:
df.groupby(['B'])
Which gives me back <pandas.core.groupby.generic.DataFrameGroupBy object at 0x11d559080>
At this point, I am not sure how I should code moving on forward getting the desired results after grouping the column B.
Thank you for your suggestions.
You need lambda function with custom function for processing each column separately with Series.value_counts and then join values of index to values of counts of Series together:
def f(x):
x = x.value_counts()
y = x.index.astype(str) + '(' + x.astype(str) + ')'
return y.reset_index(drop=True)
df1 = df.groupby(['B']).apply(lambda x: x.apply(f)).reset_index(drop=True)
print (df1)
B C D E F
0 Ford(2) Italy(1) 2014(1) Washington(1) New(2)
1 NaN Canada(1) 2005(1) Rome(1) NaN
2 Honda(4) USA(3) 2001(2) Salt Lake(3) Used(2)
3 NaN Canada(1) 2011(1) Washington(1) Crashed(1)
4 NaN NaN 2000(1) NaN New(1)
5 Toyota(1) USA(1) 2010(1) Ney York(1) Used(1)

Removing non-alphanumeric symbols in dataframe

How do I remove non-alphabet from the values in the dataframe? I only managed to convert all to lower case
def doubleAwardList(self):
dfwinList = pd.DataFrame()
dfloseList = pd.DataFrame()
dfwonandLost = pd.DataFrame()
#self.dfWIN... and self.dfLOSE... is just the function used to call the files chosen by user
groupby_name= self.dfWIN.groupby("name")
groupby_nameList= self.dfLOSE.groupby("name _List")
list4 = []
list5 = []
notAwarded = "na"
for x, group in groupby_name:
if x != notAwarded:
list4.append(str.lower(str(x)))
dfwinList= pd.DataFrame(list4)
for x, group in groupby_nameList:
list5.append(str.lower(str(x)))
dfloseList = pd.DataFrame(list5)
data sample: Basically I mainly need to remove the full stops and hyphens as I will require to compare it to another file but the naming isn't very consistent so i had to remove the non-alphanumeric for much more accurate result
creative-3
smart tech pte. ltd.
nutritive asia
asia's first
desired result:
creative 3
smart tech pte ltd
nutritive asia
asia s first
Use DataFrame.replace only and add whitespace to pattern:
df = df.replace('[^a-zA-Z0-9 ]', '', regex=True)
If one column - Series:
df = pd.DataFrame({'col': ['creative-3', 'smart tech pte. ltd.',
'nutritive asia', "asia's first"],
'col2':range(4)})
print (df)
col col2
0 creative-3 0
1 smart tech pte. ltd. 1
2 nutritive asia 2
3 asia's first 3
df['col'] = df['col'].replace('[^a-zA-Z0-9 ]', '', regex=True)
print (df)
col col2
0 creative3 0
1 smart tech pte ltd 1
2 nutritive asia 2
3 asias first 3
EDIT:
If multiple columns is possible select only object, obviously string columns and if necessary cast to strings:
cols = df.select_dtypes('object').columns
print (cols)
Index(['col'], dtype='object')
df[cols] = df[cols].astype(str).replace('[^a-zA-Z0-9 ]', '', regex=True)
print (df)
col col2
0 creative3 0
1 smart tech pte ltd 1
2 nutritive asia 2
3 asias first 3
Why not just the below, (i did make into lower btw):
df=df.replace('[^a-zA-Z0-9]', '',regex=True).str.lower()
Then now:
print(df)
Will get the desired data-frame
Update:
try:
df=df.apply(lambda x: x.str.replace('[^a-zA-Z0-9]', '').lower(),axis=0)
If only one column do:
df['your col']=df['your col'].str.replace('[^a-zA-Z0-9]', '').str.lower()

Categories

Resources