df1
ID Col
1 new york, london school of economics, america
2 california & washington, harvard university, america
Expected output is :
df1
ID Col
1 new york,london school of economics,america
2 california & washington,harvard university,america
My try is :
df1[Col].apply(lambda x : x.str.replace(", ","", regex=True))
It is advisable to use the regular expression ,\s+, which allows you to capture several consecutive whitespace characters after a comma, as in washington, harvard
df = pd.DataFrame({'ID': [1, 2], 'Col': ['new york, london school of economics, america',
'california & washington, harvard university, america']}).set_index('ID')
df.Col = df.Col.str.replace(r',\s+', ',', regex=True)
print(df)
Col
ID
1 new york,london school of economics,america
2 california & washington,harvard university,ame...
You can use str.replace(', ', ",") instead of a lambda function. However, this will only work if there is only one space after ",".
As Алексей Р mentioned, (r',\s+', ",", regex=True) is needed to catch any extra spaces after ",".
Reference: https://pandas.pydata.org/docs/reference/api/pandas.Series.str.replace.html
Example:
import pandas as pd
data_ = ['new york, london school of economics, america', 'california & washington, harvard university, america']
df1 = pd.DataFrame(data_)
df1.columns = ['Col']
df1.index.name = 'ID'
df1.index = df1.index + 1
df1['Col'] = df1['Col'].str.replace(r',\s+', ",", regex=True)
print(df1)
Result:
Col
ID
1 new york,london school of economics,america
2 california & washington,harvard university,ame...
If you mention the axis it will be solved
df.apply(lambda x: x.str.replace(', ',',',regex=True),axis=1)
You can split the string on ',' and then remove the extra whitespaces and join the list.
df1=df1['Col'].apply(lambda x : ",".join([w.strip() for w in x.split(',')]))
Hope this helps.
Related
I am trying to do a conditional string assignment - if the cell contains the locations, assign the geo name into the cell next to it. I tried np.where and np.select and they tend to work on a single value assignment instead of multiple value assignment. Any suggestion I can do it through Numpy or there's an easier way to do this?
Europe = ['London', 'Paris', 'Berlin']
North_America = ['New York', 'Toroto', 'Boston']
Asia = ['Hong Kong', 'Tokyo', 'Singapore']
data = {'location':["London, Paris", "Hong Kong", "London, New York", "Singapore, Toroto", "Boston"]}
df = pd.DataFrame(data)
location
0 London, Paris
1 Hong Kong
2 London, New York
3 Singapore, Toroto
4 Boston
# np.where approach
df['geo'] = np.where(( ( (df['location'].isin(Europe) ) ) | ( (df['location'].isin(North_America) ) ) ), 'Europe', 'North America')
# np.select approach
conditions = [
df['location'].isin(Europe),
df['location'].isin(North_America)
]
choices = ['Europe', 'North America']
df['geo'] = np.select(conditions, choices, default=0)
Expected output:
location geo
0 London, Paris Europe, Europe
1 Hong Kong Asia
2 London, New York Europe, North America
3 Singapore, Toroto Asia, North America
4 Boston North America
Create a mapping of each country -> area, then use explode and map to apply the mapping and finally, use groupby and apply to rebuild the list:
geo = {'Europe': Europe, 'North_America': North_America, 'Asia': Asia}
mapping = {country: area for area, countries in geo.items() for country in countries}
df['geo'] = df['location'].str.split(', ').explode().map(mapping) \
.groupby(level=0).apply(', '.join)
Output:
>>> df
location geo
0 London, Paris Europe, Europe
1 Hong Kong Asia
2 London, New York Europe, North_America
3 Singapore, Toroto Asia, North_America
4 Boston North_America
By using NumPy library together with python for loops we can get the results. At first we combine lists of country's cities together and then create another list named continents which length is the same as the created list of cities:
import numpy as np
import pandas as pd
continents = ["Europe"] * len(Europe) + ["North_America"] * len(North_America) + ["Asia"] * len(Asia)
countries = Europe + North_America + Asia
locations = data['location']
Then for each city, even for each in the combinations, we find its index in the created country list. Then we create a list for number of commas in each of that combinations for using to create the desired output with commas:
corsp = []
comma_nums = []
for i in locations:
for j, k in enumerate(i.split(', ')):
corsp.append(np.where(np.array(countries) == k)[0][0])
comma_nums.append(j)
continents list will be reordered and modified by created index list. Then its arguments combined in list format as the combination style which where in locations, and then the lists convert to strings as they are needed for the output:
reordered_continents = [continents[i] for i in corsp]
mod_continents = []
iter = 0
f = 1
for i in comma_nums:
mod_continents.append(reordered_continents[iter:i + f])
iter = i + f
f = iter + 1
for i, j in enumerate(mod_continents):
if len(j) > 1:
for k in j:
mod_continents[i] = ', '.join(j)
else:
mod_continents[i] = ''.join(j)
df['geo'] = mod_continents
I have a dataframe like this:
Destinations
Paris,Oslo, Paris,Milan, Athens,Amsterdam
Boston,New York, Boston,London, Paris,New York
Nice,Paris, Milan,Paris, Nice,Milan
I want to get the following dataframe (without space between the cities):
Destinations_2 no_destinations
Paris,Oslo,Milan,Athens,Amsterdam 5
Boston,New York,London,Paris 4
Nice,Paris,Milan 3
How to remove duplicates within a cell?
You can use a list comprehension which is faster than using apply() (replace Col with the original column name) :
df['no_destinations']=[len(set([a.strip() for a in i.split(',')])) for i in df['Col']]
print(df)
Col no_destinations
0 Paris,Oslo, Paris,Milan, Athens,Amsterdam 5
1 Boston,New York, Boston,London, Paris,New York 4
2 Nice,Paris, Milan,Paris, Nice,Milan 3
df['no_destinations'] = df.Destinations.str.split(',').apply(set).apply(len)
if there are spaces in between use
df.Destinations.str.split(',').apply(lambda x: list(map(str.strip,x))).apply(set).apply(len)
Output
Destinations nodestinations
0 Paris,Oslo, Paris,Milan, Athens,Amsterdam 5
1 Boston,New York, Boston,London, Paris,New York 4
2 Nice,Paris, Milan,Paris, Nice,Milan 3
# your data:
import pandas as pd
data = {'Destinations': ['Paris,Oslo, Paris,Milan, Athens,Amsterdam',
'Boston,New York, Boston,London, Paris,New York',
'Nice,Paris, Milan,Paris, Nice,Milan']}
df = pd.DataFrame(data)
>>>
Destinations
0 Paris,Oslo, Paris,Milan, Athens,Amsterdam
1 Boston,New York, Boston,London, Paris,New York
2 Nice,Paris, Milan,Paris, Nice,Milan
First: make every row of your column a list.
df.Destinations = df.Destinations.apply(lambda x: x.replace(', ', ',').split(','))
>>>
Destinations
0 [Paris, Oslo, Paris, Milan, Athens, Amsterdam]
1 [Boston, New York, Boston, London, Paris, New York]
2 [Nice, Paris, Milan, Paris, Nice, Milan]
Second: removes dups from the lists
df.Destinations = df.Destinations.apply(lambda x: list(dict.fromkeys(x)))
# or: df.Destinations = df.Destinations.apply(lambda x: list(set(x)))
>>>
Destinations
0 [Paris, Oslo, Milan, Athens, Amsterdam]
1 [Boston, New York, London, Paris]
2 [Nice, Paris, Milan]
Finally, create desired columns
df['no_destinations'] = df.Destinations.apply(lambda x: len(x))
df['Destinations_2'] = df.Destinations.apply(lambda x: ','.join(x))
All steps use the apply and lambda functions, you can chain or nest them together if you want
All the previous answers have addressed only one part of your problem i.e. to show the unique count (no_destinations). Let me try to answer both of your queries.
The idea below is to apply a method on the Destinations column which returns 2 series named Destinations_2 and no_destinations which contain unique elements separated by comma with no space, and count of unique elements, respectively.
import pandas as pd
data = {'Destinations': ['Paris,Oslo, Paris,Milan, Athens,Amsterdam',
'Boston,New York, Boston,London, Paris,New York',
'Nice,Paris, Milan,Paris, Nice,Milan'
]}
def remove_dups(x):
data = set(x.replace(" ", "").split(','))
return pd.Series([','.join(data),len(data)], index=['Destinations_2', 'no_destinations'])
df = pd.DataFrame.from_dict(data)
df[['Destinations_2', 'no_destinations']] = df['Destinations'].apply(remove_dups)
print(df.head())
Output:
Note: As you are not concerned with the order, I have used set above. If you need to maintain the order, you will have to replace set with some other logic to remove duplicates.
I have a number of columns in a dataframe:
df = pd.DataFrame({'Date':[1990],'State Income of Alabama':[1],
'State Income of Washington':[2],
'State Income of Arizona':[3]})
All headers have the same number of strings and all have the exact same strings with exactly one white space between the State's name.
I want to take out the strings 'State Income of ' and leave the state in tact as a new header for the set so they just all read:
Alabama Washington Arizona
1 2 3
I've tried using the replace columns function in Python like:
df.columns = df.columns.str.replace('State Income of ', '')
But this isn't giving me the desired output.
Here is another solution, not in place:
df.rename(columns=lambda x: x.split()[-1])
or in place:
df.rename(columns=lambda x: x.split()[-1], inplace = True)
Your way works for me, but there are alternatives:
One way is to split your column names and take the last word:
df.columns = [i.split()[-1] for i in df.columns]
>>> df
Alabama Arizona Washington
0 1 3 2
You can use the re module for this:
>>> import pandas as pd
>>> df = pd.DataFrame({'State Income of Alabama':[1],
... 'State Income of Washington':[2],
... 'State Income of Arizona':[3]})
>>>
>>> import re
>>> df.columns = [re.sub('State Income of ', '', col) for col in df]
>>> df
Alabama Washington Arizona
0 1 2 3
re.sub('State Income of', '', col) will replace any occurrence of 'State Income of' with an empty string (with "nothing," effectively) in the string col.
CSV file: (sample1.csv)
Location_City, Location_State, Name, hobbies
Los Angeles, CA, John, "['Music', 'Running']"
Texas, TX, Jack, "['Swimming', 'Trekking']"
I want to convert hobbies column of CSV into following output
Location_City, Location_State, Name, hobbies
Los Angeles, CA, John, Music
Los Angeles, CA, John, Running
Texas, TX, Jack, Swimming
Texas, TX, Jack, Trekking
I have read csv into dataframe but I don't know how to convert it?
data = pd.read_csv("sample1.csv")
df=pd.DataFrame(data)
df
You can use findall or extractall for get lists from hobbies colum, then flatten with chain.from_iterable and repeat another columns:
a = df['hobbies'].str.findall("'(.*?)'").astype(np.object)
lens = a.str.len()
from itertools import chain
df1 = pd.DataFrame({
'Location_City' : df['Location_City'].values.repeat(lens),
'Location_State' : df['Location_State'].values.repeat(lens),
'Name' : df['Name'].values.repeat(lens),
'hobbies' : list(chain.from_iterable(a.tolist())),
})
Or create Series, remove first level and join to original DataFrame:
df1 = (df.join(df.pop('hobbies').str.extractall("'(.*?)'")[0]
.reset_index(level=1, drop=True)
.rename('hobbies'))
.reset_index(drop=True))
print (df1)
Location_City Location_State Name hobbies
0 Los Angeles CA John Music
1 Los Angeles CA John Running
2 Texas TX Jack Swimming
3 Texas TX Jack Trekking
We can solve this using pandas.DataFrame.explode function which was introduced in version 0.25.0 if you have same or higher version, you can use below code.
explode function reference: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.explode.html
import pandas as pd
import ast
data = {
'Location_City': ['Los Angeles','Texas'],
'Location_State': ['CA','TX'],
'Name': ['John','Jack'],
'hobbies': ["['Music', 'Running']", "['Swimming', 'Trekking']"]
}
df = pd.DataFrame(data)
# Converting a string representation of a list into an actual list object
list_eval = lambda x: ast.literal_eval(x)
df['hobbies'] = df['hobbies'].apply(list_eval)
# Exploding the list
df = df.explode('hobbies')
print(df)
Location_City Location_State Name hobbies
0 Los Angeles CA John Music
0 Los Angeles CA John Running
1 Texas TX Jack Swimming
1 Texas TX Jack Trekking
How do I remove non-alphabet from the values in the dataframe? I only managed to convert all to lower case
def doubleAwardList(self):
dfwinList = pd.DataFrame()
dfloseList = pd.DataFrame()
dfwonandLost = pd.DataFrame()
#self.dfWIN... and self.dfLOSE... is just the function used to call the files chosen by user
groupby_name= self.dfWIN.groupby("name")
groupby_nameList= self.dfLOSE.groupby("name _List")
list4 = []
list5 = []
notAwarded = "na"
for x, group in groupby_name:
if x != notAwarded:
list4.append(str.lower(str(x)))
dfwinList= pd.DataFrame(list4)
for x, group in groupby_nameList:
list5.append(str.lower(str(x)))
dfloseList = pd.DataFrame(list5)
data sample: Basically I mainly need to remove the full stops and hyphens as I will require to compare it to another file but the naming isn't very consistent so i had to remove the non-alphanumeric for much more accurate result
creative-3
smart tech pte. ltd.
nutritive asia
asia's first
desired result:
creative 3
smart tech pte ltd
nutritive asia
asia s first
Use DataFrame.replace only and add whitespace to pattern:
df = df.replace('[^a-zA-Z0-9 ]', '', regex=True)
If one column - Series:
df = pd.DataFrame({'col': ['creative-3', 'smart tech pte. ltd.',
'nutritive asia', "asia's first"],
'col2':range(4)})
print (df)
col col2
0 creative-3 0
1 smart tech pte. ltd. 1
2 nutritive asia 2
3 asia's first 3
df['col'] = df['col'].replace('[^a-zA-Z0-9 ]', '', regex=True)
print (df)
col col2
0 creative3 0
1 smart tech pte ltd 1
2 nutritive asia 2
3 asias first 3
EDIT:
If multiple columns is possible select only object, obviously string columns and if necessary cast to strings:
cols = df.select_dtypes('object').columns
print (cols)
Index(['col'], dtype='object')
df[cols] = df[cols].astype(str).replace('[^a-zA-Z0-9 ]', '', regex=True)
print (df)
col col2
0 creative3 0
1 smart tech pte ltd 1
2 nutritive asia 2
3 asias first 3
Why not just the below, (i did make into lower btw):
df=df.replace('[^a-zA-Z0-9]', '',regex=True).str.lower()
Then now:
print(df)
Will get the desired data-frame
Update:
try:
df=df.apply(lambda x: x.str.replace('[^a-zA-Z0-9]', '').lower(),axis=0)
If only one column do:
df['your col']=df['your col'].str.replace('[^a-zA-Z0-9]', '').str.lower()