Removing strings in a series of headers - python

I have a number of columns in a dataframe:
df = pd.DataFrame({'Date':[1990],'State Income of Alabama':[1],
'State Income of Washington':[2],
'State Income of Arizona':[3]})
All headers have the same number of strings and all have the exact same strings with exactly one white space between the State's name.
I want to take out the strings 'State Income of ' and leave the state in tact as a new header for the set so they just all read:
Alabama Washington Arizona
1 2 3
I've tried using the replace columns function in Python like:
df.columns = df.columns.str.replace('State Income of ', '')
But this isn't giving me the desired output.

Here is another solution, not in place:
df.rename(columns=lambda x: x.split()[-1])
or in place:
df.rename(columns=lambda x: x.split()[-1], inplace = True)

Your way works for me, but there are alternatives:
One way is to split your column names and take the last word:
df.columns = [i.split()[-1] for i in df.columns]
>>> df
Alabama Arizona Washington
0 1 3 2

You can use the re module for this:
>>> import pandas as pd
>>> df = pd.DataFrame({'State Income of Alabama':[1],
... 'State Income of Washington':[2],
... 'State Income of Arizona':[3]})
>>>
>>> import re
>>> df.columns = [re.sub('State Income of ', '', col) for col in df]
>>> df
Alabama Washington Arizona
0 1 2 3
re.sub('State Income of', '', col) will replace any occurrence of 'State Income of' with an empty string (with "nothing," effectively) in the string col.

Related

Remove space between string after comma in python dataframe column

df1
ID Col
1 new york, london school of economics, america
2 california & washington, harvard university, america
Expected output is :
df1
ID Col
1 new york,london school of economics,america
2 california & washington,harvard university,america
My try is :
df1[Col].apply(lambda x : x.str.replace(", ","", regex=True))
It is advisable to use the regular expression ,\s+, which allows you to capture several consecutive whitespace characters after a comma, as in washington, harvard
df = pd.DataFrame({'ID': [1, 2], 'Col': ['new york, london school of economics, america',
'california & washington, harvard university, america']}).set_index('ID')
df.Col = df.Col.str.replace(r',\s+', ',', regex=True)
print(df)
Col
ID
1 new york,london school of economics,america
2 california & washington,harvard university,ame...
You can use str.replace(', ', ",") instead of a lambda function. However, this will only work if there is only one space after ",".
As Алексей Р mentioned, (r',\s+', ",", regex=True) is needed to catch any extra spaces after ",".
Reference: https://pandas.pydata.org/docs/reference/api/pandas.Series.str.replace.html
Example:
import pandas as pd
data_ = ['new york, london school of economics, america', 'california & washington, harvard university, america']
df1 = pd.DataFrame(data_)
df1.columns = ['Col']
df1.index.name = 'ID'
df1.index = df1.index + 1
df1['Col'] = df1['Col'].str.replace(r',\s+', ",", regex=True)
print(df1)
Result:
Col
ID
1 new york,london school of economics,america
2 california & washington,harvard university,ame...
If you mention the axis it will be solved
df.apply(lambda x: x.str.replace(', ',',',regex=True),axis=1)
You can split the string on ',' and then remove the extra whitespaces and join the list.
df1=df1['Col'].apply(lambda x : ",".join([w.strip() for w in x.split(',')]))
Hope this helps.

frequency of string (comma separated) in Python

I'm trying to find the frequency of strings from the field "Select Investors" on this website https://www.cbinsights.com/research-unicorn-companies
Is there a way to pull out the frequency of each of the comma separated strings?
For example, how frequent does the term "Sequoia Capital China" show up?
The solution provided by #Mazhar checks whether a certain term is a substring of a string delimited by commas. As a consequence, the number of occurrences of 'Sequoia Capital' returned by this approach is the sum of the occurrences of all the strings that contain 'Sequoia Capital', namely 'Sequoia Capital', 'Sequoia Capital China', 'Sequoia Capital India', 'Sequoia Capital Israel' and 'and Sequoia Capital China'. The following code avoids that issue:
import pandas as pd
from collections import defaultdict
url = "https://www.cbinsights.com/research-unicorn-companies"
df = pd.read_html(url)[0]
freqs = defaultdict(int)
for group in df['Select Investors']:
if hasattr(group, 'lower'):
for raw_investor in group.lower().split(','):
investor = raw_investor.strip()
# Ignore empty strings produced by wrong data like this:
# 'B Capital Group,, GE Ventures, McKesson Ventures'
if investor:
freqs[investor] += 1
Demo
In [57]: freqs['sequoia capital']
Out[57]: 41
In [58]: freqs['sequoia capital china']
Out[58]: 46
In [59]: freqs['sequoia capital india']
Out[59]: 25
In [60]: freqs['sequoia capital israel']
Out[60]: 2
In [61]: freqs['and sequoia capital china']
Out[61]: 1
The sum of occurrences is 115, which coincides with the frequency returned for 'sequoia capital' by the currently accepted solution.
I made this correct, more pythonic way
import itertools
import collections
import pandas as pd
def fun(x):
x = map(lambda y: y.strip().lower(), str(x).split(','))
return filter(lambda y: y and y != 'nan', x)
# Extract data
url = "https://www.cbinsights.com/research-unicorn-companies"
df = pd.read_html(url)
first_df = df[0]
# Process
investor = first_df['Select Investors'].apply(lambda x: fun(x))
investor = investor.values.flatten()
investor = list(itertools.chain(*investor))
# Organize
final_data = collections.Counter(investor).items()
final_df = pd.DataFrame(final_data, columns=['Investor', 'Frequency'])
final_df
Output:
Investor Frequency
0 Sequoia Capital China 46
1 SIG Asia Investments 3
2 Sina Weibo 2
3 Softbank Group 9
4 Founders Fund 16
... ... ...
1187 Motive Partners. Apollo Global Management 1
1188 JBV Capital 1
1189 Array Ventures 1
1190 AWZ Ventures 1
1191 Endiya Partners 1

Python - Group dataframe based on certain string

I am trying to combine these strings and rows within certain logic:
s1 = ['abc.txt','abc.txt','ert.txt','ert.txt','ert.txt']
s2 = [1,1,2,2,2]
s3 = ['Harry Potter','Vol 1','Lord of the Rings - Vol 1',np.nan,'Harry Potter']
df = pd.DataFrame(list(zip(s1,s2,s3)),
columns=['file','id','book'])
df
Data Preview:
file id book
abc.txt 1 Harry Potter
abc.txt 1 Vol 1
ert.txt 2 Lord of the Rings
ert.txt 2 NaN
ert.txt 2 Harry Potter
I have bunch of files name columns with id's associated with it. I have 'book' column where vol 1 has been in separate row.
I know that this vol1 is only associated with 'Harry Potter' in the given dataset.
Based on the group by of 'file' & 'id', how do I combine 'Vol 1' in the same row where 'Harry Potter' string appears in the row?
Notice some data row doesn't have vo1 for Harry Potter I only want 'Vol 1' when looking at the file & id groupby.
2 Tries:
1st: doesn't work
if (df['book'] == 'Harry Potter' and df['book'].str.contains('Vol 1',case=False) in df.groupby(['file','id'])):
df.groupby(['file','id'],as_index=False).first()
2nd: this applies to every string (but don't want it apply every 'Harry Potter' string.
df.loc[df['book'].str.contains('Harry Potter',case=False,na=False), 'new_book'] = 'Harry Potter - Vol 1'
Here is the output I am looking for
file id book
abc.txt 1 Harry Potter - Vol 1
ert.txt 2 Lord of the Rings - Vol 1
ert.txt 2 NaN
ert.txt 2 Harry Potter
Start from import re (you will use it).
Then create your DataFrame:
df = pd.DataFrame({
'file': ['abc.txt','abc.txt','ert.txt','ert.txt','ert.txt'],
'id': [1, 1, 2, 2, 2],
'book': ['Harry Potter', 'Vol 1', 'Lord of the Rings - Vol 1',
np.nan, 'Harry Potter']})
The first processing step is to add a column, let's call it book2,
containing book2 from the next row:
df["book2"] = df.book.shift(-1).fillna('')
I added fillna('') to replace NaN values with an empty string.
Then define a function to be applied to each row:
def fn(row):
return f"{row.book} - {row.book2}" if row.book == 'Harry Potter'\
and re.match(r'^Vol \d+$', row.book2) else row.book
This function checks whether book == "Harry Potter" and book2 matches
"Vol " + a sequence of digits.
If it does, it returns book + book2, otherwise it returns just book.
Then we apply this function and save the result back under book:
df["book"] = df.apply(fn, axis=1)
And the only remaining thing is to drop:
rows where book matches Vol \d+,
book2 column.
The code is:
df = df.drop(df[df.book.str.match(r'^Vol \d+$').fillna(False)].index)\
.drop(columns=['book2'])
fillna(False) is needed because str.match returns NaN for
source content == NaN.
Assuming that "Vol x" occurs on the row following the title, I would use an auxilliary Series obtained by shifting the book column by -1. It is then enough to combine that Series with the book column when it starts with "Vol " and drop lines where the books column starts with "Vol ". Code could be:
b2 = df.book.shift(-1).fillna('')
df['book'] = df.book + np.where(b2.str.match('Vol [0-9]+'), ' - ' + b2, '')
print(df.drop(df.loc[df.book.fillna('').str.match('Vol [0-9]+')].index))
If the order in the dataframe is not guaranteed but if a Vol x row matches the other row in dataframe with same file and id, you can split the dataframe in 2 parts one containing the Vol x rows and one containing the other ones and update the latter from the former:
g = df.groupby(df.book.fillna('').str.match('Vol [0-9]+'))
for k, v in g:
if k:
df_vol = v
else:
df = v
for row in df_vol.iterrows():
r = row[1]
df.loc[(df.file == r.file)&(df.id==r.id), 'book'] += ' - ' + r['book']
Utilizing merge, apply, update, drop_duplicates.
set_index and merge on index file, id between df of 'Harry Potter' and df of 'Vol 1'; join to create appropriate string and convert it to dataframe
df.set_index(['file', 'id'], inplace=True)
df1 = df[df['book'] == 'Harry Potter'].merge(df[df['book'] == 'Vol 1'], left_index=True, right_index=True).apply(' '.join, axis=1).to_frame(name='book')
Out[2059]:
book
file id
abc.txt 1 Harry Potter Vol 1
Update original df, drop_duplicate, and reset_index
df.update(df1)
df.drop_duplicates().reset_index()
Out[2065]:
file id book
0 abc.txt 1 Harry Potter Vol 1
1 ert.txt 2 Lord of the Rings - Vol 1
2 ert.txt 2 NaN
3 ert.txt 2 Harry Potter

Removing non-alphanumeric symbols in dataframe

How do I remove non-alphabet from the values in the dataframe? I only managed to convert all to lower case
def doubleAwardList(self):
dfwinList = pd.DataFrame()
dfloseList = pd.DataFrame()
dfwonandLost = pd.DataFrame()
#self.dfWIN... and self.dfLOSE... is just the function used to call the files chosen by user
groupby_name= self.dfWIN.groupby("name")
groupby_nameList= self.dfLOSE.groupby("name _List")
list4 = []
list5 = []
notAwarded = "na"
for x, group in groupby_name:
if x != notAwarded:
list4.append(str.lower(str(x)))
dfwinList= pd.DataFrame(list4)
for x, group in groupby_nameList:
list5.append(str.lower(str(x)))
dfloseList = pd.DataFrame(list5)
data sample: Basically I mainly need to remove the full stops and hyphens as I will require to compare it to another file but the naming isn't very consistent so i had to remove the non-alphanumeric for much more accurate result
creative-3
smart tech pte. ltd.
nutritive asia
asia's first
desired result:
creative 3
smart tech pte ltd
nutritive asia
asia s first
Use DataFrame.replace only and add whitespace to pattern:
df = df.replace('[^a-zA-Z0-9 ]', '', regex=True)
If one column - Series:
df = pd.DataFrame({'col': ['creative-3', 'smart tech pte. ltd.',
'nutritive asia', "asia's first"],
'col2':range(4)})
print (df)
col col2
0 creative-3 0
1 smart tech pte. ltd. 1
2 nutritive asia 2
3 asia's first 3
df['col'] = df['col'].replace('[^a-zA-Z0-9 ]', '', regex=True)
print (df)
col col2
0 creative3 0
1 smart tech pte ltd 1
2 nutritive asia 2
3 asias first 3
EDIT:
If multiple columns is possible select only object, obviously string columns and if necessary cast to strings:
cols = df.select_dtypes('object').columns
print (cols)
Index(['col'], dtype='object')
df[cols] = df[cols].astype(str).replace('[^a-zA-Z0-9 ]', '', regex=True)
print (df)
col col2
0 creative3 0
1 smart tech pte ltd 1
2 nutritive asia 2
3 asias first 3
Why not just the below, (i did make into lower btw):
df=df.replace('[^a-zA-Z0-9]', '',regex=True).str.lower()
Then now:
print(df)
Will get the desired data-frame
Update:
try:
df=df.apply(lambda x: x.str.replace('[^a-zA-Z0-9]', '').lower(),axis=0)
If only one column do:
df['your col']=df['your col'].str.replace('[^a-zA-Z0-9]', '').str.lower()

How do I use a mapping variable to re-index a dataframe?

I have the following data frame:
population GDP
country
United Kingdom 4.5m 10m
Spain 3m 8m
France 2m 6m
I also have the following information in a 2 column dataframe(happy for this to be made into another datastruct if that will be more beneficial as the plan is that it will be sorted in a VARS file.
county code
Spain es
France fr
United Kingdom uk
The 'mapping' datastruct will be sorted in a random order as countries will be added/removed at random times.
What is the best way to re-index the data frame to its country code from its country name?
Is there a smart solution that would also work on other columns so for example if a data frame was indexed on date but one column was df['county'] then you could change df['country'] to its country code? Finally is there a third option that would add an additional column that was either country/code which selected the right code based on a country name in another column?
I think you can use Series.map, but it works only with Series, so need Index.to_series. Last rename_axis (new in pandas 0.18.0):
df1.index = df1.index.to_series().map(df2.set_index('county').code)
df1 = df1.rename_axis('county')
#pandas bellow 0.18.0
#df1.index.name = 'county'
print (df1)
population GDP
county
uk 4.5m 10m
es 3m 8m
fr 2m 6m
It is same as mapping by dict:
d = df2.set_index('county').code.to_dict()
print (d)
{'France': 'fr', 'Spain': 'es', 'United Kingdom': 'uk'}
df1.index = df1.index.to_series().map(d)
df1 = df1.rename_axis('county')
#pandas bellow 0.18.0
#df1.index.name = 'county'
print (df1)
population GDP
county
uk 4.5m 10m
es 3m 8m
fr 2m 6m
EDIT:
Another solution with Index.map, so to_series is omitted:
d = df2.set_index('county').code.to_dict()
print (d)
{'France': 'fr', 'Spain': 'es', 'United Kingdom': 'uk'}
df1.index = df1.index.map(d.get)
df1 = df1.rename_axis('county')
#pandas bellow 0.18.0
#df1.index.name = 'county'
print (df1)
population GDP
county
uk 4.5m 10m
es 3m 8m
fr 2m 6m
Here are some brief ways to approach your 3 questions. More details below:
1) How to change index based on mapping in separate df
Use df_with_mapping.todict("split") to create a dictionary, then use a list comprehension to change it into {"old1":"new1",...,"oldn":"newn"} form then use df.index = df.base_column.map(dictionary) to get the changed index.
2) How to change index if the new column is in the same df:
df.index = df["column_you_want"]
3) Creating a new column by mapping on a old column:
df["new_column"] = df["old_column"].map({"old1":"new1",...,"oldn":"newn"})
1) Mapping for the current index exists in separate dataframe but you don't have the mapped column in the dataframe yet
This is essentially the same as question 2 with the additional step of creating a dictionary for the mapping you want.
#creating the mapping dictionary in the form of current index : future index
df2 = pd.DataFrame([["es"],["fr"]],index = ["spain","france"])
interm_dict = df2.to_dict("split") #Creates a dictionary split into column labels, data labels and data
mapping_dict = {country:data[0] for country,data in zip(interm_dict["index"],interm_dict['data'])}
#We only want the first column of the data and the index so we need to make a new dict with a list comprehension and zip
df["country"] = df.index #Create a new column if u want to save the index
df.index = pd.Series(df.index).map(mapping_dict) #change the index
df.index.name = "" #Blanks out index name
df = df.drop("county code",1) #Drops the county code column to avoid duplicate columns
Before:
county code language
spain es spanish
france fr french
After:
language country
es spanish spain
fr french france
2) Changing the current index to one of the columns already in the dataframe
df = pd.DataFrame([["es","spanish"],["fr","french"]], columns = ["county code","language"], index = ["spain", "french"])
df["country"] = df.index #if you want to save the original index
df.index = df["county code"] #The only step you actually need
df.index.name = "" #if you want a blank index name
df = df.drop("county code",1) #if you dont want the duplicate column
Before:
county code language
spain es spanish
french fr french
After:
language country
es spanish spain
fr french french
3) Creating an additional column based on another column
This is again essentially the same as step 2 except we create an additional column instead of assigning .index to the created series.
df = pd.DataFrame([["es","spanish"],["fr","french"]], columns = ["county code","language"], index = ["spain", "france"])
df["city"] = df["county code"].map({"es":"barcelona","fr":"paris"})
Before:
county code language
spain es spanish
france fr french
After:
county code language city
spain es spanish barcelona
france fr french paris

Categories

Resources