frequency of string (comma separated) in Python - python

I'm trying to find the frequency of strings from the field "Select Investors" on this website https://www.cbinsights.com/research-unicorn-companies
Is there a way to pull out the frequency of each of the comma separated strings?
For example, how frequent does the term "Sequoia Capital China" show up?

The solution provided by #Mazhar checks whether a certain term is a substring of a string delimited by commas. As a consequence, the number of occurrences of 'Sequoia Capital' returned by this approach is the sum of the occurrences of all the strings that contain 'Sequoia Capital', namely 'Sequoia Capital', 'Sequoia Capital China', 'Sequoia Capital India', 'Sequoia Capital Israel' and 'and Sequoia Capital China'. The following code avoids that issue:
import pandas as pd
from collections import defaultdict
url = "https://www.cbinsights.com/research-unicorn-companies"
df = pd.read_html(url)[0]
freqs = defaultdict(int)
for group in df['Select Investors']:
if hasattr(group, 'lower'):
for raw_investor in group.lower().split(','):
investor = raw_investor.strip()
# Ignore empty strings produced by wrong data like this:
# 'B Capital Group,, GE Ventures, McKesson Ventures'
if investor:
freqs[investor] += 1
Demo
In [57]: freqs['sequoia capital']
Out[57]: 41
In [58]: freqs['sequoia capital china']
Out[58]: 46
In [59]: freqs['sequoia capital india']
Out[59]: 25
In [60]: freqs['sequoia capital israel']
Out[60]: 2
In [61]: freqs['and sequoia capital china']
Out[61]: 1
The sum of occurrences is 115, which coincides with the frequency returned for 'sequoia capital' by the currently accepted solution.

I made this correct, more pythonic way
import itertools
import collections
import pandas as pd
def fun(x):
x = map(lambda y: y.strip().lower(), str(x).split(','))
return filter(lambda y: y and y != 'nan', x)
# Extract data
url = "https://www.cbinsights.com/research-unicorn-companies"
df = pd.read_html(url)
first_df = df[0]
# Process
investor = first_df['Select Investors'].apply(lambda x: fun(x))
investor = investor.values.flatten()
investor = list(itertools.chain(*investor))
# Organize
final_data = collections.Counter(investor).items()
final_df = pd.DataFrame(final_data, columns=['Investor', 'Frequency'])
final_df
Output:
Investor Frequency
0 Sequoia Capital China 46
1 SIG Asia Investments 3
2 Sina Weibo 2
3 Softbank Group 9
4 Founders Fund 16
... ... ...
1187 Motive Partners. Apollo Global Management 1
1188 JBV Capital 1
1189 Array Ventures 1
1190 AWZ Ventures 1
1191 Endiya Partners 1

Related

Clean column in data frame

I'm trying to clean one column which contains the ID number which is starting from S and 7 numbers, e.g.: 'S1234567' and save only this number into new column. I started with this column named Remarks, this is an example of the data inside:
Remarks
0 S0252508 Shippment UK
1 S0255111 Shippment UK
2 S0256352 Shippment UK
3 S0259138 Shippment UK
4 S0260425 Shippment US
I've menaged to separate those rows which has the format S1234567 + text using the code below:
merged_out['Remarks'] = merged_out['Remarks'].replace("\t", "\r")
merged_out['Remarks'] = merged_out['Remarks'].replace("\n", "\r")
s = merged_out['Remarks'].str.split("\r").apply(pd.Series, 1).stack()
s.index = s.index.droplevel(-1)
s.name = 'Remarks'
del merged_out['Remarks']
merged_out = merged_out.join(s)
merged_out[['Number','Remarks']] = merged_out.Remarks.str.split(" ", 1, expand=True)
After creating a data frame I found that there are a lot of mistakes inside of that column because the data are written there manually, so there are some examples of those wrong records:
Number
0. Pallets:
1. S0246734/S0246735/S0246736
3. delivery
4. S0258780 31 cok
5. S0246732-
6. 2
7. ok
8. nan
And this is only the wrong data which are in the Number column, I will need to clear this and save only those which has the correct number, if there is sth. like that: S0246732/S0246736/S0246738, then I need to have separated row for each number with the same data as it was for this record. For the other one I need to save those which contains the number, the other should have the null value.
Here is a regex approach that will do what I think your question asks:
import pandas as pd
merged_out = pd.DataFrame({
'Remarks':[
'S0252508 Shippment UK',
'S0255111 Shippment UK',
'S0256352 Shippment UK',
'S0259138/S0259139 Shippment UK',
'S12345678 Shippment UK',
'S0260425 Shippment US']
})
pat = r'(?:(\bS\d{7})/)*(\bS\d{7}\b)'
df = merged_out.Remarks.str.extractall(pat)
df = ( pd.concat([
pd.DataFrame(df.unstack().apply(lambda row: row.dropna().tolist(), axis=1), columns=['Number']),
merged_out],
axis=1).explode('Number') )
df.Remarks = df.Remarks.str.replace(pat + r'\s*', '', regex=True)
Input:
Remarks
0 S0252508 Shippment UK
1 S0255111 Shippment UK
2 S0256352 Shippment UK
3 S0259138/S0259139 Shippment UK
4 S12345678 Shippment UK
5 S0260425 Shippment US
Output:
Number Remarks
0 S0252508 Shippment UK
1 S0255111 Shippment UK
2 S0256352 Shippment UK
3 S0259138 Shippment UK
3 S0259139 Shippment UK
5 S0260425 Shippment US
4 NaN S12345678 Shippment UK
Explanation:
with Series.str.extractall(), use a pattern to obtain 0 or more occurrences of word boundary \b followed by S followed by 7 digits and a 1 occurrence of S followed by 7 digits (flanked by word boundaries \b)
use unstack() to eliminate multiple index levels
use apply() with dropna() and tolist() to create a new dataframe with a Number column containing a list of numbers for each row
use explode() to add new rows for lists with more than one Number item
with Series.str.replace(), filter out the number matches using the previous pattern, plus r'\s*' to match trailing whitespace characters, to obtain the residual Remarks
Notes:
all rows in the sample input contain one valid Number except that one row contains multiple Number values separated by / delimiters, and another row contains no valid Number (it has S followed by 8 digits, more than the 7 that make a valid Number)
I think the easiest solution is to use regular expressions and a list comprehension:
import re
import pandas as pd
merged_out['Remarks'] = [re.split('\s', i)[0] for i in merged_out['Remarks']]
Explanation:
This regular expression allows you to split the data when there is a space and makes a list from the i row in the column Remarks. With the 0, I selected the 0 element in this list. In this case, it is the number.
In this case, the list comprehension iterates through all the column in the dataset. In consequence, you will obtain the corresponding number of each row in the new column Remarks.

Python/Pandas:How to process a column of data when it meets certain criteria

i have a csv lie this
userlabel|country
SZ5GZTD_[56][13631808]|russia
YZ5GZTC-3_[51][13680735]|uk
XZ5GZTA_12-[51][13574893]|usa
testYZ5GZWC_11-[51][13632101]|cuba
I use pandas to read this csv, I'd like to add a new column ci,Its value comes from userlabel,and the following conditions must be met:
convert values to lowercase
start with 'yz' or 'testyz'
the code is like this :
(df['userlabel'].str.lower()).str.extract(r"(test)?([a-z]+).*", expand=True)[1]
when it matched,ci is the number between the first "- or _" and second "- or _" from userlabel.
the fake code is like this:
ci = (userlabel,r'.*(\_|\-)(\d+)(\_|\-).*',2)
finally,the result is like this
userlabel ci country
SZ5GZTD_[56][13631808] russia
YZ5GZTC-3_[51][13680735] 3 uk
XZ5GZTA_12-[51][13574893] usa
testYZ5GZWC_11-[51][13632101] 11 cuba
You can use
import pandas as pd
df = pd.DataFrame({'userlabel':['SZ5GZTD_[56][13631808]','YZ5GZTC-3_[51][13680735]','XZ5GZTA_12-[51][13574893]','testYZ5GZWC_11-[51][13632101]'], 'country':['russia','uk','usa','cuba']})
df['ci'] = df['userlabel'].str.extract(r"(?i)^(?:yz|testyz)[^_-]*[_-](\d+)[-_]", expand=True)
>>> df['ci']
0 NaN
1 3
2 NaN
3 11
Name: ci, dtype: object
# To rearrange columns, add the following line:
df = df[['userlabel', 'ci', 'country']]
>>> df
userlabel ci country
0 SZ5GZTD_[56][13631808] NaN russia
1 YZ5GZTC-3_[51][13680735] 3 uk
2 XZ5GZTA_12-[51][13574893] NaN usa
3 testYZ5GZWC_11-[51][13632101] 11 cuba
See the regex demo.
Regex details:
(?i) - make the pattern case insensitive (no need using str.lower())
^ - start of string
(?:yz|testyz) - a non-capturing group matching either yz or testyz
[^_-]* - zero or more chars other than _ and -
[_-] - the first _ or -
(\d+) - Group 1 (the Series.str.extract requires a capturing group since it only returns this captured substring): one or more digits
[-_] - a - or _.
import re
def get_val(s):
l = re.findall(r'^(YZ|testYZ).*[_-](\d+)[_-].*', s)
return None if(len(l) == 0) else l[0][1]
df['ci'] = df['userlabel'].apply(lambda x: get_val(x))
df = df[['userlabel', 'ci', 'country']]
userlabel ci country
0 SZ5GZTD_[56][13631808] None russia
1 YZ5GZTC-3_[51][13680735] 3 uk
2 XZ5GZTA_12-[51][13574893] None usa
3 testYZ5GZWC_11-[51][13632101] 11 cuba

Extract integers after double space with regex

I have a dataframe where I want to extract stuff after double space. For all rows in column NAME there is a double white space after the company names before the integers.
NAME INVESTMENT PERCENT
0 APPLE COMPANY A 57 638 232 stocks OIL LTD 0.12322
1 BANANA 1 COMPANY B 12 946 201 stocks GOLD LTD 0.02768
2 ORANGE COMPANY C 8 354 229 stocks GAS LTD 0.01786
df = pd.DataFrame({
'NAME': ['APPLE COMPANY A 57 638 232 stocks', 'BANANA 1 COMPANY B 12 946 201 stocks', 'ORANGE COMPANY C 8 354 229 stocks'],
'PERCENT': [0.12322, 0.02768 , 0.01786]
})
I have this earlier, but it also includes integers in the company name:
df['STOCKS']=df['NAME'].str.findall(r'\b\d+\b').apply(lambda x: ''.join(x))
Instead I tried to extract after double spaces
df['NAME'].str.split('(\s{2})')
which gives output:
0 [APPLE COMPANY A, , 57 638 232 stocks]
1 [BANANA 1 COMPANY B, , 12 946 201 stocks]
2 [ORANGE COMPANY C, , 8 354 229 stocks]
However, I want the integers that occur after double spaces to be joined/merged and put into a new column.
NAME PERCENT STOCKS
0 APPLE COMPANY A 0.12322 57638232
1 BANANA 1 COMPANY B 0.02768 12946201
2 ORANGE COMPANY C 0.01786 12946201
How can I modify my second function to do what I want?
Following the original logic you may use
df['STOCKS'] = df['NAME'].str.extract(r'\s{2,}(\d+(?:\s\d+)*)', expand=False).str.replace(r'\s+', '')
df['NAME'] = df['NAME'].str.replace(r'\s{2,}\d+(?:\s\d+)*\s+stocks', '')
Output:
NAME PERCENT STOCKS
0 APPLE COMPANY A 0.12322 57638232
1 BANANA 1 COMPANY B 0.02768 12946201
2 ORANGE COMPANY C 0.01786 8354229
Details
\s{2,}(\d+(?:\s\d+)*) is used to extract the first occurrence of whitespace-separated consecutive digit chunks after 2 or more whitespaces and .replace(r'\s+', '') removes any whitespaces in that extracted text afterwards
.replace(r'\s{2,}\d+(?:\s\d+)*\s+stocks' updates the text in the NAME column, it removes 2 or more whitespaces, consecutive whitespace-separated digit chunks and then 1+ whitespaces and stocks. Actually, the last \s+stocks may be replaced with .* if there are other words.
Another pandas approach, which will cast STOCKS to numeric type:
df_split = (df['NAME'].str.extractall('^(?P<NAME>.+)\s{2}(?P<STOCKS>[\d\s]+)')
.reset_index(level=1, drop=True))
df_split['STOCKS'] = pd.to_numeric(df_split.STOCKS.str.replace('\D', ''))
Assign these columns back into your original DataFrame:
df[['NAME', 'STOCKS']] = df_split[['NAME', 'STOCKS']]
COMPANY_NAME STOCKS PERCENT
0 APPLE COMPANY A 57638232 0.12322
1 BANANA 1 COMPANY B 12946201 0.02768
2 ORANGE COMPANY C 8354229 0.01786
You can use look behind and look ahead operators.
''.join(re.findall(r'(?<=\s{2})(.*)(?=stocks)',string)).replace(' ','')
This catches all characters between two spaces and the word stocks and replace all the spaces with null.
Another Solution using Split
df["NAME"].apply(lambda x:x[x.find(' ')+2:x.find('stocks')-1].replace(' ',''))
Reference:-
Look_behind
You can try
df['STOCKS'] = df['NAME'].str.split(',')[2].replace(' ', '')
df['NAME'] = df['NAME'].str.split(',')[0]
This can be done without using regex by using split.
df['STOCKS'] = df['NAME'].apply(lambda x: ''.join(x.split(' ')[1].split(' ')[:-1]))
df['NAME'] = df['NAME'].str.replace(r'\s?\d+(?:\s\d+).*', '')

Removing strings in a series of headers

I have a number of columns in a dataframe:
df = pd.DataFrame({'Date':[1990],'State Income of Alabama':[1],
'State Income of Washington':[2],
'State Income of Arizona':[3]})
All headers have the same number of strings and all have the exact same strings with exactly one white space between the State's name.
I want to take out the strings 'State Income of ' and leave the state in tact as a new header for the set so they just all read:
Alabama Washington Arizona
1 2 3
I've tried using the replace columns function in Python like:
df.columns = df.columns.str.replace('State Income of ', '')
But this isn't giving me the desired output.
Here is another solution, not in place:
df.rename(columns=lambda x: x.split()[-1])
or in place:
df.rename(columns=lambda x: x.split()[-1], inplace = True)
Your way works for me, but there are alternatives:
One way is to split your column names and take the last word:
df.columns = [i.split()[-1] for i in df.columns]
>>> df
Alabama Arizona Washington
0 1 3 2
You can use the re module for this:
>>> import pandas as pd
>>> df = pd.DataFrame({'State Income of Alabama':[1],
... 'State Income of Washington':[2],
... 'State Income of Arizona':[3]})
>>>
>>> import re
>>> df.columns = [re.sub('State Income of ', '', col) for col in df]
>>> df
Alabama Washington Arizona
0 1 2 3
re.sub('State Income of', '', col) will replace any occurrence of 'State Income of' with an empty string (with "nothing," effectively) in the string col.

Removing non-alphanumeric symbols in dataframe

How do I remove non-alphabet from the values in the dataframe? I only managed to convert all to lower case
def doubleAwardList(self):
dfwinList = pd.DataFrame()
dfloseList = pd.DataFrame()
dfwonandLost = pd.DataFrame()
#self.dfWIN... and self.dfLOSE... is just the function used to call the files chosen by user
groupby_name= self.dfWIN.groupby("name")
groupby_nameList= self.dfLOSE.groupby("name _List")
list4 = []
list5 = []
notAwarded = "na"
for x, group in groupby_name:
if x != notAwarded:
list4.append(str.lower(str(x)))
dfwinList= pd.DataFrame(list4)
for x, group in groupby_nameList:
list5.append(str.lower(str(x)))
dfloseList = pd.DataFrame(list5)
data sample: Basically I mainly need to remove the full stops and hyphens as I will require to compare it to another file but the naming isn't very consistent so i had to remove the non-alphanumeric for much more accurate result
creative-3
smart tech pte. ltd.
nutritive asia
asia's first
desired result:
creative 3
smart tech pte ltd
nutritive asia
asia s first
Use DataFrame.replace only and add whitespace to pattern:
df = df.replace('[^a-zA-Z0-9 ]', '', regex=True)
If one column - Series:
df = pd.DataFrame({'col': ['creative-3', 'smart tech pte. ltd.',
'nutritive asia', "asia's first"],
'col2':range(4)})
print (df)
col col2
0 creative-3 0
1 smart tech pte. ltd. 1
2 nutritive asia 2
3 asia's first 3
df['col'] = df['col'].replace('[^a-zA-Z0-9 ]', '', regex=True)
print (df)
col col2
0 creative3 0
1 smart tech pte ltd 1
2 nutritive asia 2
3 asias first 3
EDIT:
If multiple columns is possible select only object, obviously string columns and if necessary cast to strings:
cols = df.select_dtypes('object').columns
print (cols)
Index(['col'], dtype='object')
df[cols] = df[cols].astype(str).replace('[^a-zA-Z0-9 ]', '', regex=True)
print (df)
col col2
0 creative3 0
1 smart tech pte ltd 1
2 nutritive asia 2
3 asias first 3
Why not just the below, (i did make into lower btw):
df=df.replace('[^a-zA-Z0-9]', '',regex=True).str.lower()
Then now:
print(df)
Will get the desired data-frame
Update:
try:
df=df.apply(lambda x: x.str.replace('[^a-zA-Z0-9]', '').lower(),axis=0)
If only one column do:
df['your col']=df['your col'].str.replace('[^a-zA-Z0-9]', '').str.lower()

Categories

Resources