Let's suppose I have a Series (or DataFrame) s1, for example list of all Universities and Colleges in the USA:
University
0 Searcy Harding University
1 Angwin Pacific Union College
2 Fairbanks University of Alaska Fairbanks
3 Ann Arbor University of Michigan
And another Series (od DataFrame) s2, for example list of all cities in the USA:
City
0 Searcy
1 Angwin
2 New York
3 Ann Arbor
And my desired output (bascially an intersection of s1 and s2):
Uni City
0 Searcy
1 Angwin
2 Fairbanks
3 Ann Arbor
The thing is: I'd like to create a Series that consists of cities but only these, that have a university/college. My very first thought was to remove "University" or "College" parts from the s1, but it turns out that it is not enough, as in case of Angwin Pacific Union College. Then I thought of leaving only the first word, but that excludes Ann Arbor.
Finally, I got a series of all the cities s2 and now I'm trying to use it as a filter (something similiar to .contains() or .isin()), so if a string s1 (Uni name) contains any of the elements of s2 (city name), then return only the city name.
My question is: how to do it in a neat way?
I would try to build a list comprehension of cities that are contained in at least one university name:
pd.Series([i for i in s2 if s1.str.contains(i).any()], name='Uni City')
With your example data it gives:
0 Searcy
1 Angwin
2 Ann Arbor
Name: Uni City, dtype: object
Data Used
s=pd.DataFrame({'University':['Searcy Harding University','Angwin Pacific Union College','Fairbanks University of Alaska Fairbanks','Ann Arbor University of Michigan']})
s2=pd.DataFrame({'City':['Searcy','Angwin','Fairbanks','Ann Arbor']})
Convert s2.City to set to create an iterator
st=set(s2.City.unique().tolist())
Calculate s['Uni City'] using the next() function to return the next item from the iterator.
s['Uni City']=s['University'].apply(lambda x: next((i for i in st if i in x)), np.nan)
Outcome
Related
I have a dataframe with a string column that contains a sequence of author names and their affiliations.
Address
'Smith, Jane (University of X); Doe, Betty (Institute of Y)'
'Walter, Bill (Z University); Albertson, John (Z University); Chen, Hilary (University of X)'
'Note, Joe (University of X); Cal, Stephanie (University of X)'
I want to create a new column with a Boolean TRUE/FALSE that tests if all authors are from University X. Note there can be any number of authors in the string.
Desired output:
T/F
FALSE
FALSE
TRUE
I think I can split the Address column using
df['Address_split'] = df['Address'].str.split(';', expand=False)
which then creates the list of names in the cell.
Address_split
['Smith, Jane (University of X)', 'Doe, Betty (University of Y)']
I even think I can use the all() function to test for a Boolean for one cell at a time.
all([("University X" in i) for i in df['Address_split'][2]]) returns TRUE
But I am struggling to think through how I can do this on each cell's list individually. I think I need some combination of map and/or apply.
You can split but expand so you can stack into one big Series. Then you can use extract to get the name and location.
Then your check is that all the values are 'University of X' which can be done with an equality comparison + all within a groupby. Since the grouping is based on the original index you can simply assign the result back to the original DataFrame
s = (df['Address'].str.split(';', expand=True).stack()
.str.extract('(.*)\s\((.*)\)')
.rename(columns={0: 'name', 1: 'location'}))
# name location
#0 0 Smith, Jane University of X
# 1 Doe, Betty Institute of Y
#1 0 Walter, Bill Z University
# 1 Albertson, John Z University
# 2 Chen, Hilary University of X
#2 0 Note, Joe University of X
# 1 Cal, Stephanie University of X
df['T/F'] = s['location'].eq('University of X').groupby(level=0).all()
print(df)
Address T/F
0 Smith, Jane (University of X); Doe, Betty (Ins... False
1 Walter, Bill (Z University); Albertson, John (... False
2 Note, Joe (University of X); Cal, Stephanie (U... True
You can use str.extractall to extract all the universities in parentheses and check if matches with University of X.
df['T/F'] = df['Address'].str.extractall(r"\(([^)]*)\)").eq('University of X').groupby(level=0).all()
Address T/F
0 'Smith, Jane (University of X); Doe, Betty (In... False
1 'Walter, Bill (Z University); Albertson, John ... False
2 'Note, Joe (University of X); Cal, Stephanie (... True
Here are some other options:
u = 'University of X'
df['Address'].str.count(u).eq(df['Address'].str.count(';')+1)
or
df['Address'].str.findall('([\w ]+)(?=\))').map(lambda x: set(x) == {u})
Output:
0 False
1 False
2 True
I have two dataframes:
df1:
locality
0 Chicago, IL
1 San Francisco, CA
2 Chic, TN
df2:
City County
0 San Francisco San Francisco County
1 Chic Dyer County
2 Chicago Cook County
I want to find all values (or their corresponding indices) in locality that start with each value in City so that I can eventually merge the two dataframes. The answer to this post is close to what I want to achieve, but is greedy and only extracts one value -- I want all matches, even if somewhat incorrect
For example, I would like to know that "Chic" in df2 matches both "Chicago, IL" and "Chic, TN" in df1 (and/or that "Chicago, IL" matches both "Chic" and "Chicago")
So far, I've accomplished this by using pandas apply and a custom function:
def getMatches(row):
matches = df1[df1['locality'].str.startswith(row['City'])]
return matches.index.tolist(), matches['locality']
df2.apply(getMatches, axis=1)
0 ([1], [San Francisco, CA, USA])
1 ([0, 2], [Chicago, IL, USA, Chic, TN, USA])
2 ([0], [Chicago, IL, USA])
This works fine until both df1 and df2 are large (100,000+ rows), where I run into time and memory issues (Even using Dask's parallelized apply). Is there a better alternative?
First DataFrame : housing, This data Frame contains MultiIndex (State, RegionName) and some relevant values in other 3 columns.
State RegionName 2008q3 2009q2 Ratio
New York New York 499766.666667 465833.333333 1.072844
California Los Angeles 469500.000000 413900.000000 1.134332
Illinois Chicago 232000.000000 219700.000000 1.055985
Pennsylvania Philadelphia 116933.333333 116166.666667 1.006600
Arizona Phoenix 193766.666667 168233.333333 1.151773
Second DataFrame : list_of_university_towns, Contains the names of States and Some regions and has default numeric index
State RegionName
1 Alabama Auburn
2 Alabama Florence
3 Alabama Jacksonville
4 Arizona Phoenix
5 Illinois Chicago
Now the inner join of the two dataframes :
uniHousingData = pd.merge(list_of_university_towns,housing,how="inner",on=["State","RegionName"])
This gives no values in the resultant uniHousingData dataframe, while it should have the bottom two values (index#4 and 5 from list_of_university_towns)
What am I doing wrong?
I found the issue. There was space at the end of the string in the RegionName column of the second dataframe. used Strip() method to remove the space and it worked like a charm.
I have a DataFrame with thousands of rows and two columns like so:
string state
0 the best new york cheesecake rochester ny ny
1 the best dallas bbq houston tx random str tx
2 la jolla fish shop of san diego san diego ca ca
3 nothing here dc
For each state, I have a regular expression of all city names (in lower case) structured like (city1|city2|city3|...) where the order of the cities is arbitrary (but can be changed if needed). For example, the regular expression for the state of New York contains both 'new york' and 'rochester' (and likewise 'dallas' and 'houston' for Texas, and 'san diego' and 'la jolla' for California).
I want to find out what the last appearing city in the string is (for observations 1, 2, 3, 4, I'd want 'rochester', 'houston', 'san diego', and NaN (or whatever), respectively).
I started off with str.extract and was trying to think of things like reversing the string but have reached an impasse.
Thanks so much for any help!
You can use str.findall, but if no match get empty list, so need apply. Last select last item of string by [-1]:
cities = r"new york|dallas|rochester|houston|san diego"
print (df['string'].str.findall(cities)
.apply(lambda x: x if len(x) >= 1 else ['no match val'])
.str[-1])
0 rochester
1 houston
2 san diego
3 no match val
Name: string, dtype: object
(Corrected >= 1 to > 1.)
Another solution is a bit hack - add no match string to start of each string by radd and add this string to cities too:
a = 'no match val'
cities = r"new york|dallas|rochester|houston|san diego" + '|' + a
print (df['string'].radd(a).str.findall(cities).str[-1])
0 rochester
1 houston
2 san diego
3 no match val
Name: string, dtype: object
cities = r"new york|dallas|..."
def last_match(s):
found = re.findall(cities, s)
return found[-1] if found else ""
df['string'].apply(last_match)
#0 rochester
#1 houston
#2 san diego
#3
I'm looking to delete rows of a DataFrame if total count of a particular column occurs only 1 time
Example of raw table (values are arbitrary for illustrative purposes):
print df
Country Series Value
0 Bolivia Population 123
1 Kenya Population 1234
2 Ukraine Population 12345
3 US Population 123456
5 Bolivia GDP 23456
6 Kenya GDP 234567
7 Ukraine GDP 2345678
8 US GDP 23456789
9 Bolivia #McDonalds 3456
10 Kenya #Schools 3455
11 Ukraine #Cars 3456
12 US #Tshirts 3456789
Intended outcome:
print df
Country Series Value
0 Bolivia Population 123
1 Kenya Population 1234
2 Ukraine Population 12345
3 US Population 123456
5 Bolivia GDP 23456
6 Kenya GDP 234567
7 Ukraine GDP 2345678
8 US GDP 23456789
I know that df.Series.value_counts()>1 will identify which df.Series occur more than 1 time; and that the code returned will look something like the following:
Population
True
GDP
True
#McDonalds
False
#Schools
False
#Cars
False
#Tshirts
False
I want to write something like the following so that my new DataFrame drops column values from df.Series that occur only 1 time, but this doesn't work:
df.drop(df.Series.value_counts()==1,axis=1,inplace=True)
You can do this by creating a boolean list/array by either list comprehensions or using DataFrame's string manipulation methods.
The list comprehension approach is:
vc = df['Series'].value_counts()
u = [i not in set(vc[vc==1].index) for i in df['Series']]
df = df[u]
The other approach is to use the str.contains method to check whether the values of the Series column contain a given string or match a given regular expression (used in this case as we are using multiple strings):
vc = df['Series'].value_counts()
pat = r'|'.join(vc[vc==1].index) #Regular expression
df = df[~df['Series'].str.contains(pat)] #Tilde is to negate boolean
Using this regular expressions approach is a bit more hackish and may require some extra processing (character escaping, etc) on pat in case you have regex metacharacters in the strings you want to filter out (which requires some basic regex knowledge). However, it's worth noting this approach is about 4x faster than using the list comprehension approach (tested on the data provided in the question).
As a side note, I recommend avoiding using the word Series as a column name as that's the name of a pandas object.
This is an old question, but the current answer doesn't work for any moderately large dataframes. A much faster and more "dataframe" way is to add a value count column and filter out count.
Create the dataset:
df = pd.DataFrame({'Country': 'Bolivia Kenya Ukraine US Bolivia Kenya Ukraine US Bolivia Kenya Ukraine US'.split(),
'Series': 'Pop Pop Pop Pop GDP GDP GDP GDP McDonalds Schools Cars Tshirts'.split()})
Drop rows that have a count < 1 for the column ('Series' in this case):
# Group values for Series and add 'cnt' column with count
df['cnt'] = df.groupby(['Series'])['Country'].transform('count')
# Drop indexes for count value == 1, and dropping 'cnt' column
df.drop(df[df.cnt==1].index)[['Country','Series']]