Python - Removing strings that follow a wildcard pattern - python

Initial Data (String datatype)
Los Gatos 50K
Las Palmas Canary Islands 25K
Roland Garros
Seoul 25K
Rome
Desired Result
Los Gatos
Las Palmas Canary Islands
Roland Garros
Seoul
Rome
I am looking for a way to remove any string pattern that is 2 digits and then a K. But it needs to be able to handle any 2 values before the K. I haven't seen any answers that use a wildcard for the part of the replace. It should be something like this (I know this is not valid) -
data.replace("**K", '')
Side note - This string will be a column in a dataframe so if there is an easy solution that works with that would be ideal. If not I can iterate through each row and transform it that way.

Try
df = df.replace('\d{2}K', '', regex = True)
0
0 Los Gatos
1 Las Palmas Canary Islands
2 Roland Garros
3 Seoul
4 Rome

Related

How to count the occurrences of a value in a data frame?

I have a data frame [df] that looks like this but much lagger:
title of the novel author publishing year mentionned cities
0 Beasts and creatures Bruno Ivory 2021 New York
0 Monsters Renata Mcniar 2023 New York
0 At risk Charles Dobi 2020 London
0 Manuela and Ricardo Lucas Zacci 2022 Rio de Janeiro
0 War against the machine Angelina Trotter 1999 Rio de Janeiro
I would like to add another column with the objective of counting all the occurences of the cities. The problem is that I want to maintain the year of that occurrence, as I work with history. In other words, it is important for me to be able to know when the city was mentionned.
The expected outcome would look like this:
title of the novel author publishing year mentionned cities Counter
0 Beasts and creatures Bruno Ivory 2021 New York 1
0 Monsters Renata Mcniar 2022 New York 2
0 At risk Charles Dobi 2020 London 1
0 Manuela and Ricardo Lucas Zacci 2022 Rio de Janeiro 1
0 War against the machine Angelina Trotter 1999 Rio de Janeiro 2
So far, I have just managed to count all the occurrences, but I could not relate it to the publishing years. The code I am using is:
df ['New York'] = df.eq('New York').sum().to_frame().T
Can someone help me, please?
edit:
I tried joining two dataframes and I got something interesting but not what I really wanted. The problem is that it does not keep the Publishing year on track.
d[f'counter'] = d.fgroupby('mentionned cities')['mentionned cities'].transform('counter')
result = pd.concat([df['New York'], df], axis=1, join='inner')
display(result)
Output:
title of the novel author publishing year mentionned cities Counter
0 Beasts and creatures Bruno Ivory 2021 New York 2
0 Monsters Renata Mcniar 2023 New York 2
0 At risk Charles Dobi 2020 London 1
0 Manuela and Ricardo Lucas Zacci 2022 Rio de Janeiro 1
0 War against the machine Angelina Trotter 1999 Rio de Janeiro 2
The problem still lingers on
df['Counter'] = df.groupby('mentionned cities').cumcount() + 1
Perhaps you could use a for loop to iterate through the 'mentioned cities' column, and use a dict to count the occurrences of cities:
city_count = {}
count_column = []
for city in df['mentionned cities']:
city_count[city] = city_count.get(city, 0) + 1
count_column.append(city_count[city])
df['Counter'] = count_column
If I understand it right, you can just concatenate both columns and do the same thing as you did. You need first to convert year into string, which can be done with str function, then it is easy to join both strings with +.

How to take a list/df of 27 cities and create a dataframe with 151 instances of the city name for EACH city (4077 rows total)

I've been trying various ways from all across the internet to take a list that has 27 cities and create a dataframe that has 151 instances of the city name for EACH of those 27 cities grouped in the order I put them in the list (4077 row total).
I've tried various ways of isolating the data I need by using .loc but the problem there is that there are cities that share the same name in the excel file I've imported. Those cities with the same name do have different state abbreviations but I can't find anything (I can understand) on if I can drop the rows based on the state and name. Is there a way to do that?
Another thing I tried was to create a list of the 27 city names and multiply it by 151. But that doesn't work for this because I need the data to read out as I've made it in the list and not just repeat the list itself 27 times.
I'm writing this from my phone so I don't have the code to paste in here but:
Let's say I needed this for just three cities and I wanted it to create a df with 5 instances (15 rows) for each of the 3 cities listed in their respective order as a small scale example:
City_name
- Philadelphia
- Boston
- Chicago
I'm trying to get something that looks like this:
City_name
- Philadelphia
- Philadelphia
- Philadelphia
- Philadelphia
- Philadelphia
- Boston
- Boston
- Boston
- Boston
- Boston
- Chicago
- Chicago
- Chicago
- Chicago
- Chicago
(Forgive me, I don't know how to format here)
How can I best achieve this without writing 27 dataframes (1 for each city), multiply those dataframes individually to get the 151 instances/rows, and then append them later?
I can do it that way but I'm sure there must be a cleaner method that I haven't been able to find/understand on the internet.
Thank you!
Have you tried np.repeat? You have your list with the 27 cities, and np.repeat` you change the number to the repetitions you want.
City_name = ["Philadelphia", "Boston", "Chicago"]
df = pd.DataFrame({"City_name": np.repeat(City_name, repeats=5)})
print(df)
City_name
0 Philadelphia
1 Philadelphia
2 Philadelphia
3 Philadelphia
4 Philadelphia
5 Boston
6 Boston
7 Boston
8 Boston
9 Boston
10 Chicago
11 Chicago
12 Chicago
13 Chicago
14 Chicago

Pandas "Advanced" Merge on Substring

I have two dataframes:
df1:
locality
0 Chicago, IL
1 San Francisco, CA
2 Chic, TN
df2:
City County
0 San Francisco San Francisco County
1 Chic Dyer County
2 Chicago Cook County
I want to find all values (or their corresponding indices) in locality that start with each value in City so that I can eventually merge the two dataframes. The answer to this post is close to what I want to achieve, but is greedy and only extracts one value -- I want all matches, even if somewhat incorrect
For example, I would like to know that "Chic" in df2 matches both "Chicago, IL" and "Chic, TN" in df1 (and/or that "Chicago, IL" matches both "Chic" and "Chicago")
So far, I've accomplished this by using pandas apply and a custom function:
def getMatches(row):
matches = df1[df1['locality'].str.startswith(row['City'])]
return matches.index.tolist(), matches['locality']
df2.apply(getMatches, axis=1)
0 ([1], [San Francisco, CA, USA])
1 ([0, 2], [Chicago, IL, USA, Chic, TN, USA])
2 ([0], [Chicago, IL, USA])
This works fine until both df1 and df2 are large (100,000+ rows), where I run into time and memory issues (Even using Dask's parallelized apply). Is there a better alternative?

Pandas read_html returned column with NaN values in Python

I am trying to parse table located here using Pandas read.html function. I was able to parse the table. However, the column capacity returned with NaN . I am not sure, what could be the reason.I would like to parse entire table and use it for further research. So any help is appreciated. Below is my code so far..
wiki_url='Above url'
df1=pd.read_html(wiki_url,index_col=0)
Try something like this (include flavor as bs4):
df = pd.read_html(r'https://en.wikipedia.org/wiki/List_of_NCAA_Division_I_FBS_football_stadiums',header=[0],flavor='bs4')
df = df[0]
print(df.head())
Image Stadium City State \
0 NaN Aggie Memorial Stadium Las Cruces NM
1 NaN Alamodome San Antonio TX
2 NaN Alaska Airlines Field at Husky Stadium Seattle WA
3 NaN Albertsons Stadium Boise ID
4 NaN Allen E. Paulson Stadium Statesboro GA
Team Conference Capacity \
0 New Mexico State Independent 30,343[1]
1 UTSA C-USA 65000
2 Washington Pac-12 70,500[2]
3 Boise State Mountain West 36,387[3]
4 Georgia Southern Sun Belt 25000
.............................
.............................
To replace anything under square brackets use:
df.Capacity = df.Capacity.str.replace(r"\[.*\]","")
print(df.Capacity.head())
0 30,343
1 65000
2 70,500
3 36,387
4 25000
Hope this helps.
Pandas is only able to get the superscript (for whatever reason) rather than the actual value, if you print all of df1 and check the Capacity column, you will see that some of the values are [1], [2], etc (if they have footnotes) and NaN otherwise.
You may want to look into alternatives of fetching the data, or scraping the data yourself using BeautifulSoup, since Pandas is looking and therefore returning the wrong data.
Answer Posted by #anky_91 was correct. I wanted to try another approach without using Regex. Below was my solution without using Regex.
df4=pd.read_html('https://en.wikipedia.org/wiki/List_of_NCAA_Division_I_FBS_football_stadiums',header=[0],flavor='bs4')
df4 = df4[0]
Solution was to takeout "r" presented by #anky_91 in line 1 and line 4
print(df4.Capacity.head())
0 30,343
1 65000
2 70,500
3 36,387
4 25000
Name: Capacity, dtype: object

str.extract starting from the back in pandas DataFrame

I have a DataFrame with thousands of rows and two columns like so:
string state
0 the best new york cheesecake rochester ny ny
1 the best dallas bbq houston tx random str tx
2 la jolla fish shop of san diego san diego ca ca
3 nothing here dc
For each state, I have a regular expression of all city names (in lower case) structured like (city1|city2|city3|...) where the order of the cities is arbitrary (but can be changed if needed). For example, the regular expression for the state of New York contains both 'new york' and 'rochester' (and likewise 'dallas' and 'houston' for Texas, and 'san diego' and 'la jolla' for California).
I want to find out what the last appearing city in the string is (for observations 1, 2, 3, 4, I'd want 'rochester', 'houston', 'san diego', and NaN (or whatever), respectively).
I started off with str.extract and was trying to think of things like reversing the string but have reached an impasse.
Thanks so much for any help!
You can use str.findall, but if no match get empty list, so need apply. Last select last item of string by [-1]:
cities = r"new york|dallas|rochester|houston|san diego"
print (df['string'].str.findall(cities)
.apply(lambda x: x if len(x) >= 1 else ['no match val'])
.str[-1])
0 rochester
1 houston
2 san diego
3 no match val
Name: string, dtype: object
(Corrected >= 1 to > 1.)
Another solution is a bit hack - add no match string to start of each string by radd and add this string to cities too:
a = 'no match val'
cities = r"new york|dallas|rochester|houston|san diego" + '|' + a
print (df['string'].radd(a).str.findall(cities).str[-1])
0 rochester
1 houston
2 san diego
3 no match val
Name: string, dtype: object
cities = r"new york|dallas|..."
def last_match(s):
found = re.findall(cities, s)
return found[-1] if found else ""
df['string'].apply(last_match)
#0 rochester
#1 houston
#2 san diego
#3

Categories

Resources