Have this data:
region gdp_per_capita
0 Coasts of USA 71 546
1 USA: New York, New Jersey 81 615
2 USA: California 74 205
3 USA: New England 74 000
Wanna get this:
region gdp_per_capita
0 Coasts of USA 71546
1 USA: New York, New Jersey 81615
2 USA: California 74205
3 USA: New England 74000
Tried to use df.columns = df.columns.str.replace(' ', ''), but it did not work
Just this should do:
df['gdp_per_capita'] = df['gdp_per_capita'].astype(str).str.replace('\s+', '').replace('nan', np.nan)
df['gdp_per_capita'] = pd.to_numeric(df['gdp_per_capita'])
print(df)
region gdp_per_capita
0 Coasts of USA 71546
1 USA: New York, New Jersey 81615
2 USA: California 74205
3 USA: New England 74000
Looks like you want to work with numbers rather than strings.
Hence, replacing ' ' with '' and using pd.to_numeric seems like an easy and solid approach.
Let me suggest another one which might or might not be good (it depends on your dataset).
If the thousands in your dataset are separated by a whitespace (' '), you can just read your df as
df = pd.read_csv(file, thousands = ' ')
and all your columns with 74 109 would be read as 74109 and dtype integer or float.
import re
df['gdp_per_capita'] = df['gdp_per_capita'].apply(lambda x: re.sub("[^0-9]", "", str(x))).astype(int)
I am not quite sure it will work or not but try the following:
Trim leading space of column in pandas – lstrip()
Trim trailing space of column in pandas – rstrip()
Trim Both leading and trailing space of column in pandas – strip()
Strip all the white space of column in pandas.
Let me know if it works :)
Related
I have some address data like:
Address
Buffalo, NY, 14201
Stackoverflow Street, New York, NY, 9999
I'd like to split these into columns like:
Street City State Zip
NaN Buffalo NY 14201
StackOverflow Street New York NY 99999
Essentially, I'd like to shift my strings over by one in each column in the result.
With Pandas I know I can split columns like:
import pandas as pd
df = pd.DataFrame(
data={'Address': ['Buffalo, NY, 14201', 'Stackoverflow Street, New York, NY, 99999']}
)
df[['Street','City','State','Zip']] = (
df['Address']
.str.split(',', expand=True)
.applymap(lambda col: col.strip() if col else col)
)
but need to figure out how to conditionally shift columns when my result is only 3 columns.
First, create a function to reverse a split for each row. Because if you split normally, the NaN will be at the end, so you reverse the order and split the list now the NaN will be at the end but the list is reversed.
Then, apply it to all rows.
Then, rename the columns because they will be integers.
Finally, set them in the right order.
fn = lambda x: pd.Series([i for i in reversed(x.split(','))])
pad = df['Address'].apply(fn)
pad looks like this right now,
0 1 2 3
0 14201 NY Buffalo NaN
1 99999 NY New York Stackoverflow Street
Just need to rename the columns and flip the order back.
pad.rename(columns={0:'Zip',1:'State',2:'City',3:'Street'},inplace=True)
df = pad[['Street','City','State','Zip']]
Output:
Street City State Zip
0 NaN Buffalo NY 14201
1 Stackoverflow Street New York NY 99999
Use a bit of numpy magic to reorder the columns with None on the left:
df2 = df['Address'].str.split(',', expand=True)
df[['Street','City','State','Zip']] = df2.to_numpy()[np.arange(len(df))[:,None], np.argsort(df2.notna())]
Output:
Address Street City State Zip
0 Buffalo, NY, 14201 None Buffalo NY 14201
1 Stackoverflow Street, New York, NY, 99999 Stackoverflow Street New York NY 99999
Another idea, add as many commas as needed to have n-1 (here 3) before splitting:
df[['Street','City','State','Zip']] = (
df['Address'].str.count(',')
.rsub(4-1).map(lambda x: ','*x)
.add(df['Address'])
.str.split(',', expand=True)
)
Output:
Address Street City State Zip
0 Buffalo, NY, 14201 Buffalo NY 14201
1 Stackoverflow Street, New York, NY, 99999 Stackoverflow Street New York NY 99999
Well I found a solution but not sure if there is something more performant out there. Open to other ideas.
def split_shift(s: str) -> list[str]:
split_str: list[str] = s.split(',')
# If split is only 3 items, shift things over by inserting a NA in front
if len(split_str) == 3:
split_str.insert(0,pd.NA)
return split_str
df[['Street','City','State','Zip']] = pd.DataFrame(df['Address'].apply(lambda x: split_shift(x)).tolist())
I have a really big data frame and it contains a specific column "city" with multiple cities repeating but in different case, for eg -
***City***
Gurgaon
GURGAON
gurgaon
Chennai
CHENNAI
Banglore
Hydrabad
BANGLORE
HYDRABAD
.
Is there a way to replace all the same cities with different case, with a single name.
There are total 3k rows in each column, so manually it's not possible.
Edit -
The city column of the DF also contains cities like
'Gurgaon'
'GURGAON'
'gurgaon ' #there is a white space at the end
I want something so that they all change to the same name and the delimiter is also removed. So that the output is →
'Gurgaon'
'Gurgaon'
'Gurgaon' #no white space at the end
Thanks
Here is how you can use str.strip() to remove trailing whitespaces, and then use str.title():
import pandas as pd
df = pd.DataFrame({'City':["Gurgaon",
"GURGAON",
"gurgaon",
"Chennai",
"CHENNAI",
"Banglore",
"Hydrabad",
"BANGLORE",
"HYDRABAD"]})
df['City'] = df['City'].str.strip()
df['City'] = df['City'].str.title()
print(df)
Output:
City
0 Gurgaon
1 Gurgaon
2 Gurgaon
3 Chennai
4 Chennai
5 Banglore
6 Hydrabad
7 Banglore
8 Hydrabad
First, change the cities to have the same format:
df.city=df.city.apply(lambda x: x.capitalize())
Then, remove duplicates:
df.drop_duplicates()
(I assume the rest of the columns are equal)
I am trying to remove the numbers before "-" in the name column. But not all rows have numbers before the name. How do I remove the numbers in rows that have numbers and keep the rows that don't have numbers in front untouched?
Sample df:
country Name
UK 5413-Marcus
Russia 5841-Natasha
Hong Kong Keith
China 7777-Wang
Desired df
country Name
UK Marcus
Russia Natasha
Hong Kong Keith
China Wang
I appreciate any assistance! Thanks in advance!
Pandas has string accessors for series. If you split and get the last element of the resulting list, even if a row does not have the delimeter '-' you still want the last element of that one-element list.
df.Name = df.Name.str.split('-').str.get(-1)
You might use str.lstrip for that task following way:
import pandas as pd
df = pd.DataFrame({'country':['UK','Russia','Hong Kong','China'],'Name':['5413-Marcus','5841-Natasha','Keith','7777-Wang']})
df['Name'] = df['Name'].str.lstrip('-0123456789')
print(df)
Output:
country Name
0 UK Marcus
1 Russia Natasha
2 Hong Kong Keith
3 China Wang
.lstrip does remove leading characters, .rstrip trailing characters and .strip both.
This is my df
"33, BUffalo New York"
"44, Charleston North Carolina "
], columns=['row'])
My intention is to split them by a comma followed by a space or just a space like this
33 Buffalo New York
44 Charleston North Carolina
My command is as follows:
df["row"].str.split("[,\s|\s]", n = 2, expand = True)
0 STD City State
1 33 Buffalo New York
2 44 Charleston North Carolina
As explained in the pandas docs, your split command does what it should if you just remove the square brackets. This command works:
new_df = df["row"].str.split(",\s|\s", n=2, expand=True)
Note: if your cities have spaces in them, then this will fail. It works if the state has a space in it, because the n=3 ensures that exactly 3 columns result.
The only part that you are missing is to set the first row as the header. As answered here, you can use pandas' iloc command:
new_df.columns = new_df.iloc[0]
new_df = newdf[1:]
print (new_df)
# 0 STD City State
# 1 33 BUffalo New York
# 2 44 Charleston North Carolina
I have a DataFrame with thousands of rows and two columns like so:
string state
0 the best new york cheesecake rochester ny ny
1 the best dallas bbq houston tx random str tx
2 la jolla fish shop of san diego san diego ca ca
3 nothing here dc
For each state, I have a regular expression of all city names (in lower case) structured like (city1|city2|city3|...) where the order of the cities is arbitrary (but can be changed if needed). For example, the regular expression for the state of New York contains both 'new york' and 'rochester' (and likewise 'dallas' and 'houston' for Texas, and 'san diego' and 'la jolla' for California).
I want to find out what the last appearing city in the string is (for observations 1, 2, 3, 4, I'd want 'rochester', 'houston', 'san diego', and NaN (or whatever), respectively).
I started off with str.extract and was trying to think of things like reversing the string but have reached an impasse.
Thanks so much for any help!
You can use str.findall, but if no match get empty list, so need apply. Last select last item of string by [-1]:
cities = r"new york|dallas|rochester|houston|san diego"
print (df['string'].str.findall(cities)
.apply(lambda x: x if len(x) >= 1 else ['no match val'])
.str[-1])
0 rochester
1 houston
2 san diego
3 no match val
Name: string, dtype: object
(Corrected >= 1 to > 1.)
Another solution is a bit hack - add no match string to start of each string by radd and add this string to cities too:
a = 'no match val'
cities = r"new york|dallas|rochester|houston|san diego" + '|' + a
print (df['string'].radd(a).str.findall(cities).str[-1])
0 rochester
1 houston
2 san diego
3 no match val
Name: string, dtype: object
cities = r"new york|dallas|..."
def last_match(s):
found = re.findall(cities, s)
return found[-1] if found else ""
df['string'].apply(last_match)
#0 rochester
#1 houston
#2 san diego
#3