How do I replace the similar looking values in a pandas dataframe? - python

I am new to Pandas. I have the following data types in my dataset. (The dataset is Indian Startup Funding downloaded from Kaggle.)
Date datetime64[ns]
StartupName object
IndustryVertical object
CityLocation object
InvestorsName object
InvestmentType object
AmountInUSD object
dtype: object
data['AmountInUSD'].groupby(data['CityLocation']).describe()
I did the above operation and found that many cities are similar for example,
Bangalore
Bangalore / Palo Alto
Bangalore / SFO
Bangalore / San Mateo
Bangalore / USA
Bangalore/ Bangkok
I want to do following operation, but I do not know the code to this.
In column CityLocation, find all cells which starts with 'Bang' and replace them all with 'Bangalore'. Help will be appreciated.
I did this
data[data.CityLocation.str.startswith('Bang')]
and I do not know what to do after this.

You can use the loc function to find the values in your column whose substring matches and replace with them with the value of your choosing.
import pandas as pd
df = pd.DataFrame({'CityLocation': ['Bangalore', 'Dangerlore', 'Bangalore/USA'], 'Values': [1, 2, 3]})
print(df)
# CityLocation Values
# 0 Bangalore 1
# 1 Dangerlore 2
# 2 Bangalore/USA 3
df.loc[df.CityLocation.str.startswith('Bang'), 'CityLocation'] = 'Bangalore'
print(df)
# CityLocation Values
# 0 Bangalore 1
# 1 Dangerlore 2
# 2 Bangalore 3

pandas 0.23 has a nice way to handle text. See the docs Working with Text Data. You can use regular expressions to capture and replace text.
import pandas as pd
df = pd.DataFrame({'CityLocation': ["Bangalore / Palo Alto", "Bangalore / SFO", "Other"]})
df['CityLocation'] = df['CityLocation'].str.replace("^Bang.*", "Bangalore")
print(df)
Will yield
CityLocation
0 Bangalore
1 Bangalore
2 Other

Related

DataFrame from variable and filtering data

I have a DataFrame and want to extract 3 columns from it, but one of them is an input from the user. I made a list, but need it to be iterable so I can run a For iteration.
So far I made it through by making a dictionary with 2 of the columns making a list of each and zipping them... but I really need the 3 columns...
My code:
Data=pd.read_csv(----------)
selec=input("What month would you want to show?")
NewData=[(Data['Country']),(Data['City']),(Data[selec].astype('int64')]
#here I try to iterate:
iteration=[i for i in NewData if NewData[i]<=25]
print (iteration)
*TypeError:list indices must be int ot slices, not Series*
My CSV is the following:
I want to be able to choose the month with the variable "selec" and filter the results of the month I've chosen... so the output for selec="Feb" would be:
I tried as well with loc/iloc, but not lucky at all (unhashable type:'list').
See the below example for how you can:
select specific columns from a DataFrame by providing a list of columns between the selection brackets (link to tutorial)
select specific rows from a DataFrame by providing a condition between the selection brackets (link to tutorial)
iterate rows of a DataFrame, although I don't suppose you need it - if you'd like to keep working with the DataFrame after filtering it, it's better to use the method mentioned above (you won't have to put the rows back together, and it will likely be more performant because pandas is optimized for bulk operations)
import pandas as pd
# this is just for testing, instead of pd.read_csv(...)
df = pd.DataFrame([
dict(Country="Spain", City="Madrid", Jan="15", Feb="16", Mar="17", Apr="18", May=""),
dict(Country="Spain", City="Galicia", Jan="1", Feb="2", Mar="3", Apr="4", May=""),
dict(Country="France", City="Paris", Jan="0", Feb="2", Mar="3", Apr="4", May=""),
dict(Country="Algeria", City="Argel", Jan="20", Feb="28", Mar="29", Apr="30", May=""),
])
print("---- Original df:")
print(df)
selec = "Feb" # let's pretend this comes from input()
print("\n---- Just the 3 columns:")
df = df[["Country", "City", selec]] # narrow down the df to just the 3 columns
df[selec] = df[selec].astype("int64") # convert the selec column to proper type
print(df)
print("\n---- Filtered dataframe:")
df1 = df[df[selec] <= 25]
print(df1)
print("\n---- Iterated & filtered rows:")
for row in df.itertuples():
# we could also use row[3] instead of getattr(...)
if getattr(row, selec) <= 25:
print(row)
Output:
---- Original df:
Country City Jan Feb Mar Apr May
0 Spain Madrid 15 16 17 18
1 Spain Galicia 1 2 3 4
2 France Paris 0 2 3 4
3 Algeria Argel 20 28 29 30
---- Just the 3 columns:
Country City Feb
0 Spain Madrid 16
1 Spain Galicia 2
2 France Paris 2
3 Algeria Argel 28
---- Filtered dataframe:
Country City Feb
0 Spain Madrid 16
1 Spain Galicia 2
2 France Paris 2
---- Iterated & filtered dataframe:
Pandas(Index=0, Country='Spain', City='Madrid', Feb=16)
Pandas(Index=1, Country='Spain', City='Galicia', Feb=2)
Pandas(Index=2, Country='France', City='Paris', Feb=2)

Pandas read_html returned column with NaN values in Python

I am trying to parse table located here using Pandas read.html function. I was able to parse the table. However, the column capacity returned with NaN . I am not sure, what could be the reason.I would like to parse entire table and use it for further research. So any help is appreciated. Below is my code so far..
wiki_url='Above url'
df1=pd.read_html(wiki_url,index_col=0)
Try something like this (include flavor as bs4):
df = pd.read_html(r'https://en.wikipedia.org/wiki/List_of_NCAA_Division_I_FBS_football_stadiums',header=[0],flavor='bs4')
df = df[0]
print(df.head())
Image Stadium City State \
0 NaN Aggie Memorial Stadium Las Cruces NM
1 NaN Alamodome San Antonio TX
2 NaN Alaska Airlines Field at Husky Stadium Seattle WA
3 NaN Albertsons Stadium Boise ID
4 NaN Allen E. Paulson Stadium Statesboro GA
Team Conference Capacity \
0 New Mexico State Independent 30,343[1]
1 UTSA C-USA 65000
2 Washington Pac-12 70,500[2]
3 Boise State Mountain West 36,387[3]
4 Georgia Southern Sun Belt 25000
.............................
.............................
To replace anything under square brackets use:
df.Capacity = df.Capacity.str.replace(r"\[.*\]","")
print(df.Capacity.head())
0 30,343
1 65000
2 70,500
3 36,387
4 25000
Hope this helps.
Pandas is only able to get the superscript (for whatever reason) rather than the actual value, if you print all of df1 and check the Capacity column, you will see that some of the values are [1], [2], etc (if they have footnotes) and NaN otherwise.
You may want to look into alternatives of fetching the data, or scraping the data yourself using BeautifulSoup, since Pandas is looking and therefore returning the wrong data.
Answer Posted by #anky_91 was correct. I wanted to try another approach without using Regex. Below was my solution without using Regex.
df4=pd.read_html('https://en.wikipedia.org/wiki/List_of_NCAA_Division_I_FBS_football_stadiums',header=[0],flavor='bs4')
df4 = df4[0]
Solution was to takeout "r" presented by #anky_91 in line 1 and line 4
print(df4.Capacity.head())
0 30,343
1 65000
2 70,500
3 36,387
4 25000
Name: Capacity, dtype: object

How to stop a Dataframe subset from "remembering" old values

Sorry for the weird phrasing, but I didn't know how to better describe it. I will be translating my problem to US terms to ease understanding. My problem is, I have a national database with States and Districts and I need to work only with Districts from Florida, so I do this:
df_fl=df.loc[df.state=='florida'].copy()
After some transformations I want to take mean values of every district from Florida, so I do this:
df_final=df_fl.groupby(['district']).mean()
But this brings a dataframe with every district in the database. All rows from districts that are not in Florida are filled with nans. I suppose there's an easy solution to this, but I haven't been able to find it. It's kinda counter intuitive that it works like that, too.
So, can you help me fix this?
Thanks in advance,
Edit:
my data looked like this:
District state Salary
1 Florida 1000
1 Florida 2000
2 Florida 2000
2 Florida 3000
3 California 3000
df_fl, then, looks like this:
District state Salary
1 Florida 1000
1 Florida 2000
2 Florida 2000
2 Florida 3000
And after applying
df_final=df_fl.groupby(['district']).mean()
I expected to get this:
District Salary
1 1500
2 2500
But I'm getting this:
District Salary
1 1500
2 2500
3 nan
Obviously a very simplified version, but the core remains.
It is because your 'District' column is a categorical type.
MCVE
df = pd.DataFrame(dict(
State=list('CCCCFFFF'),
District=list('WXWXYYZZ'),
Value=range(1, 9)
))
Without categorical
df.query('State == "F"').groupby('District').Value.mean()
District
Y 5.5
Z 7.5
Name: Value, dtype: float64
With categorical
df.assign(
District=pd.Categorical(df.District)
).query('State == "F"').groupby('District').Value.mean()
District
W NaN
X NaN
Y 5.5
Z 7.5
Name: Value, dtype: float64
Solution
Many ways to do this. One way that preserves the categorical typing is to use the method, remove_unused_categories
df = df.assign(District=df.District.cat.remove_unused_categories())
As piRSquared already explained this only happens with categorical data. Starting from 0.23.0 groupby has a new "observed" argument which toggles this behavior. MCVE taken from piRSquared:
>>> df = pd.DataFrame(dict(
State=list('CCCCFFFF'),
District=list('WXWXYYZZ'),
Value=range(1, 9)
))
>>> df.assign(
District=pd.Categorical(df.District)
).query('State == "F"').groupby('District').Value.mean()
District
W NaN
X NaN
Y 5.5
Z 7.5
Name: Value, dtype: float64
>>> df.assign(
District=pd.Categorical(df.District)
).query('State == "F"').groupby('District', observed=True).Value.mean()
District
Y 5.5
Z 7.5
Name: Value, dtype: float64

Pandas: Delete rows of a DataFrame if total count of a particular column occurs only 1 time

I'm looking to delete rows of a DataFrame if total count of a particular column occurs only 1 time
Example of raw table (values are arbitrary for illustrative purposes):
print df
Country Series Value
0 Bolivia Population 123
1 Kenya Population 1234
2 Ukraine Population 12345
3 US Population 123456
5 Bolivia GDP 23456
6 Kenya GDP 234567
7 Ukraine GDP 2345678
8 US GDP 23456789
9 Bolivia #McDonalds 3456
10 Kenya #Schools 3455
11 Ukraine #Cars 3456
12 US #Tshirts 3456789
Intended outcome:
print df
Country Series Value
0 Bolivia Population 123
1 Kenya Population 1234
2 Ukraine Population 12345
3 US Population 123456
5 Bolivia GDP 23456
6 Kenya GDP 234567
7 Ukraine GDP 2345678
8 US GDP 23456789
I know that df.Series.value_counts()>1 will identify which df.Series occur more than 1 time; and that the code returned will look something like the following:
Population
True
GDP
True
#McDonalds
False
#Schools
False
#Cars
False
#Tshirts
False
I want to write something like the following so that my new DataFrame drops column values from df.Series that occur only 1 time, but this doesn't work:
df.drop(df.Series.value_counts()==1,axis=1,inplace=True)
You can do this by creating a boolean list/array by either list comprehensions or using DataFrame's string manipulation methods.
The list comprehension approach is:
vc = df['Series'].value_counts()
u = [i not in set(vc[vc==1].index) for i in df['Series']]
df = df[u]
The other approach is to use the str.contains method to check whether the values of the Series column contain a given string or match a given regular expression (used in this case as we are using multiple strings):
vc = df['Series'].value_counts()
pat = r'|'.join(vc[vc==1].index) #Regular expression
df = df[~df['Series'].str.contains(pat)] #Tilde is to negate boolean
Using this regular expressions approach is a bit more hackish and may require some extra processing (character escaping, etc) on pat in case you have regex metacharacters in the strings you want to filter out (which requires some basic regex knowledge). However, it's worth noting this approach is about 4x faster than using the list comprehension approach (tested on the data provided in the question).
As a side note, I recommend avoiding using the word Series as a column name as that's the name of a pandas object.
This is an old question, but the current answer doesn't work for any moderately large dataframes. A much faster and more "dataframe" way is to add a value count column and filter out count.
Create the dataset:
df = pd.DataFrame({'Country': 'Bolivia Kenya Ukraine US Bolivia Kenya Ukraine US Bolivia Kenya Ukraine US'.split(),
'Series': 'Pop Pop Pop Pop GDP GDP GDP GDP McDonalds Schools Cars Tshirts'.split()})
Drop rows that have a count < 1 for the column ('Series' in this case):
# Group values for Series and add 'cnt' column with count
df['cnt'] = df.groupby(['Series'])['Country'].transform('count')
# Drop indexes for count value == 1, and dropping 'cnt' column
df.drop(df[df.cnt==1].index)[['Country','Series']]

Delete rows based on values in column in python

I am performing data clean on a .csv file for performing analytics. I am trying delete the rows having null values in their column in python.
Sample file:
Unnamed: 0 2012 2011 2010 2009 2008 2005
0 United States of America 760739 752423 781844 812514 843683 862220
1 Brazil 732913 717185 715702 651879 649996 NaN
2 Germany 520005 513458 515853 519010 518499 494329
3 United Kingdom (England and Wales) 310544 336997 367055 399869 419273 541455
4 Mexico 211921 212141 230687 244623 250932 239166
5 France 193081 192263 192906 193405 187937 148651
6 Sweden 87052 89457 87854 86281 84566 72645
7 Romania 17219 12299 12301 9072 9457 8898
8 Nigeria 15388 NaN 18093 14075 14692 NaN
So far used is:
from pandas import read_csv
link = "https://docs.google.com/spreadsheets......csv"
data = read_csv(link)
data.head(100000)
How can I delete these rows?
Once you have your data loaded you just need to figure out which rows to remove:
bad_rows = np.any(np.isnan(data), axis=1)
Then:
data[~bad_rows].head(100)
You need to use the dropna method to remove these values. Passing in how='any' into the method as an argument will remove the row if any of the values is null and how='all' will only remove the row if all of the values are null.
cleaned_data = data.dropna(how='any')
Edit 1.
It's worth noting that you may not want to have to create a copy of your cleaned data. (i.e. cleaned_data = data.dropna(how='any').
To save memory you can pass in the inplace option that will modify your original DataFrame and return None.
data.dropna(how='any', inplace=True)
data.head(100)

Categories

Resources