I am trying to analyze some NBA data. I have a CSV file with data and a field called "team" that has the team names and the place they finished in concatenated together in the same field. I tried to write a for loop (below) to look at the final character and if it is a digit. For example, the Boston Celtics that finished in 2nd place would show up as "Boston Celtics2". Another example would be "New York Knicks11".
for x in nba['team']:
while x[-1].isdigit():
print(x)
x = x[:len(x)-1]
print(x)
I included the print statements to make sure that it is slicing the data properly (which it is). But it is not writing back into my dataframe. Why isn't it storing this back into my dataframe?
>>> nba
team
0 Boston Celtics2
1 New York Knicks11
2 Los Angeles Lakers7
Remove trailing digits by ''.
nba['team'] = nba['team'].str.replace('\d+$', '')
>>> nba
team
0 Boston Celtics
1 New York Knicks
2 Los Angeles Lakers
It is not overwriting your DataFrame because in your for loop x only stores a copy of the value in your nba DataFrame. To change the value in your DataFrame you could loop with an index and change nba at the specific index:
for i, row in nba.iterrows():
x = row['team'];
while x[-1].isdigit():
print(x)
x = x[:len(x)-1]
print(x)
nba.at[i, 'team'] = x
Related
I am having trouble with pandas replace-function. Let's say we have an example dataframe like this:
df = pd.DataFrame({'State': ['Georgia', 'Alabama', 'Tennessee'],
'Cities': [['Atlanta', 'Albany'], ['Montgomery', 'Huntsville', 'Birmingham'], ['Nashville', 'Knoxville']]})
>>> df
State Cities
0 Georgia [Atlanta, Albany]
1 Alabama [Montgomery, Huntsville, Birmingham]
2 Tennessee [Nashville, Knoxville]
Now I want to replace the state names and city names all by abbreviations. I have two dictionaries that define the replacement values:
state_abbrv = {'Alabama': 'AL', 'Georgia': 'GA', 'Tennessee': 'TN'}
city_abbrv = {'Albany': 'Alb.', 'Atlanta': 'Atl.', 'Birmingham': 'Birm.',
'Huntsville': 'Htsv.', 'Knoxville': 'Kxv.',
'Montgomery': 'Mont.', 'Nashville': 'Nhv.'}
When using pd.DataFrame.replace() on the "States" column (which only contains one value per row) it works as expected and replaces all state names:
>>> df.replace({'State': state_abbrv})
State Cities
0 GA [Atlanta, Albany]
1 AL [Montgomery, Huntsville, Birmingham]
2 TN [Nashville, Knoxville]
I was hoping that it would also individually replace all matching names within the lists of the "Cities" column, but unfortunately it does not seem to work as all cities remain unabbreviated:
>>> df.replace({'Cities': city_abbrv})
State Cities
0 Georgia [Atlanta, Albany]
1 Alabama [Montgomery, Huntsville, Birmingham]
2 Tennessee [Nashville, Knoxville]
How do I get the pd.DataFrame.replace() function to individually circle through all list elements in the column per row and replace accordingly?
Try:
explode to split the list into individual rows
replace each column using the relevant dictionary
groupby and agg to get back the original structure
>>> output = df.explode("Cities").replace({"State": state_abbrv, "Cities": city_abbrv}).groupby("State", as_index=False)["Cities"].agg(list)
State Cities
0 AL [Mont., Htsv., Birm.]
1 GA [Atl., Alb.]
2 TN [Nhv., Kxv.]
This is what I'm trying to accomplish with my code: I have a current csv file with tennis player names, and I want to add new players to it once they show in the rankings. My script goes through the rankings and creates an array, then imports the names from the csv file. It is supposed to see which names are not in the latter, and then extract online data for those names. Then, I just want the new rows to be appended at the end of that old CSV file. My issue is that the new row is being indexed with the Player's name rather than following the index of the old file. Any ideas why that's happening? Also why is an unnamed column being added?
def get_all_players():
# imports names of players currently in the atp rankings
current_atp_ranking = check_atp_rankings()
current_player_list = current_atp_ranking['Player']
# clean up names in case of white spaces
for i in range(0, len(current_player_list)):
current_player_list[i] = current_player_list[i].strip()
# reads the main file and makes a dataframe out of it
current_file = 'ATP_stats_new.csv'
df = pd.read_csv(current_file)
# gets all the names within the main file to see which current ones aren't there
names_on_file = list(df['Player'])
# cleans up in case of any white spaces
for i in range(0, len(names_on_file)):
names_on_file[i] = names_on_file[i].strip()
# Removing Nadal for testing purposes
names_on_file.remove("Rafael Nadal")
# creating a list of players in current_players_list but not in names_on_file
new_player_list = [x for x in current_player_list if x not in names_on_file]
# loop through new_player_list
for player in new_player_list:
# delay to avoid stopping
time.sleep(2)
# finding the player's atp link for profile based on their name
atp_link = current_atp_ranking.loc[current_atp_ranking['Player'] == player, 'ATP_Link']
atp_link = atp_link.iloc[0]
# make a basic dictionary with just the player's name and link
player_dict = [{'Name': player, 'ATP_Link': atp_link}]
# enter the new dictionary into the existing main file
df.append(player_dict, ignore_index=True)
# print dataframe to see how it looks before exporting
print(df)
# export dataframe into current file
df.to_csv(current_file)
This is what the file looks like at first:
Unnamed: 0 Player ... Coach Turned_Pro
0 0 Novak Djokovic ... NaN NaN
1 1 Rafael Nadal ... Carlos Moya, Francisco Roig 2001.0
2 2 Roger Federer ... Ivan Ljubicic, Severin Luthi 1998.0
3 3 Daniil Medvedev ... NaN NaN
4 4 Dominic Thiem ... NaN NaN
... ... ... ... ... ...
1976 1976 Brian Bencic ... NaN NaN
1977 1977 Boruch Skierkier ... NaN NaN
1978 1978 Majed Kilani ... NaN NaN
1979 1979 Quentin Gueydan ... NaN NaN
1980 1980 Preston Brown ... NaN NaN
And this is what the new row looks like:
1977 1977.0 ... NaN
1978 1978.0 ... NaN
1979 1979.0 ... NaN
1980 1980.0 ... NaN
Rafael Nadal NaN ... 2001
There are critical parts of your code missing that are necessary to answer the question precisely. Two thoughts based on what you posted:
Importing Your CSV File
Your previous csv file was probably saved with the index. Make sure the csv file contents does not have the dataframe index when you last used it in the first csv column. When you save do the following:
file.to_csv('file.csv', index=False)
When you load the file like this;
pandas.read_csv('file.csv')
it will automatically assigned the index number and there won't be a duplicate column.
Misordering of Columns
Not sure what info in what order atp_link is pulling in. From what you provided it looks like it is returning two columns: "Coach" and "Turning Pro".
I would recommend creating a list (not a dict) for each new player you want to add after you pull the info from atp_link. So if you are adding Nadal, You create an info list from the information for each new player. Nadal's info list would look like this:
info_list = ['Rafael Nadal', '','2001']
Then you append the list to the dataframe like this:
df.loc[len(df),:] = info_list
Hope this helps.
I have a list of city names and a df with city, state, and zipcode columns. Some of the zipcodes are missing. When a zipcode is missing, I want to use a generic zipcode based on the city. For example, the city is San Jose so the zipcode should be a generic 'SJ_zipcode'.
pattern_city = '|'.join(cities) #works
foundit = ( (df['cty_nm'].str.contains(pattern_city, flags=re.IGNORECASE)) & (df['zip_cd']==0) & (df['st_cd'].str.match('CA') ) ) #works--is this foundit a df?
df['zip_cd'] = foundit.replace( 'SJ_zipcode' ) #nope, error
Error: "Invalid dtype for pad_1d [bool]"
Implemented with where
df['zip_cd'].where( (df['cty_nm'].str.contains(pattern_city, flags=re.IGNORECASE)) & (df['zip_cd']==0) & (df['st_cd'].str.match('CA') ), "SJ_Zipcode", inplace = True) #nope, empty set; all set to nan?
Implemented with loc
df['zip_cd'].loc[ (df['cty_nm'].str.contains(pattern_city, flags=re.IGNORECASE)) & (df['zip_cd']==0) & (df['st_cd'].str.match('CA') ) ] = "SJ_Zipcode"
Some possible solutions that did not work
df.loc[df['First Season'] > 1990, 'First Season'] = 1 which I used as df.loc[foundit, 'zip_cd'] = 'SJ_zipcode' Pandas DataFrame: replace all values in a column, based on condition and similar/same as Conditional Replace Pandas
df['c'] = df.apply( lambda row: row['a']*row['b'] if np.isnan(row['c']) else row['c'], axis=1) however, I am not multiplying values https://datascience.stackexchange.com/questions/17769/how-to-fill-missing-value-based-on-other-columns-in-pandas-dataframe
I tried a solution using where, however, it seemed to replace the values where the condition was not met with nan--but the nan value was not helpful https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.where.html
This conditional approach looked promising but then without looping with each value I was confused by how does anything happen... What should replace comparisons with False in python?
An example using replace which does not have the multiple conditions and pattern Replacing few values in a pandas dataframe column with another value
An additional 'want'; I want to update a dataframe with values, I do not want to create a new dataframe.
Try this:
df = pd.DataFrame(data)
df
city state zip
0 Burbank California 44325
1 Anaheim California nan
2 El Cerrito California 57643
3 Los Angeles California 56734
4 san Fancisco California 32819
def generate_placeholder_zip(row):
if pd.isnull(row['zip'] ):
row['zip'] =row['city']+'_ZIPCODE'
return row
df.apply(generate_placeholder_zip, axis =1)
city state zip
0 Burbank California 44325
1 Anaheim California Anaheim_ZIPCODE
2 El Cerrito California 57643
3 Los Angeles California 56734
4 san Fancisco California 32819
My pandas dataframe is very large and so I want to be able to modify the textLower(frame) function so that it gets executed in one command and I dont have to iterate over each row to perform a sequence of string manipulations over each element.
# Function iterates over all the values of a pandas dataframe
def textLower(frame):
for index, row in frame.iterrows():
row['Text'] = row['Text'].lower()
# further modification on row['Text']
return frame
def tryLower():
cities = ['Chicago', 'New York', 'Portland', 'San Francisco',
'Austin', 'Boston']
dfCities = pd.DataFrame(cities, columns=['Text'])
frame = textLower(dfCities)
for index, row in frame.iterrows():
print(row['Text'])
######################### main () #########################
def main():
tryLower()
Try this:
dfCities["Text"].str.lower()
or this:
def textLower(x):
return x.lower()
dfCities = dfCities["Text"].apply(textLower)
dfCities
# 0 chicago
# 1 new york
# 2 portland
# 3 san francisco
# 4 austin
# 5 boston
# Name: Text, dtype: object
I have a big csv files with the following format:
CSV FILE 1
id, person, city
1, John, NY
2, Lucy, Miami
3, Smith, Los Angeles
4, Mike, Chicago
5, David, Los Angeles
6, Daniel, NY
On another CSV file I have each city with a numerical code:
CSV FILE 2
city , code
NY , 100
Miami, 101
Los Angeles, 102
Chicago, 103
What I need to do is go through CSV File 1 in the city column, read the name of the city and get the numerical code for that city from CSV File 2. I could then just output that list of city codes to a text file. For this example I would get this result:
100
101
102
103
102
100
I used csv.DictReader to create dictionaries for each file but I am stuck trying to find a way to map each city to each code.
Any ideas or pointers in the right direction would be appreciated!
You have some extra whitespace there, and unlike some storage formats, CSV does care about it. If that is actually in your source data, you may have to strip it out before it will be processed as you expect (otherwise various fields will have leading and trailing whitespace).
Assuming that the whitespace is gone, however, it's fairly straightforward to do. You can just create a dictionary mapping names to codes, based on the contents of your second file.
from csv import DictReader
city_codes = {}
for row in DictReader(open('file2.csv', 'rb')):
city_codes[row['city']] = row['code']
for row in DictReader(open('file1.csv', 'rb')):
print city_codes[row['city']]
Naturally, you can send this out to a text file as you wish, simply by redirecting the output of print as you usually would.
In addition to what Jeremy suggested, you could use the string method .strip() to remove the trailing and leading whitespace automatically.
Consider using sqlite3. You can then do efficient, simple and powerful joins.
If files are really big, you can benefit from creating proper index.