Pandas Append is adding new row without Index number

Pandas Append is adding new row without Index number - python

This is what I'm trying to accomplish with my code: I have a current csv file with tennis player names, and I want to add new players to it once they show in the rankings. My script goes through the rankings and creates an array, then imports the names from the csv file. It is supposed to see which names are not in the latter, and then extract online data for those names. Then, I just want the new rows to be appended at the end of that old CSV file. My issue is that the new row is being indexed with the Player's name rather than following the index of the old file. Any ideas why that's happening? Also why is an unnamed column being added?
def get_all_players():
# imports names of players currently in the atp rankings
current_atp_ranking = check_atp_rankings()
current_player_list = current_atp_ranking['Player']
# clean up names in case of white spaces
for i in range(0, len(current_player_list)):
current_player_list[i] = current_player_list[i].strip()
# reads the main file and makes a dataframe out of it
current_file = 'ATP_stats_new.csv'
df = pd.read_csv(current_file)
# gets all the names within the main file to see which current ones aren't there
names_on_file = list(df['Player'])
# cleans up in case of any white spaces
for i in range(0, len(names_on_file)):
names_on_file[i] = names_on_file[i].strip()
# Removing Nadal for testing purposes
names_on_file.remove("Rafael Nadal")
# creating a list of players in current_players_list but not in names_on_file
new_player_list = [x for x in current_player_list if x not in names_on_file]
# loop through new_player_list
for player in new_player_list:
# delay to avoid stopping
time.sleep(2)
# finding the player's atp link for profile based on their name
atp_link = current_atp_ranking.loc[current_atp_ranking['Player'] == player, 'ATP_Link']
atp_link = atp_link.iloc[0]
# make a basic dictionary with just the player's name and link
player_dict = [{'Name': player, 'ATP_Link': atp_link}]
# enter the new dictionary into the existing main file
df.append(player_dict, ignore_index=True)
# print dataframe to see how it looks before exporting
print(df)
# export dataframe into current file
df.to_csv(current_file)
This is what the file looks like at first:
Unnamed: 0 Player ... Coach Turned_Pro
0 0 Novak Djokovic ... NaN NaN
1 1 Rafael Nadal ... Carlos Moya, Francisco Roig 2001.0
2 2 Roger Federer ... Ivan Ljubicic, Severin Luthi 1998.0
3 3 Daniil Medvedev ... NaN NaN
4 4 Dominic Thiem ... NaN NaN
... ... ... ... ... ...
1976 1976 Brian Bencic ... NaN NaN
1977 1977 Boruch Skierkier ... NaN NaN
1978 1978 Majed Kilani ... NaN NaN
1979 1979 Quentin Gueydan ... NaN NaN
1980 1980 Preston Brown ... NaN NaN
And this is what the new row looks like:
1977 1977.0 ... NaN
1978 1978.0 ... NaN
1979 1979.0 ... NaN
1980 1980.0 ... NaN
Rafael Nadal NaN ... 2001

There are critical parts of your code missing that are necessary to answer the question precisely. Two thoughts based on what you posted:
Importing Your CSV File
Your previous csv file was probably saved with the index. Make sure the csv file contents does not have the dataframe index when you last used it in the first csv column. When you save do the following:
file.to_csv('file.csv', index=False)
When you load the file like this;
pandas.read_csv('file.csv')
it will automatically assigned the index number and there won't be a duplicate column.
Misordering of Columns
Not sure what info in what order atp_link is pulling in. From what you provided it looks like it is returning two columns: "Coach" and "Turning Pro".
I would recommend creating a list (not a dict) for each new player you want to add after you pull the info from atp_link. So if you are adding Nadal, You create an info list from the information for each new player. Nadal's info list would look like this:
info_list = ['Rafael Nadal', '','2001']
Then you append the list to the dataframe like this:
df.loc[len(df),:] = info_list
Hope this helps.

Related

Create multiple data frames from one -python

I have a large file (>500 rows) with multiple data points for each unique item in the list, something like :
cheese
weight
location
gouda
1.4
AL
gouda
2
TX
gouda
1.2
CA
cheddar
5.3
AL
cheddar
6
MN
chaddar
2
WA
Havarti
4
CA
Havarti
4.2
AL
I want to make data frames for each cheese to store the relevant data
I have this:
main_cheese_file = pd.read_csv('CheeseMaster.csv')
cut_the_cheese = main_cheese_file.cheese.unique()
melted = {elem: pd.DataFrame() for elem in cut_the_cheese}
for slice in melted.slice():
melted[slice] = main_cheese_file[:][main_cheese_file.cheese == slice]
to split it up on the unique thing I want.
What I want to do with it is make df's that can be exported for each cheese with the cheese name as the file name.
So far I can force it with
melted['Cheddar'].to_csv('Cheddar.csv')
and get the Cheddars ....
but I don't want to have to know and type out each type of cheese on the list of 500 rows...
Is there a way to add this to my loop?

You can just iterate over a groupby object
import pandas as pd
df = pd.read_csv('CheeseMaster.csv')
for k,v in df.groupby('cheese'):
v.to_csv(f'{k}.csv', index=False)

For Loop not writing into DataFrame (Python)

I am trying to analyze some NBA data. I have a CSV file with data and a field called "team" that has the team names and the place they finished in concatenated together in the same field. I tried to write a for loop (below) to look at the final character and if it is a digit. For example, the Boston Celtics that finished in 2nd place would show up as "Boston Celtics2". Another example would be "New York Knicks11".
for x in nba['team']:
while x[-1].isdigit():
print(x)
x = x[:len(x)-1]
print(x)
I included the print statements to make sure that it is slicing the data properly (which it is). But it is not writing back into my dataframe. Why isn't it storing this back into my dataframe?

>>> nba
team
0 Boston Celtics2
1 New York Knicks11
2 Los Angeles Lakers7
Remove trailing digits by ''.
nba['team'] = nba['team'].str.replace('\d+$', '')
>>> nba
team
0 Boston Celtics
1 New York Knicks
2 Los Angeles Lakers

It is not overwriting your DataFrame because in your for loop x only stores a copy of the value in your nba DataFrame. To change the value in your DataFrame you could loop with an index and change nba at the specific index:
for i, row in nba.iterrows():
x = row['team'];
while x[-1].isdigit():
print(x)
x = x[:len(x)-1]
print(x)
nba.at[i, 'team'] = x

How to construct new DataFrame based on data from for loops?

I have a data set (datacomplete2), where I have data for each country for two different years. I want to calculate the difference between these years for each country (for values life, health, and lifegdp) and create a new data frame with the results.
The code:
for i in datacomplete2['Country'].unique():
life.append(datacomplete2.loc[(datacomplete2['Country']==i)&(datacomplete2['Year']==2016), 'life'] - datacomplete2.loc[(datacomplete2['Country']==i)&(datacomplete2['Year']==2000), 'life'])
health.append(datacomplete2.loc[(datacomplete2['Country']==i)&(datacomplete2['Year']==2016), 'health'] - datacomplete2.loc[(datacomplete2['Country']==i)&(datacomplete2['Year']==2000), 'health'])
lifegdp.append(datacomplete2.loc[(datacomplete2['Country']==i)&(datacomplete2['Year']==2016), 'lifegdp'] - datacomplete2.loc[(datacomplete2['Country']==i)&(datacomplete2['Year']==2000), 'lifegdp'])
newData = pd.DataFrame([life, health, lifegdp, datacomplete2['Country'].unique()], columns = ['life', 'health', 'lifegdp', 'country'])
newData
I think the for loop for calculating is correct, and the problem is in creating the new DataFrame. When I try to run the code, I get an error message: 4 columns passed, passed data had 210 columns.
I have 210 countries so I assume it somehow throws these values to the columns?
Here is also a link to a sneak peek of the data I'm using: https://i.imgur.com/jbGFPpk.png
The data as text would look like:
Country Code Year life health lifegdp
0 Algeria DZA 2000 70.292000 3.489033 20.146558
1 Algeria DZA 2016 76.078000 6.603844 11.520259
2 Angola AGO 2000 47.113000 1.908599 24.684593
3 Angola AGO 2016 61.547000 2.713149 22.684710
4 Antigua and Barbuda ATG 2000 73.541000 4.480701 16.412834
... ... ... ... ... ... ...
415 Vietnam VNM 2016 76.253000 5.659194 13.474181
416 World OWID_WRL 2000 67.684998 8.617628 7.854249
417 World OWID_WRL 2016 72.035337 9.978453 7.219088
418 Zambia ZMB 2000 44.702000 7.152371 6.249955
419 Zambia ZMB 2016 61.874000 4.477207 13.819775
Quick help required !!!
I started coding like two weeks ago so I'm very novice with this stuff.

Anurag Reddy's answer is a good concise solution if you know the dates in advance. To present an alternative and slightly more general answer - this problem is a good example use case for pandas.DataFrame.diff.
Note you don't actually need to sort the data in your example data but I've included a sort_values() line below to account for unsorted DataFrames.
import pandas as pd
# Read the raw datafile in
df = pd.read_csv("example.csv")
# Sort the data if required
df.sort_values(by=["Country"], inplace=True)
# Remove columns where you don't need the difference
new_df = df.drop(["Code", "Year"], axis=1)
# Group the data by country, take the difference between the rows, remove NaN rows, and reset the index to sequential integers
new_df = new_df.groupby(["Country"], as_index=False).diff().dropna().reset_index(drop=True)
# Add back the country names and codes as columns in the new DataFrame
new_df.insert(loc=0, column="Country", value=df["Country"].unique())
new_df.insert(loc=1, column="Code", value=df["Code"].unique())

You could do this instead
country_list = df.Country.unique().tolist()
df.drop(columns = ['Code'])
df_2016 = df.loc[(df['Country'].isin(country_list))&(df['Year']==2016)].reset_index()
df_2000 = df.loc[(df['Country'].isin(country_list))&(df['Year']==2000)].reset_index()
df_2016.drop(columns=['Year'])
df_2000.drop(columns=['Year'])
df_2016.set_index('Country').subtract(df_2000.set_index('Country'), fill_value=0)

How to read and modify a (.gct) file using python?

Which libraries would help me read a gct file in python and edit it like removing the rows with NaN values. And how will the following code change if I apply it to a .gct file?
data = pd.read_csv('PAAD1.csv')
new_data = data.dropna(axis = 0, how ='any')
print("Old data frame length:", len(data), "\nNew data frame length:",
len(new_data), "\nNumber of rows with at least 1 NA value: ",
(len(data)-len(new_data)))
new_data.to_csv('EditedPAAD.csv')

You should use the cmapPy package for this. Compared to read_csv it gives you more freedom and domain specific utilities. E.g. if your *.gct looks like this
#1.2
22215 2
Name Description Tumor_One Normal_One
1007_s_at na -0.214548 -0.18069
1053_at "RFC2 : replication factor C (activator 1) 2, 40kDa |#RFC2|" 0.868853 -1.330921
117_at na 1.124814 0.933021
121_at PAX8 : paired box gene 8 |#PAX8| -0.825381 0.102078
1255_g_at GUCA1A : guanylate cyclase activator 1A (retina) |#GUCA1A| -0.734896 -0.184104
1294_at UBE1L : ubiquitin-activating enzyme E1-like |#UBE1L| -0.366741 -1.209838
1316_at "THRA : thyroid hormone receptor, alpha (erythroblastic leukemia viral (v-erb-a) oncogene homolog, avian) |#THRA|" -0.126108 1.486972
1320_at "PTPN21 : protein tyrosine phosphatase, non-receptor type 21 |#PTPN21|" 3.083681 -0.086705
...
You can extract only rows with a desired probeset id (row id), e.g. ['1007_s_at', '1053_at', '117_at', '121_at', '1255_g_at', '1294_at UBE1L']
So to read a file, remove the nan in the description and save it again, do:
from cmapPy.pandasGEXpress.parse_gct import parse
from cmapPy.pandasGEXpress.write_gct import write
data = parse('example.gct', rid=['1007_s_at', '1053_at',
'117_at', '121_at',
'1255_g_at', '1294_at UBE1L'])
# remove nan values from row_metadata (description column)
data.row_metadata_df.dropna(inplace=True)
# remove the entries of .data_df where nan values are in row_metadata
data.data_df = data.data_df.loc[data.row_metadata_df.index]
# Can only write GCT version 1.3
write(data, 'new_example.gct')
The new_example.gct looks then like this:
#1.3
3 2 1 0
id Description Tumor_One Normal_One
1053_at RFC2 : replication factor C (activator 1) 2, 40kDa |#RFC2| 0.8689 -1.3309
121_at PAX8 : paired box gene 8 |#PAX8| -0.8254 0.1021
1255_g_at GUCA1A : guanylate cyclase activator 1A (retina) |#GUCA1A| -0.7349 -0.1841

Quick search in google will give you the following:
https://pypi.org/project/cmapPy/
Regarding to the code, if you don't care about the metadata in the 2 first rows, it seems to work for your purpose, but you should first indicate that the delimiter is TAB and skip the 2 first rows - pandas.read_csv(PATH_TO_GCT_FILE, sep='\t',skiprows=2)

need to sort .txt file into data frame with Pandas using Python 2.7

I'm new to using programming languages and I'm having trouble working this particular issue out. I'm a journalist trying to use Python to reorganize 911 data from a county dispatch office that has been provided in a .txt file.
This is how one call comes in the current format:
Incident Number: PD160010001
Incident Type: SUSPICIOUS PERSON(S)
EMS Blk: 186605 Fire Blk: 65005 Police Blk: 22145
Location: Location name,22
at XXXX Name RD ,22
Entered: 01/01/16 00:00
Dispatched: 01/01/16 00:00
Enroute: 01/01/16 00:00
On Scene: 01/01/16 00:00
Transport: / / :
Trans Complete: / / :
Closed: 01/01/16 00:04
01/01/16 00:00 OUTSRV
01/01/16 00:00 DISPOS 22H4
01/01/16 00:00 PREMPT 22H4
01/01/16 00:00 DISPOS 2212
01/01/16 00:00 EXCH 22H4
01/01/16 00:01 ADDER 22H4
01/01/16 00:04 CLEAR 2212
01/01/16 00:04 CLEAR 22H4
01/01/16 00:04 CLOSE 22H4
I was able to reorganize this in Excel using the Right and Left functions and a few other steps to get something like this:
Incident Number Incident Type EMS Blk: Closed
PD160010001 SUSPICIOUS PERSON(S) 186605 ... 01/01/16 00:04
The 9-10 rows of data with dispatched times at the bottom of each incident is redundant is not necessary.
What I'm having trouble with is finding a way to tell Pandas to take the name to the left of the colon and recognize that as one column head, while taking the info to the right of the column and assign it to the corresponding column, then repeat until after the closed column and skip the redundant information.
One year's worth of data in the .txt file is approximately 6 million rows and is cut down to just over 501,000 once it's been reorganized. Doing it in excel by hand is going to take about 4 hours per file and I want to do an analysis of call times over 10 years.
I need to learn to do this in Python to make it a practical project to follow through with.
Thank everyone. First time posting a question here.

Your description of the layout of the data is ambiguous, so I'm making some assumptions. I'm guessing the .txt file looks somewhat like this:
header2 header3 header4 header5 header6 header7 header8 header9
index 1 data12 data13 data14 data15 data16 data17 data18 data19
index 2 data22 data23 data24 data25 data26 data27 data28 data29
Where each index corresponds to a certain call and each column corresponds to some attribute of the call with the headers denoting what the data in the column represents.
The following program turns the above .txt file into a pandas dataframe and prints it out.
import pandas as pd
import re
with open(filename) as file:
rows = file.readlines()
columns = rows[0] # get the top row
columns = re.sub(' {2,}', ',', columns) # substitute whitespaces of more than
# two spaces with commas
columns = columns.strip().split(',') # turn the row into a list
content = rows[1:] # All but the first row
content = [re.sub(' {2,}',',',row).strip() for row in content] # again, whitespace to commas
content = [row.split(',') for row in content] # turn rows into lists
index = [row[0] for row in content] # take the first element of each row as the index
content = [row[1:] for row in content] # remove index from content
df = pd.DataFrame(data=content, index=index, columns=columns) # Combine into a dataframe
print(df)
Here we've assumed the columns have at least two spaces between them and that your data won't have any double spaces in it. If there's more space than that between the columns, you can change the regex to look for 3 or more consecutive spaces.
The output is
header2 header3 header4 header5 header6 header7 header8 header9
index 1 data12 data13 data14 data15 data16 data17 data18 data19
index 2 data22 data23 data24 data25 data26 data27 data28 data29
but you could do much more than print it out now that it's a dataframe.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.