I am new to data analysis on Python and faced with problem in project making process. Some of the values in csv file has a delimiter in the double quotes, so Pandas can't separate it correctly
top = pd.read_csv(r"C:\Users\User\Desktop\data analytics\Project\Analysis-Spotify-Top-2000\Spotify-2000.csv",delimiter = ",",
encoding = "UTF-8", doublequote=True, engine="python", quotechar='"', quoting=csv.QUOTE_ALL)
I found which records calls that problem:
My teacher advice me to create a new dataframe with these values and the same columns, and those records that has a delimiter in double quotes should be deleted, then the df will merge to the original.
But honestly, I don't know how to do it properly (I made some weird things - screen2)
is_title_null = pd.isnull(top["Title"])
missing_list = top[is_title_null]["Index"].tolist()
list_of_missing_list = []
for i in missing_list:
l = i.split(', ')
list_of_missing_list.append(l)
list_of_missing_list
missing_df = pd.DataFrame(np.empty((0, 15)))
missing_df.columns = ["Index", "Title","Artist","Top Genre","Year","Beats Per Minute
(BPM)","Energy","Danceability","Loudness (dB)","Liveness","Valence","Length
(Duration)","Acousticness","Speechiness","Popularity"]
missing_df.append(list_of_missing_list,ignore_index = True)
Here is my project link in GitHub (here you can see the problem): https://github.com/Sabina-Karenkina/Analysis-Spotify-Top-2000
Ok. This is not a really elegant way to do things, but as I mentioned in the comment I made previously you will not fix the problem by first creating the dataframe because the file is corrupt to begin with. I managed to find a way to easily solve it.
Open your Spotify-2000-file with excel and make text to columns. When asked which delimiter, choose , (comma). Save your file as a new ´´´csv´´´-file (Soptify2.csv) but make sure to have ; as delimiter (this is because you might have titles including commas.
Now, use pandas to read this new file:
top = pd.read_csv(r"C:/Users/k_sego/spotify2.csv",delimiter = ";",
encoding = "iso-8859-1", doublequote=True, engine="python")
top.head(100)
I have a csv file that I need to change the date value in each row. The date to be changed appears in the exact same column in each row of the csv.
import csv
firstfile = open('example.csv',"r")
firstReader = csv.reader(firstfile, delimiter='|')
firstData = list(firstReader)
DateToChange = firstData[1][25]
ChangedDate = '2018-09-30'
for row in firstReader:
for column in row:
print(column)
if column==DateToChange:
#Change the date
outputFile = open("output.csv","w")
outputFile.writelines(firstfile)
outputFile.close()
I am trying to grab and store a date already in the csv and change it using a for loop, then output the original file with the changed dates. However, the code above doesn't seem to do anything at all. I am newer to Python so I might not be understanding how to use a for loop correctly.
Any help at all is greatly appreciated!
When you call list(firstReader), you read all of the CSV data in to the firstData list. When you then, later, call for row in firstReader:, the firstReader is already exhausted, so nothing will be looped. Instead, try changing it to for row in firstData:.
Also, when you are trying to write to file, you are trying to write firstFile into the file, rather than the altered row. I'll leave you to figure out how to update the date in the row, but after that you'll need to give the file a string to write. That string should be ', '.join(row), so outputFile.write(', '.join(row)).
Finally, you should open your output file once, not each time in the loop. Move the open call to above your loop, and the close call to after your loop. Then when you have a moment, search google for 'python context manager open file' for a better way to manage the open file.
you could use pandas and numpy. Here I create a dataframe from scratch but you could load it directly from a .csv:
import pandas as pd
import numpy as np
date_df = pd.DataFrame(
{'col1' : ['12', '14', '14', '3412', '2'],
'col2' : ['2018-09-30', '2018-09-14', '2018-09-01', '2018-09-30', '2018-12-01']
})
date_to_change = '2018-09-30'
replacement_date = '2018-10-01'
date_df['col2'] = np.where(date_df['col2'] == date_to_change, replacement_date, date_df['col2'])
I have a csv file from this webpage.
I want to read some of the columns in the downloaded file (the csv version can be downloaded in the upper right corner).
Let's say I want 2 columns:
59 which in the header is star_name
60 which in the header is ra.
However, for some reason the authors of the webpage sometimes decide to move the columns around.
In the end I want something like this, keeping in mind that values can be missing.
data = #read data in a clever way
names = data['star_name']
ras = data['ra']
This will prevent my program to malfunction when the columns are changed again in the future, if they keep the name correct.
Until now I have tried various ways using the csv module and resently the pandas module. Both without any luck.
EDIT (added two lines + the header of my datafile. Sorry, but it's extremely long.)
# name, mass, mass_error_min, mass_error_max, radius, radius_error_min, radius_error_max, orbital_period, orbital_period_err_min, orbital_period_err_max, semi_major_axis, semi_major_axis_error_min, semi_major_axis_error_max, eccentricity, eccentricity_error_min, eccentricity_error_max, angular_distance, inclination, inclination_error_min, inclination_error_max, tzero_tr, tzero_tr_error_min, tzero_tr_error_max, tzero_tr_sec, tzero_tr_sec_error_min, tzero_tr_sec_error_max, lambda_angle, lambda_angle_error_min, lambda_angle_error_max, impact_parameter, impact_parameter_error_min, impact_parameter_error_max, tzero_vr, tzero_vr_error_min, tzero_vr_error_max, K, K_error_min, K_error_max, temp_calculated, temp_measured, hot_point_lon, albedo, albedo_error_min, albedo_error_max, log_g, publication_status, discovered, updated, omega, omega_error_min, omega_error_max, tperi, tperi_error_min, tperi_error_max, detection_type, mass_detection_type, radius_detection_type, alternate_names, molecules, star_name, ra, dec, mag_v, mag_i, mag_j, mag_h, mag_k, star_distance, star_metallicity, star_mass, star_radius, star_sp_type, star_age, star_teff, star_detected_disc, star_magnetic_field
11 Com b,19.4,1.5,1.5,,,,326.03,0.32,0.32,1.29,0.05,0.05,0.231,0.005,0.005,0.011664,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1,2008,2011-12-23,94.8,1.5,1.5,2452899.6,1.6,1.6,Radial Velocity,,,,,11 Com,185.1791667,17.7927778,4.74,,,,,110.6,-0.35,2.7,19.0,G8 III,,4742.0,,
11 UMi b,10.5,2.47,2.47,,,,516.22,3.25,3.25,1.54,0.07,0.07,0.08,0.03,0.03,0.012887,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1,2009,2009-08-13,117.63,21.06,21.06,2452861.05,2.06,2.06,Radial Velocity,,,,,11 UMi,229.275,71.8238889,5.02,,,,,119.5,0.04,1.8,24.08,K4III,1.56,4340.0,,
An easy way to do this is using the pandas library like this.
import pandas as pd
fields = ['star_name', 'ra']
df = pd.read_csv('data.csv', skipinitialspace=True, usecols=fields)
# See the keys
print df.keys()
# See content in 'star_name'
print df.star_name
The problem here was the skipinitialspace which remove the spaces in the header. So ' star_name' becomes 'star_name'
According to the latest pandas documentation you can read a csv file selecting only the columns which you want to read.
import pandas as pd
df = pd.read_csv('some_data.csv', usecols = ['col1','col2'], low_memory = True)
Here we use usecols which reads only selected columns in a dataframe.
We are using low_memory so that we Internally process the file in chunks.
Above answers are for python2. So for python 3 users I am giving this answer. You can use the code below:
import pandas as pd
fields = ['star_name', 'ra']
df = pd.read_csv('data.csv', skipinitialspace=True, usecols=fields)
# See the keys
print(df.keys())
# See content in 'star_name'
print(df.star_name)
Got a solution to above problem in a different way where in although i would read entire csv file, but would tweek the display part to show only the content which is desired.
import pandas as pd
df = pd.read_csv('data.csv', skipinitialspace=True)
print df[['star_name', 'ra']]
This one could help in some of the scenario's in learning basics and filtering data on the basis of columns in dataframe.
I think you need to try this method.
import pandas as pd
data_df = pd.read_csv('data.csv')
print(data_df['star_name'])
print(data_df['ra'])