Read specific columns with pandas or other python module - python

I have a csv file from this webpage.
I want to read some of the columns in the downloaded file (the csv version can be downloaded in the upper right corner).
Let's say I want 2 columns:
59 which in the header is star_name
60 which in the header is ra.
However, for some reason the authors of the webpage sometimes decide to move the columns around.
In the end I want something like this, keeping in mind that values can be missing.
data = #read data in a clever way
names = data['star_name']
ras = data['ra']
This will prevent my program to malfunction when the columns are changed again in the future, if they keep the name correct.
Until now I have tried various ways using the csv module and resently the pandas module. Both without any luck.
EDIT (added two lines + the header of my datafile. Sorry, but it's extremely long.)
# name, mass, mass_error_min, mass_error_max, radius, radius_error_min, radius_error_max, orbital_period, orbital_period_err_min, orbital_period_err_max, semi_major_axis, semi_major_axis_error_min, semi_major_axis_error_max, eccentricity, eccentricity_error_min, eccentricity_error_max, angular_distance, inclination, inclination_error_min, inclination_error_max, tzero_tr, tzero_tr_error_min, tzero_tr_error_max, tzero_tr_sec, tzero_tr_sec_error_min, tzero_tr_sec_error_max, lambda_angle, lambda_angle_error_min, lambda_angle_error_max, impact_parameter, impact_parameter_error_min, impact_parameter_error_max, tzero_vr, tzero_vr_error_min, tzero_vr_error_max, K, K_error_min, K_error_max, temp_calculated, temp_measured, hot_point_lon, albedo, albedo_error_min, albedo_error_max, log_g, publication_status, discovered, updated, omega, omega_error_min, omega_error_max, tperi, tperi_error_min, tperi_error_max, detection_type, mass_detection_type, radius_detection_type, alternate_names, molecules, star_name, ra, dec, mag_v, mag_i, mag_j, mag_h, mag_k, star_distance, star_metallicity, star_mass, star_radius, star_sp_type, star_age, star_teff, star_detected_disc, star_magnetic_field
11 Com b,19.4,1.5,1.5,,,,326.03,0.32,0.32,1.29,0.05,0.05,0.231,0.005,0.005,0.011664,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1,2008,2011-12-23,94.8,1.5,1.5,2452899.6,1.6,1.6,Radial Velocity,,,,,11 Com,185.1791667,17.7927778,4.74,,,,,110.6,-0.35,2.7,19.0,G8 III,,4742.0,,
11 UMi b,10.5,2.47,2.47,,,,516.22,3.25,3.25,1.54,0.07,0.07,0.08,0.03,0.03,0.012887,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1,2009,2009-08-13,117.63,21.06,21.06,2452861.05,2.06,2.06,Radial Velocity,,,,,11 UMi,229.275,71.8238889,5.02,,,,,119.5,0.04,1.8,24.08,K4III,1.56,4340.0,,

An easy way to do this is using the pandas library like this.
import pandas as pd
fields = ['star_name', 'ra']
df = pd.read_csv('data.csv', skipinitialspace=True, usecols=fields)
# See the keys
print df.keys()
# See content in 'star_name'
print df.star_name
The problem here was the skipinitialspace which remove the spaces in the header. So ' star_name' becomes 'star_name'

According to the latest pandas documentation you can read a csv file selecting only the columns which you want to read.
import pandas as pd
df = pd.read_csv('some_data.csv', usecols = ['col1','col2'], low_memory = True)
Here we use usecols which reads only selected columns in a dataframe.
We are using low_memory so that we Internally process the file in chunks.

Above answers are for python2. So for python 3 users I am giving this answer. You can use the code below:
import pandas as pd
fields = ['star_name', 'ra']
df = pd.read_csv('data.csv', skipinitialspace=True, usecols=fields)
# See the keys
print(df.keys())
# See content in 'star_name'
print(df.star_name)

Got a solution to above problem in a different way where in although i would read entire csv file, but would tweek the display part to show only the content which is desired.
import pandas as pd
df = pd.read_csv('data.csv', skipinitialspace=True)
print df[['star_name', 'ra']]
This one could help in some of the scenario's in learning basics and filtering data on the basis of columns in dataframe.

I think you need to try this method.
import pandas as pd
data_df = pd.read_csv('data.csv')
print(data_df['star_name'])
print(data_df['ra'])

Related

How to preserve complicated excel header formats when manipulating data using Pandas Python?

I am parsing a large excel data file to another one, however the headers are very abnormal. I tried to use "read_excel skiprows" and that did not work. I also tried to include the header in
df = pd.read_excel(user_input, header= [1:3], sheet_name = 'PN Projection'), but then I get this error "ValueError: cannot join with no overlapping index names." To get around this I tried to name the columns by location and that did not work either.
When I run the code as shows below everything works fine, but past cell "U" I get the header titles to be "unnamed1, 2, ..." I understand this is because pandas is considering the first row to be the header(which are empty), but how do I fix this? Is there a way to preserve the headers without manually typing in the format for each cell? Any and all help is appreciated, thank you!
small section of the excel file header
the code I am trying to run
#!/usr/bin/env python
import sys
import os
import pandas as pd
#load source excel file
user_input = input("Enter the path of your source excel file (omit 'C:'): ")
#reads the source excel file
df = pd.read_excel(user_input, sheet_name = 'PN Projection')
#Filtering dataframe
#Filters out rows with 'EOL' in column 'item status' and 'xcvr' in 'description'
df = df[~(df['Item Status'] == 'EOL')]
df = df[~(df['Description'].str.contains("XCVR", na=False))]
#Filters in rows with "XC" or "spartan" in 'description' column
df = df[(df['Description'].str.contains("XC", na=False) | df['Description'].str.contains("Spartan", na=False))]
print(df)
#Saving to a new spreadsheet called Filtered Data
df.to_excel('filtered_data.xlsx', sheet_name='filtered_data')
If you do not need the top 2 rows, then:
df = pd.read_excel(user_input, sheet_name = 'PN Projection',error_bad_lines=False, skiprows=range(0,2)
This has worked for me when handling several strangely formatted files. Let me know if this isn't what your looking for, or if their are additional issues.

Loading CSV into dataframe results in all records becoming "NaN"

I'm new to python (and posting on SO), and I'm trying to use some code I wrote that worked in another similar context to import data from a file into a MySQL table. To do that, I need to convert it to a dataframe. In this particular instance I'm using Federal Election Comission data that is pipe-delimited (It's the "Committee Master" data here). It looks like this.
C00000059|HALLMARK CARDS PAC|SARAH MOE|2501 MCGEE|MD #500|KANSAS CITY|MO|64108|U|Q|UNK|M|C||
C00000422|AMERICAN MEDICAL ASSOCIATION POLITICAL ACTION COMMITTEE|WALKER, KEVIN MR.|25 MASSACHUSETTS AVE, NW|SUITE 600|WASHINGTON|DC|200017400|B|Q||M|M|ALABAMA MEDICAL PAC|
C00000489|D R I V E POLITICAL FUND CHAPTER 886|JERRY SIMS JR|3528 W RENO||OKLAHOMA CITY|OK|73107|U|N||Q|L||
C00000547|KANSAS MEDICAL SOCIETY POLITICAL ACTION COMMITTEE|JERRY SLAUGHTER|623 SW 10TH AVE||TOPEKA|KS|666121627|U|Q|UNK|Q|M|KANSAS MEDICAL SOCIETY|
C00000729|AMERICAN DENTAL ASSOCIATION POLITICAL ACTION COMMITTEE|DI VINCENZO, GIORGIO T. DR.|1111 14TH STREET, NW|SUITE 1100|WASHINGTON|DC|200055627|B|Q|UNK|M|M|INDIANA DENTAL PAC|
When I run this code, all of the records come back "NaN."
import pandas as pd
import pymysql
print('convert CSV to dataframe')
data = pd.read_csv ('Desktop/Python/FECupdates/cm.txt', delimiter='|')
df = pd.DataFrame(data, columns=['CMTE_ID','CMTE_NM','TRES_NM','CMTE_ST1','CMTE_ST2','CMTE_CITY','CMTE_ST','CMTE_ZIP','CMTE_DSGN','CMTE_TP','CMTE_PTY_AFFILIATION','CMTE_FILING_FREQ','ORG_TP','CONNECTED_ORG_NM','CAND_ID'])
print(df.head(10))
If I remove the dataframe part and just do this, it displays the data, so it doesn't seem like it's a problem with file itself (but what do I know?):
import pandas as pd
import pymysql
print('convert CSV to dataframe')
data = pd.read_csv ('Desktop/Python/FECupdates/cm.txt', delimiter='|')
print(data.head(10))
I've spent hours looking at different questions here that seem to be trying to address similar issues -- in which cases the problems apparently stemmed from things like the encoding or different kinds of delimiters -- but each time I try to make the same changes to my code I get the same result. I've also converted the whole thing to a csv, by changing all the commas in fields to "$" and then changing the pipes to commas. It still shows up as all "Nan," even though the number of records is correct if I upload it to MySQL (they're just all empty).
You made typos in columns list. Pandas can automatically recognize columns.
import pandas as pd
import pymysql
print('convert CSV to dataframe')
data = pd.read_csv ('cn.txt', delimiter='|')
df = pd.DataFrame(data)
print(df.head(10))
Also, you can create an empty dataframe and concatenate the readed file.
import pandas as pd
import pymysql
print('convert CSV to dataframe')
data = pd.read_csv ('cn.txt', delimiter='|')
data2 = pd.DataFrame()
df = pd.concat([data,data2],ignore_index=True)
print(df.head(10))
Try this, worked for me:
path = Desktop/Python/FECupdates
df = pd.read_csv(path+'cm.txt',encoding ='unicode_escape', sep='|')
df = df.apply(lambda x: x.str.strip() if x.dtype == "object" else x)
df.columns = ['CMTE_ID','CMTE_NM','TRES_NM','CMTE_ST1','CMTE_ST2','CMTE_CITY','CMTE_ST','CMTE_ZIP','CMTE_DSGN','CMTE_TP','CMTE_PTY_AFFILIATION','CMTE_FILING_FREQ','ORG_TP','CONNECTED_ORG_NM','CAND_ID']
df.head(200)
Output:

Python Datacompy library: how to save report string into a csv file?

I'm comparing between two data frames using Datacompy, but how can I save the final result as an excel sheet or csv file? I got a string as an output, but how can I save it as a CSV.
import pandas as pd
df1_1=pd.read_csv('G1-1.csv')
df1_2=pd.read_csv('G1-2.csv')
import datacompy
compare = datacompy.Compare(
df1_1,
df1_2,
join_columns='SAMPLED CONTENT (URL to content)',
)
print(compare.report())
I have tried this, and it worked for me:
with open('//Path', encoding='utf-8') as report_file:
report_file.write(compare.report())
If you just using pandas, you can try pandas's own way to write into csv:
> df = pd.DataFrame([['yy','rr'],['tt', 'rr'],['cc', 'rr']], index=range(3),
columns=['a', 'b'])
> df.to_csv('compare.csv')
I hadn't used datacompy, but I suggest that you can make your results into a dataframe, then you can use the to_csv way.
This is working fine me also Full code
compare = datacompy.Compare(
Oracle_DF1,PostgreSQL_DF2,
join_columns=['c_transaction_cd','c_anti_social_force_req_id'], #You can also specify a list of columns
abs_tol=0,
rel_tol=0,
df1_name = 'Oracle Source',
df2_name = 'PostgrSQL Reference'
)
compare.matches(ignore_extra_columns=False)
Report = compare.report() csvFileToWrite=r'D://Postgres_Problem_15Feb21//Oracle_PostgreSQLDataFiles//Sample//summary.csv'
with open(csvFileToWrite,mode='r+',encoding='utf-8') as report_file:
report_file.write(compare.report())

Not reading all rows while importing csv into pandas dataframe

I am trying the kaggle challenge here, and unfortunately I am stuck at a very basic step.
I am trying to read the datasets into a pandas dataframe by executing following command:
test = pd.DataFrame.from_csv("C:/Name/DataMining/hillary/data/output/emails.csv")
The problem is that this file as you would find out has over 300,000 records, but I am reading only 7945.
print (test.shape)
(7945, 21)
Now I have double checked the file and I cannot find anything special about line number 7945. Any pointers why this could be happening?
I think better is use function read_csv with parameters quoting=csv.QUOTE_NONE and error_bad_lines=False. link
import pandas as pd
import csv
test = pd.read_csv("output/Emails.csv", quoting=csv.QUOTE_NONE, error_bad_lines=False)
print (test.shape)
#(381422, 22)
But some data (problematic) will be skipped.
If you want skip emails body data, you can use:
import pandas as pd
import csv
test = pd.read_csv(
"output/Emails.csv",
quoting=csv.QUOTE_NONE,
sep=',',
error_bad_lines=False,
header=None,
names=[
"Id", "DocNumber", "MetadataSubject", "MetadataTo", "MetadataFrom",
"SenderPersonId", "MetadataDateSent", "MetadataDateReleased",
"MetadataPdfLink", "MetadataCaseNumber", "MetadataDocumentClass",
"ExtractedSubject", "ExtractedTo", "ExtractedFrom", "ExtractedCc",
"ExtractedDateSent", "ExtractedCaseNumber", "ExtractedDocNumber",
"ExtractedDateReleased", "ExtractedReleaseInPartOrFull",
"ExtractedBodyText", "RawText"])
print (test.shape)
#delete row with NaN in column MetadataFrom
test = test.dropna(subset=['MetadataFrom'])
#delete headers in data
test = test[test.MetadataFrom != 'MetadataFrom']

Pandas DateTimeIndex

I need to have a DateTimeIndex for my dataframe. Problem is my source file. The Date header is Date(dd-mm-yy), but the actual date data has the format dd:mm:yy (24:06:1970) etc. I have lots of source files so manually changing the header would be tedious and not good programing practice. How would one go about addressing this from within python?
Perhaps creating a copy of the source file opening it, searching for the date header, changing it and then closing it? I'm new to python so I'm not exactly sure if this is the best way to go about doing things and if it is, how do I implement such code?
Currently I have this;
df = pd.read_csv('test.csv',
skiprows = 4,
parse_dates = {'stamp':[0,1]},
na_values = 'NaN',
index_col = 'stamp'
)
Where column 0 is the date column in question and column 1 is the time column.
I don't get any error messages just erroneous data.
Sorry, I should have added a snippet of the csv file in question.I've now provided it below;
some stuff I dont want
some stuff I dont want
some stuff I dont want
some stuff I dont want
Date(dd-mm-yy),Time(hh:mm:ss),Julian_Day
01:07:2013,05:40:41,182.236586,659,1638.400000
01:07:2013,05:44:03,182.238924,659,1638.400000
01:07:2013,05:47:48,182.241528,659,1638.400000
01:07:2013,05:52:21,182.244687,659,1638.400000
I think the main problem is that the header line Date(dd-mm-yy), Time(hh:mm:ss), Julian_Day only appears to specify some of the column names. Pandas cannot infer what to do with the other data.
Try skipping the file's column name line and passing pandas a list of column names and defining your own date_parser:
def my_parser(date, time):
import datetime
DATE_FORMAT = '%d:%m:%Y'
TIME_FORMAT = '%H:%M:%S'
date = datetime.datetime.strptime(date, DATE_FORMAT)
time_weird_date = datetime.datetime.strptime(time, TIME_FORMAT)
return datetime.datetime.combine(date, time_weird_date.time())
import pandas as pd
from cStringIO import StringIO
data = """\
some stuff I dont want
some stuff I dont want
some stuff I dont want
some stuff I dont want
Date(dd-mm-yy),Time(hh:mm:ss),Julian_Day
01:07:2013,05:40:41,182.236586,659,1638.400000
01:07:2013,05:44:03,182.238924,659,1638.400000
01:07:2013,05:47:48,182.241528,659,1638.400000
01:07:2013,05:52:21,182.244687,659,1638.400000
"""
pd.read_csv(StringIO(data), skiprows=5, index_col=0,
parse_dates={'datetime':['date', 'time']},
names=['date','time', 'Julian_Day', 'col_2', 'col_3'],
date_parser=my_parser)
This should give you what you want.
As you said you are new to python, I should add that the from cStringIO import StringIO, data = """..., and StringIO(data) parts are just so I could include the data directly in this answer in a runnable form. You just need pd.read_csv(my_data_filename, ... in your own code
Your dates are really weird, you should just fix them all. If you really can't fix them on disk for some reason, I guess you can do it inline:
import re
from StringIO import StringIO
s = open('test.csv').read()
def rep(m):
return '%s-%s-%sT' % (m.group('YY'), m.group('mm'), m.group('dd'))
s = re.sub(r'^(?P<dd>\d\d):(?P<mm>\d\d):(?P<YY>\d{4}),', rep, s, flags=re.M)
df = pd.read_csv(StringIO(s), skiprows=5, index_col=0,
names=['time', 'Julian_Day', 'col_2', 'col_3'])
This just takes the weird dates like 01:07:2013,05:40:41 and formats them ISO style like 2013-07-01T05:40:41. Then pandas can treat them normally. Bear in mind that these are going to be in UTC.

Categories

Resources