Turning Panda Column into text file seperated by line break - python

I would like to create a txt file, where every line is a so called "ticker symbol" (=symbol for a stock). As a first step, I downloaded all the tickers I want via a wikipedia api:
import pandas as pd
import wikipedia as wp
html1 = wp.page("List of S&P 500 companies").html().encode("UTF-8")
df = pd.read_html(html1,header =0)[0]
df = df.drop(['SEC filings','CIK', 'Headquarters Location', 'Date first added', 'Founded'], axis = 1)
df.columns = df.columns.str.replace('Symbol', 'Ticker')
Secondly, I would like to create a txt file as mentionned above with all the ticker names of column "Ticker" from df. To do so, I probably have to do somithing similar to:
f = open("tickertest.txt","w+")
f.write("MMM\nABT\n...etc.")
f.close()
Now my problem: Does anybody know how it is possible to bring my Ticker column from df into one big string where between every ticker there is a \n or every ticker is on a new line?

You can use to_csv for this.
df.to_csv("test.txt", columns=["Ticker"], header=False, index=False)
This provides flexibility to include other columns, column names, and index values at some future point (should you need to do some sleuthing, or in case your boss asks for more information). You can even change the separator. This would be a simple modification (obvious changes, e.g.):
df.to_csv("test.txt", columns=["Ticker", "Symbol",], header=True, index=True, sep="\t")
I think the benefit of this method over jfaccioni's answer is flexibility and ease of adapability. This also gets you away from explicitly opening a file. However, if you still want to explicitly open a file you should consider using "with", which will automatically close the buffer when you break out of the current indentation. e.g.
with open("test.txt", "w") as fid:
fid.write("MMM\nABT\n...etc.")

This should do the trick:
'\n'.join(df['Ticker'].astype(str).values)

Related

How to preserve complicated excel header formats when manipulating data using Pandas Python?

I am parsing a large excel data file to another one, however the headers are very abnormal. I tried to use "read_excel skiprows" and that did not work. I also tried to include the header in
df = pd.read_excel(user_input, header= [1:3], sheet_name = 'PN Projection'), but then I get this error "ValueError: cannot join with no overlapping index names." To get around this I tried to name the columns by location and that did not work either.
When I run the code as shows below everything works fine, but past cell "U" I get the header titles to be "unnamed1, 2, ..." I understand this is because pandas is considering the first row to be the header(which are empty), but how do I fix this? Is there a way to preserve the headers without manually typing in the format for each cell? Any and all help is appreciated, thank you!
small section of the excel file header
the code I am trying to run
#!/usr/bin/env python
import sys
import os
import pandas as pd
#load source excel file
user_input = input("Enter the path of your source excel file (omit 'C:'): ")
#reads the source excel file
df = pd.read_excel(user_input, sheet_name = 'PN Projection')
#Filtering dataframe
#Filters out rows with 'EOL' in column 'item status' and 'xcvr' in 'description'
df = df[~(df['Item Status'] == 'EOL')]
df = df[~(df['Description'].str.contains("XCVR", na=False))]
#Filters in rows with "XC" or "spartan" in 'description' column
df = df[(df['Description'].str.contains("XC", na=False) | df['Description'].str.contains("Spartan", na=False))]
print(df)
#Saving to a new spreadsheet called Filtered Data
df.to_excel('filtered_data.xlsx', sheet_name='filtered_data')
If you do not need the top 2 rows, then:
df = pd.read_excel(user_input, sheet_name = 'PN Projection',error_bad_lines=False, skiprows=range(0,2)
This has worked for me when handling several strangely formatted files. Let me know if this isn't what your looking for, or if their are additional issues.

Pandas doesn't separate string in csv file to columns correctly

I am new to data analysis on Python and faced with problem in project making process. Some of the values in csv file has a delimiter in the double quotes, so Pandas can't separate it correctly
top = pd.read_csv(r"C:\Users\User\Desktop\data analytics\Project\Analysis-Spotify-Top-2000\Spotify-2000.csv",delimiter = ",",
encoding = "UTF-8", doublequote=True, engine="python", quotechar='"', quoting=csv.QUOTE_ALL)
I found which records calls that problem:
My teacher advice me to create a new dataframe with these values and the same columns, and those records that has a delimiter in double quotes should be deleted, then the df will merge to the original.
But honestly, I don't know how to do it properly (I made some weird things - screen2)
is_title_null = pd.isnull(top["Title"])
missing_list = top[is_title_null]["Index"].tolist()
list_of_missing_list = []
for i in missing_list:
l = i.split(', ')
list_of_missing_list.append(l)
list_of_missing_list
missing_df = pd.DataFrame(np.empty((0, 15)))
missing_df.columns = ["Index", "Title","Artist","Top Genre","Year","Beats Per Minute
(BPM)","Energy","Danceability","Loudness (dB)","Liveness","Valence","Length
(Duration)","Acousticness","Speechiness","Popularity"]
missing_df.append(list_of_missing_list,ignore_index = True)
Here is my project link in GitHub (here you can see the problem): https://github.com/Sabina-Karenkina/Analysis-Spotify-Top-2000
Ok. This is not a really elegant way to do things, but as I mentioned in the comment I made previously you will not fix the problem by first creating the dataframe because the file is corrupt to begin with. I managed to find a way to easily solve it.
Open your Spotify-2000-file with excel and make text to columns. When asked which delimiter, choose , (comma). Save your file as a new ´´´csv´´´-file (Soptify2.csv) but make sure to have ; as delimiter (this is because you might have titles including commas.
Now, use pandas to read this new file:
top = pd.read_csv(r"C:/Users/k_sego/spotify2.csv",delimiter = ";",
encoding = "iso-8859-1", doublequote=True, engine="python")
top.head(100)

Changing date values in a .csv using a for loop

I have a csv file that I need to change the date value in each row. The date to be changed appears in the exact same column in each row of the csv.
import csv
firstfile = open('example.csv',"r")
firstReader = csv.reader(firstfile, delimiter='|')
firstData = list(firstReader)
DateToChange = firstData[1][25]
ChangedDate = '2018-09-30'
for row in firstReader:
for column in row:
print(column)
if column==DateToChange:
#Change the date
outputFile = open("output.csv","w")
outputFile.writelines(firstfile)
outputFile.close()
I am trying to grab and store a date already in the csv and change it using a for loop, then output the original file with the changed dates. However, the code above doesn't seem to do anything at all. I am newer to Python so I might not be understanding how to use a for loop correctly.
Any help at all is greatly appreciated!
When you call list(firstReader), you read all of the CSV data in to the firstData list. When you then, later, call for row in firstReader:, the firstReader is already exhausted, so nothing will be looped. Instead, try changing it to for row in firstData:.
Also, when you are trying to write to file, you are trying to write firstFile into the file, rather than the altered row. I'll leave you to figure out how to update the date in the row, but after that you'll need to give the file a string to write. That string should be ', '.join(row), so outputFile.write(', '.join(row)).
Finally, you should open your output file once, not each time in the loop. Move the open call to above your loop, and the close call to after your loop. Then when you have a moment, search google for 'python context manager open file' for a better way to manage the open file.
you could use pandas and numpy. Here I create a dataframe from scratch but you could load it directly from a .csv:
import pandas as pd
import numpy as np
date_df = pd.DataFrame(
{'col1' : ['12', '14', '14', '3412', '2'],
'col2' : ['2018-09-30', '2018-09-14', '2018-09-01', '2018-09-30', '2018-12-01']
})
date_to_change = '2018-09-30'
replacement_date = '2018-10-01'
date_df['col2'] = np.where(date_df['col2'] == date_to_change, replacement_date, date_df['col2'])

Removing rows whose index match certain values

My code:
import numpy as np
import pandas as pd
import time
tic = time.time()
I read a long file of the headers [meter] [daycode] [meter reading in kWh]. A time series of over 6,000 meters.
consum = pd.read_csv("data/File1.txt", delim_whitespace=True, encoding = "utf-8", names =['meter', 'daycode', 'val'], engine='python')
consum.set_index('meter', inplace=True)
Because I have in fact total 6 files of this humungous size, I want to filter out those with insufficient information. These are the time series data with [meter] values under code 3 by category. I can collect this category information from another file. Following is where I extract this.
id_total = pd.read_csv("data/meter_id_code.csv", header = 0, encoding="cp1252")
#print(len(id_total.index))
id_total.set_index('Code', inplace=True)
id_other = id_total.loc[3].copy()
print id_other
And this is where I write to csv to check whether the last line is correctly performed:
id_other.to_csv('data/id_other.csv', sep='\t', encoding='utf-8')
print consum[~consum.index.isin(id_other)]
Output: (of print id_other)
Problem:
I get the following warning. Here it says it didn't affect the code from working but mine is affected. I checked the correct directory (earlier confused my remote connection to gpu server with my hardware) and csv file was created. It turns out the meter IDs in the file are not filtered.
How can I fix the last line?

Read specific columns with pandas or other python module

I have a csv file from this webpage.
I want to read some of the columns in the downloaded file (the csv version can be downloaded in the upper right corner).
Let's say I want 2 columns:
59 which in the header is star_name
60 which in the header is ra.
However, for some reason the authors of the webpage sometimes decide to move the columns around.
In the end I want something like this, keeping in mind that values can be missing.
data = #read data in a clever way
names = data['star_name']
ras = data['ra']
This will prevent my program to malfunction when the columns are changed again in the future, if they keep the name correct.
Until now I have tried various ways using the csv module and resently the pandas module. Both without any luck.
EDIT (added two lines + the header of my datafile. Sorry, but it's extremely long.)
# name, mass, mass_error_min, mass_error_max, radius, radius_error_min, radius_error_max, orbital_period, orbital_period_err_min, orbital_period_err_max, semi_major_axis, semi_major_axis_error_min, semi_major_axis_error_max, eccentricity, eccentricity_error_min, eccentricity_error_max, angular_distance, inclination, inclination_error_min, inclination_error_max, tzero_tr, tzero_tr_error_min, tzero_tr_error_max, tzero_tr_sec, tzero_tr_sec_error_min, tzero_tr_sec_error_max, lambda_angle, lambda_angle_error_min, lambda_angle_error_max, impact_parameter, impact_parameter_error_min, impact_parameter_error_max, tzero_vr, tzero_vr_error_min, tzero_vr_error_max, K, K_error_min, K_error_max, temp_calculated, temp_measured, hot_point_lon, albedo, albedo_error_min, albedo_error_max, log_g, publication_status, discovered, updated, omega, omega_error_min, omega_error_max, tperi, tperi_error_min, tperi_error_max, detection_type, mass_detection_type, radius_detection_type, alternate_names, molecules, star_name, ra, dec, mag_v, mag_i, mag_j, mag_h, mag_k, star_distance, star_metallicity, star_mass, star_radius, star_sp_type, star_age, star_teff, star_detected_disc, star_magnetic_field
11 Com b,19.4,1.5,1.5,,,,326.03,0.32,0.32,1.29,0.05,0.05,0.231,0.005,0.005,0.011664,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1,2008,2011-12-23,94.8,1.5,1.5,2452899.6,1.6,1.6,Radial Velocity,,,,,11 Com,185.1791667,17.7927778,4.74,,,,,110.6,-0.35,2.7,19.0,G8 III,,4742.0,,
11 UMi b,10.5,2.47,2.47,,,,516.22,3.25,3.25,1.54,0.07,0.07,0.08,0.03,0.03,0.012887,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1,2009,2009-08-13,117.63,21.06,21.06,2452861.05,2.06,2.06,Radial Velocity,,,,,11 UMi,229.275,71.8238889,5.02,,,,,119.5,0.04,1.8,24.08,K4III,1.56,4340.0,,
An easy way to do this is using the pandas library like this.
import pandas as pd
fields = ['star_name', 'ra']
df = pd.read_csv('data.csv', skipinitialspace=True, usecols=fields)
# See the keys
print df.keys()
# See content in 'star_name'
print df.star_name
The problem here was the skipinitialspace which remove the spaces in the header. So ' star_name' becomes 'star_name'
According to the latest pandas documentation you can read a csv file selecting only the columns which you want to read.
import pandas as pd
df = pd.read_csv('some_data.csv', usecols = ['col1','col2'], low_memory = True)
Here we use usecols which reads only selected columns in a dataframe.
We are using low_memory so that we Internally process the file in chunks.
Above answers are for python2. So for python 3 users I am giving this answer. You can use the code below:
import pandas as pd
fields = ['star_name', 'ra']
df = pd.read_csv('data.csv', skipinitialspace=True, usecols=fields)
# See the keys
print(df.keys())
# See content in 'star_name'
print(df.star_name)
Got a solution to above problem in a different way where in although i would read entire csv file, but would tweek the display part to show only the content which is desired.
import pandas as pd
df = pd.read_csv('data.csv', skipinitialspace=True)
print df[['star_name', 'ra']]
This one could help in some of the scenario's in learning basics and filtering data on the basis of columns in dataframe.
I think you need to try this method.
import pandas as pd
data_df = pd.read_csv('data.csv')
print(data_df['star_name'])
print(data_df['ra'])

Categories

Resources