How to throw away lines of text with specific characters?

How to throw away lines of text with specific characters? - python

I have multiple .log files that look like:
#Software: Microsoft Internet Information Services 10.0
#Version: 1.0
#Date: 2020-04-02 00:09:16
#Fields: date time s-ip cs-method cs-uri-stem cs-uri-query s-port cs-username c-ip cs(User-Agent) cs(Referer) sc-status sc-substatus sc-win32-status time-taken
2020-04-02 00:14:16 172.31.11.70 GET /ben_laptop_Apple.html - 443 - 156.154.81.54 curl/7.54.0 - 404 0 2 28
...
2020-04-02 00:19:16 172.31.11.70 GET /ben_laptop_Apple.html - 443 - 123.123.23.23 curl/7.54.0 - 404 0 2 47
I want to parse and concatenate the fields to get a nicely formatted Pandas table. To that end, I have the following working well:
# Match the extension pattern and save the list of file names in the ‘all_filenames’ variable.
extension = 'log'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
# Use pandas to concatenate all files in the list and export as CSV. The output file is named “combined_csv.csv” located in your working directory.
fields = 'date time s-ip cs-method cs-uri-stem cs-uri-query s-port cs-username c-ip cs(User-Agent) cs(Referer) sc-status sc-substatus sc-win32-status time-taken'.split(' ')
#combine all files in the list
combined_csv = pd.concat([pd.read_csv(f, sep=' ', header=None, skiprows=4, names=fields) for f in all_filenames ])
As you can see, I skip the first 4 rows to remove the header text. However, the problem is that a single log file will have the header text repeated throughout the .log file. So my files actually look like:
#Software: Microsoft Internet Information Services 10.0
#Version: 1.0
#Date: 2020-04-02 00:09:16
#Fields: date time s-ip cs-method cs-uri-stem cs-uri-query s-port cs-username c-ip cs(User-Agent) cs(Referer) sc-status sc-substatus sc-win32-status time-taken
2020-04-02 00:14:16 172.31.11.70 GET /ben_laptop_Apple.html - 443 - 156.154.81.54 curl/7.54.0 - 404 0 2 28
...
#Software: Microsoft Internet Information Services 10.0
#Version: 1.0
#Date: 2020-04-02 00:09:16
#Fields: date time s-ip cs-method cs-uri-stem cs-uri-query s-port cs-username c-ip cs(User-Agent) cs(Referer) sc-status sc-substatus sc-win32-status time-taken
...
2020-04-02 00:19:16 172.31.11.70 GET /ben_laptop_Apple.html - 443 - 123.123.23.23 curl/7.54.0 - 404 0 2 47
How do I filter out the repeating header text? I'm guessing I need a RegEx solution.

Instead of using the skiprows=4 you should use the comment='#'. This way your pd.read_csv will skip the rows that begins with #:
combined_csv = pd.concat([pd.read_csv(f, sep=' ', header=None, comment='#', names=fields) for f in all_filenames ])

I think this can help you:
df = pd.read_csv('sample_file.csv', comment='#')
From the documentation:
comment : str, default None
Indicates remainder of line should not be parsed. If found at the
beginning of a line, the line will be ignored altogether. This
parameter must be a single character. Like empty lines (as long as
skip_blank_lines=True), fully commented lines are ignored by the
parameter header but not by skiprows. For example, if comment=’#’,
parsing ‘#emptyna,b,cn1,2,3’ with header=0 will result in ‘a,b,c’
being treated as the header.

Related

Save forecast data to csv

Hello I have this code.
forecast = m.predict(future)
forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']]
forecast.tail(365)
ds trend yhat_lower yhat_upper trend_lower trend_upper additive_terms additive_terms_lower additive_terms_upper weekly weekly_lower weekly_upper multiplicative_terms multiplicative_terms_lower multiplicative_terms_upper yhat
307 2022-12-30 01:00:00 8744.804921 4151.683644 19973.732090 8744.804921 8744.804921 3425.715807 3425.715807 3425.715807 3425.715807 3425.715807 3425.715807 0.0 0.0 0.0 12170.520728
308 2022-12-30 02:00:00 8743.882714 3948.935733 20003.308794 8743.882714 8743.882714 3691.081394 3691.081394 3691.081394
So I want to download/export this forecast data as csv file.
from google.colab import files
files.download() # ? what I need do here.
Thanks a lot

forecast.to_csv('forecast.csv')

I found the solution.
forecast = m.predict(future)
forecast_data = forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']]
#forecast.tail(365)
print (forecast_data)
forecast_data.to_csv('myCsv.csv')
f.close()

Why is python only printing one data set in the algorithm?

So I am trying to build a trading software and I am using the code from an online YouTuber. I am gathering all of the data for the companies on the S&P 500 in the get_data_from_yahoo() function. So when I run that code it says Already Have (then the given ticker) which is fine, but when I got to print the data for this in the following function, which is compile_data(), it only print one ticker which is ZTS.
Anyone have any ideas?
import bs4 as bs
import datetime as dt
import os
import pandas as pd
from pandas_datareader import data as pdr
import pickle
import requests
import fix_yahoo_finance as yf
def save_sp500_tickers():
resp = requests.get('http://en.wikipedia.org/wiki/List_of_S%26P_500_companies')
soup = bs.BeautifulSoup(resp.text, 'lxml')
table = soup.find('table', {'class': 'wikitable sortable'})
tickers = []
for row in table.findAll('tr')[1:]:
ticker = row.findAll('td')[0].text.replace('.', '-')
ticker = ticker[:-1]
tickers.append(ticker)
with open("sp500tickers.pickle", "wb") as f:
pickle.dump(tickers, f)
print(tickers)
return tickers
save_sp500_tickers()
def get_data_from_yahoo(reload_sp500=False):
if reload_sp500:
tickers = save_sp500_tickers()
else:
with open("sp500tickers.pickle", "rb") as f:
tickers = pickle.load(f)
if not os.path.exists('stock_dfs'):
os.makedirs('stock_dfs')
start = dt.datetime(2019, 6, 8)
end = dt.datetime.now()
for ticker in tickers:
print(ticker)
if not os.path.exists('stock_dfs/{}.csv'.format(ticker)):
df = pdr.get_data_yahoo(ticker, start, end)
df.reset_index(inplace=True)
df.set_index("Date", inplace=True)
df.to_csv('stock_dfs/{}.csv'.format(ticker))
else:
print('Already have {}'.format(ticker))
save_sp500_tickers()
get_data_from_yahoo()
def complied_data():
with open("sp500tickers.pickle","rb") as f:
tickers = pickle.load(f)
main_df = pd.DataFrame()
for count, ticker in enumerate(tickers):
df = pd.read_csv('stock_dfs/{}.csv'.format(ticker))
df.set_index('Date', inplace=True)
df.rename(columns = {'Adj Close':ticker}, inplace=True)
df.drop(['Open', 'High', 'Low','Close','Volume'], 1, inplace=True)
if main_df.empty:
main_df = df
else:
main_df = main_df.join(df, how='outer')
if count % 10 == 0:
print(count)
print(main_df.head())
main_df.to_csv('sp500_joined_closes.csv')
complied_data()
When I run this code this is what it says:
MMM
Already have MMM
ABT
Already have ABT
ABBV
Already have ABBV
ABMD
Already have ABMD
ACN
Already have ACN
ATVI
Already have ATVI
ADBE
Already have ADBE
AMD
Already have AMD
AAP
Already have AAP
AES
Already have AES
AMG
Already have AMG
AFL
Already have AFL
A
Already have A
APD
Already have APD
AKAM
Already have AKAM
ALK
Already have ALK
ALB
Already have ALB
It then continues to say that it already has all of the 500 companies(I did not show the hole thing because the list is very long). But when I run the compile_data()
function it only prints the data for one ticker:
ZTS
Date
2019-01-02 83.945038
2019-01-03 81.043526
2019-01-04 84.223267
2019-01-07 84.730026
2019-01-08 85.991997

The problem is in a for loop, specifically the one in complied_data.
The if-else and if blocks should be included in the for loop:
for count, ticker in enumerate(tickers):
df = pd.read_csv('stock_dfs/{}.csv'.format(ticker))
df.set_index('Date', inplace=True)
df.rename(columns = {'Adj Close':ticker}, inplace=True)
df.drop(['Open', 'High', 'Low','Close','Volume'], 1, inplace=True)
if main_df.empty:
main_df = df
else:
main_df = main_df.join(df, how='outer')
if count % 10 == 0:
print(count)
Otherwise they will be evaluated only after it is done looping and elaborate the last element.
The following is the output when changing to the above indentation:
(... omitted counting from 0)
470
480
490
500
MMM ABT ABBV ABMD ... YUM ZBH ZION ZTS
Date ...
2019-06-10 165.332672 80.643486 74.704918 272.429993 ... 107.794380 121.242027 43.187107 109.920105
2019-06-11 165.941788 80.494644 75.889320 262.029999 ... 106.722885 120.016762 43.758469 109.860268
2019-06-12 166.040024 81.318237 76.277657 254.539993 ... 108.082100 120.225945 43.512192 111.136780
2019-06-13 165.882843 81.655624 76.646561 255.529999 ... 108.121788 119.329407 44.063854 109.730621
2019-06-14 163.760803 81.586166 76.394157 250.960007 ... 108.925407 116.998398 44.211620 110.488556
[5 rows x 505 columns]

Text to PDF Positioning Lines

I have a text file that i am reading and writing line by line into a PDF. The lines are out of position on the PDF because the FPDF library is left aligning all my lines. I am using the property set x so i can position each line to my liking. I am trying to reposition the headers until "RATE CODE CY" the would like all the data under the columns to come after. Then another header appears. I would like to align all the headers that come after the data. I know a for loop needs to be done to bring rest of the data...the issue is a header will come again and there is where i have to make the change with set_x property.
pdf = FPDF("L", "mm", "A4")
pdf.add_page()
pdf.set_font('arial', style='', size=10.0)
lines = file.readlines()
header8 = lines[7]
header8_1 = " ".join(lines[8].split()[:4])
header8_2 = " ".join(lines[8].split()[4:])
header9_1 = " ".join(lines[9].split()[:5])
header9_2 = " ".join(lines[9].split()[5:])
pdf.cell(ln=0, h=5.0, align='L', w=0, txt=header8_1, border=0)
pdf.set_x(125)
pdf.cell(ln=1, h=5.0, align='L', w=0, txt=header8_2, border=0)
pdf.cell(ln=0, h=5.0, align='L', w=0, txt=header9_1, border=0)
pdf.set_x(125)
pdfcell(ln=1, h=5.0, align='L', w=0, txt=header9_2, border=0)
Current PDF file:
READ SVC B MAXIMUM TOTAL DUE METER NO REMARKS
ACCOUNT # SERVICE ADDRESS CITY DATE DAY C KWH KWD AMOUNT
RATE CODE CY CUSTOMER NAME MAILING ADDRESS
----------------------------------------------------------------------------------------------------
11211-22222 12345 TEST HWY #86 TITUSVIL 10/12/19 29 C 1,444 189.01 ABC1234
GS-1 3 Home & ASSOC INC 1234 Miami HWY APT49
22222-33333 12345 TEST HWY #88 TITUSVIL 10/04/19 29 C 256 41.50 ABC1235
GS-1 3 DGN & ASSOC INC 1234 Miami HWY APT49
READ SVC B MAXIMUM TOTAL DUE METER NO REMARKS
ACCOUNT # SERVICE ADDRESS CITY DATE DAY C KWH KWD AMOUNT
RATE CODE CY CUSTOMER NAME MAILING ADDRESS
----------------------------------------------------------------------------------------------------
11211-22222 12345 TEST HWY #86 TITUSVIL 10/12/19 29 C 1,444 189.01 ABC1234
GS-1 3 Home & ASSOC INC 1234 Miami HWY APT49
22222-33333 12345 TEST HWY #88 TITUSVIL 10/04/19 29 C 256 41.50 ABC1235
GS-1 3 DGN & ASSOC INC 1234 Miami HWY APT49

Selecting rows based on multiple conditions using Python pandas

Hi I am trying to find a row that satisfies multiple user inputs, I want the result to return a single line that matches the flight date and destination, with origin airport being Atlanta. If they input anything else, it gives back an error and quits.
The input data is a CSV that looks like this:
FL_DATE ORIGIN DEST DEP_TIME
5/1/2017 ATL IAD 1442
5/1/2017 MCO EWR 932
5/1/2017 IAH MIA 1011
5/1/2017 EWR TPA 1646
5/1/2017 RSW EWR 1054
5/1/2017 IAD RDU 2216
5/1/2017 IAD BDL 1755
5/1/2017 EWR RSW 1055
5/1/2017 MCO EWR 744
My current code:
import pandas as pd
df=pd.read_csv("flights.data.csv") #import data frame
input1 = input ('Enter your flight date in MM/DD/YYYY: ') #input flight date
try:
date = str(input1) #flight date is a string
except:
print('Invalid date') #error message if it isn't a string
quit()
input2 = input('Enter your destination airport code: ') #input airport code
try:
destination = str(input2) #destination is a string
except:
print('Invalid destination airport code') #error message if it isn't a string
quit()
df.loc[df['FL_DATE'] == date] & df[df['ORIGIN'] == 'ATL'] & df[df['DEST'] == destination]
#matches flight date, destination, and origin has to equal to GNV
Ideal output is just returning the first row, if I input 5/1/2017 as 'date' and 'IAD' as destination.

You should be able to resolve your issue with below example. The syntax of yours was wrong for multiple conditions
import pandas as pd
df=pd.DataFrame({'FL_DATE':['5/1/2017'],'ORIGIN':['ATL'],'DEST':['IAD'],'DEP_TIME':[1442]})
df.loc[(df['FL_DATE'] == '5/1/2017') & (df['ORIGIN'] == 'ATL') & (df['DEST'] == 'IAD')]
Gives
DEP_TIME DEST FL_DATE ORIGIN
1442 IAD 5/1/2017 ATL
You should change your code to something like this
df.loc[(df['FL_DATE'] == date) & (df['ORIGIN'] == 'ATL') & (df['DEST'] == destination)]

In your loc statement, you need to fix your brackets and add parentheses between conditions:
df.loc[(df['FL_DATE'] == input1) & (df['ORIGIN'] == 'ATL') & (df['DEST'] == input2)]
Then it works:
>>> df.loc[(df['FL_DATE'] == date) & (df['ORIGIN'] == 'ATL') & (df['DEST'] == destination)]
FL_DATE ORIGIN DEST DEP_TIME
0 5/1/2017 ATL IAD 1442

How can get ' USDJPY'(currency rates) with pandas and yahoo finance?

I am learning and using the pandas and python.
Today, I am trying to make a fx rate table,
but I got a trouble with getting the pricess of 'USDJPY'.
When I get a prices of 'EUR/USD', i code like this.
eur = web.DataReader('EURUSD=X','yahoo')['Adj Close']
it works.
But when I wrote
jpy = web.DataReader('USDJPY=X','yahoo')['Adj Close']
the error message comes like this:
--------------------------------------------------------------------------- IOError Traceback (most recent call
last) in ()
----> 1 jpy = web.DataReader('USDJPY=X','yahoo')['Adj Close']
C:\Anaconda\lib\site-packages\pandas\io\data.pyc in DataReader(name,
data_source, start, end, retry_count, pause)
70 return get_data_yahoo(symbols=name, start=start, end=end,
71 adjust_price=False, chunksize=25,
---> 72 retry_count=retry_count, pause=pause)
73 elif data_source == "google":
74 return get_data_google(symbols=name, start=start, end=end,
C:\Anaconda\lib\site-packages\pandas\io\data.pyc in
get_data_yahoo(symbols, start, end, retry_count, pause, adjust_price,
ret_index, chunksize, name)
388 """
389 return _get_data_from(symbols, start, end, retry_count, pause,
--> 390 adjust_price, ret_index, chunksize, 'yahoo', name)
391
392
C:\Anaconda\lib\site-packages\pandas\io\data.pyc in
_get_data_from(symbols, start, end, retry_count, pause, adjust_price, ret_index, chunksize, source, name)
334 # If a single symbol, (e.g., 'GOOG')
335 if isinstance(symbols, (basestring, int)):
--> 336 hist_data = src_fn(symbols, start, end, retry_count, pause)
337 # Or multiple symbols, (e.g., ['GOOG', 'AAPL', 'MSFT'])
338 elif isinstance(symbols, DataFrame):
C:\Anaconda\lib\site-packages\pandas\io\data.pyc in
_get_hist_yahoo(sym, start, end, retry_count, pause)
188 '&g=d' +
189 '&ignore=.csv')
--> 190 return _retry_read_url(url, retry_count, pause, 'Yahoo!')
191
192
C:\Anaconda\lib\site-packages\pandas\io\data.pyc in
_retry_read_url(url, retry_count, pause, name)
167
168 raise IOError("after %d tries, %s did not "
--> 169 "return a 200 for url %r" % (retry_count, name, url))
170
171
IOError: after 3 tries, Yahoo! did not return a 200 for url
'http://ichart.yahoo.com/table.csv?s=USDJPY=X&a=0&b=1&c=2010&d=1&e=1&f=2014&g=d&ignore=.csv'
Other currencies like 'GBPUSD' also have same problem.
Can you solve this problem?
Do you have any idea of getting 'USDJPY' from yahoo or google???

Yahoo Finance doesn't provide historical data on exchange rates (i.e. there's no "Historical Prices" link in the top left of the page like there would be for stocks, indices, etc...)
You can use FRED (Federal Reserve of St. Louis data) to get these exchange rates...
import pandas.io.data as web
jpy = web.DataReader('DEXJPUS', 'fred')
UPDATE: hase moved the pandas-datareader
from pandas_datareader import data
jpy = data.DataReader('DEXJPUS', 'fred')
or the more direct way...
jpy = web.get_data_fred('DEXJPUS')
A list of all of the exchange rate that FRED has daily data for can be found here: http://research.stlouisfed.org/fred2/categories/94

Yahoo Finance doesn't provide historical data on exchange rates
Yes it does but not on cross rates. All vs the USD
List of Yahoo USD Exchange Rates
a = web.DataReader("JPY=X", 'yahoo')

The free and easy way is Yahoo:
# get fx rates
# https://finance.yahoo.com/currencies
# example EUR/USD = EURUSD%3DX?p=EURUSD%3DX
import pandas as pd
import pandas_datareader as dr
# change date range here
start_date = '2021-02-26'
end_date = '2021-03-01'
# retrieve market data of current ticker symbol
print('This is the table with HLOC, Volume, Adj Close prices')
eurusd = dr.data.DataReader('EURUSD%3DX', data_source='yahoo', start=start_date, end=end_date)
print(eurusd)
# just get latest adjusted close for further use
print('This is the Adj Close prices only')
print(eurusd['Adj Close'])
and it also works with other crosses, contrary to the above statements:
# EURCHF%3DX
eurchf = dr.data.DataReader('EURCHF%3DX', data_source='yahoo', start=start_date, end=end_date)
print(eurchf)

Get the historical exchange rates from OANDA
http://pandas-datareader.readthedocs.io/en/latest/remote_data.html
In [1]: from pandas_datareader.oanda import get_oanda_currency_historical_rates
In [2]: start, end = "2016-01-01", "2016-06-01"
In [3]: quote_currency = "USD"
In [4]: base_currency = ["EUR", "GBP", "JPY"]
In [5]: df_rates = get_oanda_currency_historical_rates(
start, end,
quote_currency=quote_currency,
base_currency=base_currency
)
In [6]: print(df_rates)
Update: Oanda started charging for this lately
https://www.oanda.com/fx-for-business/exchange-rates-api

#!pip install yfinance
#!pip install mplfinance
from datetime import datetime
import yfinance as yf
import mplfinance as mpf
#import pandas as pd
#import pandas_datareader as dr
# change date range here
start_date = '2021-02-26'
end_date = '2021-03-01'
#This Does NOT WORK#
# retrieve market data of current ticker symbol
print('This is the table with HLOC, Volume, Adj Close prices')
eurusd = dr.data.DataReader('EURUSD%3DX', data_source='yahoo',
start=start_date, end=end_date)
print(eurusd)
#This Does#
data = yf.download('USDCAD=X', start=start_date, end=end_date)
#If someone can figure out how to get the S5,S30, M1, M3 etc. Please share

I think you can use custom intervals by passing it as an argument to the yf.download() function. For example:
data = yf.download('USDCAD=X', start=start_date, end=end_date, interval='1m')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to throw away lines of text with specific characters? - python

Instead of using the skiprows=4 you should use the comment='#'. This way your pd.read_csv will skip the rows that begins with #: combined_csv = pd.concat([pd.read_csv(f, sep=' ', header=None, comment='#', names=fields) for f in all_filenames ])

Related

Save forecast data to csv

Why is python only printing one data set in the algorithm?

Text to PDF Positioning Lines

Selecting rows based on multiple conditions using Python pandas

How can get ' USDJPY'(currency rates) with pandas and yahoo finance?

Categories

Resources