How to write data to existing excel file using pandas? - python

I want to request some data from a python module tushare.
By using this code, I can each time get a line of data.
However I want to send the server a request for like every 5 seconds
and put all the data within 4 hrs into one excel file.
I notice that pandas is already built in tushare.
How to put the data together and generate only one excel file?
import tushare as ts
df=ts.get_realtime_quotes('000875')
df.to_excel(r'C:\Users\stockfile\000875.xlsx')

You can do it with for example
df = df.append(ts.get_realtime_quotes('000875'))
Given the number of calls, it nay be better to create a data frame and fill it with data rows as they arrive. Something like this:
# Here, just getting column names:
columns = ts.get_realtime_quotes('000875').columns
# Choose the right number of calls, N,
df = pd.DataFrame(index = range(N), columns = columns)
for i in range(N):
df.iloc[0] = ts.get_realtime_quotes('000875').iloc[0]
sleep(5)
Another way to do it (possibly simpler and without preallocating the empty data frame) would be storing answers from tushare in a list and then applying pd.concat.
list_of_dfs = []
for _ in range(N):
list_of_dfs.append(ts.get_realtime_quotes('000875'))
sleep(5)
full_df = pd.concat(list_of_dfs)
This way you don't need to know the number of requests in advance (for example, if you decide to write the for loop without explicit number of repetitions).

Related

Is there a way to view my data frame in pandas without reading in the file every time?

Here is my code:
import pandas as pd
df = pd.read_parquet("file.parqet", engine='pyarrow')
df_set_index = df.set_index('column1')
row_count = df.shape[0]
column_count = df.shape[1]
print(df_set_index)
print(row_count)
print(column_count)
Can I run this without reading in the parquet file each time I want to do a row count, column count, etc? It takes a while to read in the file because it's large and I already read it in once but I'm not sure how to.
pd.read_parquet reads files that are stored on the disc and stores it in cache which is naturally slow with a lot of data. So, you could engineer a solution like:
1.) column_count
pd.read_parquet("file.parqet", engine='pyarrow', nrows=1).shape[1]
-> This would give you the number of columns while only reading in 1 row
-> .shape returns a tuple with values (# rows, # columns), so just grab the second item for the number of columns as demonstrated above.
2.) row_count
cols_want = ['colmn1'] # put whatever column names you want here
row_count = pd.read_parquet("file.parqet", engine='pyarrow', usecols=cols_want).shape[0]
-> This would give you the number of rows in the column "column1" without having to read in all the other columns (which is the reason for your solution taking awhile).
3.) df.set_index(...) isn't meant to be stored in a variable, so I'm not sure what you want to do there. If you're trying to see what is in the column just use #2 above and remove the ".shape[0]" call

Python(Pandas) filtering large dataframe and write multiple csv files

I have this following data frame, I'm constructing a Python function(to use it in Labview) that basically only does: data pair & data cleaning.
The data frame is like this:
I need pandas to pick out each column(except 'Date') individually and pair it with 'Date'(customized index), before writing separately into individual CSV files, I need to make sure the Pressure column data does not contain any '0' number, and for each temperature columns, data that are equal to 0 or bigger than 150 will be filtered out.
The following is my Python function, parameters x1 and x2 will be fed through LabVIEW input to specify a user-selected "date range".
def data_slice(x1, x2):
import pandas as pd
df = pd.read_csv('exp_log.csv')
df.set_index('Date', inplace=True)
df_p = df.loc[x1:x2, 'Pressure']
filt = (df_p['Pressure'] == 0)
df_p = df_p.loc[~filt]
df_p.to_csv('modified_pressure.csv', index=True)
all_cols = list(df.columns)
temp_cols = all_cols[1:]
for i in temp_cols:
df_i = df.loc[x1:x2, 'i']
filt = (df_i > 150) | (df_i == 0)
df_i = df_i.loc[~filt]
df_i.to_csv(f'modified_temp{i}.csv', index=True)
My question will be....will this piece of Python code actually work properly? aka, to write out individual CSV files efficiently?? Given the fact that the actual exp_log.csv file is a super large file containing data logged for days....
It will work with your command df_i.to_csv(f'modified_temp{i}.csv', index=True). Except for the fact that this line is outside your for-loop. It's missing indentation.
Besides I would recommend to separate responsibilities. So I mean split this function in multiple functions, each with its own purpose like importing data, saving data, manipulating data ect. Try to keep one level of abstraction per function.
Lastly, don't not import libraries within the function.
Not exactly. Your last loop will not work. "df_i" will not be evaluated the way you want. Also, the df.loc[x1:x2,'i'] will not evaluate to the column you want it to. The first part, until the first .to_csv() should work fine.

Adding whole lines of a dataframe via a for loop

I had code as follows to collect interesting rows into a new dataframe:
df = df1.iloc[[66,113,231,51,152,122,185,179,114,169,97][:]]
but I want to use a for loop to collect the data. I have read that I need to combine the data as a list and then create the dataframe, but all the examples I have seen are for numbers and I can't create the same for each line of a dataframe. At the moment I have the following:
data = ['A','B','C','D','E']
for n in range(10):
data.append(dict(zip(df1.iloc[n, 4])))
df = pd.Dataframe(data)
(P.S. I have 4 in the code because I want the data to be selected via column E and the dataframe is already sorted so I am just looking for the first 10 rows)
Thanks in advance for your help.

Adding to pandas dataframe line by line

I'm making a dataframe and I need to add to it line by line. I created the df with
df = pd.DataFrame(columns=('date', 'daily_high', 'daily_low'))
then I'm reading data from an API, so I run
for api in api_list:
with urllib.request.urlopen(api) as url:
data = json.loads(url.read().decode())
and I need to put different attributes from data in to the dataframe.
I tried to put
df = df.append({'date':datetime.fromtimestamp(data["currently"]["time"]).strftime("20%y%m%d"), 'daily_high' : data["daily"]["data"][0]["temperatureHigh"], 'daily_low': data["daily"]["data"][0]["temperatureLow"]},ignore_index=True)
in the for loop, but it was taking a long time and I'm not sure if this is good practice. Is there a better way to do this? Maybe I could create three separate series and join them together?
pandas.DataFrame.append is inefficient for iterative approaches.
From documentation:
Iteratively appending rows to a DataFrame can be more computationally
intensive than a single concatenate. A better solution is to append
those rows to a list and then concatenate the list with the original
DataFrame all at once.
As mentioned, concatenating results will be more efficient, but in your case using pandas.DataFrame.from_dict would be even more convenient.
Also, I would use requests library for requesting urls.
import requests
d = {}
d['date'] = []
d['daily_high'] = []
d['daily_low'] = []
for api_url in api_list:
data = requests.get(api_url).json()
d['date'].append(datetime.fromtimestamp(data["currently"]["time"]).strftime("20%y%m%d"))
d['daily_high'].append(data["daily"]["data"][0]["temperatureHigh"])
d['daily_low'].append(data["daily"]["data"][0]["temperatureLow"])
df = pd.DataFrame.from_dict(d)

Add calculated columns to each DataFrame inside a Panel without for-loop

I have ~300 .csv files all with the same number of rows and columns for instrumentation data. Since each .csv file represents a day and the structure is the same, I figured it would be best to pull each .csv into a Pandas DataFrame and then throw them into a Panel object to perform faster calculations.
I would like to add additional calculated columns to each DataFrame that is inside the Panel, preferably without a for-loop. I'm attempting to use the apply function to the panel and name the new columns based on the original column name appended with a 'p' (for easier indexing later). Below is the code I am currently using.
import pandas as pd
import numpy as np
import os.path
dir = "data/testsetup1/"
filelist = []
def initializeDataFrames():
for f in os.listdir(dir):
if ".csv" in f:
filelist.append(dir + f)
dd={}
for f in filelist:
dd[f[len(dir):(len(f)-4)]] = pd.read_csv(f)
return pd.Panel(dd)
def newCalculation(pointSeries):
#test function, more complex functions to follow
pointSeriesManiuplated = pointSeries.copy()
percentageMove = 1.0/float(len(pointSeriesManiuplated))
return pointSeriesManiuplated * percentageMove
myPanel = initializeDataFrames()
#calculatedPanel = myPanel.join(lambda x: myPanel[x,:,0:17].apply(lambda y:newCalculation(myPanel[x,:,0:17].ix[y])), rsuffix='p')
calculatedPanel = myPanel.ix[:,:,0:17].join(myPanel.ix[:,:,0:17].apply(lambda x: newCalculation(x), axis=2), rsuffix='p')
print calculatedPanel.values
The code above currently duplicates each DataFrame with the calculated columns instead of appending them to each DataFrame. The apply function I'm using operates on a Series object, which in this case would be a passed column. My question is how can I use the apply function on a Panel such that it calculates new columns and appends them to each DataFrame?
Thanks in advance.
If you want to add a new column via apply simply assign the output of the apply operation to the column you desire:
myPanel['new_column_suffix_p'] = myPanel.apply(newCalculation)
If you want multiple columns you can make a custom function for this:
def calc_new_columns (rowset):
rowset['newcolumn1'] = calculation1(rowset.columnofinterest)
rowset['newcolumn2'] = calculation2(rowset.columnofinterest2 + rowset.column3)
return rowset
myPanel = myPanel.apply(calc_new_columns)
On a broader note. You are manually handling sections of your data frame when it looks like you can just do the new column operation all at once. I would suggest importing the first csv file into a data frame. Then loop through the remaining 299 csv and use DataFrame.append to add to the original data frame. Then you would have one data frame for all the data that simple needs the calculated column added.
nit: "dir" is a builtin function. you shouldn't use it as a variable name.
Try using a double transpose:
p = pd.Panel(np.random.rand(4,10,17),
items=pd.date_range('2013/11/10',periods=4),
major_axis=range(10),
minor_axis=map(lambda x: "col%d" % x, range(17)))
pT = p.transpose(2,1,0)
pT = pT.join(pT.apply(newCalculation, axis='major'), rsuffix='p')
p = pT.transpose(2,1,0)

Categories

Resources