Refresh a Panda dataframe while printing using loop - python

How can i print a new dataframe and clear the last printed dataframe while using a loop?
So it wont show all dataframes just the last one in the output?
Using print(df, end="\r") doesn't work
import pandas as pd
import numpy as np
while True:
df = pd.DataFrame(np.random.rand(10,10))
print(df)
If i get live data from an api to insert into the df, i'll use the while loop to constantly update the data. But how can i print only the newest dataframe instead of printing all the dataframes underneath each other in the output?
If i use the snippet below it does work, but i think there should be a more elegant solution.
import pandas as pd
import numpy as np
Height_DF = 10
Width_DF = 10
while True:
df = pd.DataFrame(np.random.rand(10,10))
print(df)
for i in range(Height_DF + 1):
sys.stdout.write("\033[F")

try this:
import pandas as pd
import numpy as np
import time
import sys
while True:
df = pd.DataFrame(np.random.rand(10,10))
print(df)
sys.stdout.write("\033[F")
time.sleep(1)

Related

python run script every 2 mins, but failed

import pandas as pd
import numpy as np
import datetime
import schedule
import time
ticks = api.ticks(api.Contracts.Stocks["2330"], "2022-08-09")
df = pd.DataFrame({**ticks})
df.ts = pd.to_datetime(df.ts)
df = df[df.volume>200]
df
Above code, works fine. I got data.
Below code, not working. I got nothing. It just keep running but no data coming.
My goal is to run the code (receive data), every 2 mins automatically.
I counldnt figure out where go wrong.
I would need some help. tried many times and spent a lot of time.
import pandas as pd
import numpy as np
import datetime
import schedule
import time
def show_datafr():
ticks = api.ticks(api.Contracts.Stocks["2330"], "2022-08-09")
df = pd.DataFrame({**ticks})
df.ts = pd.to_datetime(df.ts)
df = df[df.volume>200]
df
schedule.every(4).seconds.do(show_datafr)
while 1:
schedule.run_pending()
time.sleep(1)
To display df you can import display from IPython.display
You might want to install it with pip install ipython incase you don't have it installed.
import pandas as pd
import numpy as np
import datetime
from schedule
import time
from IPython.display import display # Additional import
def show_datafr():
ticks = api.ticks(api.Contracts.Stocks["2330"], "2022-08-09")
df = pd.DataFrame({**ticks})
df.ts = pd.to_datetime(df.ts)
df = df[df.volume>200]
display(df) # To display dataframe
schedule.every(2).minutes.do(show_datafr) # Remember you said every 2 minutes
while True:
schedule.run_pending()
time.sleep(1)
if you want to run every 2 min, the schedule line is quite strange.
it should be:
schedule.every(2).minutes.do(show_datafr)
instead of:
schedule.every(4).seconds.do(show_datafr)
as what you wrote is to run every 4 sec, and possibly the operation cannot be finished in 4 sec, causing it no output

How to combine 2 columns in pandas DataFrame?

Hello! This is a CSV table.I was trying to combine CSV output with Python to create Gantt Charts. Each column in CSV file means a date time, for example start1 is the hours and the start2 - minutes. After that, i use pd.to_datetime(data["start1"], format="%H") for the proper formatting. Same to the start2.
And here is the thing: how can i combine both this columns in pandas DataFrame to get one column in "%H-%M" format? Like data["start"]. Here is the data.head() output and code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
from datetime import timedelta
#import data
data = pd.read_csv('TEST.csv')
#convert data str to "datetime" data
data["start1"] = pd.to_datetime(data["start1"], format="%H")
data["start2"] = pd.to_datetime(data["start2"], format="%M")
data["end1"] = pd.to_datetime(data["end1"], format="%H")
data["end2"] = pd.to_datetime(data["end2"], format="%M")
Try:
data["start"] = pd.to_datetime(data["start1"].astype(str).str.pad(2, fillchar="0") +
data["start2"].astype(str).str.pad(2, fillchar="0"),
format="%H%M")
data["end"] = pd.to_datetime(data["end1"].astype(str).str.pad(2, fillchar="0") +
data["end2"].astype(str).str.pad(2, fillchar="0"),
format="%H%M")
Before you change the data types to date time you can add an additional column like this:
data["start"] = data["start1"] + '-' + data["start2"]
data["start"] = pd.to_datetime(data["start"], format="%H-%M")
# then do the other conversions.

Downloading blocks of data from census, how to wrtie to multiple csvs to not exceed memory

Suppose I have a list of api keys I am downloading from the census data
Example:
variable_list = [
'B08006_017E',
'B08016_002E',
'B08016_003E',
'B08016_004E',
...
]
Now given memory constraints for putting this data onto one csv file. I want to create a way in which I place blocks of 100 variables from the variable list onto a number of csv files. For example, if I have 200 variables than I would have 2 csv files of the first 100 and one with the second 100 varaibles. I hope that is clear.
This is how I am currently downloading the data:
import pandas as pd
import censusdata
pd.set_option('display.expand_frame_repr', False)
pd.set_option('display.precision', 2)
#import statsmodels.formula.api as sm
import urllib3
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
import censusgeocode as cg
import numpy as np
from numbers import Number
import plotly
import matplotlib.pyplot as plt
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)
import requests
import pandas
import geopandas
import json
import math
from haversine import haversine
from ipfn import ipfn
import networkx
from matplotlib import pyplot
from matplotlib import patheffects
from shapely.geometry import LineString, MultiLineString
variable_list1 = [
'B08006_017E',
'B08016_002E'
'B08016_003E',
'B08016_004E'
]
all_variable_lists = [variable_list1]
print(len(all_variable_lists[0]))
#2) For each year, download the relevant variables for each tract
def download_year(year,variable_list,State,County,Tract):
df = censusdata.download('acs5', year, censusdata.censusgeo([('state',State),('county',County),('tract',Tract)]), variable_list, key = 'e39a53c23358c749629da6f31d8f03878d4088d6')
df['Year']=str(year)
return df
#3) Define function to download for a single year and state
def callback_arg(i,variable_list,year):
try:
print('Downloading - ',year,'State', i,' of 57')
if i<10:
df = download_year(year,variable_list,'0'+str(i),'*','*')
return df
if i==51:
df = download_year(year,variable_list,str(i),'*','*')
return df
else:
df = download_year(year,variable_list,str(i),'*','*')
return df
except:
pass
#3) Function to download for all states and all years, do some slight formatting
def download_all_data(variable_list,max_year):
df=download_year(2012,variable_list,'01','*','*')
for year in range(2012,max_year+1):
if year == 2012:
for i in range(0,57):
df=df.append(callback_arg(i,variable_list,year))
else:
for i in range(0,57):
df=df.append(callback_arg(i,variable_list,year))
df2=df.reset_index()
df2=df2.rename(columns = {"index": "Location+Type"}).astype(str)
df2['state']=df2["Location+Type"].str.split(':').str[0].str.split(', ').str[2]
df2['Census_tract']=df2["Location+Type"].str.split(':').str[0].str.split(',').str[0].str.split(' ').str[2][0]
df2['County_name']=df2["Location+Type"].str.split(':').str[0].str.split(', ').str[1]
return(df2)
#4) Some slight formatting
def write_to_csv(df2,name = 'test'):
df2.to_csv(name)
#5) The line below is commented out, but should run the entire download sequence
def write_to_csv(df, ide):
df.to_csv('test' + str(ide) + '.csv')
list_of_dfs = []
for var_list in all_variable_lists:
list_of_dfs.append(download_all_data(var_list, 2012))
x1 = list_of_dfs[0].reset_index()
# x3 = pd.merge(x1,x2, on=['index','Location+Type','Year','state','Census_tract','County_name'])
write_to_csv(x1,1)
If anyone can give me some ideas on how to achieve what I want this would greatly help me. Thank you.
It looks like you're already chunking the variable_lists here:
for var_list in all_variable_lists:
list_of_dfs.append(download_all_data(var_list, 2012))
Just make sure each var_list has only 100 items. Then chunk the csv writing in the same way, using enumerate to increment the index for filename:
for index, out_list in enumerate(list_of_dfs):
write_to_csv(out_list.reset_index(),index)
If you're just looking to break up the final output at write time:
for index, out_list in enumerate(np.array_split(x1, 100)):
write_to_csv(out_list,index)

Python pandas - Filter a data frame based on a pre-defined array

I'm trying to filter a data frame based on the contents of a pre-defined array.
I've looked up several examples on StackOverflow but simply get an empty output.
I'm not able to figure what is it I'm doing incorrectly. Could I please seek some guidance here?
import pandas as pd
import numpy as np
csv_path = 'history.csv'
df = pd.read_csv(csv_path)
pre_defined_arr = ["A/B", "C/D", "E/F", "U/Y", "R/E", "D/F"]
distinct_count_column_headers = ['Entity']
distinct_elements= pd.DataFrame(df.drop_duplicates().Entity.value_counts(),columns=distinct_count_column_headers)
filtered_data= distinct_elements[distinct_elements['Entity'].isin(pre_defined_arr)]
print("Filtered data ... ")
print(filtered_data)
OUTPUT
Filtered data ...
Empty DataFrame
Columns: [Entity]
Index: []
Managed to that using filter function -> .filter(items=pre_defined_arr )
import pandas as pd
import numpy as np
csv_path = 'history.csv'
df = pd.read_csv(csv_path)
pre_defined_arr = ["A/B", "C/D", "E/F", "U/Y", "R/E", "D/F"]
distinct_count_column_headers = ['Entity']
distinct_elements_filtered= pd.DataFrame(df.drop_duplicates().Entity.value_counts().filter(items=pre_defined_arr),columns=distinct_count_column_headers)
It's strange that there's just one answer I bumped on that suggests filter function. Almost 9 out 10 out there talk about .isin function which didn't work in my case.

pandas - Joining CSV time series into a single dataframe

I'm trying to get 4 CSV files into one dataframe. I've looked around on the web for examples and tried a few but they all give errors. Finally I think I'm onto something, but it gives unexpected results. Can anybody tell me why this doesn't work?
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
n = 24*365*4
dates = pd.date_range('20120101',periods=n,freq='h')
df = pd.DataFrame(np.random.randn(n,1),index=dates,columns=list('R'))
#df = pd.DataFrame(index=dates)
paths = ['./LAM DIV/10118218_JAN_LAM_DIV_1.csv',
'./LAM DIV/10118218_JAN-APR_LAM_DIV_1.csv',
'./LAM DIV/10118250_JAN_LAM_DIV_2.csv',
'./LAM DIV/10118250_JAN-APR_LAM_DIV_2.csv']
for i in range(len(paths)):
data = pd.read_csv(paths[i], index_col=0, header=0, parse_dates=True)
df.join(data['TempC'])
df.head()
Expected result:
Date Time R 0 1 2 3
Getting this:
Date Time R
You need to save the result of your join:
df = df.join(data['TempC'])

Categories

Resources