I am using the following code -
import pandas as pd
from mftool import Mftool
import pandas as pd
import os
import time
mf = Mftool()
data =mf.get_scheme_historical_nav('138564',as_Dataframe=True)
data = data.rename_axis("Date",index= False)`
`
The above mentioned code gives me data in the following format -
enter image description here
Clearly, Date has been set to index, but i want to
keep 'Date' column in my df without categorizing it as index.
change dd-mm-yyyy to yyyy-mm-dd
can anybody help, thank you!
I tried using following, but it was not useful -
'data = data.set_index(to_datetime(data['Date']))
'data.d['Date'] = pd.to_datetime(data['Dateyour text'])'`
How can i print a new dataframe and clear the last printed dataframe while using a loop?
So it wont show all dataframes just the last one in the output?
Using print(df, end="\r") doesn't work
import pandas as pd
import numpy as np
while True:
df = pd.DataFrame(np.random.rand(10,10))
print(df)
If i get live data from an api to insert into the df, i'll use the while loop to constantly update the data. But how can i print only the newest dataframe instead of printing all the dataframes underneath each other in the output?
If i use the snippet below it does work, but i think there should be a more elegant solution.
import pandas as pd
import numpy as np
Height_DF = 10
Width_DF = 10
while True:
df = pd.DataFrame(np.random.rand(10,10))
print(df)
for i in range(Height_DF + 1):
sys.stdout.write("\033[F")
try this:
import pandas as pd
import numpy as np
import time
import sys
while True:
df = pd.DataFrame(np.random.rand(10,10))
print(df)
sys.stdout.write("\033[F")
time.sleep(1)
I am just starting out with data science, so apologies if this is a bone question with a simple answer, but I have been scanning google for hours and have tried multiple solutions to no avail.
Basically, my dataset has automatically adjusted some values such as 3-5 to 03-May. I am not able to simply change the values in Excel, rather I need to clean the data in Python. My first thought was simply to use the replace tool i.e. df = df.replace('2019-05-03 00:00:00', '3-5') but it doesn't work, presumably as the dtype is different between the timestamp and the str(?) - it works if I adjust the code i.e. df = df.replace('0-2', '3-5').
I can't simply add that data as a missing value either as it is simply an error in formatting rather than a spurious entry.
Is there a simple way of doing this?
Listed below is an example snippet of the data I am working with:
GitHub public gist
PSB for code:
#Dependencies
import pytest
import pandas as pd
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)
import numpy as np
from google.colab import drive
import io
#Import data
from google.colab import files
upload = files.upload()
df = pd.read_excel(io.BytesIO(upload['breast-cancer.xls']))
df
#Clean Data
df.types
#Correcting tumor-size and inv-nodes values
'''def clean_data(dataset):
for i in dataset:
dataset = dataset.replace('2019-05-03 00:00:00','3-5')
dataset = dataset.replace('2019-08-06 00:00:00','6-8')
dataset = dataset.replace('2019-09-11 00:00:00','9-11')
dataset = dataset.replace('2014-12-01 00:00:00','12-14')
dataset = dataset.replace('2014-10-01 00:00:00','10-14')
dataset = dataset.replace('2019-09-05 00:00:00','5-9')
return dataset
cleaned_dataset = dataset.apply(clean_data)
cleaned_dataset'''
df = df.replace('2019-05-03 00:00:00', '3-5')
df
#Check for duplicates
df.duplicated()
df[['tumor-size', 'inv-nodes']] = df[['tumor-size', 'inv-nodes']].astype(str)
That line of code saved the day.
Suppose I have a list of api keys I am downloading from the census data
Example:
variable_list = [
'B08006_017E',
'B08016_002E',
'B08016_003E',
'B08016_004E',
...
]
Now given memory constraints for putting this data onto one csv file. I want to create a way in which I place blocks of 100 variables from the variable list onto a number of csv files. For example, if I have 200 variables than I would have 2 csv files of the first 100 and one with the second 100 varaibles. I hope that is clear.
This is how I am currently downloading the data:
import pandas as pd
import censusdata
pd.set_option('display.expand_frame_repr', False)
pd.set_option('display.precision', 2)
#import statsmodels.formula.api as sm
import urllib3
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
import censusgeocode as cg
import numpy as np
from numbers import Number
import plotly
import matplotlib.pyplot as plt
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)
import requests
import pandas
import geopandas
import json
import math
from haversine import haversine
from ipfn import ipfn
import networkx
from matplotlib import pyplot
from matplotlib import patheffects
from shapely.geometry import LineString, MultiLineString
variable_list1 = [
'B08006_017E',
'B08016_002E'
'B08016_003E',
'B08016_004E'
]
all_variable_lists = [variable_list1]
print(len(all_variable_lists[0]))
#2) For each year, download the relevant variables for each tract
def download_year(year,variable_list,State,County,Tract):
df = censusdata.download('acs5', year, censusdata.censusgeo([('state',State),('county',County),('tract',Tract)]), variable_list, key = 'e39a53c23358c749629da6f31d8f03878d4088d6')
df['Year']=str(year)
return df
#3) Define function to download for a single year and state
def callback_arg(i,variable_list,year):
try:
print('Downloading - ',year,'State', i,' of 57')
if i<10:
df = download_year(year,variable_list,'0'+str(i),'*','*')
return df
if i==51:
df = download_year(year,variable_list,str(i),'*','*')
return df
else:
df = download_year(year,variable_list,str(i),'*','*')
return df
except:
pass
#3) Function to download for all states and all years, do some slight formatting
def download_all_data(variable_list,max_year):
df=download_year(2012,variable_list,'01','*','*')
for year in range(2012,max_year+1):
if year == 2012:
for i in range(0,57):
df=df.append(callback_arg(i,variable_list,year))
else:
for i in range(0,57):
df=df.append(callback_arg(i,variable_list,year))
df2=df.reset_index()
df2=df2.rename(columns = {"index": "Location+Type"}).astype(str)
df2['state']=df2["Location+Type"].str.split(':').str[0].str.split(', ').str[2]
df2['Census_tract']=df2["Location+Type"].str.split(':').str[0].str.split(',').str[0].str.split(' ').str[2][0]
df2['County_name']=df2["Location+Type"].str.split(':').str[0].str.split(', ').str[1]
return(df2)
#4) Some slight formatting
def write_to_csv(df2,name = 'test'):
df2.to_csv(name)
#5) The line below is commented out, but should run the entire download sequence
def write_to_csv(df, ide):
df.to_csv('test' + str(ide) + '.csv')
list_of_dfs = []
for var_list in all_variable_lists:
list_of_dfs.append(download_all_data(var_list, 2012))
x1 = list_of_dfs[0].reset_index()
# x3 = pd.merge(x1,x2, on=['index','Location+Type','Year','state','Census_tract','County_name'])
write_to_csv(x1,1)
If anyone can give me some ideas on how to achieve what I want this would greatly help me. Thank you.
It looks like you're already chunking the variable_lists here:
for var_list in all_variable_lists:
list_of_dfs.append(download_all_data(var_list, 2012))
Just make sure each var_list has only 100 items. Then chunk the csv writing in the same way, using enumerate to increment the index for filename:
for index, out_list in enumerate(list_of_dfs):
write_to_csv(out_list.reset_index(),index)
If you're just looking to break up the final output at write time:
for index, out_list in enumerate(np.array_split(x1, 100)):
write_to_csv(out_list,index)
I'm trying to filter a data frame based on the contents of a pre-defined array.
I've looked up several examples on StackOverflow but simply get an empty output.
I'm not able to figure what is it I'm doing incorrectly. Could I please seek some guidance here?
import pandas as pd
import numpy as np
csv_path = 'history.csv'
df = pd.read_csv(csv_path)
pre_defined_arr = ["A/B", "C/D", "E/F", "U/Y", "R/E", "D/F"]
distinct_count_column_headers = ['Entity']
distinct_elements= pd.DataFrame(df.drop_duplicates().Entity.value_counts(),columns=distinct_count_column_headers)
filtered_data= distinct_elements[distinct_elements['Entity'].isin(pre_defined_arr)]
print("Filtered data ... ")
print(filtered_data)
OUTPUT
Filtered data ...
Empty DataFrame
Columns: [Entity]
Index: []
Managed to that using filter function -> .filter(items=pre_defined_arr )
import pandas as pd
import numpy as np
csv_path = 'history.csv'
df = pd.read_csv(csv_path)
pre_defined_arr = ["A/B", "C/D", "E/F", "U/Y", "R/E", "D/F"]
distinct_count_column_headers = ['Entity']
distinct_elements_filtered= pd.DataFrame(df.drop_duplicates().Entity.value_counts().filter(items=pre_defined_arr),columns=distinct_count_column_headers)
It's strange that there's just one answer I bumped on that suggests filter function. Almost 9 out 10 out there talk about .isin function which didn't work in my case.