Pandas concatenate dataframes with for loop - python

I am trying to get tables from a website. The website's URL contains dates so I will have to iterate over dates in order to get historical data. I am generating dates as follows:
import datetime
start = datetime.datetime.strptime("26-09-2016", "%d-%m-%Y")
end = datetime.datetime.strptime("30-09-2016", "%d-%m-%Y")
date_generated = [start + datetime.timedelta(days=x) for x in range(0, (end-start).days)]
dates_list = []
for date in date_generated:
txt = str(str(date.day) + '.' + str(date.month) + '.' + str(date.year))
dates_list.append(txt)
dates_list
After this, I am running the code below to concatenate all the tables:
for i in range(0, 3):
allURL = 'https://www.uzse.uz/trade_results?date=' + dates_list[i] + '&locale=en&mkt_id=ALL&page=%d'
ndf_list = []
for i in range(1, 100):
url = allURL %i
if pd.read_html(url)[0].empty:
break
else :
ndf_list.append(pd.read_html(url)[0])
ndf = pd.concat(ndf_list)
ndf.insert(0, 'Date', dates_list[i])
mdf = pd.concat(ndf, ignore_index = True)
mdf
However, this does not work and I get:
TypeError: first argument must be an iterable of pandas objects, you passed an object of type "DataFrame"
I do not understand what I am doing wrong. I am expecting to have one table that comes from 26th, 27th, and 28th September.
Please help.

Not sure about the last line(s), but I'd approach it this way
import datetime
import pandas as pd
start = datetime.datetime.strptime("26-09-2016", "%d-%m-%Y")
end = datetime.datetime.strptime("30-09-2016", "%d-%m-%Y")
date_generated = [
start + datetime.timedelta(days=x) for x in range(0, (end-start).days)]
dates_list = []
for date in date_generated:
txt = str(str(date.day) + '.' + str(date.month) + '.' + str(date.year))
dates_list.append(txt)
dates_list
ndf = pd.DataFrame() # create empty ndf
for i in range(0, 3):
allURL = 'https://www.uzse.uz/trade_results?date=' + \
dates_list[i] + '&locale=en&mkt_id=ALL&page=%d'
# ndf_list = []
for j in range(1, 100):
url = allURL % j
if pd.read_html(url)[0].empty:
break
else:
# ndf_list.append(pd.read_html(url)[0])
chunk = pd.read_html(url)[0]
chunk['Date'] = dates_list[i] # Date is positioned at last position, let's fix that
# get a list of all the columns
cols = chunk.columns.tolist()
# rearrange the columns, move the last element (Date) to the first position
cols = cols[-1:] + cols[:-1]
# reorder the dataframe
chunk = chunk[cols]
ndf = pd.concat([ndf, chunk])
# ndf = pd.concat(ndf_list)
# ndf.insert(0, 'Date', dates_list[i])
print(ndf)
# mdf = pd.concat(ndf, ignore_index=True)
# mdf

Related

how to run a for loop for a list of data frames which converts a data.frame to matrix

I have a list called fdt_frozen which is a list of multiple data frames.
I'm trying to generate a match score using pairwise combinations for each group of data frame in the list. I have tested the below logic and it runs fine when i subset the list of data frames into a single data frame( like it works for a single data frame- fdt_frozen[6])
But it is not working when i try to loop the match score logic for all data frames in the list as it gives index not callable error when i try for loop.
Could anyone please help me with putting this entire logic inside a for loop so that it will loop over for all data frames in the list
a = np.asmatrix(fdt_frozen[6])
compare_all = []
for i in range(a.shape[1]):
compare = [1 if a[x[0], i] == a[x[1], i] else 0 for x in combinations(range(a.shape[0]), 2)]
compare_all.append(compare)
compare1 = pd.DataFrame(compare_all)
compare1 = compare1.T
compare1.columns = fdt_single.columns
compare_all1 = []
for i in range(a.shape[1]):
compare = [1 if pd.isnull(a[x[0], i]) and pd.isnull(a[x[1], i]) else 0 for x in combinations(range(a.shape[0]), 2)]
compare_all1.append(compare)
compare2 = pd.DataFrame(compare_all1)
compare2 = compare2.T
compare2.columns = fdt_single.columns
compare2[compare2 == 1] = np.nan
compare = compare1+compare2
combinations = list(itertools.combinations(range(a.shape[0]), 2))
combinations = [a[x[0],0] + '-' + a[x[1],0] for x in combinations]
compare.index = combinations
compare = compare.drop("Material", axis=1)
combinations = compare.index
MFDETAILED = pd.DataFrame({'COMBINATION': combinations,
'MS': round(((compare.sum(axis=1,skipna=True))/(~compare.isna()).sum(axis=1))*100,0)})
MFDETAILED = pd.concat([MFDETAILED, compare], axis=1)
MF = MFDETAILED.iloc[:, [0,1]]
feature_names = compare.columns
MF['MF'] = compare.apply(lambda x: ', '.join(feature_names[x == 1]), axis=1)
feature_names = compare.columns
MF['NMF'] = compare.apply(lambda x: ', '.join(feature_names[x == 0]), axis=1)
MF[['LOOKUP', 'MAT']] = MF['COMBINATION'].str.split("-", expand=True)
MF = MF.drop(['COMBINATION'], axis=1)
MF = MF[["LOOKUP","MAT","MS", "MF","NMF"]]
MF.columns = ["LOOKUP MATERIAL", "MATERIAL", "MATCH SCORE","MATCH FEATURES","NON MATCH FEATURES"]
ja = MF[["LOOKUP MATERIAL", "MATERIAL", "MATCH SCORE","MATCH FEATURES","NON MATCH FEATURES"]]
ja[["LOOKUP MATERIAL", "MATERIAL"]] = ja[["MATERIAL","LOOKUP MATERIAL"]].values
cols = ['Material', 'Material' , '100' , 'All', 'None']
MF_SELF = pd.DataFrame(columns = cols)
MF_SELF['Material'] = fdt_frozen[6]['Material']
MF_SELF['Material'] = fdt_frozen[6]['Material']
MF_SELF['100'] = 100
MF_SELF['All'] = 'All'
MF_SELF['None'] = 'None'
MF_SELF.columns = ["LOOKUP MATERIAL", "MATERIAL", "MATCH SCORE","MATCH FEATURES","NON MATCH FEATURES"]
MF = pd.concat([MF, ja, MF_SELF], axis=0)
MF = MF[MF['MATCH SCORE'] >= 70]
MF = MF.reset_index(drop = True)

Accessing pyomo variables with two indices

I have started using pyomo to solve optimization problems. I have a bit of an issue regarding accessing the variables, which use two indices. I can easily print the solution, but I want to store the index depending variable values within a pd.DataFrame to further analyze the result. I have written following code, but it needs forever to store the variables. Is there a faster way?
df_results = pd.DataFrame()
df_variables = pd.DataFrame()
results.write()
instance.solutions.load_from(results)
for v in instance.component_objects(Var, active=True):
print ("Variable",v)
varobject = getattr(instance, str(v))
frequency = np.empty([len(price_dict)])
for index in varobject:
exist = False
two = False
if index is not None:
if type(index) is int:
#For time index t (0:8760 hours of year)
exists = True #does a index exist
frequency[index] = float(varobject[index].value)
else:
#For components (names)
if type(index) is str:
print(index)
print(varobject[index].value)
else:
#for all index with two indices
two = True #is index of two indices
if index[1] in df_variables.columns:
df_variables[index[0], str(index[1]) + '_' + str(v)] = varobject[index].value
else:
df_variables[index[1]] = np.nan
df_variables[index[0], str(index[1]) + '_' + str(v)] = varobject[index].value
else:
# If no index exist, simple print the variable value
print(varobject.value)
if not(exists):
if not(two):
df_variable = pd.Series(frequency, name=str(v))
df_results = pd.concat([df_results, df_variable], axis=1)
df_variable.drop(df_variable.index, inplace=True)
else:
df_results = pd.concat([df_results, df_variable], axis=1)
df_variable.drop(df_variable.index, inplace=True)
with some more work and less DataFrame, I have solved the issue with following code. Thanks to BlackBear for the comment
df_results = pd.DataFrame()
df_variables = pd.DataFrame()
results.write()
instance.solutions.load_from(results)
for v in instance.component_objects(Var, active=True):
print ("Variable",v)
varobject = getattr(instance, str(v))
frequency = np.empty([20,len(price_dict)])
exist = False
two = False
list_index = []
dict_position = {}
count = 0
for index in varobject:
if index is not None:
if type(index) is int:
#For time index t (0:8760 hours of year)
exist = True #does a index exist
frequency[0,index] = float(varobject[index].value)
else:
#For components (names)
if type(index) is str:
print(index)
print(varobject[index].value)
else:
#for all index with two indices
exist = True
two = True #is index of two indices
if index[1] in list_index:
position = dict_position[index[1]]
frequency[position,index[0]] = varobject[index].value
else:
dict_position[index[1]] = count
list_index.append(index[1])
print(list_index)
frequency[count,index[0]] = varobject[index].value
count += 1
else:
# If no index exist, simple print the variable value
print(varobject.value)
if exist:
if not(two):
frequency = np.transpose(frequency)
df_variable = pd.Series(frequency[:,0], name=str(v))
df_results = pd.concat([df_results, df_variable], axis=1)
df_variable.drop(df_variable.index, inplace=True)
else:
for i in range(count):
df_variable = pd.Series(frequency[i,:], name=str(v)+ '_' + list_index[i])
df_results = pd.concat([df_results, df_variable], axis=1)
df_variable.drop(df_variable.index, inplace=True)

How to store the results of each iteration of for loop in a dataframe

cols = Germandata.columns
percentage_list = [0.05,0.01,0.1]
for i in range(len(Germandata)) :
for percentage in percentage_list:
columns_n = 3
random_columns = np.random.choice(cols, columns_n, replace=False)
local_data = Germandata.copy()
remove_n = int(round(local_data.shape[0] * percentage, 0))
for column_name in random_columns:
drop_indices = np.random.choice(local_data.index, remove_n, replace=False)
local_data.loc[drop_indices, column_name] = np.nan
The code here selects the columns at random and will delete certain percentage of observations from the data and it will replace them with NANs. The problem here is after running the loop i will get the final percentage deleted dataframe in the percentage list because it is overwriting after each iteration. How to store the dataframe with nans after each iteration.? Ideally i should get three dataframes with different percent of data deleted.
Try this
df_list = []
cols = Germandata.columns
percentage_list = [0.05,0.01,0.1]
for percentage in percentage_list:
columns_n = 3
random_columns = np.random.choice(cols, columns_n, replace=False)
local_data = Germandata.copy()
remove_n = int(round(local_data.shape[0] * percentage, 0))
for column_name in random_columns:
drop_indices = np.random.choice(local_data.index, remove_n, replace=False)
local_data.loc[drop_indices, column_name] = np.nan
local_data['percentage'] = percentage # optional
df_list.append(local_data)
df_05 = df_list[0]
df_01 = df_list[1]
df_1 = df_list[2]
Alternatively, you can use a dictionary
df_dict = {}
cols = Germandata.columns
percentage_list = [0.05,0.01,0.1]
for percentage in percentage_list:
columns_n = 3
random_columns = np.random.choice(cols, columns_n, replace=False)
local_data = Germandata.copy()
remove_n = int(round(local_data.shape[0] * percentage, 0))
for column_name in random_columns:
drop_indices = np.random.choice(local_data.index, remove_n, replace=False)
local_data.loc[drop_indices, column_name] = np.nan
local_data['percentage'] = percentage # optional
df_dict[str(percentage)] = local_data
df_05 = df_dict['0.05']
df_01 = df_dict['0.01']
df_1 = df_dict['0.1']

Python pd.read_csv, .to_sql different length than actual data

I have a database with ~ 50million rows. After reading to a database I only get 21,000 rows. What am I doing wrong? Thanks.
chunksize = 100000
csv_database = create_engine('sqlite:///csv_database.db', pool_pre_ping=True)
i=0
j=0
q=0
for df in pd.read_csv(filename, chunksize = chunksize, iterator = False):
# df = df.rename(columns={c: c.replace(' ', '') for c in df.columns})
df.index += j
i+= 1
df.to_sql('table', csv_database, if_exists='append')
j = df.index[-1] +1
q+=1
print("q: " + repr(q))
columnx = df.iloc[:,0]
columny = df.iloc[:,1]
columnz = df.iloc[:,2]
columnmass = df.iloc[:,3]
out: [21739 rows x 1 columns] etc etc.
in[19]: len(df)
Out[19]: 21739
'df' doesn't contain the entire csv file as you specified chunk size to 100000, and 21739 is the number of rows inserted in the last iteration.
If you do a count(1) of your table, I bet you'll get something like 5_21739.
Following code is working for me.
import numpy as np
import pandas as pd
import sqlite3
from sqlalchemy import create_engine
DIR = 'C:/Users/aslams/Desktop/checkpoint/'
FILE = 'SUBSCRIBER1.csv'
file = '{}{}'.format(DIR, FILE)
csv_database = create_engine('sqlite:///csv_database.db')
chunksize = 10000
i = 0
j = 0
for df in pd.read_csv(file, chunksize=chunksize, iterator=True):
df = df.rename(columns= {c: c.replace(' ', '') for c in df.columns})
df.index +=3
df.to_sql('data_use', csv_database, if_exists = 'append')
j = df.index[-1]+1
print('| index: {}',format(j))

Looping over lists Python, indexing (basic bootstrap)

Given the following two lists:
dates = [1,2,3,4,5]
rates = [0.0154, 0.0169, 0.0179, 0.0187, 0.0194]
I would like to generate a list
df = []
of same lengths as dates and rates (0 to 4 = 5 elements) in 'pure' Python (without Numpy) as an exercise.
df[i] would be equal to:
df[0] = (1 / (1 + rates[0])
df[1] = (1 - df[0] * rates[1]) / (1 + rates[1])
...
df[4] = (1 - (df[0] + df[1]..+df[3])*rates[4]) / (1 + rates[4])
I was trying:
df = []
df.append(1 + rates[0]) #create df[0]
for date in enumerate(dates, start = 1):
running_sum_vec = 0
for i in enumerate(rates, start = 1):
running_sum_vec += df[i] * rates[i]
df[i] = (1 - running_sum_vec) / (1+ rates[i])
return df
but am getting as TypeError: list indices must be integers. Thank you.
So, the enumerate method return two values: index and value
>>> x = ['a', 'b', 'a']
>>> for y_count, y in enumerate(x):
... print('index: {}, value: {}'.format(y_count, y))
...
index: 0, value: a
index: 1, value: b
index: 2, value: a
It's because of for i in enumerate(rates, start = 1):. enumerate generates tuples of the index and the object in the list. You should do something like
for i, rate in enumerate(rates, start=1):
running_sum_vec += df[i] * rate
You'll need to fix the other loop (for date in enumerate...) as well.
You also need to move df[i] = (1 - running_sum_vec) / (1+ rates[i]) back into the loop (currently it will only set the last value) (and change it to append since currently it will try to set at an index out of bounds).
Not sure if this is what you want:
df = []
sum = 0
for ind, val in enumerate(dates):
df.append( (1 - (sum * rates[ind])) / (1 + rates[ind]) )
sum += df[ind]
Enumerate returns both index and entry.
So assuming the lists contain ints, your code can be:
df = []
df.append(1 + rates[0]) #create df[0]
for date in dates:
running_sum_vec = 0
for i, rate in enumerate(rates[1:], start = 1):
running_sum_vec += df[i] * rate
df[i] = (1 - running_sum_vec) / (1+ rate)
return df
Although I'm almost positive there's a way with list comprehension. I'll have to think about it for a bit.

Categories

Resources