Amending dataframe from a generator that reads multiple excel files - python

My question ultimately is - is it possible to amend inplace each dataframe of a generator of dataframes?
I have a series of excel files in a folder that each have a table in the same format. Ultimately I want to concatenate each file into 1 large dataframe. They all have unique column headers but share the same indices (historical dates but may be across different time frames) so I want to concatenate the dataframes but aligned by their date. So I first created a generator function to create dataframes from each 'Data1' worksheet in the excel files
all_files = glob.glob(os.path.join(path, "*"))
df_from_each_file = (pd.read_excel(f,'Data1') for f in all_files) #generator comprehension
The below code is the formatting that needs to be done to each dataframe so that I can concatenate them correctly in my final line. I changed the index to the date column but there are also some rows that contain data that is not relevant.
def format_ABS(df):
df.drop(labels=range(0, 9), axis=0,inplace=True)
df.set_index(df.iloc[:,0],inplace=True)
df.drop(df.columns[0],axis=1,inplace=True)
However this doesn't work when I place the function within a generator comphrension (as i am amending all the dataframes inplace). The generator produced has no objects. Why doesn't the below line work? Is it because it can only loop through the generator once?
format_df = (format_ABS(x) for x in df_from_each_file)
but
format_df(next(df_from_each_file)
does work on each individual dataframe
The final product is then the below
concatenated_df = pd.concat(df_from_each_file, ignore_index=True)
I have gotten what I wanted by assigning index_col=0 in the pd.read_excel line but it go me thinking about generators and amending the dataframe in general.

Related

Pandas. How to call data frame from the list of data frames?

I have a list of data frames:
all_df = ['df_0','df_1','df_2','df_3','df_4','df_5','df_6']
How can I call them from this list to do something like that:
for (df,names) in zip(all_df,names):
df.to_csv('output/{}.csv'.format(names))
When executing expectedly I'm getting error 'str' object has no attribute 'to_csv' since I'm giving string to the 'to_csv'.
How can I save several data-frames (or perform other actions on them) in the for loop?
Thanks!
Could you please also give an idea on how to create the 'right' list of data frames from this:
path = 'inp/multysheet_excel_01.xlsx'
xl = pd.ExcelFile(path)
sh_name = xl.sheet_names
n = 0
for i in sh_name:
exec('df_{} = pd.read_excel("{}", sheet_name="{}")'.format(n, path, i))
n+=1
so basically I'm trying to get the each sheet of an input excel as a separate dataframe, perform some actions on them, and save each output dataframe in separate excels.
You're quite close, but I see some mistakes in that for loop. Say you have a list of dataframes dfs, and its corresponding names as a list of strings names, you can save those dataframes using the names as:
dfs = [df_1, df_2, df_3]
names = ['df_0','df_1','df_2']
for df,name in zip(dfs,names):
df.to_csv('output\\{}.csv'.format(name))
Though if you only had a list of names, you could also do something like:
names = ['df_0','df_1','df_2']
for name in names:
globals()[name].to_csv('output\\{}.csv'.format(name))

Appending DataFrames in Loop

Goal: Appending DataFrames in Loop to get combined dataframe.
df_base = pd.DataFrame(columns=df_col.columns)
file_path = 'DATA/'
filenames = ['savedrecs-23.txt', 'savedrecs-25.txt', 'savedrecs-24.txt']
For-Loop:
for file in filenames:
path = file_path+file
doc = codecs.open(path,'rU','UTF-8')
df_add = pd.read_csv(doc, sep='\t')
res = df_base.append(df_add)
res.shape
Expected Outcome:
(15, 67) ; all three data frames merged into one dataframe
Current Outcome:
(5, 67) ; just returns the last dataframe in the loop.
res = df_base.append(df_add)
Pandas append function does not modify the object it is called on. It returns a new object that contains the rows from the added dataframe appended onto the rows of the original dataframe.
Since you never modified df_base, so your output is just the frame from the last file, appended to the empty df_base dataframe.
Note that the pandas documentation doesn't recommend iteratively appending dataframes together. Instead, "a better solution is to append those rows to a list and then concatenate the list with the original DataFrame all at once." (with an example given)

Merging data from python list into one dataframe

I have the following files in AAMC_K.txt, AAU.txt, ACU.txt, ACY.txt in a folder called AMEX. I am trying to merge these text files into one dataframe. I have tried to do so with pd.merge() but I get an error that the merge function needs a right and left parameter and my data is in a python list. How can I merge the data in the data_list into one pandas dataframe.
import pandas as pd
import os
textfile_names = os.listdir("AMEX")
textfile_names.sort()
data_list = []
for i in range(len(textfile_names)):
data = pd.read_csv("AMEX/"+textfile_names[i], index_col=None, header=0)
data_list.append(data)
frame = pd.merge(data_list, on='<DTYYYYMMDD>', how='outer')
"AE.txt"
<TICKER>,<PER>,<DTYYYYMMDD>,<TIME>,<OPEN>,<HIGH>,<LOW>,<CLOSE>,<VOL>,<OPENINT>
AE,D,19970102,000000,12.6250,12.6250,11.7500,11.7500,144,0
AE,D,19970103,000000,11.8750,12.1250,11.8750,12.1250,25,0
AAU.txt
<TICKER>,<PER>,<DTYYYYMMDD>,<TIME>,<OPEN>,<HIGH>,<LOW>,<CLOSE>,<VOL>,<OPENINT>
AAU,D,20020513,000000,0.4220,0.4220,0.4220,0.4220,0,0
AAU,D,20020514,000000,0.4177,0.4177,0.4177,0.4177,0,0
ACU.txt
<TICKER>,<PER>,<DTYYYYMMDD>,<TIME>,<OPEN>,<HIGH>,<LOW>,<CLOSE>,<VOL>,<OPENINT>
ACU,D,19970102,000000,5.2500,5.3750,5.1250,5.1250,52,0
ACU,D,19970103,000000,5.1250,5.2500,5.0625,5.2500,12,0
ACY.txt
<TICKER>,<PER>,<DTYYYYMMDD>,<TIME>,<OPEN>,<HIGH>,<LOW>,<CLOSE>,<VOL>,<OPENINT>
ACY,D,19980116,000000,9.7500,9.7500,8.8125,8.8125,289,0
ACY,D,19980120,000000,8.7500,8.7500,8.1250,8.1250,151,0
I want the output to be filtered with the DTYYYYMMDD and put into one dataframe frame.
OUTPUT
<TICKER>,<PER>,<DTYYYMMDD>,<TIME>,<OPEN>,<HIGH>,<LOW>,<CLOSE>,<VOL>,<OPENINT>,<TICKER>,<PER>,<DTYYYMMDD>,<TIME>,<OPEN>,<HIGH>,<LOW>,<CLOSE>,<VOL>,<OPENINT>
ACU,D,19970102,000000,5.2500,5.3750,5.1250,5.1250,52,0,AE,D,19970102,000000,12.6250,12.6250,11.7500,11.7500,144,0
ACU,D,19970103,000000,5.1250,5.2500,5.0625,5.2500,12,0,AE,D,19970103,000000,11.8750,12.1250,11.8750,12.1250,25,0
As #busybear says, pd.concat is the right tool for this job: frame = pd.concat(data_list).
merge is for when you're joining two dataframes which usually have some of the same columns and some different ones. You choose a column (or index or multiple) which identifies which rows in the two dataframes correspond to each other, and pandas handles making a dataframe whose rows are combinations of the corresponding rows in the two original dataframes. This function only works on 2 dataframes at a time; you'd have to do a loop to merge more in (it's uncommon to need to merge many dataframes this way).
concat is for when you have multiple dataframes and want to just append all of their rows or columns into one large dataframe. (Let's assume you're concatenating rows, as you want here.) It doesn't use an identifier to determine which rows correspond. All it does is create a new dataframe which has each row from each of the concated dataframes (all the rows from the first, then all from the second, etc.).
I think the above is a decent TLDR on merge vs concat but see here for a lengthy but much more comprehensive guide on using merge/join/concat with dataframes.

How to make multiple json processing faster using pySpark?

I am having a list of json files within Databricks and what I am trying to do is to read each json, extract the values needed and then append that in an empty pandas dataframe. Each json file corresponds to one row on the final dataframe. The initial json filelist length is 50k. What I have built so far is the function below which does the job perfectly, but it takes so much time that it makes me subset the json filelist in 5k bins and run each one separately. It takes 30mins each. I am limited to use only a 3-node cluster in Databricks.
Any chance that you could improve the efficiency of my function? Thanks in advance.
### Create a big dataframe including all json files ###
def jsons_to_pdf(all_paths):
# Create an empty pandas dataframes (it is defined only with column names)
pdf = create_initial_pdf(samplefile)
# Append each row into the above dataframe
for path in all_paths:
# Create a spark dataframe
sdf = sqlContext.read.json(path)
# Create a two extracted lists of values
init_values = sdf.select("id","logTimestamp","otherTimestamp").rdd.flatMap(lambda x: x).collect()
id_values = sdf.select(sdf["dataPoints"]["value"]).rdd.flatMap(lambda x: x).collect()[0]
#Append the concatenated list each one as a row into the initial dataframe
pdf.loc[len(pdf)] = init_values + id_values
return pdf
One json file looks like the following:
And what I want to achieve is to have dataPoints['id'] as new columns and dataPoints['value'] as their value, so as to end up into this:
According to your example, what you want to perform is a pivot and then transform your data into a pandas dataframe.
The steps are :
Collect all you jsons into 1 big dataframe,
pivot your data,
transform them into a pandas dataframe
Try something like this :
from functools import reduce
def jsons_to_pdf(all_paths):
# Create a big dataframe from all the jsons
sdf = reduce(
lambda a,b : a.union(b),
[
sqlContext.read.json(path)
for path
in all_paths
]
)
# select and pivot your data
pivot_df = sdf.select(
"imoNo",
"logTimestamp",
"payloadTimestamp",
F.explode("datapoints").alias("datapoint")
).groupBy(
"imoNo",
"logTimestamp",
"payloadTimestamp",
).pivot(
"datapoint.id"
).sum("datapoint.value")
# convert to a pandas dataframe
pdf = pivot_df.toPandas()
return pdf
According to your comment, you can replace the list of files all_paths with a generic path and change the way you create sdf:
all_paths = 'abc/*/*/*' # 3x*, one for year, one for month, one for day
def jsons_to_pdf(all_paths):
# Create a big dataframe from all the jsons
sdf = sqlContext.read.json(path)
This will surely increase the performances.

open all csv in a folder one by one and append them to a dictionary or pandas dataframe

I have data save as csv in a folder. I would like to open them and create a unique dictionary or dataframe to work with it. the data have the same column name but different number of row.
I have tried
big_data={}
path='/pathname'
files=glob.glob(path+/".csv")
for l in files:
data=pd.read_csv(l,index_col=None, header=0)
big_data.append(data)
df=pd.DataFrame.from_dict(big_data)
but the result is not good at all
can anyone give me a hint what I am doing wrong?
You should use a list and concat:
big_data=[]
path='/pathname'
files=glob.glob(path+/".csv")
for l in files:
data=pd.read_csv(l,index_col=None, header=0)
big_data.append(data)
df=pd.concat(big_data)
the problem with the from_dict approach is that it's expecting the keys to be either indices or columns, but in your case they are df objects which is incorrect

Categories

Resources